Manage LLM Applications with UpTrain + Langfuse

It’s a well-known fact that building an LLM application’s prototype is easy, but upgrading it to production-grade quality is extremely hard. Due to the non-deterministic nature of LLMs, it is very difficult to control their behavior in the wild and prevent silent failures. This calls for the need for a robust evaluation and observability framework. We are excited to announce UpTrain’s recent integration with Langfuse to provide LLM developers with best-in-class tools to manage and improve their LLM applications in development and production. If you wish to skip right ahead to the tutorial, check it out here.

What is UpTrain?

UpTrain [github || website || docs] is an open-source platform to evaluate and improve Generative AI applications. It provides grades for 20+ preconfigured checks (covering language, code, embedding use cases), performs root cause analyses on instances of failure cases and provides guidance on how to resolve them.

Key Highlights:

Data Security: As an open-source solution, UpTrain conducts all evaluations and analyses locally, ensuring that your data remains within your secure environment (except for the LLM calls).
Custom Evaluator LLMs: UpTrain allows for customisation of your evaluator LLM, offering options among several endpoints, including OpenAI, Anthropic, Llama, Mistral, or Azure.
Insights that help with model improvement: Beyond mere evaluation, UpTrain offers deep insights by pinpointing the specific components of your LLM pipeline, that are underperforming, as well as identifying common patterns among failure cases, thereby helping in their resolution.
Diverse Experimentations: The platform enables experimentation with different prompts, LLM models, RAG modules, embedding models, etc. and helps you find the best fit for your specific use case.
Compare open-source LLMs: With UpTrain, you can compare your fine-tuned open-source LLMs against proprietary ones (such as GPT-4), helping you to find the most cost-effective model without compromising quality.

What is Langfuse?

Langfuse [github || website || docs] is an open-source LLM engineering platform to help teams collaboratively debug, analyze and iterate on their LLM Applications.

Tracing: At the core of Langfuse lies its observability capabilities. Its tracing features allow teams to instrument their application through async SDKs (Python & TS) or integrations (OpenAi, Langchain, Llama-Index…) and start collecting a rich data set of LLM traces. This data is the corpus on top of which users can start running analyses and workflows.

The Langfuse platform allows users to visually inspect and debug their applications. Common use cases are investigating costs, latencies and drilling into user feedback and bugs.

Evals: Langfuse allows users to score the quality of their application. This can be done through human input (user feedback, manual scoring) or model-based evaluations (now including uptrain.ai evals).

Prompt Management: Langfuse effectively manages and versions prompts. Engineered for optimal performance, the feature incorporates SDK-level caching for enhanced efficiency. It can be thought of as a Prompt CMS (Content Management System).

Metrics: Langfuse analytics derives actionable insights from production data. Langfuse tracks rich data on quality, cost, latency and volume.
Analyze this data via dashboards and utilize powerful filtering and export functions. All data in Langfuse can be exported at any point for analysis or fine tuning.

More: Langfuse is an open source project (GitHub). It can be easily self-hosted or used through Langfuse Cloud with a generous free tier. See the Langfuse Documentation for more information.

Evaluation with UpTrain x Observability with Langfuse

With this integration, you can seamlessly use UpTrain to evaluate the quality of your LLM applications and add those scores to the traces facilitated by Langfuse for observability. If you want to jump into the code directly, try the tutorial here.

Step 1: Log your query-response pairs with Langfuse

Use one of the many integrations (Python, JS, Langchain, LlamaIndex, LiteLLM, …) of Langfuse to capture production or development data of your LLM application. Check out the docs for a full list of all integrations and a quickstart.

Step 2: Retrieve the traces to evaluate with UpTrain

for interaction in data:
    trace = langfuse.trace(name = "uptrain batch")
    trace.span(
        name = "retrieval",
        input={'question': interaction['question']},
        output={'context': interaction['context']}
    )
    trace.span(
        name = "generation",
        input={'question': interaction['question'], 'context': interaction['context']},
        output={'response': interaction['response']}
    )
 
# await that Langfuse SDK has processed all events before trying to retrieve it in the next step
langfuse.flush()

Step 3: Evaluate the quality of retrieved-context and generated response

res = eval_llm.evaluate(
    data = data,
    checks = [Evals.CONTEXT_RELEVANCE, Evals.FACTUAL_ACCURACY, Evals.RESPONSE_COMPLETENESS]
)

Step 4: Log the scores back to the Langfuse platform for visualization and analysis

for _, row in df.iterrows():
    for metric_name in ["context_relevance", "factual_accuracy","response_completeness"]:
        langfuse.score(
            name=metric_name,
            value=row["score_"+metric_name],
            trace_id=row["trace_id"]
        )

And voila! You can now observe the quality of your application along with full traces in these 4 simple steps. By incorporating UpTrain’s evaluation metrics into Langfuse, you can track and compare the effectiveness of different experiments, gaining valuable insights into the strengths and weaknesses of your LLM models and applications in development and production.

20+ pre-configured checks at your disposal

UpTrain provides 20+ pre-configured checks to evaluate the quality of your final response, retrieved-context as well as all the interim steps. You can:

Access scores for Response Completeness, Relevance & Validity, etc.,
Compute the quality of retrieval and degree of context utilization,
Check if the response can be verified from the context or not,
Detect and prevent prompt injection and jailbreak attempts
Examine whether the user is frustrated while interacting with your chatbot
Or define your custom evaluations

You can find the complete list of pre-configured evaluations here. Further, You can use evaluation and observability stats to understand how your systems are performing in production, identify bottlenecks and iteratively improve them.

Conclusion

In this blog, we have introduced the recent integration between UpTrain’s evaluation platform and Langfuse’s observability platform. Moreover, with this integration, you can seamlessly track your applications’ latency, cost and quality, all in one place. Happy building!