Databricks for Playground and Evaluations

This guide walks you through integrating Databricks language model endpoints with Langfuse. By doing so, you can quickly experiment with prompts and debug interactions using the Langfuse Playground, as well as benchmark your models systematically with Evaluations.

With Langfuse, you can:

Experiment in the Playground: The interactive Playground lets you test your language models in real-time. You can send custom prompts, review detailed responses, and add prompts to your Prompt Library.
Benchmark with Evaluations: LLM-as-a-Judge evaluations provide a way to benchmark your application’s performance. You can run pre-defined test templates, analyze metrics like latency and accuracy, and refine your models based on measurable outcomes.

1. Step: Set Up a Serving Endpoint in Databricks

Begin by setting up a serving endpoint in Databricks. This lets you query custom fine-tuned models or models served via a gateway such as OpenAI or Anthropic. For advanced configuration options, refer to the Databricks docs.

Set up a Serving Endpoint in Databricks

2. Add the Model in your Project Settings

Next, add your Databricks model endpoint to your Langfuse project settings.

Make sure you’ve entered the correct endpoint URL and authentication details. The model name is the name of the serving endpoint you created in Databricks.

Add the Model in Your Project Settings

3. Use the Model in the Playground

The Langfuse Playground offers an interactive interface where you can:

Send prompts and view quick results.
Add prompts to your Prompt Library.

Use the Model in the Playground

Select Databricks as your LLM provider and choose the endpoint you configured earlier.

4. Use the Model for Evaluations

LLM-as-a-judge is a technique to evaluate the quality of LLM applications by using an LLM as a judge. The LLM is given a trace or a dataset entry and asked to score and reason about the output. The scores and reasoning are stored as scores in Langfuse.

Use the Model for Evaluations