Skip to content

Commit f00aa2a

Browse files
authored
Merge pull request #543 from future-agi/fix/rewrite-observe-evals
rewrite observability evals
2 parents 3c0cc01 + 92f3028 commit f00aa2a

2 files changed

Lines changed: 33 additions & 32 deletions

File tree

src/lib/navigation.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -204,8 +204,8 @@ export const tabNavigation: NavTab[] = [
204204
title: 'Features',
205205
items: [
206206
{ title: 'Set Up Observability', href: '/docs/observe/features/quickstart' },
207-
{ title: 'Evals', href: '/docs/observe/features/evals' },
208-
{ title: 'Group Traces by Session', href: '/docs/observe/features/session' },
207+
{ title: 'Run Evals on Traces', href: '/docs/observe/features/evals' },
208+
{ title: 'Sessions', href: '/docs/observe/features/session' },
209209
{ title: 'Users', href: '/docs/observe/features/users' },
210210
{ title: 'Alerts & Monitors', href: '/docs/observe/features/alerts' },
211211
{ title: 'Voice Observability', href: '/docs/observe/features/voice' },

src/pages/docs/observe/features/evals.mdx

Lines changed: 31 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,79 +1,80 @@
11
---
2-
title: "Evaluations"
3-
description: "Create and run eval tasks on Observe project data: filter spans, choose historic or continuous runs, set sampling and limits, and attach preset or custom evaluations."
2+
title: "Run Evals on Traces"
3+
description: "Run automated quality checks on your traced spans in Observe: filter spans, choose historic or continuous runs, set sampling and limits, and attach preset or custom evaluations."
44
---
55

66
## About
77

8-
**Evaluations** in Observe are automated quality checks run on your traced spans, e.g. hallucination, bias, context adherence, toxicity. The feature scores LLM (or other) outputs so you can see pass/fail and numeric results per span in the dashboard, track quality over time, and trigger alerts when scores cross a threshold. Evals can run on existing data (historic) or on new spans as they arrive (continuous), and results are stored on the span and available for filtering, export, and monitors.
8+
Evals run automated quality checks on your production traces, scoring every LLM response for hallucination, tone, bias, toxicity, and more. You configure which checks to run, filter which spans they apply to, and choose whether to evaluate historical data or new spans as they arrive. Results appear per span in the Observe dashboard and can trigger alerts when quality drops.
9+
910
{/* ARCADE EMBED START */}
1011
<script>{` function onArcadeIframeMessage(e) { if (e.origin !== 'https://demo.arcade.software' || !e.isTrusted) return; const arcadeIframe = document.querySelector(\`iframe[src*=\${e.data.id}]\`); if (!arcadeIframe || !arcadeIframe.contentWindow) return; if (e.data.event === 'arcade-init') { arcadeIframe.contentWindow.postMessage({event: 'register-popout-handler'}, '*'); } if (e.data.event === 'arcade-popout-open') { arcadeIframe.style['position'] = 'fixed'; arcadeIframe.style['z-index'] = '9999999'; } if (e.data.event === 'arcade-popout-close') { arcadeIframe.style['position'] = 'absolute'; arcadeIframe.style['z-index'] = 'auto'; } } window.addEventListener('message', onArcadeIframeMessage); `}</script>
1112
<div style={{position: 'relative', paddingBottom: 'calc(57.1875% + 100px)', height: 0, minWidth: '600px', width: '100%'}}><iframe src="https://demo.arcade.software/Yu4mABONU00uVaeC2NKP?embed&embed_mobile=inline&embed_desktop=inline&show_copy_link=true" title="Datasets Evaluations" frameBorder="0" loading="lazy" allowFullScreen allow="clipboard-write" style={{position: 'absolute', top: 0, left: 0, width: '100%', height: '100%', colorScheme: 'light'}} ></iframe></div>
1213
{/* ARCADE EMBED END */}
14+
15+
---
16+
1317
## When to use
1418

15-
- **Historic batch**: Run evals on a time range of existing spans to score quality (e.g. hallucination, bias, context adherence) after the fact.
16-
- **Continuous monitoring**: Run evals automatically on new spans as they arrive so you catch regressions in production.
17-
- **Cost and volume control**: Use sampling rate (e.g. 10%) and max spans per run so you don’t evaluate every span and can control cost.
18-
- **Targeted evaluation**: Filter by observation type (e.g. LLM only), session, or span attributes so only relevant spans are evaluated.
19-
- **Multiple evals per task**: Attach several eval configs to one task so each span gets multiple scores in a single run.
19+
- **Scoring production output quality**: Run historic evals after a release to check for hallucinations, bias, or unsafe content across real traffic.
20+
- **Catching regressions in production**: Set up a continuous eval task so new spans are scored automatically and you see quality drops before users report them.
21+
- **Spot-checking a specific time window**: Filter by date range or session to evaluate only the spans from an incident or a specific user flow.
22+
- **Controlling eval cost**: Use sampling rate and span limits to evaluate a representative subset instead of every span.
23+
- **Running multiple quality checks at once**: Attach several evals to one task so each span gets scored for tone, safety, and accuracy in a single run.
2024

2125
---
2226

2327
## How to
2428

2529
<Steps>
2630
<Step title="Set filters">
27-
Define filters so the task runs only on the spans you care about. Supported filter keys:
31+
Define filters so the task runs only on the spans you care about.
2832

2933
![Set filters](/images/docs/observe/1.png)
3034

31-
- **observation_type**: Node/span type (e.g. `llm`, `chain`, `agent`). Pass a string or list of types.
32-
- **date_range**: Time range: a two-element list `[start_date, end_date]` (applied to `created_at`).
33-
- **created_at**: Minimum creation time (spans created at or after this value).
34-
- **project_id**: Restrict to a specific Observe project.
35-
- **session_id**: Restrict to traces in a given session.
36-
- **span_attributes_filters**: List of span-attribute conditions (same structure as in the Observe UI filters).
35+
| Filter | Description |
36+
|--------|-------------|
37+
| `observation_type` | Node/span type (e.g. `llm`, `chain`, `agent`). |
38+
| `date_range` | Time range: `[start_date, end_date]` applied to `created_at`. |
39+
| `created_at` | Minimum creation time (spans at or after this value). |
40+
| `project_id` | Restrict to a specific Observe project. |
41+
| `session_id` | Restrict to traces in a given session. |
42+
| `span_attributes_filters` | List of span-attribute conditions. |
3743

38-
Filters are stored in the tasks `filters` JSON and applied when the task runs.
44+
Filters are stored in the task's `filters` field and applied when the task runs.
3945
</Step>
4046

4147
<Step title="Choose run type">
42-
Set **run_type**:
48+
Set the **run type**:
4349

4450
![Choose run type](/images/docs/observe/2.png)
4551

46-
- **Historical**: Run on existing spans that match the filters (optionally within a time range). The task processes up to the sampling cap and span limit, then can complete.
47-
- **Continuous**: Run on new spans as they arrive. Each run only considers spans created after the last run; the task stays active for ongoing evaluation.
52+
- **Historical**: Run on existing spans matching the filters, up to the sampling cap and span limit. The task completes after processing.
53+
- **Continuous**: Run on new spans as they arrive. Each run only processes spans created after the last run; the task stays active for ongoing evaluation.
4854
</Step>
4955

5056
<Step title="Set sampling rate and span limit">
5157
![Set sampling rate and span limit](/images/docs/observe/3.png)
5258

53-
- **sampling_rate**: Percentage of matching spans to evaluate (0100). Example: `50` means 50% of filtered spans are sampled per run. Helps control cost and volume.
54-
- **spans_limit**: Maximum number of spans to process per run (default is 1000). For historical runs, the task stops when either the sampled count or this limit is reached.
59+
- **sampling_rate**: Percentage of matching spans to evaluate (0-100). For example, `50` evaluates 50% of filtered spans per run.
60+
- **spans_limit**: Maximum number of spans to process per run (default 1000). The task stops when either the sampled count or this limit is reached.
5561
</Step>
5662

5763
<Step title="Select evals to run">
58-
59-
Attach one or more **CustomEvalConfig** IDs to the task (the evals you’ve already created for the project). The task runs each selected eval on every span it processes. For evals that need an input (e.g. Bias Detection), configure the **input key** to a span attribute path (e.g. `llm.output_messages.0.message.content`) so the eval reads the right field from each span. See [built-in evals](/docs/evaluation/builtin) for supported evaluations and their inputs.
64+
Attach one or more eval configs to the task. The task runs each selected eval on every span it processes. For evals that need an input (e.g. Bias Detection), set the **input key** to a span attribute path (e.g. `gen_ai.output.messages.0.message.content`) so the eval reads the right field from each span. See [built-in evals](/docs/evaluation/builtin) for supported evaluations and their required inputs.
6065
</Step>
6166

6267
<Step title="Run the task">
6368
![run](/images/docs/observe/4.png)
64-
Create or update the eval task via the API or UI, then run it. You can test the configuration (filters and evals) before saving. Task status values: `pending`, `running`, `completed`, `failed`, `paused`, `deleted`. Results appear on the spans in the Observe dashboard and can be used for alerts.
69+
70+
Create or update the eval task via the API or UI, then run it. You can test the configuration before saving. Task status values: `pending`, `running`, `completed`, `failed`, `paused`, `deleted`. Results appear on the spans in the Observe dashboard and can be used for alerts.
6571
</Step>
6672
</Steps>
6773

6874
<Note>
69-
Eval tasks are processed asynchronously (e.g. by a cron). Status and results update as runs complete. For continuous tasks, new spans are picked up on subsequent runs.
75+
Eval tasks are processed asynchronously. Status and results update as runs complete. For continuous tasks, new spans are picked up on subsequent runs.
7076
</Note>
7177

72-
## Key concepts
73-
74-
- **Span attributes**: Spans store data in key-value form (e.g. `llm.output_messages.0.message.content`). When an eval needs an input, you point it to one of these attribute paths. See [spans](/docs/tracing/concepts/spans) and [span attributes](/docs/tracing/concepts/spans#span-attributes) for the schema.
75-
- **Bias Detection example**: Set the eval’s input key to a span attribute that holds the text to check (e.g. `llm.output_messages.0.message.content`). The eval returns Passed (neutral) or Failed (bias detected).
76-
7778
---
7879

7980
## Next Steps

0 commit comments

Comments
 (0)