Merge pull request #543 from future-agi/fix/rewrite-observe-evals

hadarishav · web-flow · commit f00aa2abea80 · 2026-04-06T12:46:12.000+05:30
rewrite observability evals
diff --git a/src/lib/navigation.ts b/src/lib/navigation.ts
@@ -204,8 +204,8 @@ export const tabNavigation: NavTab[] = [
             title: 'Features',
             items: [
               { title: 'Set Up Observability', href: '/docs/observe/features/quickstart' },
-              { title: 'Evals', href: '/docs/observe/features/evals' },
-              { title: 'Group Traces by Session', href: '/docs/observe/features/session' },
+              { title: 'Run Evals on Traces', href: '/docs/observe/features/evals' },
+              { title: 'Sessions', href: '/docs/observe/features/session' },
               { title: 'Users', href: '/docs/observe/features/users' },
               { title: 'Alerts & Monitors', href: '/docs/observe/features/alerts' },
               { title: 'Voice Observability', href: '/docs/observe/features/voice' },
diff --git a/src/pages/docs/observe/features/evals.mdx b/src/pages/docs/observe/features/evals.mdx
@@ -1,79 +1,80 @@
 ---
-title: "Evaluations"
-description: "Create and run eval tasks on Observe project data: filter spans, choose historic or continuous runs, set sampling and limits, and attach preset or custom evaluations."
+title: "Run Evals on Traces"
+description: "Run automated quality checks on your traced spans in Observe: filter spans, choose historic or continuous runs, set sampling and limits, and attach preset or custom evaluations."
 ---
 
 ## About
 
-**Evaluations** in Observe are automated quality checks run on your traced spans, e.g. hallucination, bias, context adherence, toxicity. The feature scores LLM (or other) outputs so you can see pass/fail and numeric results per span in the dashboard, track quality over time, and trigger alerts when scores cross a threshold. Evals can run on existing data (historic) or on new spans as they arrive (continuous), and results are stored on the span and available for filtering, export, and monitors.
+Evals run automated quality checks on your production traces, scoring every LLM response for hallucination, tone, bias, toxicity, and more. You configure which checks to run, filter which spans they apply to, and choose whether to evaluate historical data or new spans as they arrive. Results appear per span in the Observe dashboard and can trigger alerts when quality drops.
+
 {/* ARCADE EMBED START */}
 <script>{` function onArcadeIframeMessage(e) { if (e.origin !== 'https://demo.arcade.software' || !e.isTrusted) return; const arcadeIframe = document.querySelector(\`iframe[src*=\${e.data.id}]\`); if (!arcadeIframe || !arcadeIframe.contentWindow) return; if (e.data.event === 'arcade-init') { arcadeIframe.contentWindow.postMessage({event: 'register-popout-handler'}, '*'); } if (e.data.event === 'arcade-popout-open') { arcadeIframe.style['position'] = 'fixed'; arcadeIframe.style['z-index'] = '9999999'; } if (e.data.event === 'arcade-popout-close') { arcadeIframe.style['position'] = 'absolute'; arcadeIframe.style['z-index'] = 'auto'; } } window.addEventListener('message', onArcadeIframeMessage); `}</script>
 <div style={{position: 'relative', paddingBottom: 'calc(57.1875% + 100px)', height: 0, minWidth: '600px', width: '100%'}}><iframe src="https://demo.arcade.software/Yu4mABONU00uVaeC2NKP?embed&embed_mobile=inline&embed_desktop=inline&show_copy_link=true" title="Datasets Evaluations" frameBorder="0" loading="lazy" allowFullScreen allow="clipboard-write" style={{position: 'absolute', top: 0, left: 0, width: '100%', height: '100%', colorScheme: 'light'}} ></iframe></div>
 {/* ARCADE EMBED END */}
+
+---
+
 ## When to use
 
-- **Historic batch**: Run evals on a time range of existing spans to score quality (e.g. hallucination, bias, context adherence) after the fact.
-- **Continuous monitoring**: Run evals automatically on new spans as they arrive so you catch regressions in production.
-- **Cost and volume control**: Use sampling rate (e.g. 10%) and max spans per run so you don’t evaluate every span and can control cost.
-- **Targeted evaluation**: Filter by observation type (e.g. LLM only), session, or span attributes so only relevant spans are evaluated.
-- **Multiple evals per task**: Attach several eval configs to one task so each span gets multiple scores in a single run.
+- **Scoring production output quality**: Run historic evals after a release to check for hallucinations, bias, or unsafe content across real traffic.
+- **Catching regressions in production**: Set up a continuous eval task so new spans are scored automatically and you see quality drops before users report them.
+- **Spot-checking a specific time window**: Filter by date range or session to evaluate only the spans from an incident or a specific user flow.
+- **Controlling eval cost**: Use sampling rate and span limits to evaluate a representative subset instead of every span.
+- **Running multiple quality checks at once**: Attach several evals to one task so each span gets scored for tone, safety, and accuracy in a single run.
 
 ---
 
 ## How to
 
 <Steps>
   <Step title="Set filters">
-    Define filters so the task runs only on the spans you care about. Supported filter keys:
+    Define filters so the task runs only on the spans you care about.
 
     ![Set filters](/images/docs/observe/1.png)
 
-    - **observation_type**: Node/span type (e.g. `llm`, `chain`, `agent`). Pass a string or list of types.
-    - **date_range**: Time range: a two-element list `[start_date, end_date]` (applied to `created_at`).
-    - **created_at**: Minimum creation time (spans created at or after this value).
-    - **project_id**: Restrict to a specific Observe project.
-    - **session_id**: Restrict to traces in a given session.
-    - **span_attributes_filters**: List of span-attribute conditions (same structure as in the Observe UI filters).
+    | Filter | Description |
+    |--------|-------------|
+    | `observation_type` | Node/span type (e.g. `llm`, `chain`, `agent`). |
+    | `date_range` | Time range: `[start_date, end_date]` applied to `created_at`. |
+    | `created_at` | Minimum creation time (spans at or after this value). |
+    | `project_id` | Restrict to a specific Observe project. |
+    | `session_id` | Restrict to traces in a given session. |
+    | `span_attributes_filters` | List of span-attribute conditions. |
 
-    Filters are stored in the task’s `filters` JSON and applied when the task runs.
+    Filters are stored in the task's `filters` field and applied when the task runs.
   </Step>
 
   <Step title="Choose run type">
-    Set **run_type**:
+    Set the **run type**:
 
     ![Choose run type](/images/docs/observe/2.png)
 
-    - **Historical**: Run on existing spans that match the filters (optionally within a time range). The task processes up to the sampling cap and span limit, then can complete.
-    - **Continuous**: Run on new spans as they arrive. Each run only considers spans created after the last run; the task stays active for ongoing evaluation.
+    - **Historical**: Run on existing spans matching the filters, up to the sampling cap and span limit. The task completes after processing.
+    - **Continuous**: Run on new spans as they arrive. Each run only processes spans created after the last run; the task stays active for ongoing evaluation.
   </Step>
 
   <Step title="Set sampling rate and span limit">
     ![Set sampling rate and span limit](/images/docs/observe/3.png)
 
-    - **sampling_rate**: Percentage of matching spans to evaluate (0–100). Example: `50` means 50% of filtered spans are sampled per run. Helps control cost and volume.
-    - **spans_limit**: Maximum number of spans to process per run (default is 1000). For historical runs, the task stops when either the sampled count or this limit is reached.
+    - **sampling_rate**: Percentage of matching spans to evaluate (0-100). For example, `50` evaluates 50% of filtered spans per run.
+    - **spans_limit**: Maximum number of spans to process per run (default 1000). The task stops when either the sampled count or this limit is reached.
   </Step>
 
   <Step title="Select evals to run">
-
-    Attach one or more **CustomEvalConfig** IDs to the task (the evals you’ve already created for the project). The task runs each selected eval on every span it processes. For evals that need an input (e.g. Bias Detection), configure the **input key** to a span attribute path (e.g. `llm.output_messages.0.message.content`) so the eval reads the right field from each span. See [built-in evals](/docs/evaluation/builtin) for supported evaluations and their inputs.
+    Attach one or more eval configs to the task. The task runs each selected eval on every span it processes. For evals that need an input (e.g. Bias Detection), set the **input key** to a span attribute path (e.g. `gen_ai.output.messages.0.message.content`) so the eval reads the right field from each span. See [built-in evals](/docs/evaluation/builtin) for supported evaluations and their required inputs.
   </Step>
 
   <Step title="Run the task">
     ![run](/images/docs/observe/4.png)
-    Create or update the eval task via the API or UI, then run it. You can test the configuration (filters and evals) before saving. Task status values: `pending`, `running`, `completed`, `failed`, `paused`, `deleted`. Results appear on the spans in the Observe dashboard and can be used for alerts.
+
+    Create or update the eval task via the API or UI, then run it. You can test the configuration before saving. Task status values: `pending`, `running`, `completed`, `failed`, `paused`, `deleted`. Results appear on the spans in the Observe dashboard and can be used for alerts.
   </Step>
 </Steps>
 
 <Note>
-  Eval tasks are processed asynchronously (e.g. by a cron). Status and results update as runs complete. For continuous tasks, new spans are picked up on subsequent runs.
+  Eval tasks are processed asynchronously. Status and results update as runs complete. For continuous tasks, new spans are picked up on subsequent runs.
 </Note>
 
-## Key concepts
-
-- **Span attributes**: Spans store data in key-value form (e.g. `llm.output_messages.0.message.content`). When an eval needs an input, you point it to one of these attribute paths. See [spans](/docs/tracing/concepts/spans) and [span attributes](/docs/tracing/concepts/spans#span-attributes) for the schema.
-- **Bias Detection example**: Set the eval’s input key to a span attribute that holds the text to check (e.g. `llm.output_messages.0.message.content`). The eval returns Passed (neutral) or Failed (bias detected).
-
 ---
 
 ## Next Steps