The trace to eval to gate workflow
EvalGate is built around one operating loop: collect traces from real AI behavior, turn reviewed failures into reusable evaluation cases, and gate regressions in CI before they reach production. Start with the local gate if you only need regression protection. Add the platform when you need shared traces, review history, LLM judges, costs, benchmarks, annotations, and governance. LLMs drift silently. A single prompt change can degrade quality across thousands of responses before anyone notices. The trace to eval to gate loop closes that gap: real failures become reviewed test cases, and promoted test cases become merge gates so the same issue is caught before it ships again.Product layers
| Layer | Use it when | What it proves |
|---|---|---|
| Local gate | You want the first trustworthy workflow | Test or eval regressions fail before merge |
| Traces | You need production signal | Eval cases can come from real behavior |
| Reviews and judges | Humans need to inspect quality decisions | Scores and approvals are explainable |
| Governance | Multiple teams ship AI changes | Quality, cost, and release evidence are auditable |
Collect traces from real behavior
Traces are structured records of your AI system’s actual behavior in production or staging. Every time your application reports a trace, EvalGate stores the context you send: inputs and outputs, tool calls, token usage, latency, errors, and any metadata you attach.EvalGate uses asymmetric sampling for server-side failure analysis by default: errors and thumbs-down feedback are always analyzed, while successful traces are sampled at 10%. SDK trace reporting sends traces by default unless you configure client-side sampling.From the Traces page in your dashboard you can search by metadata, inspect nested spans, and identify the exact inputs that caused a failure before converting them into permanent test coverage.
TypeScript
Turn failures into eval coverage
Raw traces are observations. Evaluations are commitments. Once you have a set of traced failures, you promote them into labeled test cases that run on every change.EvalGate gives you three tools for building eval coverage from traces and run artifacts:Label - Use the interactive CLI to mark each case pass or fail and assign a failure mode. Each saved label becomes part of your golden dataset.Cluster - Group similar failures by behavior and workflow shape so you can spot patterns instead of triaging individual cases.Synthesize - Generate additional golden case drafts from labeled failures to expand coverage beyond what production traffic alone produces.The result is a reusable evaluation suite that reflects the failure modes your users actually encounter, not hypothetical ones you invented in advance.
The local
synthesize command writes synthetic case drafts to JSONL. Platform-generated candidate eval cases start quarantined and do not gate releases until they are promoted by review or by an explicit promotion workflow.Gate regressions in CI
With a labeled golden dataset in place, you can enforce quality in CI. The local gate compares your test/eval command against A minimal GitHub Actions workflow looks like this:When a PR introduces a regression, the CI command fails, writes a GitHub step summary, and emits annotations for failed or regressed specs when
evals/baseline.json. The eval CI path writes run artifacts and can compare a head run against a base run.--format github is used. Reviewers can inspect the run artifacts uploaded from .evalgate/.The full loop
The three steps above map onto a longer operating cycle that runs continuously as your AI system evolves:| Stage | What happens |
|---|---|
trace | Collect workflow runs, tool use, and trajectory data from production |
cluster | Group failures and coverage gaps by behavior and workflow shape |
synthesize | Generate candidate eval drafts and experiment plans from real gaps |
gate | Score outcomes, behavior, trajectory, integrity, and judge evidence |
review | Inspect cases, disagreement, provenance, and human feedback |
auto | Run bounded autonomous experiments against the eval suite |
ship | Promote only when the evidence clears the gate |
auto stage lets you run bounded prompt-improvement experiments and keep only changes that pass the gate:
Start with
trace, gate, and review before using auto. The autonomous loop works best when you already have stable traces, labeled failure modes, and a trusted baseline to gate against.