Skip to main content

The trace to eval to gate workflow

EvalGate is built around one operating loop: collect traces from real AI behavior, turn reviewed failures into reusable evaluation cases, and gate regressions in CI before they reach production. Start with the local gate if you only need regression protection. Add the platform when you need shared traces, review history, LLM judges, costs, benchmarks, annotations, and governance. LLMs drift silently. A single prompt change can degrade quality across thousands of responses before anyone notices. The trace to eval to gate loop closes that gap: real failures become reviewed test cases, and promoted test cases become merge gates so the same issue is caught before it ships again.

Product layers

LayerUse it whenWhat it proves
Local gateYou want the first trustworthy workflowTest or eval regressions fail before merge
TracesYou need production signalEval cases can come from real behavior
Reviews and judgesHumans need to inspect quality decisionsScores and approvals are explainable
GovernanceMultiple teams ship AI changesQuality, cost, and release evidence are auditable
1

Collect traces from real behavior

Traces are structured records of your AI system’s actual behavior in production or staging. Every time your application reports a trace, EvalGate stores the context you send: inputs and outputs, tool calls, token usage, latency, errors, and any metadata you attach.EvalGate uses asymmetric sampling for server-side failure analysis by default: errors and thumbs-down feedback are always analyzed, while successful traces are sampled at 10%. SDK trace reporting sends traces by default unless you configure client-side sampling.
TypeScript
const trace = await client.traces.create({
  name: 'Chat Completion',
  traceId: 'trace-' + Date.now(),
  metadata: { model: 'gpt-4' }
});

await client.traces.createSpan(trace.id, {
  name: 'OpenAI API Call',
  type: 'llm',
  input: 'What is AI?',
  output: 'AI stands for Artificial Intelligence...',
  metadata: { tokens: 150, latency_ms: 1200 }
});
From the Traces page in your dashboard you can search by metadata, inspect nested spans, and identify the exact inputs that caused a failure before converting them into permanent test coverage.
Use descriptive trace names like customer-support-query instead of generic labels like llm-call. Specific names make it faster to find relevant failures when you’re building eval coverage.
2

Turn failures into eval coverage

Raw traces are observations. Evaluations are commitments. Once you have a set of traced failures, you promote them into labeled test cases that run on every change.EvalGate gives you three tools for building eval coverage from traces and run artifacts:Label - Use the interactive CLI to mark each case pass or fail and assign a failure mode. Each saved label becomes part of your golden dataset.
npx @evalgate/sdk label
# Arrow-key menu, u to undo, Ctrl-C saves progress
Cluster - Group similar failures by behavior and workflow shape so you can spot patterns instead of triaging individual cases.
npx @evalgate/sdk cluster --run .evalgate/runs/latest.json
Synthesize - Generate additional golden case drafts from labeled failures to expand coverage beyond what production traffic alone produces.
npx @evalgate/sdk synthesize \
  --dataset .evalgate/golden/labeled.jsonl \
  --output .evalgate/golden/synthetic.jsonl
The local synthesize command writes synthetic case drafts to JSONL. Platform-generated candidate eval cases start quarantined and do not gate releases until they are promoted by review or by an explicit promotion workflow.
The result is a reusable evaluation suite that reflects the failure modes your users actually encounter, not hypothetical ones you invented in advance.
3

Gate regressions in CI

With a labeled golden dataset in place, you can enforce quality in CI. The local gate compares your test/eval command against evals/baseline.json. The eval CI path writes run artifacts and can compare a head run against a base run.
# Run locally
npx @evalgate/sdk gate

# With GitHub step summary output
npx @evalgate/sdk gate --format github

# Or as a full CI workflow
npx @evalgate/sdk ci --format github --write-results --base main
A minimal GitHub Actions workflow looks like this:
name: EvalGate CI
on: [push, pull_request]
jobs:
  evalgate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm ci
      - run: npx @evalgate/sdk ci --format github --write-results --base main
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: evalgate-results
          path: .evalgate/
When a PR introduces a regression, the CI command fails, writes a GitHub step summary, and emits annotations for failed or regressed specs when --format github is used. Reviewers can inspect the run artifacts uploaded from .evalgate/.
Judge credibility is enforced at gate time. If true positive rate (TPR) or true negative rate (TNR) falls below the configured thresholds, the platform gate fails rather than silently using a biased score. The separate weak-discriminative-power case exits with warning code 8. Set tprMin and tnrMin in evalgate.config.json to control the hard thresholds.

The full loop

The three steps above map onto a longer operating cycle that runs continuously as your AI system evolves:
trace -> cluster -> synthesize -> gate -> review -> auto -> ship
StageWhat happens
traceCollect workflow runs, tool use, and trajectory data from production
clusterGroup failures and coverage gaps by behavior and workflow shape
synthesizeGenerate candidate eval drafts and experiment plans from real gaps
gateScore outcomes, behavior, trajectory, integrity, and judge evidence
reviewInspect cases, disagreement, provenance, and human feedback
autoRun bounded autonomous experiments against the eval suite
shipPromote only when the evidence clears the gate
The auto stage lets you run bounded prompt-improvement experiments and keep only changes that pass the gate:
npx @evalgate/sdk auto --objective tone_mismatch --prompt prompts/support.md --autonomous --budget 3
Start with trace, gate, and review before using auto. The autonomous loop works best when you already have stable traces, labeled failure modes, and a trusted baseline to gate against.

Why this order matters

Running evaluations without traces means testing scenarios you imagined, not the ones users hit. Gating without evaluations means blocking on metrics that don’t reflect real failures. Tracing without gating means collecting signal you never act on. The trace to eval to gate order ensures that every stage feeds the next with real evidence. Issues that reach production can be captured as traces, reviewed into eval coverage, promoted into a gate, and blocked from shipping again.