Skip to main content

How EvalGate captures regression data

EvalGate does not store one generic “regression” object. It records evidence at three layers, then compares that evidence against a baseline or review decision.

Local gate data

The local gate is the no-account path created by npx @evalgate/sdk init. init runs your existing package test script, captures a baseline in evals/baseline.json, and installs a GitHub Actions workflow at .github/workflows/evalgate-gate.yml. The built-in baseline records:
FieldWhat it means
confidenceTests.passedWhether the test command passed when the baseline was created
confidenceTests.totalParsed test count, when the test runner prints one
commitSha, generatedAt, updatedAtProvenance for the baseline snapshot
toleranceConfigurable thresholds used by the local gate
When you run npx @evalgate/sdk gate, EvalGate runs your test command again, compares the current pass state and test count to the baseline, and writes evals/regression-report.json.
The built-in local gate protects whatever your existing test script measures. If your test script runs AI evals, it gates AI behavior. If it only runs unit tests, it gates test health until you add eval specs or a custom eval:regression-gate script.

Eval run artifacts

The eval runner records behavior at the spec and case level. When you run with --write-results, EvalGate writes:
.evalgate/last-run.json
.evalgate/runs/<runId>.json
.evalgate/runs/latest.json
.evalgate/runs/index.json
Each run artifact contains the run ID, timestamp, spec results, summary counts, pass rate, score, budget/cost data when present, and optional input, expected, actual, and metadata fields supplied by the spec executor. evalgate diff compares a base run and a head run. It classifies changes as new failures, fixed failures, score drops, score improvements, execution errors, added specs, or removed specs.

Platform trace data

The platform captures production and staging behavior through traces and spans. POST /api/collector ingests one trace with spans. POST /api/collector/batch ingests up to 100 traces. The collector stores:
RecordCaptured fields
Tracestatus, source, environment, duration, repo SHA, content hash, metadata
Spaninput, output, model, vendor, token counts, latency, parameters, errors, metadata
Feedbackthumbs up/down or other feedback attached to a trace or span
After ingestion, EvalGate decides whether to enqueue the trace for failure analysis. Errors and thumbs-down feedback are always analyzed. Successful traces are sampled for analysis at the server default rate of 10%.
Sampling controls failure-analysis work, not whether the SDK sends a trace. The SDK collector helpers send traces by default unless you configure client-side sampling.

From trace to regression coverage

When a trace is analyzed, EvalGate looks for failure signals in the trace output, groups repeated failures by a stable hash, and stores the result as a failure_report. From that report it can create a candidate_eval_case with the source trace IDs, minimized input, expected constraints, quality score, and review status. Candidate cases start quarantined. They do not gate releases until they are promoted by review or by an explicit promotion workflow. Promoted cases become regression coverage that can be used by eval runs and release gates.

What blocks a release

Different commands block on different evidence:
CommandBlocks on
npx @evalgate/sdk gateLocal test baseline deltas in evals/regression-report.json
npx @evalgate/sdk ciEval spec failures and optional base/head run diff
npx @evalgate/sdk checkPlatform quality score, baseline drop, budget/latency thresholds, policy, and judge credibility
The most accurate way to describe the loop is: EvalGate captures raw behavior, turns reviewed failures into eval coverage, and blocks releases when current evidence regresses against the selected baseline.