How EvalGate captures regression data
EvalGate does not store one generic “regression” object. It records evidence at three layers, then compares that evidence against a baseline or review decision.
Local gate data
The local gate is the no-account path created by npx @evalgate/sdk init.
init runs your existing package test script, captures a baseline in evals/baseline.json, and installs a GitHub Actions workflow at .github/workflows/evalgate-gate.yml. The built-in baseline records:
| Field | What it means |
|---|
confidenceTests.passed | Whether the test command passed when the baseline was created |
confidenceTests.total | Parsed test count, when the test runner prints one |
commitSha, generatedAt, updatedAt | Provenance for the baseline snapshot |
tolerance | Configurable thresholds used by the local gate |
When you run npx @evalgate/sdk gate, EvalGate runs your test command again, compares the current pass state and test count to the baseline, and writes evals/regression-report.json.
The built-in local gate protects whatever your existing test script measures. If your test script runs AI evals, it gates AI behavior. If it only runs unit tests, it gates test health until you add eval specs or a custom eval:regression-gate script.
Eval run artifacts
The eval runner records behavior at the spec and case level. When you run with --write-results, EvalGate writes:
.evalgate/last-run.json
.evalgate/runs/<runId>.json
.evalgate/runs/latest.json
.evalgate/runs/index.json
Each run artifact contains the run ID, timestamp, spec results, summary counts, pass rate, score, budget/cost data when present, and optional input, expected, actual, and metadata fields supplied by the spec executor.
evalgate diff compares a base run and a head run. It classifies changes as new failures, fixed failures, score drops, score improvements, execution errors, added specs, or removed specs.
The platform captures production and staging behavior through traces and spans.
POST /api/collector ingests one trace with spans. POST /api/collector/batch ingests up to 100 traces. The collector stores:
| Record | Captured fields |
|---|
| Trace | status, source, environment, duration, repo SHA, content hash, metadata |
| Span | input, output, model, vendor, token counts, latency, parameters, errors, metadata |
| Feedback | thumbs up/down or other feedback attached to a trace or span |
After ingestion, EvalGate decides whether to enqueue the trace for failure analysis. Errors and thumbs-down feedback are always analyzed. Successful traces are sampled for analysis at the server default rate of 10%.
Sampling controls failure-analysis work, not whether the SDK sends a trace. The SDK collector helpers send traces by default unless you configure client-side sampling.
From trace to regression coverage
When a trace is analyzed, EvalGate looks for failure signals in the trace output, groups repeated failures by a stable hash, and stores the result as a failure_report. From that report it can create a candidate_eval_case with the source trace IDs, minimized input, expected constraints, quality score, and review status.
Candidate cases start quarantined. They do not gate releases until they are promoted by review or by an explicit promotion workflow. Promoted cases become regression coverage that can be used by eval runs and release gates.
What blocks a release
Different commands block on different evidence:
| Command | Blocks on |
|---|
npx @evalgate/sdk gate | Local test baseline deltas in evals/regression-report.json |
npx @evalgate/sdk ci | Eval spec failures and optional base/head run diff |
npx @evalgate/sdk check | Platform quality score, baseline drop, budget/latency thresholds, policy, and judge credibility |
The most accurate way to describe the loop is: EvalGate captures raw behavior, turns reviewed failures into eval coverage, and blocks releases when current evidence regresses against the selected baseline.