How EvalGate captures regression data

EvalGate does not store one generic “regression” object. It records evidence at three layers, then compares that evidence against a baseline or review decision.

Local gate data

The local gate is the no-account path created by npx @evalgate/sdk init. init runs your existing package test script, captures a baseline in evals/baseline.json, and installs a GitHub Actions workflow at .github/workflows/evalgate-gate.yml. The built-in baseline records:

Field	What it means
`confidenceTests.passed`	Whether the test command passed when the baseline was created
`confidenceTests.total`	Parsed test count, when the test runner prints one
`commitSha`, `generatedAt`, `updatedAt`	Provenance for the baseline snapshot
`tolerance`	Configurable thresholds used by the local gate

When you run npx @evalgate/sdk gate, EvalGate runs your test command again, compares the current pass state and test count to the baseline, and writes evals/regression-report.json.

The built-in local gate protects whatever your existing test script measures. If your test script runs AI evals, it gates AI behavior. If it only runs unit tests, it gates test health until you add eval specs or a custom eval:regression-gate script.

Eval run artifacts

The eval runner records behavior at the spec and case level. When you run with --write-results, EvalGate writes:

.evalgate/last-run.json
.evalgate/runs/<runId>.json
.evalgate/runs/latest.json
.evalgate/runs/index.json

Each run artifact contains the run ID, timestamp, spec results, summary counts, pass rate, score, budget/cost data when present, and optional input, expected, actual, and metadata fields supplied by the spec executor. evalgate diff compares a base run and a head run. It classifies changes as new failures, fixed failures, score drops, score improvements, execution errors, added specs, or removed specs.

Platform trace data

The platform captures production and staging behavior through traces and spans. POST /api/collector ingests one trace with spans. POST /api/collector/batch ingests up to 100 traces. The collector stores:

Record	Captured fields
Trace	status, source, environment, duration, repo SHA, content hash, metadata
Span	input, output, model, vendor, token counts, latency, parameters, errors, metadata
Feedback	thumbs up/down or other feedback attached to a trace or span

After ingestion, EvalGate decides whether to enqueue the trace for failure analysis. Errors and thumbs-down feedback are always analyzed. Successful traces are sampled for analysis at the server default rate of 10%.

Sampling controls failure-analysis work, not whether the SDK sends a trace. The SDK collector helpers send traces by default unless you configure client-side sampling.

From trace to regression coverage

When a trace is analyzed, EvalGate looks for failure signals in the trace output, groups repeated failures by a stable hash, and stores the result as a failure_report. From that report it can create a candidate_eval_case with the source trace IDs, minimized input, expected constraints, quality score, and review status. Candidate cases start quarantined. They do not gate releases until they are promoted by review or by an explicit promotion workflow. Promoted cases become regression coverage that can be used by eval runs and release gates.

What blocks a release

Different commands block on different evidence:

Command	Blocks on
`npx @evalgate/sdk gate`	Local test baseline deltas in `evals/regression-report.json`
`npx @evalgate/sdk ci`	Eval spec failures and optional base/head run diff
`npx @evalgate/sdk check`	Platform quality score, baseline drop, budget/latency thresholds, policy, and judge credibility

The most accurate way to describe the loop is: EvalGate captures raw behavior, turns reviewed failures into eval coverage, and blocks releases when current evidence regresses against the selected baseline.

​How EvalGate captures regression data

​Local gate data

​Eval run artifacts

​Platform trace data

​From trace to regression coverage

​What blocks a release

How EvalGate captures regression data

Local gate data

Eval run artifacts

Platform trace data

From trace to regression coverage

What blocks a release