EvalGate CLI command reference

Complete reference for all EvalGate CLI commands: setup, gates, CI integration, trace labeling, failure analysis, judge orchestration, and auto loops.

The EvalGate CLI is the fastest way to run regression gates, analyze failure patterns, and automate prompt improvement without leaving your terminal. Run TypeScript CLI commands with npx @evalgate/sdk <command> for zero-install usage, or add @evalgate/sdk to your project and run the same commands through your package manager.

TypeScript (npx)
Python

npm install @evalgate/sdk
npx @evalgate/sdk <command>

pip install "evalgate-sdk[cli]"
evalgate <command>

Exit codes

Code	Meaning
`0`	Clean — no regressions
`1`	Regressions detected
`2`	Configuration error
`8`	`WARN` — judge credibility low (discriminative power ≤ 0.05)

Setup and initialization

npx @evalgate/sdk init — scaffold a new project

Detects your repository, runs your test script to create an initial baseline, installs .github/workflows/evalgate-gate.yml, and prints what to commit. Works with any Node.js project and requires no account.

npx @evalgate/sdk init

After running init, commit the generated files and push to trigger your first CI gate:

git add evals/ .github/workflows/evalgate-gate.yml evalgate.config.json
git push

npx @evalgate/sdk verify — pre-flight validation

Runs six checks before your first gate or CI run: API key, baseline file, config schema, test runner detection, CI workflow presence, and network reachability. Use this to catch setup problems before they cause silent CI failures.

npx @evalgate/sdk verify

npx @evalgate/sdk doctor — environment diagnostics

Diagnoses your full environment — Node version, SDK version, config file validity, API key scope, and connection to the EvalGate platform. Run this when a gate command produces unexpected output.

npx @evalgate/sdk doctor

Gate and CI

npx @evalgate/sdk gate — run the regression gate locally

Compares your current test results against the stored baseline and exits 1 if any metric regresses.

npx @evalgate/sdk gate

Output format options:

npx @evalgate/sdk gate --format github   # CI step summary and job annotations
npx @evalgate/sdk gate --format json     # Machine-readable JSON output

npx @evalgate/sdk ci — one-command CI gate

Discovers eval specs, runs them, writes results, and compares against a base run when --base is provided. Add --impacted-only to run only specs affected by the current diff.

npx @evalgate/sdk ci --format github --write-results --base main

Full GitHub Actions workflow:

name: EvalGate CI
on: [push, pull_request]
jobs:
  evalgate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm ci
      - run: npx @evalgate/sdk ci --format github --write-results --base main
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: evalgate-results
          path: .evalgate/

npx @evalgate/sdk check — platform gate

Runs the regression gate against the EvalGate platform (requires EVALGATE_API_KEY). Use --onFail import to upload failed run context to the dashboard for review.

npx @evalgate/sdk check --format github --onFail import

npx @evalgate/sdk baseline update — refresh the baseline

Re-runs your tests and overwrites the stored baseline with the new results. Run this after you intentionally change model behavior or fix a known issue.

npx @evalgate/sdk baseline update

npx @evalgate/sdk upgrade --full — full metric gate

Runs the gate with all available metrics enabled, including judge credibility checks. Use this before promoting a new prompt or model version.

npx @evalgate/sdk upgrade --full

Labeling and analysis

npx @evalgate/sdk failure-modes — define app-specific failure modes

Opens an interactive prompt to define the failure categories specific to your application (for example: hallucination, off_topic, wrong_format). Run this once before labeling.

npx @evalgate/sdk failure-modes

Failure modes are stored in your config and used by analyze and the judge credibility system.

npx @evalgate/sdk label — interactive trace labeling

Steps through your unlabeled traces one by one. Use arrow keys to select pass/fail and pick a failure mode. Press u to undo the previous label. Press Ctrl-C to save progress and exit.

npx @evalgate/sdk label

Each label you save becomes part of the golden dataset that can be used by later eval runs and gates.

npx @evalgate/sdk analyze — failure-mode frequency report

Aggregates all labeled traces and prints a frequency report showing which failure modes are most common, their weighted impact, and whether any exceed the thresholds you defined in evalgate.config.json.

npx @evalgate/sdk analyze

npx @evalgate/sdk replay-decision — compare two runs

Loads two saved run artifacts and emits a keep/discard decision for each case — useful for reviewing whether a prompt change improved or regressed specific failure modes.

npx @evalgate/sdk replay-decision \
  --previous .evalgate/runs/run-prev.json \
  --current  .evalgate/runs/run-latest.json

Advanced

npx @evalgate/sdk cluster — group similar failures

Reads a saved run artifact and groups cases with similar failure patterns. Use the output to prioritize which failure mode to fix first.

npx @evalgate/sdk cluster --run .evalgate/runs/latest.json

npx @evalgate/sdk synthesize — generate synthetic golden cases

Reads your labeled failure dataset and generates deterministic synthetic test cases to expand coverage of underrepresented failure modes.

npx @evalgate/sdk synthesize \
  --dataset .evalgate/golden/labeled.jsonl \
  --output  .evalgate/golden/synthetic.jsonl

npx @evalgate/sdk auto — bounded autonomous prompt-improvement loop

Reads labeled failures and prior prompt history, generates the next candidate prompt edit, evaluates it against impacted specs, and keeps the edit only if it does not regress any existing case. The loop terminates on explicit guard conditions rather than running open-ended.

npx @evalgate/sdk auto \
  --objective tone_mismatch \
  --prompt prompts/support.md \
  --autonomous \
  --budget 3

To repeat bounded cycles unattended (for example, overnight):

npx @evalgate/sdk auto daemon --cycles 5

auto and auto daemon are currently only available in the TypeScript CLI. Use them with npx @evalgate/sdk auto alongside the Python SDK for Python runtimes.

npx @evalgate/sdk discover --manifest — refresh the spec manifest

Scans your project for eval spec files, refreshes the manifest, and reports any redundant or overlapping specs.

npx @evalgate/sdk discover --manifest

Judge commands

npx @evalgate/sdk judge registry — list available judges

Prints all judges available in the EvalGate registry for your organization.

npx @evalgate/sdk judge registry

npx @evalgate/sdk judge presets — list judge presets

Prints the built-in judge presets (pre-configured provider + model + prompt combinations).

npx @evalgate/sdk judge presets

npx @evalgate/sdk judge test — test a judge configuration

Runs a judge against a single input/output pair and prints the score, reasoning, and signals. Use this to validate a judge configuration before wiring it into your gate.

npx @evalgate/sdk judge test \
  --provider openai \
  --model gpt-5.2-chat-latest \
  --judge support_quality \
  --input "Cancel my subscription" \
  --output "I've canceled your plan effective today."

Example output:

{
  "score": 0.92,
  "passed": true,
  "reasoning": "The response directly addresses the user's request with a clear confirmation.",
  "signals": ["direct", "action_confirmed", "professional_tone"]
}

npx @evalgate/sdk judge compare — compare two outputs

Runs a judge against two candidate outputs for the same input and returns a preference decision with reasoning. Useful for A/B prompt comparisons.

npx @evalgate/sdk judge compare \
  --config-id 42 \
  --input "Cancel my subscription" \
  --output-a "I've canceled your plan effective today." \
  --output-b "Please visit billing settings to make changes."

Judge credibility config

Configure judge credibility thresholds and failure-mode alerts in evalgate.config.json at the root of your project:

{
  "judge": {
    "bootstrapSeed": 42,
    "tprMin": 0.70,
    "tnrMin": 0.70,
    "minLabeledSamples": 30
  },
  "failureModeAlerts": {
    "modes": {
      "hallucination": { "weight": 1.5, "maxPercent": 10 },
      "off_topic":     { "weight": 1.0, "maxPercent": 20, "maxCount": 5 },
      "wrong_format":  { "weight": 0.8, "maxPercent": 15 }
    }
  }
}

Set bootstrapSeed to a fixed value (for example, 42) to make judge credibility calculations deterministic across CI runs. Without a fixed seed, bootstrap confidence intervals may vary slightly between runs.

When a judge’s discriminative power (TPR + TNR − 1) falls at or below 0.05, the gate skips score correction and exits with code 8 (WARN) instead of using a potentially biased score. When labeled sample count is below minLabeledSamples, bootstrap confidence intervals are also skipped — both conditions emit reason codes into the judgeCredibility block of the JSON report.

​EvalGate CLI command reference

​Exit codes

​Setup and initialization

​Gate and CI

​Labeling and analysis

​Advanced

​Judge commands

​Judge credibility config

EvalGate CLI command reference

Exit codes

Setup and initialization

Gate and CI

Labeling and analysis

Advanced

Judge commands

Judge credibility config