EvalGate: CI for AI behavior
EvalGate helps AI teams stop the same failure from shipping twice. Start with a local regression gate, then add traces, evaluation history, LLM judges, reviews, cost controls, and governance when your AI system reaches production scale. The product is one operating loop: trace to eval to gate. Real behavior produces evidence. Reviewed failures become test cases. Promoted test cases become CI gates that give reviewers evidence before a prompt, model, retriever, or agent change ships.Quick start
Set up your first local eval gate in under 5 minutes, with no account required.
Trace to eval to gate
Understand the operating loop before you wire in platform features.
SDK and CLI
Install the TypeScript or Python SDK and use the same assertions locally, in app code, and in CI.
API reference
Integrate directly with the EvalGate platform for traces, runs, projects, and keys.
The adoption path
Start with one local gate
Install the SDK, snapshot your current test/eval health, and add a CI step that fails when the baseline regresses. This proves the workflow before anyone has to adopt another dashboard.
Capture real failures
Add tracing when local gates are not enough. EvalGate captures production and staging behavior with inputs, outputs, tool calls, latency, token usage, cost, and metadata.
Promote failures into coverage
Convert repeated or high-risk failures into reusable eval cases. Label, cluster, synthesize, and review cases so your suites track actual user pain.
What to use first
| Stage | Use this | Outcome |
|---|---|---|
| First repo | Local gate | Block regressions without creating an account |
| Production AI feature | Traces and eval runs | Turn real behavior into coverage |
| Team rollout | Reviews, judges, and PR annotations | Make AI quality reviewable |
| Governed rollout | Costs, benchmarks, annotations, and audit history | Track quality, spend, and release evidence across projects |
Explore next
CI/CD integration
Wire EvalGate into GitHub Actions or GitLab CI to gate every PR.
Tracing setup
Capture the real AI behavior that should become eval coverage.
LLM judge
Add judge-backed scoring when assertions alone are not enough.
Agent governance
Scale from one gate to governed AI release workflows.