eval-gate — Sam Latino

Eval scores wobble between identical runs, so a CI gate that diffs point estimates will either page your team for noise or get muted within a month — eval-gate fails builds only when a regression clears the measured noise floor.

what it is

A regression gate for LLM behavior, shaped like the rest of your CI: golden sets in JSONL, deterministic scorers, a drift check with a statistical floor, and a GitHub Action that posts one sticky PR comment and sets the exit code. The core is Python; the Action is TypeScript. The other five projects on this site run it on their own pull requests.

golden sets

One case per line: the input, the expectation, and which scorer judges it. JSONL because line-oriented diffs review well and blame works per-case — a golden set you cannot code-review decays into folklore.

scorers

scorer	verdict basis
`exact`	string equality
`regex`	pattern match
`json_subset`	expected keys/values present in candidate JSON
`numeric_tolerance`	within a configured tolerance
`text_similarity`	token-level F1
`judge` (optional)	model-graded — see degradation below

Every scorer except the judge is embedding-free and fully deterministic: same inputs, same verdict, on any machine, at any time. That property is what makes a baseline from three weeks ago still meaningful today.

drift detection

Run a baseline k times and you have a distribution, not a number. eval-gate compares the candidate’s mean score against the baseline’s confidence interval:

inside the interval — noise; the gate stays quiet
below it, past a configured threshold — regression; the build fails and the comment names the threshold that tripped

The k-repeat cost is real (k× eval spend on the baseline) and it buys the one property a gate cannot live without: failures the team believes.

docs/gate-flow.svg

golden set + candidate → scorers → drift gate (mean vs noise floor) → Action → sticky comment + exit code

the Action

The Action posts a sticky comment — one comment per PR, edited in place, not a fresh wall of text per push — with per-scorer deltas against baseline, and enforces the fail-the-build thresholds. Gate output you have to go hunting for is gate output nobody reads.

degradation, not silence

If no judge endpoint is configured, judge-scored cases report skipped (no endpoint) in the comment: visible, counted, and never folded into a pass. A gate that silently skips part of its golden set is a gate that lies about coverage, which is worse than no gate.

numbers

tests: 112 core + Action
scorers: 6 5 deterministic + judge
dogfooders: 5 the other projects here
PR comments: 1 sticky, edited in place

bench/gate-overhead pending

No performance claims yet. What belongs here: wall-clock gate overhead per PR at typical golden-set sizes, and scorer throughput — measured, not asserted. Until those runs exist, the table stays empty.