sam@latino:~$

sam@latino:~/projects/eval-gate$ cat README.md

eval-gate

Regression-eval CI gate — embedding-free scorers, drift detection with a k-repeat noise floor, and a sticky PR comment via a GitHub Action. Dogfooded by every other repo here.

lang Python + TypeScript status active tests  112 repo  github.com/slatino-dev/eval-gate

  • [evals]
  • [CI]
  • [drift detection]
  • [GitHub Actions]
  • [regression testing]

Eval scores wobble between identical runs, so a CI gate that diffs point estimates will either page your team for noise or get muted within a month — eval-gate fails builds only when a regression clears the measured noise floor.

what it is

A regression gate for LLM behavior, shaped like the rest of your CI: golden sets in JSONL, deterministic scorers, a drift check with a statistical floor, and a GitHub Action that posts one sticky PR comment and sets the exit code. The core is Python; the Action is TypeScript. The other five projects on this site run it on their own pull requests.

golden sets

One case per line: the input, the expectation, and which scorer judges it. JSONL because line-oriented diffs review well and blame works per-case — a golden set you cannot code-review decays into folklore.

scorers

scorerverdict basis
exactstring equality
regexpattern match
json_subsetexpected keys/values present in candidate JSON
numeric_tolerancewithin a configured tolerance
text_similaritytoken-level F1
judge (optional)model-graded — see degradation below

Every scorer except the judge is embedding-free and fully deterministic: same inputs, same verdict, on any machine, at any time. That property is what makes a baseline from three weeks ago still meaningful today.

drift detection

Run a baseline k times and you have a distribution, not a number. eval-gate compares the candidate’s mean score against the baseline’s confidence interval:

  • inside the interval — noise; the gate stays quiet
  • below it, past a configured threshold — regression; the build fails and the comment names the threshold that tripped

The k-repeat cost is real (k× eval spend on the baseline) and it buys the one property a gate cannot live without: failures the team believes.

docs/gate-flow.svg
golden set + candidate → scorers → drift gate (mean vs noise floor) → Action → sticky comment + exit code

the Action

The Action posts a sticky comment — one comment per PR, edited in place, not a fresh wall of text per push — with per-scorer deltas against baseline, and enforces the fail-the-build thresholds. Gate output you have to go hunting for is gate output nobody reads.

degradation, not silence

If no judge endpoint is configured, judge-scored cases report skipped (no endpoint) in the comment: visible, counted, and never folded into a pass. A gate that silently skips part of its golden set is a gate that lies about coverage, which is worse than no gate.

numbers

tests
112 core + Action
scorers
6 5 deterministic + judge
dogfooders
5 the other projects here
PR comments
1 sticky, edited in place
bench/gate-overhead pending

No performance claims yet. What belongs here: wall-clock gate overhead per PR at typical golden-set sizes, and scorer throughput — measured, not asserted. Until those runs exist, the table stays empty.