sam@latino:~/projects/eval-gate$ cat README.md
eval-gate
Regression-eval CI gate — embedding-free scorers, drift detection with a k-repeat noise floor, and a sticky PR comment via a GitHub Action. Dogfooded by every other repo here.
lang Python + TypeScript status active tests 112 repo github.com/slatino-dev/eval-gate
- [evals]
- [CI]
- [drift detection]
- [GitHub Actions]
- [regression testing]
Eval scores wobble between identical runs, so a CI gate that diffs point estimates will either page your team for noise or get muted within a month — eval-gate fails builds only when a regression clears the measured noise floor.
what it is
A regression gate for LLM behavior, shaped like the rest of your CI: golden sets in JSONL, deterministic scorers, a drift check with a statistical floor, and a GitHub Action that posts one sticky PR comment and sets the exit code. The core is Python; the Action is TypeScript. The other five projects on this site run it on their own pull requests.
golden sets
One case per line: the input, the expectation, and which scorer judges it. JSONL because line-oriented diffs review well and blame works per-case — a golden set you cannot code-review decays into folklore.
scorers
| scorer | verdict basis |
|---|---|
exact | string equality |
regex | pattern match |
json_subset | expected keys/values present in candidate JSON |
numeric_tolerance | within a configured tolerance |
text_similarity | token-level F1 |
judge (optional) | model-graded — see degradation below |
Every scorer except the judge is embedding-free and fully deterministic: same inputs, same verdict, on any machine, at any time. That property is what makes a baseline from three weeks ago still meaningful today.
drift detection
Run a baseline k times and you have a distribution, not a number. eval-gate compares the candidate’s mean score against the baseline’s confidence interval:
- inside the interval — noise; the gate stays quiet
- below it, past a configured threshold — regression; the build fails and the comment names the threshold that tripped
The k-repeat cost is real (k× eval spend on the baseline) and it buys the one property a gate cannot live without: failures the team believes.
the Action
The Action posts a sticky comment — one comment per PR, edited in place, not a fresh wall of text per push — with per-scorer deltas against baseline, and enforces the fail-the-build thresholds. Gate output you have to go hunting for is gate output nobody reads.
degradation, not silence
If no judge endpoint is configured, judge-scored cases report
skipped (no endpoint) in the comment: visible, counted, and never folded
into a pass. A gate that silently skips part of its golden set is a gate
that lies about coverage, which is worse than no gate.
numbers
- tests
- 112 core + Action
- scorers
- 6 5 deterministic + judge
- dogfooders
- 5 the other projects here
- PR comments
- 1 sticky, edited in place
No performance claims yet. What belongs here: wall-clock gate overhead per PR at typical golden-set sizes, and scorer throughput — measured, not asserted. Until those runs exist, the table stays empty.