millstone — Sam Latino

Embedding pipelines have become the default answer to code retrieval before anyone asks what BM25 with a code-aware tokenizer and a repo-map fails to find — millstone is that question, built carefully enough to be falsified.

The standard retrieval stack of the moment goes: chunk the repo, embed the chunks, stand up an ANN index, rerank, and accept that you now operate a model dependency, an index lifecycle, and a similarity score nobody can explain. Sometimes that stack earns its keep. My claim is narrower than “embeddings are bad”: for code, the cheap lexical baseline is much stronger than its reputation, and most teams adopt the expensive stack without ever measuring the gap. Code is not natural language. Identifiers are deliberate names, not paraphrases — the developer who wrote parse_task_header will search for roughly those words — and the structure embeddings are supposed to recover implicitly is sitting right there in the syntax tree, explicitly, for free.

So millstone takes the unfashionable side and arms it properly.

the apparatus

Three pieces, all clean-room:

Okapi BM25, implemented from the literature and verified against hand-computed values — not against another library’s output, a distinction that cost me a dead end (below).
A code-aware tokenizer. camelCase splits, snake_case splits, unicode identifiers survive. Most of lexical retrieval’s bad reputation on code is actually tokenizer failure: if TaskStore never becomes task and store, BM25 never had a chance.
A tree-sitter repo-map. Symbol table plus cross-file references — the structural signal a flat index lacks, recovered from parse trees instead of approximated by vectors.

the discipline

A thesis with a home-team scoreboard is marketing, so the harness is built adversarially. Millstone, tantivy, and SQLite FTS5 run behind one retriever interface over the same corpora: SciFact, and a file-localization task built from SWE-bench-lite. Corpora are fetched by xtask with checksums and never committed. Scoring is nDCG@10, MRR, and Recall@k — the metric implementations are unit-tested against known values, and a Kendall-tau rank-correlation check against tantivy guards the whole rig’s validity.

bench/harness.svg

one harness, three live retrievers, shared metrics — the vector baseline holds a documented empty seat until bench time

The empty seat matters. The fastembed+HNSW vector baseline is a documented placeholder, deferred to bench time on purpose: wiring a half-tuned vector retriever now would produce exactly the kind of strawman comparison this project exists to complain about. When it runs, it runs properly or the result doesn’t ship.

results

bench/retrieval-quality pending

millstone vs tantivy vs SQLite FTS5 (vector baseline to follow): nDCG@10, MRR, Recall@k over SciFact and SWE-bench-lite file-localization. The harness, the metric tests, and the corpus fetchers are done and green; numbers land here when real runs do. If the thesis is wrong, this table is where it will say so.

tests: 35 incl. metric correctness
retrievers wired: 3 millstone · tantivy · FTS5
corpora: 2 SciFact · SWE-bench-lite
metrics: 3 nDCG@10 · MRR · Recall@k

If the vector baseline wins at bench time, the thesis falsifies in public and the table above will say so. That outcome would be worth shipping too — the point was never that embeddings lose, it’s that nobody should pay for them on reputation alone.