sam@latino:~/projects/millstone$ cat README.md
millstone
Clean-room BM25 + tree-sitter repo-map retrieval crate, with a bench harness against tantivy and SQLite FTS5 — the "you probably don't need embeddings (yet)" thesis.
lang Rust status active tests 35 repo github.com/slatino-dev/millstone
- [BM25]
- [lexical retrieval]
- [tree-sitter]
- [repo-map]
- [benchmarking]
Embedding pipelines have become the default answer to code retrieval before anyone asks what BM25 with a code-aware tokenizer and a repo-map fails to find — millstone is that question, built carefully enough to be falsified.
The standard retrieval stack of the moment goes: chunk the repo, embed the
chunks, stand up an ANN index, rerank, and accept that you now operate a model
dependency, an index lifecycle, and a similarity score nobody can explain.
Sometimes that stack earns its keep. My claim is narrower than “embeddings are
bad”: for code, the cheap lexical baseline is much stronger than its
reputation, and most teams adopt the expensive stack without ever measuring
the gap. Code is not natural language. Identifiers are deliberate names, not
paraphrases — the developer who wrote parse_task_header will search for
roughly those words — and the structure embeddings are supposed to recover
implicitly is sitting right there in the syntax tree, explicitly, for free.
So millstone takes the unfashionable side and arms it properly.
the apparatus
Three pieces, all clean-room:
- Okapi BM25, implemented from the literature and verified against hand-computed values — not against another library’s output, a distinction that cost me a dead end (below).
- A code-aware tokenizer.
camelCasesplits,snake_casesplits, unicode identifiers survive. Most of lexical retrieval’s bad reputation on code is actually tokenizer failure: ifTaskStorenever becomestaskandstore, BM25 never had a chance. - A tree-sitter repo-map. Symbol table plus cross-file references — the structural signal a flat index lacks, recovered from parse trees instead of approximated by vectors.
the discipline
A thesis with a home-team scoreboard is marketing, so the harness is built
adversarially. Millstone, tantivy, and SQLite FTS5 run behind one retriever
interface over the same corpora: SciFact, and a file-localization task built
from SWE-bench-lite. Corpora are fetched by xtask with checksums and never
committed. Scoring is nDCG@10, MRR, and Recall@k — the metric implementations
are unit-tested against known values, and a Kendall-tau rank-correlation check
against tantivy guards the whole rig’s validity.
The empty seat matters. The fastembed+HNSW vector baseline is a documented placeholder, deferred to bench time on purpose: wiring a half-tuned vector retriever now would produce exactly the kind of strawman comparison this project exists to complain about. When it runs, it runs properly or the result doesn’t ship.
results
millstone vs tantivy vs SQLite FTS5 (vector baseline to follow): nDCG@10, MRR, Recall@k over SciFact and SWE-bench-lite file-localization. The harness, the metric tests, and the corpus fetchers are done and green; numbers land here when real runs do. If the thesis is wrong, this table is where it will say so.
- tests
- 35 incl. metric correctness
- retrievers wired
- 3 millstone · tantivy · FTS5
- corpora
- 2 SciFact · SWE-bench-lite
- metrics
- 3 nDCG@10 · MRR · Recall@k
If the vector baseline wins at bench time, the thesis falsifies in public and the table above will say so. That outcome would be worth shipping too — the point was never that embeddings lose, it’s that nobody should pay for them on reputation alone.