sam@latino:~$

sam@latino:~/writing$ ls -lt

writing

Notes from building agent systems, evals, and self-hosted LLM infrastructure. Every post is backed by a repo you can clone and a case study you can check it against.

3 posts rss

· retrieval · bm25 · benchmarks

BM25 beat my vector database (sometimes)

A crossover framework for lexical versus vector retrieval on code — and the adversarial bench harness I built so my own argument can lose.

· tool calling · evals · vllm · gateways

Every model fails tool calling differently

Tool calling is the load-bearing primitive of every agent stack, and open models break it in at least eleven distinguishable ways. Naming the failure modes changes how you build the layer above.

· agent security · owasp · prompt injection · evals

Red-teaming my own agents with the OWASP Agentic Top 10

Turning "resists prompt injection" into a regression number: a deterministic harness, 146 probes across five OWASP agentic categories, and a hardening sweep that went 73% → 3% → 0%.