redcell
Defensive agent-robustness harness mapped to the OWASP Top 10 for Agentic Applications — deterministic oracles, a 146-case probe corpus, and a sandboxed three-level target.
status: active · 132 tests
sam@latino:~$ whoami
6 open-source projects · 524 tests · Rust · Python · TypeScript
Defensive agent-robustness harness mapped to the OWASP Top 10 for Agentic Applications — deterministic oracles, a 146-case probe corpus, and a sandboxed three-level target.
status: active · 132 tests
Early-adopter MCP 2026-07-28 release-candidate server in Rust — stateless core, the Tasks extension, and statelessness proven by round-robining one client across two server instances.
status: active · 67 tests
Single-binary OpenAI-compatible gateway where privacy routing is enforced in the type system — a private request cannot select an external backend, by construction.
status: active · 13 tests
Clean-room BM25 + tree-sitter repo-map retrieval crate, with a bench harness against tantivy and SQLite FTS5 — the "you probably don't need embeddings (yet)" thesis.
status: active · 35 tests
Tool-calling and structured-output conformance matrix for vLLM-served open models, with an 11-label failure taxonomy and a mock server that proves the scorer itself is correct.
status: active · 165 tests
Regression-eval CI gate — embedding-free scorers, drift detection with a k-repeat noise floor, and a sticky PR comment via a GitHub Action. Dogfooded by every other repo here.
status: active · 112 tests
A crossover framework for lexical versus vector retrieval on code — and the adversarial bench harness I built so my own argument can lose.
Tool calling is the load-bearing primitive of every agent stack, and open models break it in at least eleven distinguishable ways. Naming the failure modes changes how you build the layer above.
Turning "resists prompt injection" into a regression number: a deterministic harness, 146 probes across five OWASP agentic categories, and a hardening sweep that went 73% → 3% → 0%.