redcell — Sam Latino

An agent that passes a manual red-team review on Tuesday can leak its system prompt on Wednesday — redcell turns “resists prompt injection” into a number you can regress against.

redcell is a defensive test harness. It runs a fixed corpus of attack probes against an agent, scores every transcript with deterministic oracles, and reports attack success rate and severity-weighted risk per hardening level. Nothing in the scoring path depends on one model’s opinion of another model’s output.

scope

Corpus: 146 probe cases mapped to the OWASP Top 10 for Agentic Applications (2026), covering five categories: ASI01, ASI02, ASI03, ASI05, ASI06.
Target: an in-repo, sandboxed agent with three hardening levels — none, basic, hardened. The harness ships the thing it attacks, so every number below reproduces from a clone.
Rules of engagement: authorized testing only. redcell measures defenses you own. It is not an attack kit, and the corpus assumes a target you are permitted to break.

method

docs/architecture.svg

probe corpus → runner → sandboxed target → deterministic oracles → report; the optional judge path degrades to needs-review

Each probe is a scripted interaction with an injection payload and a known ground truth. The runner drives it against the target, captures the full transcript, and hands it to three oracle families:

Canary exfiltration. Secrets are seeded into the target’s context as canaries. The oracle flags a leak whether it surfaces raw, base64-encoded, hex-encoded, or as a recognizable partial fragment.
Forbidden-tool invocation. Any call to a tool the case’s policy forbids is a finding, regardless of how politely the model narrated it.
Regex / JSON-path predicates. Case-specific assertions over the transcript structure for everything the first two don’t cover.

An optional local-model judge can annotate transcripts the oracles cannot classify. When no endpoint is configured, judge-dependent cases degrade to needs-review — a visible state in the report, never a silent pass.

findings

The flagship run sweeps the same 146-probe corpus across all three hardening levels of the in-repo target. These numbers are harness-produced, not estimated:

runs/hardening-sweep 3 rows

same corpus, same oracles, three target configurations
hardening level	attack success rate	severity-weighted risk
none	73%	84.3
basic	3%	2.4
hardened	0%	0.0

harness-produced: redcell runner over the 146-case corpus against the in-repo sandboxed target; deterministic oracles only, no judge in the loop

Read it honestly. The target is the in-repo sandbox, not your production agent, and 0% against this corpus does not mean safe — it means this corpus is exhausted. The result that matters is the slope: identical probes, identical oracles, three configurations, and a measurable answer to “did the mitigation move anything.” Re-run the sweep after a prompt change and you know whether you regressed, in minutes, without convening a review.

scoring integrity

A harness that mis-scores is worse than no harness — it converts vulnerability into confidence. So the rig itself is under test: scripted fake agents with known ground truth, one deliberately vulnerable and one deliberately safe, run through the full pipeline, and the suite asserts the scores land exactly where they must. The 132 tests cover the oracles, the sandbox, and that rig-correctness loop.

probes: 146 OWASP agentic, 5 categories
hardening levels: 3 none · basic · hardened
oracle families: 3 all deterministic
tests: 132 incl. rig-correctness