sam@latino:~/projects/callcheck$ cat README.md
callcheck
Tool-calling and structured-output conformance matrix for vLLM-served open models, with an 11-label failure taxonomy and a mock server that proves the scorer itself is correct.
lang Python status active tests 165 repo github.com/slatino-dev/callcheck
- [tool calling]
- [structured output]
- [vLLM]
- [conformance testing]
- [failure taxonomy]
“Supports tool calling” on a model card and “emits a parseable, schema-valid tool call on the wire” are different claims — callcheck measures the second one, per model, per task shape.
Lab notebook follows. Terse on purpose; the transcripts argue better than I do.
apparatus
- tasks — YAML, roughly 40–50 definitions. Single call, parallel calls,
nested objects, enums, unicode arguments, int64 boundaries,
$refschemas. The edge cases are the corpus; the happy path is one row. - client — OpenAI-compatible, aimed at vLLM-served open models.
- runner — k=3 per matrix cell. Resumable mid-matrix: finished cells are never re-run, half-finished ones are.
- checkers — three stages, strictly ordered: parse, then jsonschema validate, then semantic predicates. A failure stops at the first stage it cannot pass. The order is the diagnosis.
- report — per-model tables plus a failure-transcript gallery. The gallery is the point. A label without its transcript invites arguing; a label with one ends it.
procedure
taxonomy
Every failure lands in exactly one of eleven labels. Mutually exclusive, or the matrix cells stop being comparable.
no_call answered in prose, called nothing
wrong_tool called a tool, the wrong one
malformed_json arguments failed to parse at all
schema_violation parsed, failed jsonschema
hallucinated_param argument that isn't in the schema
missing_required required argument absent
type_coercion right field, wrong type ("42" for 42)
escaping_error quoting/escaping mangled the payload
truncation call cut off mid-emission
parallel_collapse N requested calls emitted as one
spurious_call called a tool when none was needed controls
The scorer is itself under test. The in-repo mockserver replays known tool-call outputs — well-formed and deliberately malformed — and CI asserts the checkers assign exactly the expected labels. A conformance harness whose own scorer is unverified is just opinion with YAML. 165 tests, most of them this.
- tests
- 165 scorer correctness heavy
- failure labels
- 11 mutually exclusive
- task definitions
- 40–50 YAML, edge-case heavy
- repeats
- k=3 resumable per cell
status
Matrix cells are model × task family × k=3. Runner, checkers, taxonomy, and mockserver controls: done and green. Conformance numbers for real vLLM-served models land here once the runs execute on local GPUs. No cell gets filled in by hand.