sam@latino:~$

sam@latino:~/projects/callcheck$ cat README.md

callcheck

Tool-calling and structured-output conformance matrix for vLLM-served open models, with an 11-label failure taxonomy and a mock server that proves the scorer itself is correct.

lang Python status active tests  165 repo  github.com/slatino-dev/callcheck

  • [tool calling]
  • [structured output]
  • [vLLM]
  • [conformance testing]
  • [failure taxonomy]

“Supports tool calling” on a model card and “emits a parseable, schema-valid tool call on the wire” are different claims — callcheck measures the second one, per model, per task shape.

Lab notebook follows. Terse on purpose; the transcripts argue better than I do.

apparatus

  • tasks — YAML, roughly 40–50 definitions. Single call, parallel calls, nested objects, enums, unicode arguments, int64 boundaries, $ref schemas. The edge cases are the corpus; the happy path is one row.
  • client — OpenAI-compatible, aimed at vLLM-served open models.
  • runner — k=3 per matrix cell. Resumable mid-matrix: finished cells are never re-run, half-finished ones are.
  • checkers — three stages, strictly ordered: parse, then jsonschema validate, then semantic predicates. A failure stops at the first stage it cannot pass. The order is the diagnosis.
  • report — per-model tables plus a failure-transcript gallery. The gallery is the point. A label without its transcript invites arguing; a label with one ends it.

procedure

docs/pipeline.svg
tasks → runner → model under test → three-stage checkers → labeled report; the mockserver run is the control that proves the scorer

taxonomy

Every failure lands in exactly one of eleven labels. Mutually exclusive, or the matrix cells stop being comparable.

taxonomy — 11 labels
no_call             answered in prose, called nothing
wrong_tool          called a tool, the wrong one
malformed_json      arguments failed to parse at all
schema_violation    parsed, failed jsonschema
hallucinated_param  argument that isn't in the schema
missing_required    required argument absent
type_coercion       right field, wrong type ("42" for 42)
escaping_error      quoting/escaping mangled the payload
truncation          call cut off mid-emission
parallel_collapse   N requested calls emitted as one
spurious_call       called a tool when none was needed
label names are the project's stable vocabulary; glosses paraphrase the checker logic

controls

The scorer is itself under test. The in-repo mockserver replays known tool-call outputs — well-formed and deliberately malformed — and CI asserts the checkers assign exactly the expected labels. A conformance harness whose own scorer is unverified is just opinion with YAML. 165 tests, most of them this.

tests
165 scorer correctness heavy
failure labels
11 mutually exclusive
task definitions
40–50 YAML, edge-case heavy
repeats
k=3 resumable per cell

status

results/matrix pending

Matrix cells are model × task family × k=3. Runner, checkers, taxonomy, and mockserver controls: done and green. Conformance numbers for real vLLM-served models land here once the runs execute on local GPUs. No cell gets filled in by hand.