callcheck — Sam Latino

“Supports tool calling” on a model card and “emits a parseable, schema-valid tool call on the wire” are different claims — callcheck measures the second one, per model, per task shape.

Lab notebook follows. Terse on purpose; the transcripts argue better than I do.

apparatus

tasks — YAML, roughly 40–50 definitions. Single call, parallel calls, nested objects, enums, unicode arguments, int64 boundaries, $ref schemas. The edge cases are the corpus; the happy path is one row.
client — OpenAI-compatible, aimed at vLLM-served open models.
runner — k=3 per matrix cell. Resumable mid-matrix: finished cells are never re-run, half-finished ones are.
checkers — three stages, strictly ordered: parse, then jsonschema validate, then semantic predicates. A failure stops at the first stage it cannot pass. The order is the diagnosis.
report — per-model tables plus a failure-transcript gallery. The gallery is the point. A label without its transcript invites arguing; a label with one ends it.

procedure

docs/pipeline.svg

tasks → runner → model under test → three-stage checkers → labeled report; the mockserver run is the control that proves the scorer

taxonomy

Every failure lands in exactly one of eleven labels. Mutually exclusive, or the matrix cells stop being comparable.

taxonomy — 11 labels

no_call             answered in prose, called nothing
wrong_tool          called a tool, the wrong one
malformed_json      arguments failed to parse at all
schema_violation    parsed, failed jsonschema
hallucinated_param  argument that isn't in the schema
missing_required    required argument absent
type_coercion       right field, wrong type ("42" for 42)
escaping_error      quoting/escaping mangled the payload
truncation          call cut off mid-emission
parallel_collapse   N requested calls emitted as one
spurious_call       called a tool when none was needed

label names are the project's stable vocabulary; glosses paraphrase the checker logic

controls

The scorer is itself under test. The in-repo mockserver replays known tool-call outputs — well-formed and deliberately malformed — and CI asserts the checkers assign exactly the expected labels. A conformance harness whose own scorer is unverified is just opinion with YAML. 165 tests, most of them this.

tests: 165 scorer correctness heavy
failure labels: 11 mutually exclusive
task definitions: 40–50 YAML, edge-case heavy
repeats: k=3 resumable per cell

status

results/matrix pending

Matrix cells are model × task family × k=3. Runner, checkers, taxonomy, and mockserver controls: done and green. Conformance numbers for real vLLM-served models land here once the runs execute on local GPUs. No cell gets filled in by hand.