zig_eval is a registry-driven evaluation library for testing LLM products,
agent runtimes, and OpenAI-compatible chat APIs from Zig.
It is not tied to a single application. A team can point zig_eval at any
product that exposes a chat-completions-style endpoint, define datasets as
JSONL, group evals by capability, and compare behavior across services or
models.
- Load service definitions from JSON.
- Load grouped eval definitions from a registry directory.
- Load eval cases from JSONL datasets.
- Evaluate exact-match, includes, and required JSON field checks.
- Evaluate model-graded checks through a configured LLM judge service.
- Validate single-turn OpenAI-compatible tool calls and root-level arguments.
- Run file-backed multimodal eval cases with built-in image and text-file rendering.
- Call an OpenAI-compatible
POST /v1/chat/completionsendpoint. - Support authenticated and unauthenticated product endpoints.
- Run grouped eval definitions across configured services and datasets.
- Aggregate results into plain-text and JSON report artifacts.
- Report Wilson 95% pass-rate confidence intervals and baseline comparison statistics.
- Run evals from the
zig_evalCLI with service, group, eval, run-count, and output-format filters. - Run evals with bounded parallelism, per-service throttling, and retry metadata in reports.
zig_eval loads services, eval definitions, and JSONL datasets from a registry,
then the runner sends each eval case to each selected service. Raw run results
can be aggregated into readable text reports or machine-readable JSON artifacts.
See docs/usage.md for a complete library usage example. See docs/reference.md for registry and report field definitions.
List the example registry:
zig build run -- list --registry examples/registryRun one eval against one configured service:
zig build run -- run --registry examples/registry --service local-product --eval smoke.reply_okUse JSON report output:
zig build run -- run --registry examples/registry --service local-product --format jsonRun with bounded parallel workers:
zig build run -- run --registry examples/registry --parallel 4 --max-inflight-per-service 2Run a model-graded eval with the configured judge service:
zig build run -- run --registry examples/registry --service local-product --eval quality.helpful_summary --judge-service judgeRun a single-turn tool-calling eval:
zig build run -- run --registry examples/registry --service local-product --eval tools.search_webRun a file-backed multimodal eval:
zig build run -- run --registry examples/registry --service local-product --eval multimodal.release_notesrun requires the selected service endpoint to be reachable.
The repository includes a starter registry under examples/registry.
registry
├── services.json
├── evals
│ └── <group>
│ └── <eval>.json
└── data
└── <group>
└── <eval>
└── <split>.jsonl
registry/services.json is an array of service definitions. Use api_key_env
for Bearer-token authentication, or omit it for internal products that do not
need auth.
[
{
"name": "product-staging",
"base_url": "https://product.example.com/v1/chat/completions",
"api_key_env": "PRODUCT_STAGING_API_KEY",
"default_model": "product-model",
"provider": "product",
"system_prompt": "Answer exactly according to the eval instructions.",
"timeout_ms": 30000
},
{
"name": "local-product",
"base_url": "http://127.0.0.1:9000/v1/chat/completions",
"default_model": "local-model",
"timeout_ms": 15000
}
]Each eval definition points to a dataset and one matcher.
{
"id": "smoke.reply_ok",
"group": "smoke",
"description": "Checks that the service can return a simple literal answer.",
"dataset_path": "data/smoke/reply_ok/test.jsonl",
"split": "test",
"matcher": {
"kind": "exact_match",
"case_sensitive": true,
"trim_whitespace": true
},
"default_run_count": 3,
"service_allowlist": ["product-staging", "local-product"]
}Datasets are JSONL files. Each line is one eval case.
{"id":"case-1","input":"Reply with exactly OK.","ideal":"OK"}
{"id":"case-2","input":"Reply with exactly READY.","ideal":"READY"}Use examples/registry as a copyable starting point for a product eval suite.
It includes:
- one unauthenticated local product service
- one authenticated staging product service
- one judge service for model-graded evals
- one smoke eval using
exact_match - one structured-output eval using
json_fields - one quality eval using
model_grade - one tool eval using
tool_call - two multimodal evals using text and image attachments
Use model_grade when an answer cannot be checked with a simple exact,
contains, or JSON-field rule. The product service generates the candidate
answer, then a separate judge service scores it against a rubric and returns
JSON with score, passed, and reason.
Model-graded evals are useful for quality checks such as helpfulness, correctness, completeness, summarization quality, or instruction following. They are more flexible than deterministic matchers, but they cost an extra model call and depend on the quality of the judge rubric.
Use tool_call to check whether a product chooses the expected OpenAI-style
tool and sends the expected root-level argument values. Eval definitions provide
the tool schema, and dataset cases provide expected tool calls.
V3 validates tool selection and arguments only. It does not execute tools, simulate tool results, or run a multi-turn agent loop.
Dataset cases may include file attachments. The default OpenAI-compatible client renders PNG, JPEG, and WebP images as image content blocks, and appends UTF-8 text-like files as labeled prompt context. PDFs, audio, video, archives, and arbitrary binaries are available to custom adapters but are not rendered by the default client.
Attachment paths are resolved inside the registry root and are limited to 5 MB per file.
Aggregate reports include Wilson 95% confidence intervals for pass rate at the eval, service, and model levels. Library users can also compare services or models against a baseline to get descriptive pass-rate deltas with confidence intervals.
Services may define retry behavior:
{
"retry": {
"max_attempts": 3,
"backoff_ms": 500,
"retry_on_status": [429, 500, 502, 503, 504]
}
}The current focus is a small, stable core:
- OpenAI-compatible chat execution.
- Product-neutral service configuration.
- Capability-based eval grouping.
- Deterministic matchers plus V2 model-graded judging.
- V3 single-turn tool-call validation.
- V4 file-backed multimodal evals and confidence interval reporting.
- Clear ownership of allocated data in public APIs.