Skip to content

scalabs/zig_eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

zig_eval

zig_eval is a registry-driven evaluation library for testing LLM products, agent runtimes, and OpenAI-compatible chat APIs from Zig.

It is not tied to a single application. A team can point zig_eval at any product that exposes a chat-completions-style endpoint, define datasets as JSONL, group evals by capability, and compare behavior across services or models.

Current Capabilities

  • Load service definitions from JSON.
  • Load grouped eval definitions from a registry directory.
  • Load eval cases from JSONL datasets.
  • Evaluate exact-match, includes, and required JSON field checks.
  • Evaluate model-graded checks through a configured LLM judge service.
  • Validate single-turn OpenAI-compatible tool calls and root-level arguments.
  • Run file-backed multimodal eval cases with built-in image and text-file rendering.
  • Call an OpenAI-compatible POST /v1/chat/completions endpoint.
  • Support authenticated and unauthenticated product endpoints.
  • Run grouped eval definitions across configured services and datasets.
  • Aggregate results into plain-text and JSON report artifacts.
  • Report Wilson 95% pass-rate confidence intervals and baseline comparison statistics.
  • Run evals from the zig_eval CLI with service, group, eval, run-count, and output-format filters.
  • Run evals with bounded parallelism, per-service throttling, and retry metadata in reports.

How It Works

zig_eval loads services, eval definitions, and JSONL datasets from a registry, then the runner sends each eval case to each selected service. Raw run results can be aggregated into readable text reports or machine-readable JSON artifacts.

See docs/usage.md for a complete library usage example. See docs/reference.md for registry and report field definitions.

Quick Start

List the example registry:

zig build run -- list --registry examples/registry

Run one eval against one configured service:

zig build run -- run --registry examples/registry --service local-product --eval smoke.reply_ok

Use JSON report output:

zig build run -- run --registry examples/registry --service local-product --format json

Run with bounded parallel workers:

zig build run -- run --registry examples/registry --parallel 4 --max-inflight-per-service 2

Run a model-graded eval with the configured judge service:

zig build run -- run --registry examples/registry --service local-product --eval quality.helpful_summary --judge-service judge

Run a single-turn tool-calling eval:

zig build run -- run --registry examples/registry --service local-product --eval tools.search_web

Run a file-backed multimodal eval:

zig build run -- run --registry examples/registry --service local-product --eval multimodal.release_notes

run requires the selected service endpoint to be reachable.

Registry Layout

The repository includes a starter registry under examples/registry.

registry
├── services.json
├── evals
│   └── <group>
│       └── <eval>.json
└── data
    └── <group>
        └── <eval>
            └── <split>.jsonl

Service Configuration

registry/services.json is an array of service definitions. Use api_key_env for Bearer-token authentication, or omit it for internal products that do not need auth.

[
  {
    "name": "product-staging",
    "base_url": "https://product.example.com/v1/chat/completions",
    "api_key_env": "PRODUCT_STAGING_API_KEY",
    "default_model": "product-model",
    "provider": "product",
    "system_prompt": "Answer exactly according to the eval instructions.",
    "timeout_ms": 30000
  },
  {
    "name": "local-product",
    "base_url": "http://127.0.0.1:9000/v1/chat/completions",
    "default_model": "local-model",
    "timeout_ms": 15000
  }
]

Eval Definition

Each eval definition points to a dataset and one matcher.

{
  "id": "smoke.reply_ok",
  "group": "smoke",
  "description": "Checks that the service can return a simple literal answer.",
  "dataset_path": "data/smoke/reply_ok/test.jsonl",
  "split": "test",
  "matcher": {
    "kind": "exact_match",
    "case_sensitive": true,
    "trim_whitespace": true
  },
  "default_run_count": 3,
  "service_allowlist": ["product-staging", "local-product"]
}

Dataset Format

Datasets are JSONL files. Each line is one eval case.

{"id":"case-1","input":"Reply with exactly OK.","ideal":"OK"}
{"id":"case-2","input":"Reply with exactly READY.","ideal":"READY"}

Example Registry

Use examples/registry as a copyable starting point for a product eval suite. It includes:

  • one unauthenticated local product service
  • one authenticated staging product service
  • one judge service for model-graded evals
  • one smoke eval using exact_match
  • one structured-output eval using json_fields
  • one quality eval using model_grade
  • one tool eval using tool_call
  • two multimodal evals using text and image attachments

Model-Graded Evals

Use model_grade when an answer cannot be checked with a simple exact, contains, or JSON-field rule. The product service generates the candidate answer, then a separate judge service scores it against a rubric and returns JSON with score, passed, and reason.

Model-graded evals are useful for quality checks such as helpfulness, correctness, completeness, summarization quality, or instruction following. They are more flexible than deterministic matchers, but they cost an extra model call and depend on the quality of the judge rubric.

Tool-Calling Evals

Use tool_call to check whether a product chooses the expected OpenAI-style tool and sends the expected root-level argument values. Eval definitions provide the tool schema, and dataset cases provide expected tool calls.

V3 validates tool selection and arguments only. It does not execute tools, simulate tool results, or run a multi-turn agent loop.

Multimodal File Evals

Dataset cases may include file attachments. The default OpenAI-compatible client renders PNG, JPEG, and WebP images as image content blocks, and appends UTF-8 text-like files as labeled prompt context. PDFs, audio, video, archives, and arbitrary binaries are available to custom adapters but are not rendered by the default client.

Attachment paths are resolved inside the registry root and are limited to 5 MB per file.

Advanced Statistics

Aggregate reports include Wilson 95% confidence intervals for pass rate at the eval, service, and model levels. Library users can also compare services or models against a baseline to get descriptive pass-rate deltas with confidence intervals.

Retry policy

Services may define retry behavior:

{
  "retry": {
    "max_attempts": 3,
    "backoff_ms": 500,
    "retry_on_status": [429, 500, 502, 503, 504]
  }
}

Design Direction

The current focus is a small, stable core:

  • OpenAI-compatible chat execution.
  • Product-neutral service configuration.
  • Capability-based eval grouping.
  • Deterministic matchers plus V2 model-graded judging.
  • V3 single-turn tool-call validation.
  • V4 file-backed multimodal evals and confidence interval reporting.
  • Clear ownership of allocated data in public APIs.

Resources

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages