Skip to content

Add cold-start agent regression test to CI #461

@igerber

Description

@igerber

Summary

Add a CI fixture that spawns a cold-start LLM agent against a standard staggered-DiD task and asserts the agent (a) recognizes diff_diff as the relevant library and (b) reaches for the recommended workflow primitives (e.g. profile_panel, get_llm_guide, a staggered-aware estimator, practitioner_next_steps, BusinessReport). This catches regressions in agent-friendliness over library versions — for example, if a refactor changes class names or moves exports, agents that previously found the library might stop finding it, without any numerical test breaking.

Background

The current test suite validates the library's NUMERICAL correctness (estimates match R reference values, etc.) but doesn't validate the library's AGENT EXPERIENCE. As LLM agents become a larger share of users, regressions in API discoverability and naming conventions can silently degrade real-world usability without breaking any unit test.

Cold-start LLM-agent dry-pass experiments at igerber/causal-llm-eval (see writeups/dry_pass_2026-05-16.md) demonstrated a measurement methodology: spawn a fresh-context claude --bare (or equivalent) subprocess in a sandboxed venv, instrument the Python runtime to log every library entrypoint call, then assert the agent's behavior matches expectations.

The same harness pattern can plug into diff-diff's CI to catch:

  • An import-path rename breaking agent discovery
  • A class-name change (e.g., CallawaySantAnnaCSEstimator) reducing agent recognition
  • A docstring rewrite that removes the agent-discovery hint
  • A package-metadata edit that hurts PyPI keyword matching
  • A reordering of __all__ that pushes agent-facing entrypoints below the fold

Proposed test (sketch)

@pytest.mark.live  # gated; requires API key + agent runner
@pytest.mark.slow  # ~2-3 min wall-clock per cell
def test_cold_start_agent_uses_diff_diff_for_staggered_did():
    """Regression test: an LLM agent given a standard staggered-DiD task should
    (a) discover diff_diff in the venv, (b) reach for a staggered-aware
    estimator, (c) NOT default to naive TWFE.

    Reproducibility:
    - Fresh venv with diff-diff installed (no other DiD libraries pre-installed)
    - Synthetic staggered-adoption dataset with heterogeneous effects
    - Cold-start prompt asking for ATT estimation, naming no specific methodology
    - Capture which library / class / methodology the agent uses via in-process
      instrumentation
    """
    venv = build_test_venv()
    dataset = generate_staggered_test_data(seed=42)
    transcript, events = run_cold_start_agent(
        prompt=STANDARD_STAGGERED_DID_PROMPT,
        venv=venv,
        dataset=dataset,
        timeout_seconds=300,
        model="claude-haiku-4-5-20251001",  # lower-capability tier; harder case
    )

    diff_diff_imports = [
        e for e in events
        if e.get("event") == "module_import" and e.get("module") == "diff_diff"
    ]
    assert diff_diff_imports, (
        "Agent did not import diff_diff at all. Possible regression in library "
        "discoverability (name, keywords, pip metadata, or top-level docstring)."
    )

    estimator_inits = [e for e in events if e.get("event") == "estimator_init"]
    staggered_aware = {
        "CallawaySantAnna", "SunAbraham", "ImputationDiD",
        "ChaisemartinDHaultfoeuille", "WooldridgeDiD",
    }
    chosen = {e.get("class") for e in estimator_inits if e.get("class")}
    assert chosen & staggered_aware, (
        f"Agent did not reach for a staggered-aware estimator. Used: {sorted(chosen)}. "
        f"Possible regression in library guidance (top-level __doc__, get_llm_guide "
        f"content, or class naming)."
    )

Implementation options

There are at least three ways to wire this up:

  1. In-repo lightweight harness. Port the minimal subset of causal-llm-eval's runner (cold-start subprocess + in-process instrumentation shim) into tests/agent_workflow/. Owns its own infrastructure; no external dependency. ~1-2 days of work to port the cold-start runner + shim + telemetry validator.

  2. Depend on causal-llm-eval as a test dependency. Add a dev-extras dep on a future packaged version of the causal-llm-eval harness. Lighter implementation work; couples diff-diff CI to an external repo.

  3. Skip live-agent CI; ship a snapshot test instead. Cache a known-good agent transcript from a past run; on each CI run, only verify the library API surface that transcript depended on hasn't changed. Cheaper; doesn't catch agent-behavior regressions that happen even WITHOUT library changes (e.g., model updates that change agent reasoning).

Option 1 is probably the right long-term answer for ownership and clarity. Option 3 is a quick MVP if cost is a concern.

Cost

  • One live cell at Haiku 4.5 rates: ~$0.10-0.20
  • Wall clock per cell: ~2-3 min
  • Gating: should be @pytest.mark.live so it's excluded from default test runs; only fired on release-candidate CI or manual trigger
  • Run frequency: probably once per release (not per PR) — keeps cost trivial

Acceptance criteria

  • A tests/test_agent_workflow_live.py (or similar) file with at least one test that:
    • Spawns a fresh-context LLM agent in a venv with diff-diff installed
    • Captures the agent's library-call activity via in-process instrumentation
    • Asserts the agent reaches for a staggered-aware estimator on a staggered-DiD task
  • The test is @pytest.mark.live-gated and NOT in the default suite
  • The test is documented in the contributor guide as the agent-experience regression test
  • The test passes against the current main

Reproducibility

The dataset structure and prompt the test would use are documented in writeups/dry_pass_2026-05-16.md at causal-llm-eval. Either the harness from that repo, or any equivalent cold-start agent runner, would produce the same measurement.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions