Add cold-start agent regression test to CI

## Summary

Add a CI fixture that spawns a cold-start LLM agent against a standard staggered-DiD task and asserts the agent (a) recognizes `diff_diff` as the relevant library and (b) reaches for the recommended workflow primitives (e.g. `profile_panel`, `get_llm_guide`, a staggered-aware estimator, `practitioner_next_steps`, `BusinessReport`). This catches regressions in agent-friendliness over library versions — for example, if a refactor changes class names or moves exports, agents that previously found the library might stop finding it, without any numerical test breaking.

## Background

The current test suite validates the library's NUMERICAL correctness (estimates match R reference values, etc.) but doesn't validate the library's AGENT EXPERIENCE. As LLM agents become a larger share of users, regressions in API discoverability and naming conventions can silently degrade real-world usability without breaking any unit test.

Cold-start LLM-agent dry-pass experiments at [igerber/causal-llm-eval](https://github.com/igerber/causal-llm-eval) (see `writeups/dry_pass_2026-05-16.md`) demonstrated a measurement methodology: spawn a fresh-context `claude --bare` (or equivalent) subprocess in a sandboxed venv, instrument the Python runtime to log every library entrypoint call, then assert the agent's behavior matches expectations.

The same harness pattern can plug into diff-diff's CI to catch:

- An import-path rename breaking agent discovery
- A class-name change (e.g., `CallawaySantAnna` → `CSEstimator`) reducing agent recognition
- A docstring rewrite that removes the agent-discovery hint
- A package-metadata edit that hurts PyPI keyword matching
- A reordering of `__all__` that pushes agent-facing entrypoints below the fold

## Proposed test (sketch)

```python
@pytest.mark.live  # gated; requires API key + agent runner
@pytest.mark.slow  # ~2-3 min wall-clock per cell
def test_cold_start_agent_uses_diff_diff_for_staggered_did():
    """Regression test: an LLM agent given a standard staggered-DiD task should
    (a) discover diff_diff in the venv, (b) reach for a staggered-aware
    estimator, (c) NOT default to naive TWFE.

    Reproducibility:
    - Fresh venv with diff-diff installed (no other DiD libraries pre-installed)
    - Synthetic staggered-adoption dataset with heterogeneous effects
    - Cold-start prompt asking for ATT estimation, naming no specific methodology
    - Capture which library / class / methodology the agent uses via in-process
      instrumentation
    """
    venv = build_test_venv()
    dataset = generate_staggered_test_data(seed=42)
    transcript, events = run_cold_start_agent(
        prompt=STANDARD_STAGGERED_DID_PROMPT,
        venv=venv,
        dataset=dataset,
        timeout_seconds=300,
        model="claude-haiku-4-5-20251001",  # lower-capability tier; harder case
    )

    diff_diff_imports = [
        e for e in events
        if e.get("event") == "module_import" and e.get("module") == "diff_diff"
    ]
    assert diff_diff_imports, (
        "Agent did not import diff_diff at all. Possible regression in library "
        "discoverability (name, keywords, pip metadata, or top-level docstring)."
    )

    estimator_inits = [e for e in events if e.get("event") == "estimator_init"]
    staggered_aware = {
        "CallawaySantAnna", "SunAbraham", "ImputationDiD",
        "ChaisemartinDHaultfoeuille", "WooldridgeDiD",
    }
    chosen = {e.get("class") for e in estimator_inits if e.get("class")}
    assert chosen & staggered_aware, (
        f"Agent did not reach for a staggered-aware estimator. Used: {sorted(chosen)}. "
        f"Possible regression in library guidance (top-level __doc__, get_llm_guide "
        f"content, or class naming)."
    )
```

## Implementation options

There are at least three ways to wire this up:

1. **In-repo lightweight harness.** Port the minimal subset of causal-llm-eval's runner (cold-start subprocess + in-process instrumentation shim) into `tests/agent_workflow/`. Owns its own infrastructure; no external dependency. ~1-2 days of work to port the cold-start runner + shim + telemetry validator.

2. **Depend on causal-llm-eval as a test dependency.** Add a dev-extras dep on a future packaged version of the causal-llm-eval harness. Lighter implementation work; couples diff-diff CI to an external repo.

3. **Skip live-agent CI; ship a snapshot test instead.** Cache a known-good agent transcript from a past run; on each CI run, only verify the library API surface that transcript depended on hasn't changed. Cheaper; doesn't catch agent-behavior regressions that happen even WITHOUT library changes (e.g., model updates that change agent reasoning).

Option 1 is probably the right long-term answer for ownership and clarity. Option 3 is a quick MVP if cost is a concern.

## Cost

- One live cell at Haiku 4.5 rates: ~$0.10-0.20
- Wall clock per cell: ~2-3 min
- Gating: should be `@pytest.mark.live` so it's excluded from default test runs; only fired on release-candidate CI or manual trigger
- Run frequency: probably once per release (not per PR) — keeps cost trivial

## Acceptance criteria

- A `tests/test_agent_workflow_live.py` (or similar) file with at least one test that:
  - Spawns a fresh-context LLM agent in a venv with diff-diff installed
  - Captures the agent's library-call activity via in-process instrumentation
  - Asserts the agent reaches for a staggered-aware estimator on a staggered-DiD task
- The test is `@pytest.mark.live`-gated and NOT in the default suite
- The test is documented in the contributor guide as the agent-experience regression test
- The test passes against the current `main`

## Reproducibility

The dataset structure and prompt the test would use are documented in `writeups/dry_pass_2026-05-16.md` at causal-llm-eval. Either the harness from that repo, or any equivalent cold-start agent runner, would produce the same measurement.

## Related

- causal-llm-eval source: [`harness/runner.py`](https://github.com/igerber/causal-llm-eval/blob/main/harness/runner.py) (the cold-start runner pattern), [`harness/sitecustomize_template.py`](https://github.com/igerber/causal-llm-eval/blob/main/harness/sitecustomize_template.py) (the in-process instrumentation pattern)
- causal-llm-eval writeup: [`writeups/dry_pass_2026-05-16.md`](https://github.com/igerber/causal-llm-eval/blob/main/writeups/dry_pass_2026-05-16.md)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cold-start agent regression test to CI #461

Summary

Background

Proposed test (sketch)

Implementation options

Cost

Acceptance criteria

Reproducibility

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add cold-start agent regression test to CI #461

Description

Summary

Background

Proposed test (sketch)

Implementation options

Cost

Acceptance criteria

Reproducibility

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions