Skip to content

feat(python-sdk): contract test scaffold and conventionality contract test#39

Open
czi-fsisenda wants to merge 1 commit intofsisenda/sdk_python_basic_conventionalityfrom
fsisenda/sdk_python_contract_tests
Open

feat(python-sdk): contract test scaffold and conventionality contract test#39
czi-fsisenda wants to merge 1 commit intofsisenda/sdk_python_basic_conventionalityfrom
fsisenda/sdk_python_contract_tests

Conversation

@czi-fsisenda
Copy link
Copy Markdown

Summary

Jira:

Contract tests for evaluators in the Python SDK

  • introduces capture.py that captures llm inputs, outputs, info from eval notebooks
  • adds capture to conventionality notebook
  • generates contract artifact for conventionality from data captured from notebook
  • introduces make commands for building and validating contract artifacts
  • contract test for conventionality

Test Plan

  • Wrote automated tests
  • Manually tested my changes, and here are the details:

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds contract-test infrastructure to the Python SDK and seeds it with an initial Conventionality evaluator contract artifact + test, ensuring evaluator behavior matches the reference notebook and that bundled artifacts stay synced with canonical settings.

Changes:

  • Introduces contracts.toml artifacts for the Conventionality evaluator (canonical under sdks/settings/ plus bundled copy under the Python package).
  • Adds a contract-test loader + harness and a Conventionality contract test that asserts prompt fidelity and result mapping.
  • Adds Makefile targets and a sync-guard test to keep bundled contract artifacts byte-identical to the canonical source.

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
sdks/settings/conventionality/contracts.toml Adds canonical Conventionality contract artifact captured from the notebook.
sdks/python/src/learning_commons_evaluators/settings/conventionality/contracts.toml Adds bundled package copy of the Conventionality contract artifact for installed-package testing.
sdks/python/tests/settings/test_load_settings.py Adds bundled-artifact presence check and a canonical-vs-bundled sync guard.
sdks/python/tests/contract_tests/loader.py Adds TOML-backed contract case model + loader resolving via the package settings root.
sdks/python/tests/contract_tests/harness.py Adds provider-mocking harness that captures prompt requests and asserts contract fidelity.
sdks/python/tests/contract_tests/conventionality.py Adds Conventionality case loader and notebook→SDK expected-result mapper.
sdks/python/tests/contract_tests/test_conventionality.py Adds the initial Conventionality contract test for the “turnip” case.
sdks/python/tests/contract_tests/__init__.py Defines the contract-tests package and documents the contract-test approach.
sdks/python/Makefile Adds build/check-build and contract-test/sync targets for artifact maintenance.
evals/conventionality_evaluator.ipynb Updates the notebook to capture LLM calls and print a contracts.toml block.
evals/capture.py Adds notebook utilities for capturing prompt/response snapshots and emitting TOML artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread evals/capture.py
description="…", # optional human-readable label
)

3. Print the TOML block and paste it into ``contract_tests.toml``:
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring references pasting output into contract_tests.toml, but this repo stores artifacts in contracts.toml (e.g. sdks/settings/<evaluator>/contracts.toml). Updating the filename/path here will prevent confusion when authors regenerate contract data.

Suggested change
3. Print the TOML block and paste it into ``contract_tests.toml``:
3. Print the TOML block and paste it into ``contracts.toml`` (for example,
``sdks/settings/<evaluator>/contracts.toml``):

Copilot uses AI. Check for mistakes.
Comment thread evals/capture.py
Comment on lines +15 to +44
and output dict directly — no manual field extraction needed::

case_input = {"text": my_text, "grade_level": 4}
case_output = run_evaluator(**case_input)

_cap = capture_case(
name="my_case",
input=case_input,
llm_call_captures=["step_name"], # prefixes, in call order
expected_result=case_output,
description="…", # optional human-readable label
)

3. Print the TOML block and paste it into ``contract_tests.toml``:

print(build_contract_toml(_cap_one, _cap_two))

Resetting between runs
-----------------------
Call ``reset_captures()`` at the start of each evaluation to avoid stale data from a
previous run leaking into the next capture_case::

reset_captures()
output = run_evaluator(text, grade)
_cap = capture_case(
name="my_case",
input={"text": text, "grade_level": grade},
llm_call_captures=["main"],
expected_result=output,
)
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples in the docstring use an input key grade_level, but the Conventionality evaluator input schema uses grade (and the contract TOML in this PR uses grade). Consider updating the examples to match the actual evaluator API so notebook authors don’t capture mismatched input shapes.

Copilot uses AI. Check for mistakes.
Comment thread evals/capture.py
Comment on lines +204 to +216
def build_contract_toml(*cases: dict[str, Any]) -> str:
"""Build the contract TOML block for one or more test cases.

Args:
*cases: One or more dicts as returned by :func:`capture_case`.

Returns:
TOML string ready to paste into ``contract_tests.toml``.

Example::

print(build_contract_toml(_cap_grade3, _cap_grade7))
"""
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_contract_toml()’s docstring also says the output should be pasted into contract_tests.toml, but the contract artifacts in this repo are named contracts.toml. Aligning the docstring with the actual file name will reduce regeneration errors.

Copilot uses AI. Check for mistakes.
HOW TO ADD A NEW CASE
---------------------
1. Add a ``[cases.<name>]`` entry to
``settings/conventionality/contracts.toml``.
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “HOW TO ADD A NEW CASE” section points to settings/conventionality/contracts.toml, but the canonical source-of-truth file in this repo is under sdks/settings/conventionality/contracts.toml (and there is also a bundled copy under the Python package). Updating the path here will make the instructions unambiguous.

Suggested change
``settings/conventionality/contracts.toml``.
``sdks/settings/conventionality/contracts.toml``.

Copilot uses AI. Check for mistakes.

Each evaluator's ``contracts.toml`` lives next to its ``settings.toml`` in
``sdks/settings/<evaluator>/`` (e.g.
``settings/conventionality/contracts.toml``). This module provides the data
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring path example is inconsistent: it says contracts live in sdks/settings/<evaluator>/ but the example immediately below omits the sdks/ prefix. Aligning the example with the canonical path will avoid confusion when adding new evaluator contracts.

Suggested change
``settings/conventionality/contracts.toml``). This module provides the data
``sdks/settings/conventionality/contracts.toml``). This module provides the data

Copilot uses AI. Check for mistakes.
description="",
maturity=EvaluatorMaturity.beta,
),
evaluation_settings=None, # type: ignore[arg-type]
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EvaluationMetadata.evaluation_settings is typed as Any, so passing None shouldn’t require # type: ignore[arg-type]. Dropping the ignore will keep mypy useful here and avoid accidentally masking a real type error in the future.

Suggested change
evaluation_settings=None, # type: ignore[arg-type]
evaluation_settings=None,

Copilot uses AI. Check for mistakes.
Comment thread sdks/python/Makefile
unit-test:
$(PYTEST) tests/ -v --ignore=tests/contract_tests

contract-test:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 - Add a step in CI to run

temperature: float
llm_response: str

def is_populated(self) -> bool:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 - Currently unused. Is this used in a downstream PR and in tests?

Comment thread evals/capture.py
@@ -0,0 +1,320 @@
"""Contract test capture utilities for evaluator notebooks.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 - As discussed, move this into the sdk / python, for now. Eventually parts of this will be supplemented / replaced by our updated notebook

"from langchain_core.prompts.chat import HumanMessagePromptTemplate\n",
"from langchain_google_genai import ChatGoogleGenerativeAI\n",
"from pydantic import BaseModel, Field\n",
"from textstat import textstat as ts"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 - Are these changes necessary for the first release scope? Will these need to be applied to all the notebooks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants