feat(python-sdk): contract test scaffold and conventionality contract test#39
Conversation
There was a problem hiding this comment.
Pull request overview
Adds contract-test infrastructure to the Python SDK and seeds it with an initial Conventionality evaluator contract artifact + test, ensuring evaluator behavior matches the reference notebook and that bundled artifacts stay synced with canonical settings.
Changes:
- Introduces
contracts.tomlartifacts for the Conventionality evaluator (canonical undersdks/settings/plus bundled copy under the Python package). - Adds a contract-test loader + harness and a Conventionality contract test that asserts prompt fidelity and result mapping.
- Adds Makefile targets and a sync-guard test to keep bundled contract artifacts byte-identical to the canonical source.
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
sdks/settings/conventionality/contracts.toml |
Adds canonical Conventionality contract artifact captured from the notebook. |
sdks/python/src/learning_commons_evaluators/settings/conventionality/contracts.toml |
Adds bundled package copy of the Conventionality contract artifact for installed-package testing. |
sdks/python/tests/settings/test_load_settings.py |
Adds bundled-artifact presence check and a canonical-vs-bundled sync guard. |
sdks/python/tests/contract_tests/loader.py |
Adds TOML-backed contract case model + loader resolving via the package settings root. |
sdks/python/tests/contract_tests/harness.py |
Adds provider-mocking harness that captures prompt requests and asserts contract fidelity. |
sdks/python/tests/contract_tests/conventionality.py |
Adds Conventionality case loader and notebook→SDK expected-result mapper. |
sdks/python/tests/contract_tests/test_conventionality.py |
Adds the initial Conventionality contract test for the “turnip” case. |
sdks/python/tests/contract_tests/__init__.py |
Defines the contract-tests package and documents the contract-test approach. |
sdks/python/Makefile |
Adds build/check-build and contract-test/sync targets for artifact maintenance. |
evals/conventionality_evaluator.ipynb |
Updates the notebook to capture LLM calls and print a contracts.toml block. |
evals/capture.py |
Adds notebook utilities for capturing prompt/response snapshots and emitting TOML artifacts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| description="…", # optional human-readable label | ||
| ) | ||
|
|
||
| 3. Print the TOML block and paste it into ``contract_tests.toml``: |
There was a problem hiding this comment.
The module docstring references pasting output into contract_tests.toml, but this repo stores artifacts in contracts.toml (e.g. sdks/settings/<evaluator>/contracts.toml). Updating the filename/path here will prevent confusion when authors regenerate contract data.
| 3. Print the TOML block and paste it into ``contract_tests.toml``: | |
| 3. Print the TOML block and paste it into ``contracts.toml`` (for example, | |
| ``sdks/settings/<evaluator>/contracts.toml``): |
| and output dict directly — no manual field extraction needed:: | ||
|
|
||
| case_input = {"text": my_text, "grade_level": 4} | ||
| case_output = run_evaluator(**case_input) | ||
|
|
||
| _cap = capture_case( | ||
| name="my_case", | ||
| input=case_input, | ||
| llm_call_captures=["step_name"], # prefixes, in call order | ||
| expected_result=case_output, | ||
| description="…", # optional human-readable label | ||
| ) | ||
|
|
||
| 3. Print the TOML block and paste it into ``contract_tests.toml``: | ||
|
|
||
| print(build_contract_toml(_cap_one, _cap_two)) | ||
|
|
||
| Resetting between runs | ||
| ----------------------- | ||
| Call ``reset_captures()`` at the start of each evaluation to avoid stale data from a | ||
| previous run leaking into the next capture_case:: | ||
|
|
||
| reset_captures() | ||
| output = run_evaluator(text, grade) | ||
| _cap = capture_case( | ||
| name="my_case", | ||
| input={"text": text, "grade_level": grade}, | ||
| llm_call_captures=["main"], | ||
| expected_result=output, | ||
| ) |
There was a problem hiding this comment.
The examples in the docstring use an input key grade_level, but the Conventionality evaluator input schema uses grade (and the contract TOML in this PR uses grade). Consider updating the examples to match the actual evaluator API so notebook authors don’t capture mismatched input shapes.
| def build_contract_toml(*cases: dict[str, Any]) -> str: | ||
| """Build the contract TOML block for one or more test cases. | ||
|
|
||
| Args: | ||
| *cases: One or more dicts as returned by :func:`capture_case`. | ||
|
|
||
| Returns: | ||
| TOML string ready to paste into ``contract_tests.toml``. | ||
|
|
||
| Example:: | ||
|
|
||
| print(build_contract_toml(_cap_grade3, _cap_grade7)) | ||
| """ |
There was a problem hiding this comment.
build_contract_toml()’s docstring also says the output should be pasted into contract_tests.toml, but the contract artifacts in this repo are named contracts.toml. Aligning the docstring with the actual file name will reduce regeneration errors.
| HOW TO ADD A NEW CASE | ||
| --------------------- | ||
| 1. Add a ``[cases.<name>]`` entry to | ||
| ``settings/conventionality/contracts.toml``. |
There was a problem hiding this comment.
The “HOW TO ADD A NEW CASE” section points to settings/conventionality/contracts.toml, but the canonical source-of-truth file in this repo is under sdks/settings/conventionality/contracts.toml (and there is also a bundled copy under the Python package). Updating the path here will make the instructions unambiguous.
| ``settings/conventionality/contracts.toml``. | |
| ``sdks/settings/conventionality/contracts.toml``. |
|
|
||
| Each evaluator's ``contracts.toml`` lives next to its ``settings.toml`` in | ||
| ``sdks/settings/<evaluator>/`` (e.g. | ||
| ``settings/conventionality/contracts.toml``). This module provides the data |
There was a problem hiding this comment.
Docstring path example is inconsistent: it says contracts live in sdks/settings/<evaluator>/ but the example immediately below omits the sdks/ prefix. Aligning the example with the canonical path will avoid confusion when adding new evaluator contracts.
| ``settings/conventionality/contracts.toml``). This module provides the data | |
| ``sdks/settings/conventionality/contracts.toml``). This module provides the data |
| description="", | ||
| maturity=EvaluatorMaturity.beta, | ||
| ), | ||
| evaluation_settings=None, # type: ignore[arg-type] |
There was a problem hiding this comment.
EvaluationMetadata.evaluation_settings is typed as Any, so passing None shouldn’t require # type: ignore[arg-type]. Dropping the ignore will keep mypy useful here and avoid accidentally masking a real type error in the future.
| evaluation_settings=None, # type: ignore[arg-type] | |
| evaluation_settings=None, |
| unit-test: | ||
| $(PYTEST) tests/ -v --ignore=tests/contract_tests | ||
|
|
||
| contract-test: |
There was a problem hiding this comment.
P0 - Add a step in CI to run
| temperature: float | ||
| llm_response: str | ||
|
|
||
| def is_populated(self) -> bool: |
There was a problem hiding this comment.
P1 - Currently unused. Is this used in a downstream PR and in tests?
| @@ -0,0 +1,320 @@ | |||
| """Contract test capture utilities for evaluator notebooks. | |||
There was a problem hiding this comment.
P0 - As discussed, move this into the sdk / python, for now. Eventually parts of this will be supplemented / replaced by our updated notebook
| "from langchain_core.prompts.chat import HumanMessagePromptTemplate\n", | ||
| "from langchain_google_genai import ChatGoogleGenerativeAI\n", | ||
| "from pydantic import BaseModel, Field\n", | ||
| "from textstat import textstat as ts" |
There was a problem hiding this comment.
P0 - Are these changes necessary for the first release scope? Will these need to be applied to all the notebooks?
Summary
Jira:
Contract tests for evaluators in the Python SDK
Test Plan