feat(python-sdk): contract test scaffold and conventionality contract test by czi-fsisenda · Pull Request #39 · learning-commons-org/evaluators

czi-fsisenda · 2026-04-30T14:26:53Z

Summary

Jira:

Contract tests for evaluators in the Python SDK

introduces capture.py that captures llm inputs, outputs, info from eval notebooks
adds capture to conventionality notebook
generates contract artifact for conventionality from data captured from notebook
introduces make commands for building and validating contract artifacts
contract test for conventionality

Test Plan

Wrote automated tests
Manually tested my changes, and here are the details:

Copilot

Pull request overview

Adds contract-test infrastructure to the Python SDK and seeds it with an initial Conventionality evaluator contract artifact + test, ensuring evaluator behavior matches the reference notebook and that bundled artifacts stay synced with canonical settings.

Changes:

Introduces contracts.toml artifacts for the Conventionality evaluator (canonical under sdks/settings/ plus bundled copy under the Python package).
Adds a contract-test loader + harness and a Conventionality contract test that asserts prompt fidelity and result mapping.
Adds Makefile targets and a sync-guard test to keep bundled contract artifacts byte-identical to the canonical source.

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`sdks/settings/conventionality/contracts.toml`	Adds canonical Conventionality contract artifact captured from the notebook.
`sdks/python/src/learning_commons_evaluators/settings/conventionality/contracts.toml`	Adds bundled package copy of the Conventionality contract artifact for installed-package testing.
`sdks/python/tests/settings/test_load_settings.py`	Adds bundled-artifact presence check and a canonical-vs-bundled sync guard.
`sdks/python/tests/contract_tests/loader.py`	Adds TOML-backed contract case model + loader resolving via the package settings root.
`sdks/python/tests/contract_tests/harness.py`	Adds provider-mocking harness that captures prompt requests and asserts contract fidelity.
`sdks/python/tests/contract_tests/conventionality.py`	Adds Conventionality case loader and notebook→SDK expected-result mapper.
`sdks/python/tests/contract_tests/test_conventionality.py`	Adds the initial Conventionality contract test for the “turnip” case.
`sdks/python/tests/contract_tests/__init__.py`	Defines the contract-tests package and documents the contract-test approach.
`sdks/python/Makefile`	Adds `build`/`check-build` and contract-test/sync targets for artifact maintenance.
`evals/conventionality_evaluator.ipynb`	Updates the notebook to capture LLM calls and print a `contracts.toml` block.
`evals/capture.py`	Adds notebook utilities for capturing prompt/response snapshots and emitting TOML artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-30T14:33:42Z

+           description="…",                 # optional human-readable label
+       )
+
+3. Print the TOML block and paste it into ``contract_tests.toml``:


The module docstring references pasting output into contract_tests.toml, but this repo stores artifacts in contracts.toml (e.g. sdks/settings/<evaluator>/contracts.toml). Updating the filename/path here will prevent confusion when authors regenerate contract data.

Suggested change

3. Print the TOML block and paste it into ``contract_tests.toml``:

3. Print the TOML block and paste it into ``contracts.toml`` (for example,

``sdks/settings/<evaluator>/contracts.toml``):

Copilot · 2026-04-30T14:33:43Z

+   and output dict directly — no manual field extraction needed::
+
+       case_input  = {"text": my_text, "grade_level": 4}
+       case_output = run_evaluator(**case_input)
+
+       _cap = capture_case(
+           name="my_case",
+           input=case_input,
+           llm_call_captures=["step_name"],  # prefixes, in call order
+           expected_result=case_output,
+           description="…",                 # optional human-readable label
+       )
+
+3. Print the TOML block and paste it into ``contract_tests.toml``:
+
+       print(build_contract_toml(_cap_one, _cap_two))
+
+Resetting between runs
+-----------------------
+Call ``reset_captures()`` at the start of each evaluation to avoid stale data from a
+previous run leaking into the next capture_case::
+
+    reset_captures()
+    output = run_evaluator(text, grade)
+    _cap = capture_case(
+        name="my_case",
+        input={"text": text, "grade_level": grade},
+        llm_call_captures=["main"],
+        expected_result=output,
+    )


The examples in the docstring use an input key grade_level, but the Conventionality evaluator input schema uses grade (and the contract TOML in this PR uses grade). Consider updating the examples to match the actual evaluator API so notebook authors don’t capture mismatched input shapes.

Copilot · 2026-04-30T14:33:43Z

+def build_contract_toml(*cases: dict[str, Any]) -> str:
+    """Build the contract TOML block for one or more test cases.
+
+    Args:
+        *cases: One or more dicts as returned by :func:`capture_case`.
+
+    Returns:
+        TOML string ready to paste into ``contract_tests.toml``.
+
+    Example::
+
+        print(build_contract_toml(_cap_grade3, _cap_grade7))
+    """


build_contract_toml()’s docstring also says the output should be pasted into contract_tests.toml, but the contract artifacts in this repo are named contracts.toml. Aligning the docstring with the actual file name will reduce regeneration errors.

Copilot · 2026-04-30T14:33:43Z

+HOW TO ADD A NEW CASE
+---------------------
+1. Add a ``[cases.<name>]`` entry to
+   ``settings/conventionality/contracts.toml``.


The “HOW TO ADD A NEW CASE” section points to settings/conventionality/contracts.toml, but the canonical source-of-truth file in this repo is under sdks/settings/conventionality/contracts.toml (and there is also a bundled copy under the Python package). Updating the path here will make the instructions unambiguous.

Suggested change

``settings/conventionality/contracts.toml``.

``sdks/settings/conventionality/contracts.toml``.

Copilot · 2026-04-30T14:33:44Z

+
+Each evaluator's ``contracts.toml`` lives next to its ``settings.toml`` in
+``sdks/settings/<evaluator>/`` (e.g.
+``settings/conventionality/contracts.toml``).  This module provides the data


Docstring path example is inconsistent: it says contracts live in sdks/settings/<evaluator>/ but the example immediately below omits the sdks/ prefix. Aligning the example with the canonical path will avoid confusion when adding new evaluator contracts.

Suggested change

``settings/conventionality/contracts.toml``). This module provides the data

``sdks/settings/conventionality/contracts.toml``). This module provides the data

Copilot · 2026-04-30T14:33:44Z

+            description="",
+            maturity=EvaluatorMaturity.beta,
+        ),
+        evaluation_settings=None,  # type: ignore[arg-type]


EvaluationMetadata.evaluation_settings is typed as Any, so passing None shouldn’t require # type: ignore[arg-type]. Dropping the ignore will keep mypy useful here and avoid accidentally masking a real type error in the future.

Suggested change

evaluation_settings=None, # type: ignore[arg-type]

evaluation_settings=None,

adnanrhussain · 2026-05-05T04:50:48Z

 unit-test:
 	$(PYTEST) tests/ -v --ignore=tests/contract_tests

+contract-test:


P0 - Add a step in CI to run

adnanrhussain · 2026-05-05T04:51:32Z

+    temperature: float
+    llm_response: str
+
+    def is_populated(self) -> bool:


P1 - Currently unused. Is this used in a downstream PR and in tests?

adnanrhussain · 2026-05-05T04:54:15Z

@@ -0,0 +1,320 @@
+"""Contract test capture utilities for evaluator notebooks.


P0 - As discussed, move this into the sdk / python, for now. Eventually parts of this will be supplemented / replaced by our updated notebook

adnanrhussain · 2026-05-05T04:55:12Z

    "from langchain_core.prompts.chat import HumanMessagePromptTemplate\n",
    "from langchain_google_genai import ChatGoogleGenerativeAI\n",
    "from pydantic import BaseModel, Field\n",
-    "from textstat import textstat as ts"


P0 - Are these changes necessary for the first release scope? Will these need to be applied to all the notebooks?

feat: contract test scaffold and conventionality contract test

a55fec5

czi-fsisenda requested review from adnanrhussain, Copilot and georgemelvin April 30, 2026 14:26

Copilot started reviewing on behalf of czi-fsisenda April 30, 2026 14:27 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

adnanrhussain requested changes May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python-sdk): contract test scaffold and conventionality contract test#39

feat(python-sdk): contract test scaffold and conventionality contract test#39
czi-fsisenda wants to merge 1 commit intofsisenda/sdk_python_basic_conventionalityfrom
fsisenda/sdk_python_contract_tests

czi-fsisenda commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

adnanrhussain May 5, 2026

Uh oh!

adnanrhussain May 5, 2026

Uh oh!

adnanrhussain May 5, 2026

Uh oh!

adnanrhussain May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	3. Print the TOML block and paste it into ``contract_tests.toml``:
	3. Print the TOML block and paste it into ``contracts.toml`` (for example,
	``sdks/settings/<evaluator>/contracts.toml``):

	``settings/conventionality/contracts.toml``.
	``sdks/settings/conventionality/contracts.toml``.

	evaluation_settings=None, # type: ignore[arg-type]
	evaluation_settings=None,

		@@ -0,0 +1,320 @@
		"""Contract test capture utilities for evaluator notebooks.

Conversation

czi-fsisenda commented Apr 30, 2026

Summary

Test Plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

adnanrhussain May 5, 2026

Choose a reason for hiding this comment

Uh oh!

adnanrhussain May 5, 2026

Choose a reason for hiding this comment

Uh oh!

adnanrhussain May 5, 2026

Choose a reason for hiding this comment

Uh oh!

adnanrhussain May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants