feat(python-sdk): conventionality evaluator by czi-fsisenda · Pull Request #38 · learning-commons-org/evaluators

czi-fsisenda · 2026-04-30T14:16:11Z

Summary

Jira:

Implementation of the conventionality evaluator in the Python sdk

creates settings file for conventionality
generates Python settings from settings file
implements conventionality
unit tests

Test Plan

Wrote automated tests
Manually tested my changes, and here are the details:

Copilot

Pull request overview

Adds a new “Conventionality” evaluator to the Python SDK, backed by shared TOML settings and a generated, import-time settings module, plus extensive unit tests around evaluator inputs/settings loading and BaseEvaluator behavior.

Changes:

Added conventionality evaluator settings TOML and generated Python settings module.
Implemented ConventionalityEvaluator (+ input/output schemas) and exported it from the SDK public API.
Added/updated unit tests covering settings loading, evaluator schemas, evaluator behavior, and BaseEvaluator internals; added Makefile targets for settings generation checks.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
sdks/settings/conventionality/settings.toml	New canonical settings (metadata, prompts, model config) for conventionality.
sdks/python/src/learning_commons_evaluators/settings/_generated_conventionality_settings.py	Generated settings module consumed at runtime by the evaluator.
sdks/python/src/learning_commons_evaluators/schemas/conventionality.py	Adds conventionality settings/output Pydantic models.
sdks/python/src/learning_commons_evaluators/schemas/init.py	Exposes conventionality schemas via `learning_commons_evaluators.schemas`.
sdks/python/src/learning_commons_evaluators/evaluators/conventionality.py	Implements `ConventionalityEvaluator` + typed `ConventionalityEvaluationInput`.
sdks/python/src/learning_commons_evaluators/evaluators/init.py	Exports conventionality evaluator/input from the evaluators package.
sdks/python/src/learning_commons_evaluators/init.py	Exposes conventionality evaluator/input/schemas from the root package API.
scripts/generate_settings.py	New generator for `_generated__settings.py` from `sdks/settings/*/settings.toml`.
sdks/python/Makefile	Adds `generate-settings` / `check-generated` targets and wires `check-generated` into `verify`.
sdks/python/tests/test_package_imports.py	Ensures conventionality evaluator is importable from the root package.
sdks/python/tests/settings/test_load_settings.py	New tests for TOML settings loader + helpers and `shared_settings_root()`.
sdks/python/tests/schemas/test_evaluator_schemas.py	New tests for `EvaluationInput` coercion/validation/metadata behaviors.
sdks/python/tests/evaluators/test_conventionality.py	New tests for conventionality evaluator wiring and output typing.
sdks/python/tests/evaluators/test_base.py	New tests for BaseEvaluator telemetry branching, step execution, chain execution, errors, token usage.
sdks/python/tests/conftest.py	Updates fixtures to use real `ConventionalityEvaluationSettings` instead of a stub.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-30T14:28:31Z

@@ -0,0 +1,78 @@
+[evaluator_metadata]
+id = "conventionality"
+version = 0.1


version is set as a TOML number (0.1), which will be parsed as a float and then string-coerced later. That risks losing formatting (e.g., 0.10 → 0.1) and makes it easy to accidentally create non-semver values; prefer quoting it as a string (e.g., version = "0.1").

Suggested change

version = 0.1

version = "0.1"

Copilot · 2026-04-30T14:28:32Z

+    {
+        "name": "vocabulary",
+        "settings_cls": VocabularyEvaluationSettings,
+        "output": _GENERATED_DIR / "_generated_vocabulary_settings.py",
+    },
+]
+
+


The evaluator registry includes a vocabulary entry pointing at sdks/settings/vocabulary/settings.toml and _generated_vocabulary_settings.py, but sdks/settings/vocabulary/ (and the corresponding schema/module) isn’t present. This will make --check/generation fail when it tries to load a non-existent TOML. Remove the registry entry or guard it behind existence checks until vocabulary settings land.

Suggested change

{

"name": "vocabulary",

"settings_cls": VocabularyEvaluationSettings,

"output": _GENERATED_DIR / "_generated_vocabulary_settings.py",

},

]

]

_vocabulary_settings_cls = globals().get("VocabularyEvaluationSettings")

_vocabulary_settings_toml = _SETTINGS_DIR / "vocabulary" / "settings.toml"

if _vocabulary_settings_cls is not None and _vocabulary_settings_toml.exists():

_EVALUATORS.append(

{

"name": "vocabulary",

"settings_cls": _vocabulary_settings_cls,

"output": _GENERATED_DIR / "_generated_vocabulary_settings.py",

}

)

Copilot · 2026-04-30T14:28:32Z

+# A realistic sample long enough to satisfy the min_text_length=100 constraint.
+_SAMPLE_TEXT = (
+    "Marco Polo was a Venetian merchant and explorer who traveled through Asia "
+    "in the late 13th century. He spent nearly two decades at the court of "
+    "Kublai Khan, the Mongol ruler of China, and described his experiences in "
+    "a book that introduced Europeans to the Far East."
+)


This comment refers to min_text_length=100, but sdks/settings/conventionality/settings.toml in this PR sets min_text_length = 10. Please update this comment to avoid misleading future maintainers about where the constraint comes from.

Copilot · 2026-04-30T14:28:32Z

+
+class TestEvaluateErrorHandling:
+    def test_raises_validation_error_for_invalid_input(self):
+        # "x" is below the 100-char minimum defined in the TOML settings.


This test’s inline comment says the minimum text length is 100, but the conventionality settings TOML in this PR sets min_text_length = 10. The assertion logic is fine, but please update the comment to avoid confusion when settings change.

Suggested change

# "x" is below the 100-char minimum defined in the TOML settings.

# "x" is below the configured minimum text length in the TOML settings.

Copilot · 2026-04-30T14:28:33Z

+    ) -> TextComplexityResult:
+        """Run conventionality evaluation. Returns TextComplexityResult with answer, explanation, metadata."""
+        ps_main = evaluation_settings.prompt_settings_step_main
+        assert ps_main is not None


assert ps_main is not None is used for required runtime configuration. Assertions can be stripped with python -O, and even when enabled they raise an AssertionError rather than a user-facing ConfigurationError. Replace this with an explicit check that raises ConfigurationError (or similar) when prompt_settings_step_main is missing.

Suggested change

assert ps_main is not None

if ps_main is None:

raise ValueError(

"Missing required evaluation setting "

"'prompt_settings_step_main' for conventionality evaluator."

)

Copilot · 2026-04-30T14:28:33Z

+from learning_commons_evaluators.schemas.conventionality import (  # noqa: E402
+    ConventionalityEvaluationSettings,
+)
+from learning_commons_evaluators.schemas.vocabulary import (  # noqa: E402
+    VocabularyEvaluationSettings,
+)


This generator imports VocabularyEvaluationSettings from learning_commons_evaluators.schemas.vocabulary, but that module doesn’t exist in the SDK currently. As written, running python scripts/generate_settings.py (and the new make check-generated) will fail with ImportError. Either add the missing vocabulary schema/settings, or remove these imports until vocabulary is implemented.

Copilot · 2026-04-30T14:28:33Z


-verify: lint format-check typecheck pip-check unit-test
+verify: check-generated lint format-check typecheck pip-check unit-test



verify now depends on check-generated, which runs scripts/generate_settings.py --check. Given the script currently references a non-existent vocabulary evaluator/settings, make verify will fail. Either fix the generator registry first (recommended) or temporarily omit check-generated from verify until all referenced evaluators exist.

Copilot · 2026-04-30T14:28:34Z

+# A realistic sample long enough to satisfy the min_text_length=100 constraint.
+_SAMPLE_TEXT = (
+    "Marco Polo was a Venetian merchant and explorer who traveled through Asia "
+    "in the late 13th century. He spent nearly two decades at the court of "
+    "Kublai Khan, the Mongol ruler of China, and described his experiences in "
+    "a book that introduced Europeans to the Far East."
+)


This docstring says the sample satisfies a min_text_length=100 constraint, but the conventionality settings in this PR set min_text_length = 10. Update the comment (and any similar ones) so test intent matches the current settings source-of-truth.

Copilot · 2026-04-30T14:28:34Z

+# A realistic sample long enough to satisfy the min_text_length=100 constraint.
+_SAMPLE_TEXT = (
+    "Marco Polo was a Venetian merchant and explorer who traveled through Asia "
+    "in the late 13th century. He spent nearly two decades at the court of "
+    "Kublai Khan, the Mongol ruler of China, and described his experiences in "
+    "a book that introduced Europeans to the Far East."
+)


This comment says the sample text is sized for min_text_length=100, but the conventionality settings added in this PR set the minimum to 10 characters. Please update the comment so the test documentation matches the configured constraint.

adnanrhussain · 2026-05-04T19:05:04Z

+    ) -> TextComplexityResult:
+        """Run conventionality evaluation. Returns TextComplexityResult with answer, explanation, metadata."""
+        ps_main = evaluation_settings.prompt_settings_step_main
+        assert ps_main is not None


adnanrhussain · 2026-05-04T19:06:43Z

+            chain_inputs=prompt_inputs,
+            parser_output_type=ConventionalityOutput,
+        )
+        assert isinstance(conventionality_output, ConventionalityOutput)


P0 - Similar to above, we should raise an explicit EvaluatorError here

adnanrhussain · 2026-05-04T19:38:47Z

+            ]
+        ).partial(format_instructions=parser.get_format_instructions())
+        conventionality_output = self.execute_prompt_chain_step(
+            step_name="main",


P0 -

Suggested change

step_name="main",

step_name="conventionality_evaluation",

adnanrhussain · 2026-05-04T19:41:35Z

+        'text': TextInputSpec(name='text', min_text_length=10, max_text_length=10000),
+        'grade': GradeInputSpec(
+            name='grade',
+            allowed_grades=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],


P0 - 3 -12 only

feat: conventionality evaluator

fa41738

czi-fsisenda changed the base branch from main to fsisenda/sdk_python_scaffold April 30, 2026 14:20

czi-fsisenda requested review from adnanrhussain, Copilot and georgemelvin April 30, 2026 14:21

Copilot started reviewing on behalf of czi-fsisenda April 30, 2026 14:22 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

adnanrhussain requested changes May 4, 2026

View reviewed changes

	# "x" is below the 100-char minimum defined in the TOML settings.
	# "x" is below the configured minimum text length in the TOML settings.

-        assert ps_main is not None
+        if ps_main is None:
+            raise ValueError(
+                "Missing required evaluation setting "
+                "'prompt_settings_step_main' for conventionality evaluator."
+            )


		verify: lint format-check typecheck pip-check unit-test
		verify: check-generated lint format-check typecheck pip-check unit-test

Conversation

czi-fsisenda commented Apr 30, 2026

Summary

Test Plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

adnanrhussain May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

adnanrhussain May 4, 2026

Choose a reason for hiding this comment

Uh oh!

adnanrhussain May 4, 2026

Choose a reason for hiding this comment

Uh oh!

adnanrhussain May 4, 2026

Choose a reason for hiding this comment

Uh oh!

adnanrhussain May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants