feat(python-sdk): conventionality evaluator #38
feat(python-sdk): conventionality evaluator #38czi-fsisenda wants to merge 1 commit intofsisenda/sdk_python_scaffoldfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new “Conventionality” evaluator to the Python SDK, backed by shared TOML settings and a generated, import-time settings module, plus extensive unit tests around evaluator inputs/settings loading and BaseEvaluator behavior.
Changes:
- Added conventionality evaluator settings TOML and generated Python settings module.
- Implemented
ConventionalityEvaluator(+ input/output schemas) and exported it from the SDK public API. - Added/updated unit tests covering settings loading, evaluator schemas, evaluator behavior, and BaseEvaluator internals; added Makefile targets for settings generation checks.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| sdks/settings/conventionality/settings.toml | New canonical settings (metadata, prompts, model config) for conventionality. |
| sdks/python/src/learning_commons_evaluators/settings/_generated_conventionality_settings.py | Generated settings module consumed at runtime by the evaluator. |
| sdks/python/src/learning_commons_evaluators/schemas/conventionality.py | Adds conventionality settings/output Pydantic models. |
| sdks/python/src/learning_commons_evaluators/schemas/init.py | Exposes conventionality schemas via learning_commons_evaluators.schemas. |
| sdks/python/src/learning_commons_evaluators/evaluators/conventionality.py | Implements ConventionalityEvaluator + typed ConventionalityEvaluationInput. |
| sdks/python/src/learning_commons_evaluators/evaluators/init.py | Exports conventionality evaluator/input from the evaluators package. |
| sdks/python/src/learning_commons_evaluators/init.py | Exposes conventionality evaluator/input/schemas from the root package API. |
| scripts/generate_settings.py | New generator for _generated_*_settings.py from sdks/settings/**/settings.toml. |
| sdks/python/Makefile | Adds generate-settings / check-generated targets and wires check-generated into verify. |
| sdks/python/tests/test_package_imports.py | Ensures conventionality evaluator is importable from the root package. |
| sdks/python/tests/settings/test_load_settings.py | New tests for TOML settings loader + helpers and shared_settings_root(). |
| sdks/python/tests/schemas/test_evaluator_schemas.py | New tests for EvaluationInput coercion/validation/metadata behaviors. |
| sdks/python/tests/evaluators/test_conventionality.py | New tests for conventionality evaluator wiring and output typing. |
| sdks/python/tests/evaluators/test_base.py | New tests for BaseEvaluator telemetry branching, step execution, chain execution, errors, token usage. |
| sdks/python/tests/conftest.py | Updates fixtures to use real ConventionalityEvaluationSettings instead of a stub. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1,78 @@ | |||
| [evaluator_metadata] | |||
| id = "conventionality" | |||
| version = 0.1 | |||
There was a problem hiding this comment.
version is set as a TOML number (0.1), which will be parsed as a float and then string-coerced later. That risks losing formatting (e.g., 0.10 → 0.1) and makes it easy to accidentally create non-semver values; prefer quoting it as a string (e.g., version = "0.1").
| version = 0.1 | |
| version = "0.1" |
| { | ||
| "name": "vocabulary", | ||
| "settings_cls": VocabularyEvaluationSettings, | ||
| "output": _GENERATED_DIR / "_generated_vocabulary_settings.py", | ||
| }, | ||
| ] | ||
|
|
||
|
|
There was a problem hiding this comment.
The evaluator registry includes a vocabulary entry pointing at sdks/settings/vocabulary/settings.toml and _generated_vocabulary_settings.py, but sdks/settings/vocabulary/ (and the corresponding schema/module) isn’t present. This will make --check/generation fail when it tries to load a non-existent TOML. Remove the registry entry or guard it behind existence checks until vocabulary settings land.
| { | |
| "name": "vocabulary", | |
| "settings_cls": VocabularyEvaluationSettings, | |
| "output": _GENERATED_DIR / "_generated_vocabulary_settings.py", | |
| }, | |
| ] | |
| ] | |
| _vocabulary_settings_cls = globals().get("VocabularyEvaluationSettings") | |
| _vocabulary_settings_toml = _SETTINGS_DIR / "vocabulary" / "settings.toml" | |
| if _vocabulary_settings_cls is not None and _vocabulary_settings_toml.exists(): | |
| _EVALUATORS.append( | |
| { | |
| "name": "vocabulary", | |
| "settings_cls": _vocabulary_settings_cls, | |
| "output": _GENERATED_DIR / "_generated_vocabulary_settings.py", | |
| } | |
| ) |
| # A realistic sample long enough to satisfy the min_text_length=100 constraint. | ||
| _SAMPLE_TEXT = ( | ||
| "Marco Polo was a Venetian merchant and explorer who traveled through Asia " | ||
| "in the late 13th century. He spent nearly two decades at the court of " | ||
| "Kublai Khan, the Mongol ruler of China, and described his experiences in " | ||
| "a book that introduced Europeans to the Far East." | ||
| ) |
There was a problem hiding this comment.
This comment refers to min_text_length=100, but sdks/settings/conventionality/settings.toml in this PR sets min_text_length = 10. Please update this comment to avoid misleading future maintainers about where the constraint comes from.
|
|
||
| class TestEvaluateErrorHandling: | ||
| def test_raises_validation_error_for_invalid_input(self): | ||
| # "x" is below the 100-char minimum defined in the TOML settings. |
There was a problem hiding this comment.
This test’s inline comment says the minimum text length is 100, but the conventionality settings TOML in this PR sets min_text_length = 10. The assertion logic is fine, but please update the comment to avoid confusion when settings change.
| # "x" is below the 100-char minimum defined in the TOML settings. | |
| # "x" is below the configured minimum text length in the TOML settings. |
| ) -> TextComplexityResult: | ||
| """Run conventionality evaluation. Returns TextComplexityResult with answer, explanation, metadata.""" | ||
| ps_main = evaluation_settings.prompt_settings_step_main | ||
| assert ps_main is not None |
There was a problem hiding this comment.
assert ps_main is not None is used for required runtime configuration. Assertions can be stripped with python -O, and even when enabled they raise an AssertionError rather than a user-facing ConfigurationError. Replace this with an explicit check that raises ConfigurationError (or similar) when prompt_settings_step_main is missing.
| assert ps_main is not None | |
| if ps_main is None: | |
| raise ValueError( | |
| "Missing required evaluation setting " | |
| "'prompt_settings_step_main' for conventionality evaluator." | |
| ) |
| from learning_commons_evaluators.schemas.conventionality import ( # noqa: E402 | ||
| ConventionalityEvaluationSettings, | ||
| ) | ||
| from learning_commons_evaluators.schemas.vocabulary import ( # noqa: E402 | ||
| VocabularyEvaluationSettings, | ||
| ) |
There was a problem hiding this comment.
This generator imports VocabularyEvaluationSettings from learning_commons_evaluators.schemas.vocabulary, but that module doesn’t exist in the SDK currently. As written, running python scripts/generate_settings.py (and the new make check-generated) will fail with ImportError. Either add the missing vocabulary schema/settings, or remove these imports until vocabulary is implemented.
|
|
||
| verify: lint format-check typecheck pip-check unit-test | ||
| verify: check-generated lint format-check typecheck pip-check unit-test | ||
|
|
There was a problem hiding this comment.
verify now depends on check-generated, which runs scripts/generate_settings.py --check. Given the script currently references a non-existent vocabulary evaluator/settings, make verify will fail. Either fix the generator registry first (recommended) or temporarily omit check-generated from verify until all referenced evaluators exist.
| # A realistic sample long enough to satisfy the min_text_length=100 constraint. | ||
| _SAMPLE_TEXT = ( | ||
| "Marco Polo was a Venetian merchant and explorer who traveled through Asia " | ||
| "in the late 13th century. He spent nearly two decades at the court of " | ||
| "Kublai Khan, the Mongol ruler of China, and described his experiences in " | ||
| "a book that introduced Europeans to the Far East." | ||
| ) |
There was a problem hiding this comment.
This docstring says the sample satisfies a min_text_length=100 constraint, but the conventionality settings in this PR set min_text_length = 10. Update the comment (and any similar ones) so test intent matches the current settings source-of-truth.
| # A realistic sample long enough to satisfy the min_text_length=100 constraint. | ||
| _SAMPLE_TEXT = ( | ||
| "Marco Polo was a Venetian merchant and explorer who traveled through Asia " | ||
| "in the late 13th century. He spent nearly two decades at the court of " | ||
| "Kublai Khan, the Mongol ruler of China, and described his experiences in " | ||
| "a book that introduced Europeans to the Far East." | ||
| ) |
There was a problem hiding this comment.
This comment says the sample text is sized for min_text_length=100, but the conventionality settings added in this PR set the minimum to 10 characters. Please update the comment so the test documentation matches the configured constraint.
| ) -> TextComplexityResult: | ||
| """Run conventionality evaluation. Returns TextComplexityResult with answer, explanation, metadata.""" | ||
| ps_main = evaluation_settings.prompt_settings_step_main | ||
| assert ps_main is not None |
| chain_inputs=prompt_inputs, | ||
| parser_output_type=ConventionalityOutput, | ||
| ) | ||
| assert isinstance(conventionality_output, ConventionalityOutput) |
There was a problem hiding this comment.
P0 - Similar to above, we should raise an explicit EvaluatorError here
| ] | ||
| ).partial(format_instructions=parser.get_format_instructions()) | ||
| conventionality_output = self.execute_prompt_chain_step( | ||
| step_name="main", |
There was a problem hiding this comment.
P0 -
| step_name="main", | |
| step_name="conventionality_evaluation", |
| 'text': TextInputSpec(name='text', min_text_length=10, max_text_length=10000), | ||
| 'grade': GradeInputSpec( | ||
| name='grade', | ||
| allowed_grades=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], |
Summary
Jira:
Implementation of the conventionality evaluator in the Python sdk
Test Plan