Skip to content

feat(python-sdk): conventionality evaluator #38

Open
czi-fsisenda wants to merge 1 commit intofsisenda/sdk_python_scaffoldfrom
fsisenda/sdk_python_basic_conventionality
Open

feat(python-sdk): conventionality evaluator #38
czi-fsisenda wants to merge 1 commit intofsisenda/sdk_python_scaffoldfrom
fsisenda/sdk_python_basic_conventionality

Conversation

@czi-fsisenda
Copy link
Copy Markdown

Summary

Jira:

Implementation of the conventionality evaluator in the Python sdk

  • creates settings file for conventionality
  • generates Python settings from settings file
  • implements conventionality
  • unit tests

Test Plan

  • Wrote automated tests
  • Manually tested my changes, and here are the details:

@czi-fsisenda czi-fsisenda changed the base branch from main to fsisenda/sdk_python_scaffold April 30, 2026 14:20
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “Conventionality” evaluator to the Python SDK, backed by shared TOML settings and a generated, import-time settings module, plus extensive unit tests around evaluator inputs/settings loading and BaseEvaluator behavior.

Changes:

  • Added conventionality evaluator settings TOML and generated Python settings module.
  • Implemented ConventionalityEvaluator (+ input/output schemas) and exported it from the SDK public API.
  • Added/updated unit tests covering settings loading, evaluator schemas, evaluator behavior, and BaseEvaluator internals; added Makefile targets for settings generation checks.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
sdks/settings/conventionality/settings.toml New canonical settings (metadata, prompts, model config) for conventionality.
sdks/python/src/learning_commons_evaluators/settings/_generated_conventionality_settings.py Generated settings module consumed at runtime by the evaluator.
sdks/python/src/learning_commons_evaluators/schemas/conventionality.py Adds conventionality settings/output Pydantic models.
sdks/python/src/learning_commons_evaluators/schemas/init.py Exposes conventionality schemas via learning_commons_evaluators.schemas.
sdks/python/src/learning_commons_evaluators/evaluators/conventionality.py Implements ConventionalityEvaluator + typed ConventionalityEvaluationInput.
sdks/python/src/learning_commons_evaluators/evaluators/init.py Exports conventionality evaluator/input from the evaluators package.
sdks/python/src/learning_commons_evaluators/init.py Exposes conventionality evaluator/input/schemas from the root package API.
scripts/generate_settings.py New generator for _generated_*_settings.py from sdks/settings/**/settings.toml.
sdks/python/Makefile Adds generate-settings / check-generated targets and wires check-generated into verify.
sdks/python/tests/test_package_imports.py Ensures conventionality evaluator is importable from the root package.
sdks/python/tests/settings/test_load_settings.py New tests for TOML settings loader + helpers and shared_settings_root().
sdks/python/tests/schemas/test_evaluator_schemas.py New tests for EvaluationInput coercion/validation/metadata behaviors.
sdks/python/tests/evaluators/test_conventionality.py New tests for conventionality evaluator wiring and output typing.
sdks/python/tests/evaluators/test_base.py New tests for BaseEvaluator telemetry branching, step execution, chain execution, errors, token usage.
sdks/python/tests/conftest.py Updates fixtures to use real ConventionalityEvaluationSettings instead of a stub.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -0,0 +1,78 @@
[evaluator_metadata]
id = "conventionality"
version = 0.1
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

version is set as a TOML number (0.1), which will be parsed as a float and then string-coerced later. That risks losing formatting (e.g., 0.100.1) and makes it easy to accidentally create non-semver values; prefer quoting it as a string (e.g., version = "0.1").

Suggested change
version = 0.1
version = "0.1"

Copilot uses AI. Check for mistakes.
Comment on lines +373 to +380
{
"name": "vocabulary",
"settings_cls": VocabularyEvaluationSettings,
"output": _GENERATED_DIR / "_generated_vocabulary_settings.py",
},
]


Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The evaluator registry includes a vocabulary entry pointing at sdks/settings/vocabulary/settings.toml and _generated_vocabulary_settings.py, but sdks/settings/vocabulary/ (and the corresponding schema/module) isn’t present. This will make --check/generation fail when it tries to load a non-existent TOML. Remove the registry entry or guard it behind existence checks until vocabulary settings land.

Suggested change
{
"name": "vocabulary",
"settings_cls": VocabularyEvaluationSettings,
"output": _GENERATED_DIR / "_generated_vocabulary_settings.py",
},
]
]
_vocabulary_settings_cls = globals().get("VocabularyEvaluationSettings")
_vocabulary_settings_toml = _SETTINGS_DIR / "vocabulary" / "settings.toml"
if _vocabulary_settings_cls is not None and _vocabulary_settings_toml.exists():
_EVALUATORS.append(
{
"name": "vocabulary",
"settings_cls": _vocabulary_settings_cls,
"output": _GENERATED_DIR / "_generated_vocabulary_settings.py",
}
)

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +22
# A realistic sample long enough to satisfy the min_text_length=100 constraint.
_SAMPLE_TEXT = (
"Marco Polo was a Venetian merchant and explorer who traveled through Asia "
"in the late 13th century. He spent nearly two decades at the court of "
"Kublai Khan, the Mongol ruler of China, and described his experiences in "
"a book that introduced Europeans to the Far East."
)
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment refers to min_text_length=100, but sdks/settings/conventionality/settings.toml in this PR sets min_text_length = 10. Please update this comment to avoid misleading future maintainers about where the constraint comes from.

Copilot uses AI. Check for mistakes.

class TestEvaluateErrorHandling:
def test_raises_validation_error_for_invalid_input(self):
# "x" is below the 100-char minimum defined in the TOML settings.
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test’s inline comment says the minimum text length is 100, but the conventionality settings TOML in this PR sets min_text_length = 10. The assertion logic is fine, but please update the comment to avoid confusion when settings change.

Suggested change
# "x" is below the 100-char minimum defined in the TOML settings.
# "x" is below the configured minimum text length in the TOML settings.

Copilot uses AI. Check for mistakes.
) -> TextComplexityResult:
"""Run conventionality evaluation. Returns TextComplexityResult with answer, explanation, metadata."""
ps_main = evaluation_settings.prompt_settings_step_main
assert ps_main is not None
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert ps_main is not None is used for required runtime configuration. Assertions can be stripped with python -O, and even when enabled they raise an AssertionError rather than a user-facing ConfigurationError. Replace this with an explicit check that raises ConfigurationError (or similar) when prompt_settings_step_main is missing.

Suggested change
assert ps_main is not None
if ps_main is None:
raise ValueError(
"Missing required evaluation setting "
"'prompt_settings_step_main' for conventionality evaluator."
)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines +88 to +93
from learning_commons_evaluators.schemas.conventionality import ( # noqa: E402
ConventionalityEvaluationSettings,
)
from learning_commons_evaluators.schemas.vocabulary import ( # noqa: E402
VocabularyEvaluationSettings,
)
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generator imports VocabularyEvaluationSettings from learning_commons_evaluators.schemas.vocabulary, but that module doesn’t exist in the SDK currently. As written, running python scripts/generate_settings.py (and the new make check-generated) will fail with ImportError. Either add the missing vocabulary schema/settings, or remove these imports until vocabulary is implemented.

Copilot uses AI. Check for mistakes.
Comment thread sdks/python/Makefile
Comment on lines 50 to 52

verify: lint format-check typecheck pip-check unit-test
verify: check-generated lint format-check typecheck pip-check unit-test

Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify now depends on check-generated, which runs scripts/generate_settings.py --check. Given the script currently references a non-existent vocabulary evaluator/settings, make verify will fail. Either fix the generator registry first (recommended) or temporarily omit check-generated from verify until all referenced evaluators exist.

Copilot uses AI. Check for mistakes.
Comment on lines +32 to +38
# A realistic sample long enough to satisfy the min_text_length=100 constraint.
_SAMPLE_TEXT = (
"Marco Polo was a Venetian merchant and explorer who traveled through Asia "
"in the late 13th century. He spent nearly two decades at the court of "
"Kublai Khan, the Mongol ruler of China, and described his experiences in "
"a book that introduced Europeans to the Far East."
)
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring says the sample satisfies a min_text_length=100 constraint, but the conventionality settings in this PR set min_text_length = 10. Update the comment (and any similar ones) so test intent matches the current settings source-of-truth.

Copilot uses AI. Check for mistakes.
Comment on lines +63 to +69
# A realistic sample long enough to satisfy the min_text_length=100 constraint.
_SAMPLE_TEXT = (
"Marco Polo was a Venetian merchant and explorer who traveled through Asia "
"in the late 13th century. He spent nearly two decades at the court of "
"Kublai Khan, the Mongol ruler of China, and described his experiences in "
"a book that introduced Europeans to the Far East."
)
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment says the sample text is sized for min_text_length=100, but the conventionality settings added in this PR set the minimum to 10 characters. Please update the comment so the test documentation matches the configured constraint.

Copilot uses AI. Check for mistakes.
) -> TextComplexityResult:
"""Run conventionality evaluation. Returns TextComplexityResult with answer, explanation, metadata."""
ps_main = evaluation_settings.prompt_settings_step_main
assert ps_main is not None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

chain_inputs=prompt_inputs,
parser_output_type=ConventionalityOutput,
)
assert isinstance(conventionality_output, ConventionalityOutput)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 - Similar to above, we should raise an explicit EvaluatorError here

]
).partial(format_instructions=parser.get_format_instructions())
conventionality_output = self.execute_prompt_chain_step(
step_name="main",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 -

Suggested change
step_name="main",
step_name="conventionality_evaluation",

'text': TextInputSpec(name='text', min_text_length=10, max_text_length=10000),
'grade': GradeInputSpec(
name='grade',
allowed_grades=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 - 3 -12 only

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants