Implement multi-component reward schema and REPL rubric metadata by adithya-s-k · Pull Request #625 · huggingface/OpenEnv

adithya-s-k · 2026-04-29T18:33:26Z

Summary

This PR implements the first end-to-end slice of RFC 006 by adding a standard reward-component schema in core rubrics and integrating it into repl_env reward computation.

What changed

Added core reward component types and helpers:
- RewardComponentType
- RewardComponent
- aggregate_weighted_sum(...)
- serialize_reward_components(...)
Exported these APIs from openenv.core.rubrics
Updated REPLRubric to:
- compute structured reward components per step
- aggregate to scalar reward (existing behavior preserved)
- attach metadata fields:
  - reward_components
  - reward_total
  - reward_aggregation
Added core tests for reward component schema/helpers
Extended REPL env tests to validate reward metadata payloads

Compatibility

observation.reward remains the scalar optimization target
Component metadata is additive and backward-compatible

Validation

uv run ruff check on touched files passed
tests/core/test_rubrics/test_reward_components.py passed
tests/envs/test_repl_env.py was skipped at collection in this environment

Tier 1 – Fix required: float | None union syntax in components.py (line 38) requires Python 3.10+ at runtime; other model files use Optional[T], indicating the project targets Python < 3.10. This will raise TypeError on 3.8/3.9.
Tier 1 – Fix required: REPLRubric._attach_reward_metadata mutates observation.metadata as a side-effect of forward. Rubric forward is documented/contracted to return a float; observation construction is the environment's responsibility per PATTERNS.md. This breaks composability and surprises downstream Transform callers.
Tier 2 – Alignment: RFC 006 is referenced as the backing design (rfcs/ has only 000–005). Per PRINCIPLES.md, key decisions require a committed RFC before implementation proceeds; @darktex should confirm RFC [RFC] RFC 006: Multi-Component Reward Signals #624 is approved before merge.

Confidence Score: 3/5

Not safe to merge without fixing the Python 3.9 incompatibility and the observation-mutation side-effect in REPLRubric.forward.

Two P1 findings: a syntax incompatibility that breaks Python <3.10 at class-definition time, and an observation-mutation side-effect in the rubric that violates the documented responsibility boundary. Additionally, the backing RFC 006 is not yet committed to the repo. Multiple P1s pull the score below the 4/5 ceiling to 3/5.

src/openenv/core/rubrics/components.py (Python version compat) and envs/repl_env/rubrics.py (_attach_reward_metadata side-effect).

Important Files Changed

Filename	Overview
src/openenv/core/rubrics/components.py	New Pydantic schema for reward components — runtime-incompatible `float
envs/repl_env/rubrics.py	REPLRubric refactored to emit structured components; scalar behavior preserved, but `_attach_reward_metadata` mutates observation as a side-effect of `forward`, crossing responsibility boundaries.
src/openenv/core/rubrics/init.py	Exports new component types and helpers from `openenv.core.rubrics`; clean additive change with correct `__all__` entries.
tests/core/test_rubrics/test_reward_components.py	Unit tests for `RewardComponent` schema and helpers; covers weighted value defaults, explicit overrides, aggregation, and serialization.
tests/envs/test_repl_env.py	Extends existing REPL env tests to assert reward metadata payloads; these tests were skipped at collection per PR description, so no runtime confirmation they pass.

Sequence Diagram

sequenceDiagram
    participant Env as REPLEnvironment
    participant Rubric as REPLRubric
    participant EC as evaluate_components()
    participant AWS as aggregate_weighted_sum()
    participant ARM as _attach_reward_metadata()
    participant Obs as observation.metadata

    Env->>Rubric: forward(action, observation)
    Rubric->>EC: evaluate_components(action, observation)
    EC-->>Rubric: [RewardComponent(...)]
    Rubric->>AWS: aggregate_weighted_sum(components)
    AWS-->>Rubric: total: float
    Rubric->>ARM: _attach_reward_metadata(observation, components, total)
    ARM->>Obs: observation.metadata[reward_components] = ...
    ARM->>Obs: observation.metadata[reward_total] = total
    ARM->>Obs: observation.metadata[reward_aggregation] = weighted_sum_v1
    Rubric-->>Env: return total (float)

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/openenv/core/rubrics/components.py
Line: 38-40

Comment:
**Python <3.10 runtime incompatibility**

`float | None` (PEP 604 union syntax) requires Python 3.10+ for *runtime* evaluation. Without `from __future__ import annotations`, this annotation is evaluated eagerly at class definition time, raising `TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'` on Python 3.8/3.9. The rest of the codebase (e.g. `models.py`) consistently uses `Optional[T]`, signaling Python < 3.10 is supported. Either add `from __future__ import annotations` at the top of this file, or replace the union with `Optional[float]`.

```suggestion
    weighted_value: Optional[float] = Field(
        default=None,
        description="Optional explicit weighted value (defaults to value * weight)",
    )
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/openenv/core/rubrics/components.py
Line: 13-14

Comment:
**Inconsistent `typing` imports alongside new-style union syntax**

The file imports `Dict` and `List` from `typing` (Python 3.8 compat style), but simultaneously uses `float | None` PEP-604 syntax (Python 3.10+). These two typing styles target incompatible Python version floors. Align to one style — either migrate all annotations to the modern built-in generics with `from __future__ import annotations`, or keep everything `typing`-module style (`Dict`, `List`, `Optional[float]`).

```suggestion
from typing import Any, Dict, List, Optional
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: envs/repl_env/rubrics.py
Line: 217-230

Comment:
**ALIGNMENT FLAG: Rubric directly mutates observation**

`_attach_reward_metadata` writes `reward_components`, `reward_total`, and `reward_aggregation` directly onto `observation.metadata` as a side-effect of reward computation. PATTERNS.md shows observations being constructed and returned by the environment (`step`), not mutated downstream. A rubric whose `forward` now carries an observation-mutation side-effect creates a responsibility leak: any caller of `REPLRubric.forward` that did not expect its observation to be modified (e.g., a Transform in the pipeline, a test double, or a future rubric composition) will silently receive a mutated object.

- **Principle at stake**: Separation of reward computation from observation construction (PATTERNS.md "Reward Computation" section; INVARIANTS.md "Rewards in environment")
- **The concern**: Rubric `forward` is documented to return a `float`; attaching metadata side-effects to the observation couples reward diagnostics to observation mutation in a non-obvious way and breaks composability
- **Suggested reviewer**: `@darktex`

**Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9))

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: envs/repl_env/rubrics.py
Line: 224-226

Comment:
**Silent metadata clobber on empty-dict metadata**

`getattr(observation, "metadata", None) or {}` evaluates to a *new* empty dict when `observation.metadata` is `{}` (empty dicts are falsy). Any keys already present would be discarded. Prefer an explicit `None` check to make the intent clear.

```suggestion
        metadata = getattr(observation, "metadata", None)
        if metadata is None:
            metadata = {}
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "Add multi-component reward schema and RE..." | Re-trigger Greptile}

greptile-apps · 2026-04-29T18:42:03Z

+    weighted_value: float | None = Field(
+        default=None,
+        description="Optional explicit weighted value (defaults to value * weight)",


Python <3.10 runtime incompatibility

float | None (PEP 604 union syntax) requires Python 3.10+ for runtime evaluation. Without from __future__ import annotations, this annotation is evaluated eagerly at class definition time, raising TypeError: unsupported operand type(s) for |: 'type' and 'NoneType' on Python 3.8/3.9. The rest of the codebase (e.g. models.py) consistently uses Optional[T], signaling Python < 3.10 is supported. Either add from __future__ import annotations at the top of this file, or replace the union with Optional[float].

Suggested change

weighted_value: float | None = Field(

default=None,

description="Optional explicit weighted value (defaults to value * weight)",

weighted_value: Optional[float] = Field(

default=None,

description="Optional explicit weighted value (defaults to value * weight)",

)

Prompt To Fix With AI

This is a comment left during a code review. Path: src/openenv/core/rubrics/components.py Line: 38-40 Comment: **Python <3.10 runtime incompatibility** `float | None` (PEP 604 union syntax) requires Python 3.10+ for *runtime* evaluation. Without `from __future__ import annotations`, this annotation is evaluated eagerly at class definition time, raising `TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'` on Python 3.8/3.9. The rest of the codebase (e.g. `models.py`) consistently uses `Optional[T]`, signaling Python < 3.10 is supported. Either add `from __future__ import annotations` at the top of this file, or replace the union with `Optional[float]`. ```suggestion weighted_value: Optional[float] = Field( default=None, description="Optional explicit weighted value (defaults to value * weight)", ) ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-29T18:42:04Z

+from enum import Enum
+from typing import Any, Dict, List


Inconsistent typing imports alongside new-style union syntax

The file imports Dict and List from typing (Python 3.8 compat style), but simultaneously uses float | None PEP-604 syntax (Python 3.10+). These two typing styles target incompatible Python version floors. Align to one style — either migrate all annotations to the modern built-in generics with from __future__ import annotations, or keep everything typing-module style (Dict, List, Optional[float]).

Suggested change

from enum import Enum

from typing import Any, Dict, List

from typing import Any, Dict, List, Optional

Prompt To Fix With AI

This is a comment left during a code review. Path: src/openenv/core/rubrics/components.py Line: 13-14 Comment: **Inconsistent `typing` imports alongside new-style union syntax** The file imports `Dict` and `List` from `typing` (Python 3.8 compat style), but simultaneously uses `float | None` PEP-604 syntax (Python 3.10+). These two typing styles target incompatible Python version floors. Align to one style — either migrate all annotations to the modern built-in generics with `from __future__ import annotations`, or keep everything `typing`-module style (`Dict`, `List`, `Optional[float]`). ```suggestion from typing import Any, Dict, List, Optional ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-29T18:42:05Z

+    def _attach_reward_metadata(
+        self,
+        observation: Any,
+        components: list[RewardComponent],
+        total: float,
+    ) -> None:
+        """Attach component diagnostics to observation metadata if available."""
+        if not hasattr(observation, "metadata"):
+            return
+        metadata = getattr(observation, "metadata", None) or {}
+        metadata["reward_components"] = serialize_reward_components(components)
+        metadata["reward_total"] = total
+        metadata["reward_aggregation"] = "weighted_sum_v1"
+        observation.metadata = metadata


ALIGNMENT FLAG: Rubric directly mutates observation

_attach_reward_metadata writes reward_components, reward_total, and reward_aggregation directly onto observation.metadata as a side-effect of reward computation. PATTERNS.md shows observations being constructed and returned by the environment (step), not mutated downstream. A rubric whose forward now carries an observation-mutation side-effect creates a responsibility leak: any caller of REPLRubric.forward that did not expect its observation to be modified (e.g., a Transform in the pipeline, a test double, or a future rubric composition) will silently receive a mutated object.

Principle at stake: Separation of reward computation from observation construction (PATTERNS.md "Reward Computation" section; INVARIANTS.md "Rewards in environment")

The concern: Rubric forward is documented to return a float; attaching metadata side-effects to the observation couples reward diagnostics to observation mutation in a non-obvious way and breaks composability

Suggested reviewer: @darktex

Context Used: .claude/docs/INVARIANTS.md (source)

Prompt To Fix With AI

This is a comment left during a code review. Path: envs/repl_env/rubrics.py Line: 217-230 Comment: **ALIGNMENT FLAG: Rubric directly mutates observation** `_attach_reward_metadata` writes `reward_components`, `reward_total`, and `reward_aggregation` directly onto `observation.metadata` as a side-effect of reward computation. PATTERNS.md shows observations being constructed and returned by the environment (`step`), not mutated downstream. A rubric whose `forward` now carries an observation-mutation side-effect creates a responsibility leak: any caller of `REPLRubric.forward` that did not expect its observation to be modified (e.g., a Transform in the pipeline, a test double, or a future rubric composition) will silently receive a mutated object. - **Principle at stake**: Separation of reward computation from observation construction (PATTERNS.md "Reward Computation" section; INVARIANTS.md "Rewards in environment") - **The concern**: Rubric `forward` is documented to return a `float`; attaching metadata side-effects to the observation couples reward diagnostics to observation mutation in a non-obvious way and breaks composability - **Suggested reviewer**: `@darktex` **Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9)) How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-29T18:42:06Z

+        if not hasattr(observation, "metadata"):
+            return
+        metadata = getattr(observation, "metadata", None) or {}


Silent metadata clobber on empty-dict metadata

getattr(observation, "metadata", None) or {} evaluates to a new empty dict when observation.metadata is {} (empty dicts are falsy). Any keys already present would be discarded. Prefer an explicit None check to make the intent clear.

Suggested change

if not hasattr(observation, "metadata"):

return

metadata = getattr(observation, "metadata", None) or {}

metadata = getattr(observation, "metadata", None)

if metadata is None:

metadata = {}

Prompt To Fix With AI

This is a comment left during a code review. Path: envs/repl_env/rubrics.py Line: 224-226 Comment: **Silent metadata clobber on empty-dict metadata** `getattr(observation, "metadata", None) or {}` evaluates to a *new* empty dict when `observation.metadata` is `{}` (empty dicts are falsy). Any keys already present would be discarded. Prefer an explicit `None` check to make the intent clear. ```suggestion metadata = getattr(observation, "metadata", None) if metadata is None: metadata = {} ``` How can I resolve this? If you propose a fix, please make it concise.

Add multi-component reward schema and REPL integration

348181f

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 29, 2026

greptile-apps Bot reviewed Apr 29, 2026

View reviewed changes

sergiopaniego mentioned this pull request May 6, 2026

Add Coding Environments (Jupyter, Terminus, and Coding Tools) #626

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement multi-component reward schema and REPL rubric metadata#625

Implement multi-component reward schema and REPL rubric metadata#625
adithya-s-k wants to merge 1 commit into
huggingface:mainfrom
adithya-s-k:feat/reaward-multi-component-rewards

adithya-s-k commented Apr 29, 2026

Uh oh!

greptile-apps Bot commented Apr 29, 2026

Uh oh!

greptile-apps Bot Apr 29, 2026

Uh oh!

greptile-apps Bot Apr 29, 2026

Uh oh!

greptile-apps Bot Apr 29, 2026

Uh oh!

greptile-apps Bot Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	from enum import Enum
	from typing import Any, Dict, List
	from typing import Any, Dict, List, Optional

Conversation

adithya-s-k commented Apr 29, 2026

Summary

What changed

Compatibility

Validation

Related

Uh oh!

greptile-apps Bot commented Apr 29, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant