Skip to content

Implement multi-component reward schema and REPL rubric metadata#625

Open
adithya-s-k wants to merge 1 commit into
huggingface:mainfrom
adithya-s-k:feat/reaward-multi-component-rewards
Open

Implement multi-component reward schema and REPL rubric metadata#625
adithya-s-k wants to merge 1 commit into
huggingface:mainfrom
adithya-s-k:feat/reaward-multi-component-rewards

Conversation

@adithya-s-k
Copy link
Copy Markdown
Contributor

Summary

This PR implements the first end-to-end slice of RFC 006 by adding a standard reward-component schema in core rubrics and integrating it into repl_env reward computation.

What changed

  • Added core reward component types and helpers:
    • RewardComponentType
    • RewardComponent
    • aggregate_weighted_sum(...)
    • serialize_reward_components(...)
  • Exported these APIs from openenv.core.rubrics
  • Updated REPLRubric to:
    • compute structured reward components per step
    • aggregate to scalar reward (existing behavior preserved)
    • attach metadata fields:
      • reward_components
      • reward_total
      • reward_aggregation
  • Added core tests for reward component schema/helpers
  • Extended REPL env tests to validate reward metadata payloads

Compatibility

  • observation.reward remains the scalar optimization target
  • Component metadata is additive and backward-compatible

Validation

  • uv run ruff check on touched files passed
  • tests/core/test_rubrics/test_reward_components.py passed
  • tests/envs/test_repl_env.py was skipped at collection in this environment

Related

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 29, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 29, 2026

Greptile Summary

This PR adds a RewardComponent schema and helpers to openenv.core.rubrics, then wires REPLRubric to produce structured component metadata alongside the existing scalar reward. The scalar optimization target is preserved and the new fields are additive to observation.metadata.

  • Tier 1 – Fix required: float | None union syntax in components.py (line 38) requires Python 3.10+ at runtime; other model files use Optional[T], indicating the project targets Python < 3.10. This will raise TypeError on 3.8/3.9.
  • Tier 1 – Fix required: REPLRubric._attach_reward_metadata mutates observation.metadata as a side-effect of forward. Rubric forward is documented/contracted to return a float; observation construction is the environment's responsibility per PATTERNS.md. This breaks composability and surprises downstream Transform callers.
  • Tier 2 – Alignment: RFC 006 is referenced as the backing design (rfcs/ has only 000–005). Per PRINCIPLES.md, key decisions require a committed RFC before implementation proceeds; @darktex should confirm RFC [RFC] RFC 006: Multi-Component Reward Signals #624 is approved before merge.

Confidence Score: 3/5

Not safe to merge without fixing the Python 3.9 incompatibility and the observation-mutation side-effect in REPLRubric.forward.

Two P1 findings: a syntax incompatibility that breaks Python <3.10 at class-definition time, and an observation-mutation side-effect in the rubric that violates the documented responsibility boundary. Additionally, the backing RFC 006 is not yet committed to the repo. Multiple P1s pull the score below the 4/5 ceiling to 3/5.

src/openenv/core/rubrics/components.py (Python version compat) and envs/repl_env/rubrics.py (_attach_reward_metadata side-effect).

Important Files Changed

Filename Overview
src/openenv/core/rubrics/components.py New Pydantic schema for reward components — runtime-incompatible `float
envs/repl_env/rubrics.py REPLRubric refactored to emit structured components; scalar behavior preserved, but _attach_reward_metadata mutates observation as a side-effect of forward, crossing responsibility boundaries.
src/openenv/core/rubrics/init.py Exports new component types and helpers from openenv.core.rubrics; clean additive change with correct __all__ entries.
tests/core/test_rubrics/test_reward_components.py Unit tests for RewardComponent schema and helpers; covers weighted value defaults, explicit overrides, aggregation, and serialization.
tests/envs/test_repl_env.py Extends existing REPL env tests to assert reward metadata payloads; these tests were skipped at collection per PR description, so no runtime confirmation they pass.

Sequence Diagram

sequenceDiagram
    participant Env as REPLEnvironment
    participant Rubric as REPLRubric
    participant EC as evaluate_components()
    participant AWS as aggregate_weighted_sum()
    participant ARM as _attach_reward_metadata()
    participant Obs as observation.metadata

    Env->>Rubric: forward(action, observation)
    Rubric->>EC: evaluate_components(action, observation)
    EC-->>Rubric: [RewardComponent(...)]
    Rubric->>AWS: aggregate_weighted_sum(components)
    AWS-->>Rubric: total: float
    Rubric->>ARM: _attach_reward_metadata(observation, components, total)
    ARM->>Obs: observation.metadata[reward_components] = ...
    ARM->>Obs: observation.metadata[reward_total] = total
    ARM->>Obs: observation.metadata[reward_aggregation] = weighted_sum_v1
    Rubric-->>Env: return total (float)
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/openenv/core/rubrics/components.py
Line: 38-40

Comment:
**Python <3.10 runtime incompatibility**

`float | None` (PEP 604 union syntax) requires Python 3.10+ for *runtime* evaluation. Without `from __future__ import annotations`, this annotation is evaluated eagerly at class definition time, raising `TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'` on Python 3.8/3.9. The rest of the codebase (e.g. `models.py`) consistently uses `Optional[T]`, signaling Python < 3.10 is supported. Either add `from __future__ import annotations` at the top of this file, or replace the union with `Optional[float]`.

```suggestion
    weighted_value: Optional[float] = Field(
        default=None,
        description="Optional explicit weighted value (defaults to value * weight)",
    )
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/openenv/core/rubrics/components.py
Line: 13-14

Comment:
**Inconsistent `typing` imports alongside new-style union syntax**

The file imports `Dict` and `List` from `typing` (Python 3.8 compat style), but simultaneously uses `float | None` PEP-604 syntax (Python 3.10+). These two typing styles target incompatible Python version floors. Align to one style — either migrate all annotations to the modern built-in generics with `from __future__ import annotations`, or keep everything `typing`-module style (`Dict`, `List`, `Optional[float]`).

```suggestion
from typing import Any, Dict, List, Optional
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: envs/repl_env/rubrics.py
Line: 217-230

Comment:
**ALIGNMENT FLAG: Rubric directly mutates observation**

`_attach_reward_metadata` writes `reward_components`, `reward_total`, and `reward_aggregation` directly onto `observation.metadata` as a side-effect of reward computation. PATTERNS.md shows observations being constructed and returned by the environment (`step`), not mutated downstream. A rubric whose `forward` now carries an observation-mutation side-effect creates a responsibility leak: any caller of `REPLRubric.forward` that did not expect its observation to be modified (e.g., a Transform in the pipeline, a test double, or a future rubric composition) will silently receive a mutated object.

- **Principle at stake**: Separation of reward computation from observation construction (PATTERNS.md "Reward Computation" section; INVARIANTS.md "Rewards in environment")
- **The concern**: Rubric `forward` is documented to return a `float`; attaching metadata side-effects to the observation couples reward diagnostics to observation mutation in a non-obvious way and breaks composability
- **Suggested reviewer**: `@darktex`

**Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9))

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: envs/repl_env/rubrics.py
Line: 224-226

Comment:
**Silent metadata clobber on empty-dict metadata**

`getattr(observation, "metadata", None) or {}` evaluates to a *new* empty dict when `observation.metadata` is `{}` (empty dicts are falsy). Any keys already present would be discarded. Prefer an explicit `None` check to make the intent clear.

```suggestion
        metadata = getattr(observation, "metadata", None)
        if metadata is None:
            metadata = {}
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "Add multi-component reward schema and RE..." | Re-trigger Greptile

Comment on lines +38 to +40
weighted_value: float | None = Field(
default=None,
description="Optional explicit weighted value (defaults to value * weight)",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Python <3.10 runtime incompatibility

float | None (PEP 604 union syntax) requires Python 3.10+ for runtime evaluation. Without from __future__ import annotations, this annotation is evaluated eagerly at class definition time, raising TypeError: unsupported operand type(s) for |: 'type' and 'NoneType' on Python 3.8/3.9. The rest of the codebase (e.g. models.py) consistently uses Optional[T], signaling Python < 3.10 is supported. Either add from __future__ import annotations at the top of this file, or replace the union with Optional[float].

Suggested change
weighted_value: float | None = Field(
default=None,
description="Optional explicit weighted value (defaults to value * weight)",
weighted_value: Optional[float] = Field(
default=None,
description="Optional explicit weighted value (defaults to value * weight)",
)
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/openenv/core/rubrics/components.py
Line: 38-40

Comment:
**Python <3.10 runtime incompatibility**

`float | None` (PEP 604 union syntax) requires Python 3.10+ for *runtime* evaluation. Without `from __future__ import annotations`, this annotation is evaluated eagerly at class definition time, raising `TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'` on Python 3.8/3.9. The rest of the codebase (e.g. `models.py`) consistently uses `Optional[T]`, signaling Python < 3.10 is supported. Either add `from __future__ import annotations` at the top of this file, or replace the union with `Optional[float]`.

```suggestion
    weighted_value: Optional[float] = Field(
        default=None,
        description="Optional explicit weighted value (defaults to value * weight)",
    )
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +13 to +14
from enum import Enum
from typing import Any, Dict, List
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Inconsistent typing imports alongside new-style union syntax

The file imports Dict and List from typing (Python 3.8 compat style), but simultaneously uses float | None PEP-604 syntax (Python 3.10+). These two typing styles target incompatible Python version floors. Align to one style — either migrate all annotations to the modern built-in generics with from __future__ import annotations, or keep everything typing-module style (Dict, List, Optional[float]).

Suggested change
from enum import Enum
from typing import Any, Dict, List
from typing import Any, Dict, List, Optional
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/openenv/core/rubrics/components.py
Line: 13-14

Comment:
**Inconsistent `typing` imports alongside new-style union syntax**

The file imports `Dict` and `List` from `typing` (Python 3.8 compat style), but simultaneously uses `float | None` PEP-604 syntax (Python 3.10+). These two typing styles target incompatible Python version floors. Align to one style — either migrate all annotations to the modern built-in generics with `from __future__ import annotations`, or keep everything `typing`-module style (`Dict`, `List`, `Optional[float]`).

```suggestion
from typing import Any, Dict, List, Optional
```

How can I resolve this? If you propose a fix, please make it concise.

Comment thread envs/repl_env/rubrics.py
Comment on lines +217 to +230
def _attach_reward_metadata(
self,
observation: Any,
components: list[RewardComponent],
total: float,
) -> None:
"""Attach component diagnostics to observation metadata if available."""
if not hasattr(observation, "metadata"):
return
metadata = getattr(observation, "metadata", None) or {}
metadata["reward_components"] = serialize_reward_components(components)
metadata["reward_total"] = total
metadata["reward_aggregation"] = "weighted_sum_v1"
observation.metadata = metadata
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 ALIGNMENT FLAG: Rubric directly mutates observation

_attach_reward_metadata writes reward_components, reward_total, and reward_aggregation directly onto observation.metadata as a side-effect of reward computation. PATTERNS.md shows observations being constructed and returned by the environment (step), not mutated downstream. A rubric whose forward now carries an observation-mutation side-effect creates a responsibility leak: any caller of REPLRubric.forward that did not expect its observation to be modified (e.g., a Transform in the pipeline, a test double, or a future rubric composition) will silently receive a mutated object.

  • Principle at stake: Separation of reward computation from observation construction (PATTERNS.md "Reward Computation" section; INVARIANTS.md "Rewards in environment")
  • The concern: Rubric forward is documented to return a float; attaching metadata side-effects to the observation couples reward diagnostics to observation mutation in a non-obvious way and breaks composability
  • Suggested reviewer: @darktex

Context Used: .claude/docs/INVARIANTS.md (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/repl_env/rubrics.py
Line: 217-230

Comment:
**ALIGNMENT FLAG: Rubric directly mutates observation**

`_attach_reward_metadata` writes `reward_components`, `reward_total`, and `reward_aggregation` directly onto `observation.metadata` as a side-effect of reward computation. PATTERNS.md shows observations being constructed and returned by the environment (`step`), not mutated downstream. A rubric whose `forward` now carries an observation-mutation side-effect creates a responsibility leak: any caller of `REPLRubric.forward` that did not expect its observation to be modified (e.g., a Transform in the pipeline, a test double, or a future rubric composition) will silently receive a mutated object.

- **Principle at stake**: Separation of reward computation from observation construction (PATTERNS.md "Reward Computation" section; INVARIANTS.md "Rewards in environment")
- **The concern**: Rubric `forward` is documented to return a `float`; attaching metadata side-effects to the observation couples reward diagnostics to observation mutation in a non-obvious way and breaks composability
- **Suggested reviewer**: `@darktex`

**Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9))

How can I resolve this? If you propose a fix, please make it concise.

Comment thread envs/repl_env/rubrics.py
Comment on lines +224 to +226
if not hasattr(observation, "metadata"):
return
metadata = getattr(observation, "metadata", None) or {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Silent metadata clobber on empty-dict metadata

getattr(observation, "metadata", None) or {} evaluates to a new empty dict when observation.metadata is {} (empty dicts are falsy). Any keys already present would be discarded. Prefer an explicit None check to make the intent clear.

Suggested change
if not hasattr(observation, "metadata"):
return
metadata = getattr(observation, "metadata", None) or {}
metadata = getattr(observation, "metadata", None)
if metadata is None:
metadata = {}
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/repl_env/rubrics.py
Line: 224-226

Comment:
**Silent metadata clobber on empty-dict metadata**

`getattr(observation, "metadata", None) or {}` evaluates to a *new* empty dict when `observation.metadata` is `{}` (empty dicts are falsy). Any keys already present would be discarded. Prefer an explicit `None` check to make the intent clear.

```suggestion
        metadata = getattr(observation, "metadata", None)
        if metadata is None:
            metadata = {}
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant