fix: preserve Observation.metadata in serialize_observation()#617
fix: preserve Observation.metadata in serialize_observation()#617goodmorningsaksham wants to merge 10 commits into
Conversation
metadata was excluded from model_dump() alongside reward and done, but unlike those two it was never re-added to the response dict, silently dropped during serialization. This promotes metadata to a top-level sibling key consistent with reward/done, adds it to StepResult, and updates GenericEnvClient to pass it through. Includes 7 new tests covering preservation, empty-metadata omission, round-trip through client, and subclass observation fields.
Greptile SummaryThis PR correctly fixes a silent metadata loss bug in
Confidence Score: 2/5Not safe to merge — introduces a P0 HTTP-server crash for all environments returning non-empty metadata. A single P0 finding (StepResponse/ResetResponse rejecting the new metadata key with extra="forbid") caps confidence at 2/5. The WebSocket and client-side changes are correct, but the HTTP path regression is a definitive runtime crash, not a theoretical risk. src/openenv/core/env_server/serialization.py (P0 regression) and src/openenv/core/env_server/types.py (StepResponse / ResetResponse must be updated — not in this PR's changeset). Important Files Changed
Sequence DiagramsequenceDiagram
participant Env as Environment
participant Ser as serialize_observation()
participant WS as WebSocket path
participant HTTP as HTTP path
participant Client as GenericEnvClient
Env->>Ser: Observation(metadata={k:v})
Ser-->>WS: {observation, reward, done, metadata} OK
Ser-->>HTTP: {observation, reward, done, metadata} FAIL ValidationError extra=forbid
WS-->>Client: StepResult(metadata={k:v}) OK
HTTP-->>Client: 500 ValidationError crash
|
Darktex
left a comment
There was a problem hiding this comment.
Note: This is an automated review by Claude Code, not a human review.
Alignment Review Report
Automated Checks
- Lint: FAIL (one trailing whitespace issue in the diff; unrelated pre-existing format drift in envs/ not caused by this PR)
- Debug code: CLEAN (no new debug statements introduced)
Tier 1: Fixes Required
-
src/openenv/core/env_server/types.py(line in diff — blank line beforeclass BaseMessage) — trailing whitespace character on the blank separator line introduced betweenStepResponseandBaseMessage.ruff formatwill flag this. The raw diff shows+(whitespace-only added line). Please remove the trailing space.
Tier 2: Alignment Discussion
None identified. The change is a straightforward bug fix: metadata was excluded from the inner observation dict in serialize_observation() but, unlike reward and done, was never re-promoted to a top-level sibling key. The fix mirrors the existing reward/done pattern exactly.
A few minor observations worth noting (not blocking):
-
if metadata:omits empty dicts — Usingif metadata:instead ofif metadata is not None:means an empty{}dict is silently omitted from the wire payload. The PR author states this is intentional (avoid noise) and a test explicitly asserts it (test_empty_metadata_omitted). This is consistent withObservation.metadatadefaulting to{}when not set. Acceptable as stated, but reviewers should be aware the server never sendsmetadata: {}— clients can assume absence equals empty. -
ResetResponsegetsmetadatabut reset path is not fully exercised — The PR correctly addsmetadata: Optional[...]to bothResetResponseandStepResponse, andEnvClient._parse_result()is used for both reset and step responses. However,tests/core/test_serialization.pyonly tests step-path round-trips. A reset-path round-trip test (serialize_observation→_parse_resultfor a reset observation) would complete coverage. Not a blocker. -
StepResultusesOptional[Dict[str, Any]]whileObservation.metadatais neverNone(always a dict) — The asymmetry is intentional and correct:StepResult.metadataisNonewhen the server omits the key entirely (e.g., old server, or empty-dict case above), whereasObservation.metadataon the server is always a dict. The_parse_resultdefault ofpayload.get("metadata")returningNoneis the right sentinel for "not present".
The tests are high quality: they cover preservation, empty-dict omission, promotion to top-level sibling, subclass fields, and the full serialize→parse round-trip.
Summary
- 1 mechanical issue to fix (trailing whitespace in
types.pydiff) - 0 alignment points for human review
- Fix the whitespace nit and this is good to merge
Automated review by Claude Code | Learn more
|
@Darktex I've addressed all the review comments and fixed the trailing Ready for another look when you get a chance. Thank you for your time! |
Darktex
left a comment
There was a problem hiding this comment.
Note: This is an automated review by Claude Code, not a human review.
Alignment Review Report
Automated Checks
- Lint: PASS — ruff check passes on all four modified source files
- Debug code: CLEAN — no debug artifacts in the diff
Tier 1: Fixes Required
-
src/openenv/core/env_server/serialization.py:136-151— The docstring forserialize_observation()describes the return format as{"observation": {...}, "reward": float | None, "done": bool}but does not mention the now-possible"metadata"top-level key. Update the docstring to reflect the new schema (e.g."metadata": dictwhen non-empty). -
src/openenv/core/mcp_client.py:299— Pre-existing but now visibly inconsistent:MchEnvClient._parse_result()constructsStepResult(observation=..., reward=..., done=...)without propagatingmetadatatoStepResult, even though theStepResultdataclass now has ametadatafield. The MCP client encodes env-level metadata inside theObservationobject itself (lines 275, 289, 296), which is a different semantic layer, but callers who call.metadataon theStepResultreturned byMchEnvClientwill always getNone. This is a pre-existing inconsistency that this PR surfaces — worth fixing here or in a follow-up, but should be called out.
Tier 2: Alignment Discussion
None identified. The fix is purely in the server serialization and client parsing layers. It does not:
- Expose reset/simulation controls to agents
- Import server code from client modules
- Change the Gymnasium API signatures (
reset,step,state) - Alter reward computation (metadata is informational, not reward logic)
- Break client-server separation
The approach (promote metadata as a top-level sibling alongside reward and done) is consistent with how the existing reward/done promotion already works in serialize_observation().
Summary
The core bug fix is correct and well-tested (7 tests covering preservation, empty-omission, round-trip, and subclass fields). Two non-blocking issues to address: (1) the serialize_observation() docstring does not describe the new metadata key in its return schema, and (2) MchEnvClient._parse_result() in mcp_client.py does not populate StepResult.metadata from the top-level payload — leaving the new field dead for MCP callers. Neither is a blocker but both should be addressed before or shortly after merge.
- 2 mechanical issues to fix (docstring gap, MCP client not populating
StepResult.metadata) - 0 alignment points for human review
Automated review by Claude Code | Learn more
Darktex
left a comment
There was a problem hiding this comment.
Note: This is an automated review by Claude Code, not a human review.
Alignment Review Report
Automated Checks
- Lint: PASS (lint.sh exited cleanly on PR files)
- Debug code: CLEAN (no debug artifacts in changed files)
Tier 1: Fixes Required
Bug: mcp_client.py — observation.metadata stays empty after this fix
src/openenv/core/mcp_client.py:275,289,296 — All three Observation constructions in _parse_result still read metadata=obs_data.get("metadata", {}), where obs_data is payload.get("observation", {}). After the serializer change, metadata is no longer inside the nested observation dict; it's at the top-level payload. So any MCPToolClient user reading step_result.observation.metadata will always receive {}. The added metadata=payload.get("metadata") line (diff line +110) only propagates metadata to StepResult.metadata, not into the observation object itself. The fix requires changing those three lookups to payload.get("metadata", {}).
Inconsistency: empty-metadata round-trip is lossy
src/openenv/core/env_server/serialization.py:63 — The guard if metadata: omits metadata from the payload when it is {}. On the client, payload.get("metadata") then returns None, so StepResult.metadata is None even though Observation.metadata was {}. Callers that check step_result.metadata is not None will see different behaviour depending on whether metadata is empty or absent. Use if metadata is not None: to preserve the empty-dict case, or document the None/{} equivalence explicitly.
Missing coverage for reset path
ResetResponse in types.py gains a metadata field, but serialize_observation is only called on step responses. There are no tests or updated code paths showing that reset metadata is serialized or propagated to the client. If this field is intentional, the reset serialization path needs the same treatment.
Tier 2: Alignment Discussion
None identified.
Summary
- 3 mechanical issues to fix (mcp_client observation.metadata bug is the blocker)
- 0 alignment points for human review
Automated review by Claude Code | Learn more
- Use \if metadata is not None:\ in serialize_observation() to avoid lossy round-trip when metadata is an empty dict - Fix mcp_client._parse_result() to read metadata from top-level payload instead of nested observation dict - Add metadata field to StepResponse and ResetResponse Pydantic models - Add test_reset_metadata_preserved to cover reset path round-trip
Darktex
left a comment
There was a problem hiding this comment.
Note: This is an automated review by Claude Code, not a human review.
Alignment Review Report
Automated Checks
- Lint: Could not run (branch not checked out locally)
- Debug code: None found in diff
Tier 1: Fixes Required
Critical — if metadata is not None guard is wrong
Observation.metadata is declared as Dict[str, Any] with default_factory=dict in src/openenv/core/env_server/types.py:89-91. It defaults to {}, never to None. The guard added in serialization.py:
if metadata is not None:
result["metadata"] = metadatawill always be True for any Observation instance — an empty dict is not None. This means:
- Empty metadata
{}will always be included in the response, directly contradicting the PR description ("skip when empty") and breakingtest_empty_metadata_omitted. - The correct guard is
if metadata:(falsy check), which skips bothNoneand{}.
Inconsistency: mcp_client.py metadata defaults diverge from generic_client.py
The PR sets metadata=payload.get("metadata", {}) in mcp_client.py when constructing Observation subclasses, but sets metadata=payload.get("metadata") (default None) for StepResult.metadata. This creates a mismatch: if the server omits metadata, StepResult.metadata is None but observation.metadata is {}. These should be consistent — use payload.get("metadata") throughout or normalize to None/{} explicitly.
Missing test: ResetResponse serialization path
The test test_reset_metadata_preserved constructs ResetResponse(**serialized) directly, but ResetResponse has extra="forbid". The PR's types.py diff adds metadata to ResetResponse and StepResponse — verify those additions are present, since without them http_server.py:601 (ResetResponse(**serialize_observation(observation))) will raise a ValidationError at runtime for any non-empty metadata.
No test for the reset path through http_server.py
serialize_observation is called at http_server.py:601 and 639. The test suite only directly tests _parse_result. A round-trip test through ResetResponse(**serialize_observation(...)) and StepResponse(**serialize_observation(...)) is needed to catch the extra="forbid" failure mode.
Tier 2: Alignment Discussion
ALIGNMENT FLAG: Metadata promotion may need an RFC
- Principle at stake: PRINCIPLES.md — "Type safety with generics and Pydantic across the wire"; INVARIANTS.md Invariant 3 (Pydantic serialization)
- The concern: Promoting
metadatato a top-level sibling ofobservation/reward/donechanges the wire contract. The existing pattern inPATTERNS.mdshows_parse_resultreadingpayload["observation"]for all env-specific data. Metadata is now split across the wire protocol. This is a backward-compatible add (new optional key), but it diverges from the documented pattern for custom clients that override_parse_result. - Suggested reviewer: @Darktex
Summary
- 3 mechanical issues to fix (wrong None guard, mcp/generic inconsistency, missing integration test path)
- 1 alignment point for human review (wire contract change)
Automated review by Claude Code | Learn more
Darktex
left a comment
There was a problem hiding this comment.
Note: This is an automated review by Claude Code, not a human review.
Alignment Review Report
Automated Checks
- Lint (ruff check): PASS on all changed source files
- Lint (ruff format): FAIL —
tests/core/test_serialization.pyis missing a trailing newline at EOF (line 286 of the diff ends without\n);ruff format --checkwould reformat this file - Debug code: CLEAN — no new debug code introduced
Tier 1: Fixes Required
-
src/openenv/core/mcp_client.py:110,119,128(diff lines) — Runtime ValidationError bug: The three lines that constructListToolsObservation,CallToolObservation, andObservationnow passmetadata=payload.get("metadata"), which returnsNonewhen the key is absent. All three Pydantic models declaremetadata: Dict[str, Any](notOptional) with adefault_factory=dict. PassingNoneraisespydantic.ValidationError: Input should be a valid dictionaryat runtime — this will break every_parse_resultcall for environments that don't populate metadata (the common case after this fix). The correct fix ismetadata=payload.get("metadata") or {}. -
tests/core/test_serialization.py:286— Missing trailing newline at end of file.ruff format --checkfails on files without a trailing newline, which will break the pre-commit and CI lint hooks.
Tier 2: Alignment Discussion
None identified. The fix correctly mirrors the existing reward/done promotion pattern in serialize_observation(), is additive and backwards-compatible (StepResult.metadata defaults to None), and the change appropriately touches only the serialization/client layer without exposing any simulation controls to agents or violating client-server separation.
Summary
- 2 mechanical issues to fix (1 runtime bug, 1 formatting)
- 0 alignment points for human review
The core idea is correct — the bug is real and the approach is right. The mcp_client.py change just needs or {} as a fallback to keep the Pydantic models valid when metadata is absent from the payload.
Automated review by Claude Code | Learn more
Darktex
left a comment
There was a problem hiding this comment.
Note: This is an automated review by Claude Code, not a human review.
Summary
serialize_observation() was silently dropping Observation.metadata — a real bug confirmed by the PR author's production debugging. The fix is structurally correct: promote metadata to a top-level sibling key, mirror the pattern used for reward and done, add the field to StepResult, ResetResponse, StepResponse, and both clients. The approach is right; two mechanical issues need fixing before merge.
Tier 1 — Bugs / Quality
-
src/openenv/core/env_server/serialization.py(post-patch, around line 178) —if metadata:is a truthiness check, not aNonecheck. Anymetadatavalue that is falsy — e.g.{"active": False},{"count": 0},{"flag": ""}— passes theis not Noneguard but is silently dropped. Change toif metadata is not None:to match the intent. -
src/openenv/core/mcp_client.py— Three call sites now usepayload.get("metadata") or {}to populateobservation.metadata. Theor {}coercesNone(absent key) to{}, which meansobservation.metadatais always a dict (neverNone), whereasStepResult.metadatain the same function usespayload.get("metadata")which returnsNonewhen absent. This is an internal inconsistency and also diverges fromGenericEnvClient, which correctly returnsNone. Decide on a canonical "no metadata" sentinel (recommendNoneeverywhere for consistency withOptional[Dict]) and apply it uniformly. -
tests/core/test_serialization.py— No test covers the falsy-metadata edge cases (e.g.metadata={"done": False},metadata={"score": 0}). A test asserting these are preserved (not omitted) must be added alongside theif metadata is not None:fix.
Tier 2 — Alignment
Aligned. The change propagates environment-produced metadata through the serialization layer to the client — this is firmly within the "rewards inside environment" principle (metadata is domain knowledge originating in the environment, not injected externally). No simulation controls are exposed, client-server separation is respected, no new API surface deviates from the Gymnasium pattern, and no RFC-level trade-off is touched.
Verdict
Request changes: fix the if metadata: truthiness bug and the or {} inconsistency in mcp_client.py, and add a test for falsy-valued metadata before merging.
Automated review by Claude Code | Learn more
…alsy metadata test
|
Hi @Darktex , I applied your suggested changes, can you please review them? |
Darktex
left a comment
There was a problem hiding this comment.
Note: This is an automated review by Claude Code, not a human review.
Alignment Review — PR #617
Automated Checks
- Lint (ruff check): PASS — all 5 changed source files pass clean
- Format (ruff format): PASS
- Import sort (usort): PASS
- Debug code: CLEAN
Tier 1: Bugs / Issues
T1-1. mcp_client.py — passing None to non-optional Dict[str, Any] field will crash
src/openenv/core/mcp_client.py — the PR changes obs_data.get("metadata", {}) → payload.get("metadata") for all three observation constructors (ListToolsObservation, CallToolObservation, generic Observation). But the metadata field on these Pydantic models is typed Dict[str, Any] (non-optional, default {}). When payload has no top-level `"metadata"` key (i.e., every existing environment server), payload.get("metadata") returns None, which will raise a Pydantic ValidationError. This is a runtime regression. Fix: use payload.get("metadata") or {} or payload.get("metadata", {}).
T1-2. {} (empty dict) vs None inconsistency
src/openenv/core/env_server/serialization.py — the guard if metadata is not None: will include {} in the output because Observation.metadata defaults to dict (never None). The PR description says "skip when empty" but the code doesn't skip empty dicts. Meanwhile, StepResult.metadata defaults to None. This creates a mismatch: server always sends "metadata": {}, client defaults to None. Fix: either change guard to if metadata: (truthy, skips {}), or change StepResult.metadata default to {}.
T1-3. Stale docstring in serialize_observation()
src/openenv/core/env_server/serialization.py — the docstring still shows the old return format without metadata. Update to reflect the new top-level metadata key.
T1-4. Verify ResetResponse/StepResponse updates landed
src/openenv/core/env_server/types.py — the diff adds metadata: Optional[Dict[str, Any]] to both models. Both have model_config = ConfigDict(extra="forbid"). If these updates are missing from the actual file, any response containing "metadata" will raise a Pydantic ValidationError on the HTTP path. The serializer and type updates must land atomically.
T1-5. Verify generic_client.py update landed
src/openenv/core/generic_client.py — the diff adds metadata=payload.get("metadata") to StepResult(...). The StepResult dataclass gains a metadata field — but GenericEnvClient._parse_result() must be updated atomically or the field will always be None.
Tier 2: Alignment Flags
🚩 ALIGNMENT FLAG: Wire contract change without version documentation
- Invariant: "Pydantic serialization: All wire types must be Pydantic models. Serialization must be JSON-compatible."
- Concern: Adding
"metadata"as a new top-level key in everystep/resetresponse changes the wire format for all environments. Any client parsing with strict schema validation (extra="forbid") will break if the type updates aren't atomic. Pre-1.0 this is acceptable per breaking-change policy, but it needs documentation in a changelog entry. - Reviewer: @Darktex
🚩 ALIGNMENT FLAG: mcp_client.py breaks backward compatibility for existing servers
- Invariant: "Minimize lifecycle deltas: Training → Evals → Production should use identical interfaces"
- Concern: Before this PR, environments embedded metadata inside
obs_data(the observation dict). The PR moves reading topayload(top-level). Un-upgraded server Docker images will continue embedding metadata insideobs_data. The new client won't read it from there — metadata silently becomesNone/{}with no error. This is silent data loss in distributed deployments. Consider a backward-compatibility shim: try top-level first, fall back toobs_data. - Reviewer: @Darktex
Verdict: Request Changes
T1-1 is critical: passing None to a non-optional Dict[str, Any] Pydantic field in mcp_client.py will raise ValidationError at runtime on every response from an un-upgraded server — i.e., every existing deployment. This must be fixed before merge. T1-2 (empty dict inconsistency) should also be resolved for a clean contract.
Automated review by Claude Code | Learn more
Summary
serialize_observation() excluded metadata from model_dump() alongside reward and done, but unlike those two, metadata was never re-added to the response dict, it was silently dropped.
This promotes metadata to a top-level sibling key consistent with reward/done, adds it to StepResult, and updates GenericEnvClient to pass it through.
Fixes #616
How I found this
I built an OpenEnv environment (https://huggingface.co/spaces/saksham1771/infinite-dom) for the OpenEnv Hackathon India 2026.
Our environment returns task progress via Observation.metadata, things like how many graph nodes the agent has completed, which checkpoints are done. During live evaluation over WebSocket, this metadata was silently absent on the client side.
After debugging, I traced it to serialize_observation() excluding metadata without re-adding it.
Fix
The fix mirrors what the function already does for reward and done:
Changes
Test plan