[RFC] add Environment dataset (taskset) RFC#727
Conversation
Darktex
left a comment
There was a problem hiding this comment.
Note: This is an automated review by Claude Code, not a human review.
Alignment Review — PR #727
Tier 1: Spec-Level Issues
-
reset() signature mismatch with cursor: The RFC proposes
env.reset()binds row 0, but the Gymnasium API signature isreset(seed?, episode_id?). The RFC must show howepisode_idmaps to dataset row selection — a caller passingepisode_id=42would expect row 42, not row 0. -
Reward computation location unclear: The RFC introduces dataset rows with
openenv_reset/openenv_step/taskcolumn conventions but doesn't specify whether reward computation (which must live inside the environment per RFC 002/004) reads from the dataset row. If thetaskcolumn carries ground truth for reward logic, the RFC needs to state that the reward rubric lives server-side and the dataset row is input-only. -
"Exactly one of space_id, image, or package" not validated: The RFC declares this constraint but provides no schema validation approach. It should reference the existing
openenv.yamlschema validation pathway or specify how AutoEnv enforces mutual exclusivity at parse time. -
URI scheme ambiguity: The
hf://datasets/{repo_id}/{environment_id}@{revision}format doesn't clarify how{environment_id}resolves — is it a filename or a key within the YAMLenvironmentslist? Needs to be unambiguous for multiple environments in one repo. -
Phase 5 ("docs") has no testable deliverables: Phases 1-4 are concrete. Phase 5 should be fleshed out or merged into Phase 4.
Tier 2: Alignment Flags
ALIGNMENT FLAG: Dataset cursor iteration may expose reset control to agents
- Invariant: "Agents cannot reset" (RFC 001)
- Concern: If an agent can influence which row
step()reads next, or if the cursor wraps/is queryable, an agent could learn the dataset has a "restart" boundary. The RFC doesn't specify whether cursor position is hidden from agents.
ALIGNMENT FLAG: Verifiers framework declaration enables external reward computation
- Invariant: "Rewards inside environment" (RFC 002)
- Concern: Verifiers computes reward/verification outside the environment boundary. If a dataset-bound environment declares a Verifiers runtime, reward computation would violate the invariant. The RFC needs to either exclude reward from the Verifiers path or define how external scores are re-ingested.
ALIGNMENT FLAG: AutoEnv resolution may break client-server separation
- Invariant: Client-server separation
- Concern: If AutoEnv downloads and instantiates the
packageentry on the client side before a container boundary exists, it may import server-side code from the client context.
ALIGNMENT FLAG: Cursor model and "one env = one trajectory"
- Invariant: One env = one trajectory (RFC 004)
- Concern: The RFC should explicitly state that each
reset()starts a new trajectory bound to a new row, and the cursor doesn't create mid-episode task switching.
Verdict
5 spec issues + 4 alignment flags. The overall direction is sound but needs these clarifications before implementation begins.
Automated review by Claude Code | Learn more
This PR adds RFC 006 for Hugging Face RL environment datasets, documenting dataset-root environment.yaml declarations, AutoEnv handling for hf://datasets references, and dataset-bound environment behavior. It also updates the RFC index so the proposal is discoverable.\n\nValidation: git diff --check and git diff --cached --check.