Skip to content

[RFC] add Environment dataset (taskset) RFC#727

Open
burtenshaw wants to merge 1 commit into
huggingface:mainfrom
burtenshaw:codex/hf-rl-env-datasets
Open

[RFC] add Environment dataset (taskset) RFC#727
burtenshaw wants to merge 1 commit into
huggingface:mainfrom
burtenshaw:codex/hf-rl-env-datasets

Conversation

@burtenshaw
Copy link
Copy Markdown
Collaborator

This PR adds RFC 006 for Hugging Face RL environment datasets, documenting dataset-root environment.yaml declarations, AutoEnv handling for hf://datasets references, and dataset-bound environment behavior. It also updates the RFC index so the proposal is discoverable.\n\nValidation: git diff --check and git diff --cached --check.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 21, 2026
@burtenshaw burtenshaw changed the title [codex] add env dataset RFC [RFC] add Environment dataset (taskset) RFC May 29, 2026
@burtenshaw burtenshaw marked this pull request as ready for review May 29, 2026 10:17
Copy link
Copy Markdown
Contributor

@Darktex Darktex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This is an automated review by Claude Code, not a human review.


Alignment Review — PR #727

Tier 1: Spec-Level Issues

  • reset() signature mismatch with cursor: The RFC proposes env.reset() binds row 0, but the Gymnasium API signature is reset(seed?, episode_id?). The RFC must show how episode_id maps to dataset row selection — a caller passing episode_id=42 would expect row 42, not row 0.

  • Reward computation location unclear: The RFC introduces dataset rows with openenv_reset/openenv_step/task column conventions but doesn't specify whether reward computation (which must live inside the environment per RFC 002/004) reads from the dataset row. If the task column carries ground truth for reward logic, the RFC needs to state that the reward rubric lives server-side and the dataset row is input-only.

  • "Exactly one of space_id, image, or package" not validated: The RFC declares this constraint but provides no schema validation approach. It should reference the existing openenv.yaml schema validation pathway or specify how AutoEnv enforces mutual exclusivity at parse time.

  • URI scheme ambiguity: The hf://datasets/{repo_id}/{environment_id}@{revision} format doesn't clarify how {environment_id} resolves — is it a filename or a key within the YAML environments list? Needs to be unambiguous for multiple environments in one repo.

  • Phase 5 ("docs") has no testable deliverables: Phases 1-4 are concrete. Phase 5 should be fleshed out or merged into Phase 4.

Tier 2: Alignment Flags

ALIGNMENT FLAG: Dataset cursor iteration may expose reset control to agents

  • Invariant: "Agents cannot reset" (RFC 001)
  • Concern: If an agent can influence which row step() reads next, or if the cursor wraps/is queryable, an agent could learn the dataset has a "restart" boundary. The RFC doesn't specify whether cursor position is hidden from agents.

ALIGNMENT FLAG: Verifiers framework declaration enables external reward computation

  • Invariant: "Rewards inside environment" (RFC 002)
  • Concern: Verifiers computes reward/verification outside the environment boundary. If a dataset-bound environment declares a Verifiers runtime, reward computation would violate the invariant. The RFC needs to either exclude reward from the Verifiers path or define how external scores are re-ingested.

ALIGNMENT FLAG: AutoEnv resolution may break client-server separation

  • Invariant: Client-server separation
  • Concern: If AutoEnv downloads and instantiates the package entry on the client side before a container boundary exists, it may import server-side code from the client context.

ALIGNMENT FLAG: Cursor model and "one env = one trajectory"

  • Invariant: One env = one trajectory (RFC 004)
  • Concern: The RFC should explicitly state that each reset() starts a new trajectory bound to a new row, and the cursor doesn't create mid-episode task switching.

Verdict

5 spec issues + 4 alignment flags. The overall direction is sound but needs these clarifications before implementation begins.


Automated review by Claude Code | Learn more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants