Skip to content

feat QED Math Environment with MCP tools and LLM-judge rubric#446

Open
rycerzes wants to merge 30 commits into
huggingface:mainfrom
rycerzes:feat/qed-math-env
Open

feat QED Math Environment with MCP tools and LLM-judge rubric#446
rycerzes wants to merge 30 commits into
huggingface:mainfrom
rycerzes:feat/qed-math-env

Conversation

@rycerzes
Copy link
Copy Markdown
Contributor

@rycerzes rycerzes commented Mar 17, 2026

Summary

Adds the QED Math Environment, a mathematical proof generation and evaluation env ported from QED-Nano. This environment deeply integrates both MCP tools (as the sole agent API) and LLM-judge rubrics (structured 0–7 grading with normalized rewards), making it a reference implementation for the upcoming MCP + Rubrics features.

What's included

Environment (qed_math_env)

  • QEDMathEnvironment - extends MCPEnvironment; manages problem lifecycle, dataset loading, and proof submission
  • MathProofRubric - extends openenv.core.rubrics.base.Rubric; LLM-judge grading via OpenAI-compatible endpoints with <score>N</score> parsing, retry logic, and optional score thresholding
  • QEDMathEnv client - extends MCPToolClient with typed ProblemObservation / ProofSubmissionObservation models
  • 3 MCP tools: get_problem, submit_proof, get_grading_guidelines

Key features

  • Flexible dataset loading: local JSONL/JSON, Hugging Face Hub, or built-in bootstrap problems
  • Answer-mode verification: math_verify-based \boxed{} checking (no LLM call needed)
  • Reward shaping: discount factor (γ^tokens), length penalty (buffer zone), optional score thresholding (collapses 1–5 → 1)
  • Reasoning stripping: configurable delimiters (e.g. </think>) removed before grading
  • Multi-step problems: configurable max attempts with per-attempt feedback
  • Verifier metrics: QED-Nano-compatible metrics (verifier/rollouts/success, verifier/failures/*, latency, token counts) in observation metadata
  • Evaluator prompts: template-based (prompts/evaluator_prompts/v2.md)

Testing

Test file (test_qed_math_environment.py) covers:

Area Tests
MCP tools ListToolsAction, CallToolAction for all 3 tools
Rubric grading Mocked MathProofRubric.grade(), score→reward normalization
Answer verification Correct/wrong/missing \boxed{} via math_verify
Reward shaping Discount factor, length penalty, score thresholding
Reasoning stripping </think> delimiter handling, fallback behavior
Dataset loading Local JSONL, problem ID selection, answer-mode \boxed{} wrapping
Multi-step problems Retry logic with attempt tracking
Original problem field RC stream compatibility
Verifier metrics Metrics on GradingResult and submission payload
Server integration Full HTTP/WebSocket stack (marked @pytest.mark.integration)
PYTHONPATH=src:envs uv run pytest tests/envs/test_qed_math_environment.py -v

TODO before review & merge

  • Commit test_qed_math_environment.py
  • Add inference example script (examples/qed_math_inference.py)
  • Consider adding GRPO training example (separate PR)

Env Inference example with logs

PYTHONPATH=src:envs uv run python examples/qed_math_inference.py

Output

QED Math server: http://localhost:8000
API:   https://api.groq.com/openai/v1
Model: glm-5
Tools: ['get_problem', 'submit_proof', 'get_grading_guidelines']
Episodes: 3
============================================================
Episode 1  |  problem_id=(random)  |  type=proof
============================================================
--- Step 1 ---
Tool: get_problem({})
Result: {"problem": "Prove that the sum of two even integers is even.", ...
--- Step 2 ---
Tool: get_grading_guidelines({})
Result: {"grading_guidelines": "Award full credit for a correct parity argument.", ...
--- Step 3 ---
Tool: submit_proof({"proof": "**Proof.**\n\nAn integer is *even* if and only if it can be expressed as $2k$...
Result: {"done": true, "reward": 1.0, ...
Outcome: CORRECT  score=7/7  reward=1.000

CC: @burtenshaw

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 17, 2026
Copy link
Copy Markdown
Contributor Author

@rycerzes rycerzes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR implements the env properly with rubrics and MCP, both tested.

Merge order:

  1. #456 must be merged into main first - it fixes bug #455 in the current main
  2. This branch can then be rebased onto main and merged (it has already cherry-picked the patch from #456

The GRPO post training code would be a different PR.
@burtenshaw

@rycerzes rycerzes marked this pull request as ready for review March 26, 2026 11:03
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 26, 2026

Greptile Summary

This PR adds the QED Math Environment — a mathematical proof generation and evaluation env that uses MCP tools as the sole agent API and an LLM-judge rubric (0–7 scale) for grading. It is a well-structured reference implementation that introduces MathVerifierService (process-pool based, async-safe), MathProofRubric, flexible dataset loading, and a comprehensive test suite.

Several concerns raised in previous review rounds have been addressed: step_count is now incremented in step_async, unseeded reset() now calls random.choice, get_problem_payload() gates reference_solution to answer-mode only, and the signal-based timeout issue is resolved by moving answer verification into worker processes. Two new issues remain:

  • reset() still leaks reference_solution for proof-mode problems (qed_math_environment.py:850): the guard applied to get_problem_payload() was not mirrored in the reset() return path. Since QEDMathEnv.reset() normalizes and returns this to the caller, proof-mode agents can see the grading key at episode start.
  • output_length_tokens is agent-controllable in the submit_proof MCP tool (mcp_server.py:46): the discount factor and length penalty are applied using a token count supplied by the agent. An agent that omits or sets this to zero bypasses all reward shaping, undermining the integrity of the shaped reward signal.

Additionally, the _verify_math static method on QEDMathEnvironment is dead code — the live grading path goes through MathVerifierService — and should be removed to avoid confusion with the signal-unsafe pattern it embodies.

Confidence Score: 4/5

Two P1 issues should be resolved before merging: reference_solution leak in reset() and agent-controllable reward shaping via output_length_tokens.

The environment is functionally solid and prior review concerns have largely been addressed. The two remaining P1 issues directly affect training validity: the reset() reference_solution leak undermines proof-mode isolation, and the agent-supplied output_length_tokens in the submit_proof tool breaks reward integrity by letting the agent opt out of shaping.

envs/qed_math_env/server/qed_math_environment.py (reset() reference_solution leak, dead _verify_math method) and envs/qed_math_env/server/mcp_server.py (output_length_tokens in submit_proof tool).

Important Files Changed

Filename Overview
envs/qed_math_env/server/qed_math_environment.py Core environment implementation — two P1 issues: reset() leaks reference_solution for proof-mode (fix applied to get_problem_payload was not mirrored here), and dead _verify_math static method shadowing the correct process-based verifier path.
envs/qed_math_env/server/mcp_server.py MCP tool registration — P1 alignment issue: submit_proof exposes output_length_tokens as an agent-controllable parameter, allowing agents to bypass or manipulate reward shaping.
envs/qed_math_env/server/rubric.py LLM judge rubric with 0-7 scoring, retry logic, and verifier metrics — well structured; broad except Exception in _call_llm was flagged in a prior thread and remains.
envs/qed_math_env/server/math_verify_service.py Process-pool based answer verifier — correctly isolates signal-based timeouts to worker processes, includes lazy start, retry, pool restart, and metrics; no issues found.
envs/qed_math_env/client.py Typed client wrapper over MCPToolClient — normalizes observations into typed models; no issues beyond inheriting the reference_solution leak from reset().
src/openenv/core/mcp_client.py Adds GenericMCPObservation with extra=allow to pass through env-specific fields when no typed observation class matches; small whitespace fix in docstring.
tests/envs/test_qed_math_environment.py Comprehensive test suite covering rubric grading, answer verification, reward shaping, dataset loading, and multi-step logic; autouse fixture patches OpenAI client cleanly.

Sequence Diagram

sequenceDiagram
    participant Agent
    participant QEDMathEnv as QEDMathEnv (client)
    participant Server as QEDMathEnvironment (server)
    participant MathVerifier as MathVerifierService
    participant LLMJudge as MathProofRubric (LLM)

    Agent->>QEDMathEnv: reset(problem_id?)
    QEDMathEnv->>Server: POST /reset
    Server-->>QEDMathEnv: ProblemObservation (problem, guidelines, [ref_solution if answer-mode])
    QEDMathEnv-->>Agent: ProblemObservation

    Agent->>QEDMathEnv: call_tool(get_problem)
    QEDMathEnv->>Server: CallToolAction get_problem()
    Server-->>QEDMathEnv: payload (problem, guidelines, [ref_solution if answer-mode])
    QEDMathEnv-->>Agent: ProblemObservation

    Agent->>QEDMathEnv: call_tool(submit_proof, proof=...)
    QEDMathEnv->>Server: CallToolAction submit_proof(proof)

    alt evaluation_mode == answer
        Server->>MathVerifier: verify_answer(prediction, gold)
        MathVerifier-->>Server: VerifyResponse (correct/wrong)
    else evaluation_mode == proof
        Server->>LLMJudge: grade(proof, problem, ref_solution, guidelines)
        LLMJudge-->>Server: GradingResult (score 0-7, reward 0-1)
    end

    Server->>Server: _apply_reward_shaping(reward, output_length_tokens)
    Server-->>QEDMathEnv: ProofSubmissionObservation (score, reward, done, feedback)
    QEDMathEnv-->>Agent: ProofSubmissionObservation
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: envs/qed_math_env/server/qed_math_environment.py
Line: 848-860

Comment:
**`reset()` leaks `reference_solution` for proof-mode problems**

`get_problem_payload()` (line 1063) was correctly fixed to gate `reference_solution` on `evaluation_mode == "answer"`. However, the `reset()` method still returns a `ProblemObservation` with `reference_solution` unconditionally — for every problem type including `proof` and `multi_step`.

In the `QEDMathEnv` client (`client.py`), the result of `reset()` is normalized into a `ProblemObservation` and returned to the caller. In the inference example (`examples/qed_math_inference.py`) the agent drives the episode from the reset observation, so `reference_solution` is visible at episode start even without calling `get_problem`.

Apply the same guard used in `get_problem_payload`:

```suggestion
        return ProblemObservation(
            problem=self._current_problem.get("problem", ""),
            reference_solution=(
                self._current_problem.get("reference_solution", "")
                if self._current_problem.get("evaluation_mode", "proof") == "answer"
                else ""
            ),
            grading_guidelines=parse_schema(
                self._current_problem.get("grading_guidelines", "") or ""
            ),
```

**ALIGNMENT FLAG**: Exposes the scoring key to the agent via the reset path for `proof`-type problems, undermining training validity.
- **Principle at stake**: Agent isolation invariant — agents must not access simulation controls or grading keys outside answer-mode.
- **The concern**: `get_problem_payload()` was guarded, but the `reset()` HTTP response is also agent-accessible and contains the same leak.
- **Suggested reviewer**: `@darktex`

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: envs/qed_math_env/server/mcp_server.py
Line: 46-55

Comment:
**`output_length_tokens` is agent-controllable, undermining reward shaping**

The `submit_proof` MCP tool exposes `output_length_tokens` as an agent-supplied parameter. This creates a reward integrity problem:

1. An adversarial agent can pass the default value of `0` to bypass the discount factor and length penalty entirely, receiving the unmodified base reward regardless of actual generation length.
2. Even well-intentioned agents will omit this parameter in most integrations, silently disabling shaping when a non-trivial `discount_factor` or `buffer_tokens` is configured.

The token count of the agent's generation should be determined server-side (e.g., passed by the training harness that dispatches the HTTP step action, or measured on the proof text itself) rather than supplied by the agent through the MCP tool interface.

**ALIGNMENT FLAG**: The reward-inside-environment principle is violated because the agent can choose its own reward-shaping inputs.
- **Principle at stake**: Reward integrity — the environment is the sole authority on reward values.
- **The concern**: Discount factor and length penalty become optional opt-ins from the agent's perspective.
- **Suggested reviewer**: `@darktex`

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: envs/qed_math_env/server/qed_math_environment.py
Line: 1070-1115

Comment:
**`_verify_math` static method is dead code**

`QEDMathEnvironment._verify_math` implements answer verification via `math_verify` with a signal-based timeout, but it is never called in any code path. Answer-mode grading routes through `_grade_answer_submission``_verifier_service.verify_answer()` (the process-pool based service), which is the correct async-safe implementation.

This method is also not imported or exercised in the test suite. Consider removing it to avoid the maintenance confusion of two parallel verification implementations — especially since a previous review comment already flagged that the signal-based timeout (`signal.SIGALRM`) is unsafe in async contexts.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (2): Last reviewed commit: "Merge branch 'main' into feat/qed-math-e..." | Re-trigger Greptile

Comment thread envs/qed_math_env/server/qed_math_environment.py
Comment thread envs/qed_math_env/server/qed_math_environment.py Outdated
Comment thread envs/qed_math_env/server/qed_math_environment.py
Comment thread envs/qed_math_env/server/qed_math_environment.py Outdated
Comment thread envs/qed_math_env/server/qed_math_environment.py
Comment thread envs/qed_math_env/server/qed_math_environment.py
Comment thread envs/qed_math_env/server/rubric.py
@burtenshaw
Copy link
Copy Markdown
Collaborator

@rycerzes nice work. #456 is now merged to main.

Could you please address the greplit review. Mainly:

  • step count on async path
  • reset randomness bug
  • reference solution leakage

rycerzes added 17 commits March 28, 2026 21:53
- add QEDMathAction and QEDMathObservation dataclasses in models.py
- implement QEDMathEnv client with reset, step, get_problem,
submit_proof
- main env class
- mcp server tools
- step & reset logic
- impl MathProofRubric
- LLM Grading Logic
- rubric config in env
- map problem data structure
- support multiple types of problems
- wss handling
- client methods
- `submit_proof` method accepts an optional `output_length_tokens` parameter for reward shaping.
- Introduced `remove_reasoning` function to strip reasoning traces from model output.
- Added `length_penalty` function to compute penalties for overlong sequences.
- Adjusted grading logic to apply discount factors and penalties based on token count.
- implement metrics aggregation
dockerfile
@rycerzes rycerzes force-pushed the feat/qed-math-env branch from d810587 to f54cc91 Compare March 28, 2026 16:23
…solution handling

- update tests for answer mode
- integrate with QED Math environment
- Fix MCP client parsing bug where reset/step observations were coerced
into base Observation, dropping env-specific fields
- tests
Copy link
Copy Markdown
Contributor Author

@rycerzes rycerzes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved answer-mode verification into a process-isolated worker pool (math_verify_service.py) since Math-Verify's signal-based timeouts are unsafe in threaded contexts. The service uses a bounded ProcessPoolExecutor with heartbeat monitoring, dead-worker restart, request-ID ownership to discard stale responses, and deterministic status-to-reward mapping so infra failures score 0 without crashing rollouts. Gold targets are pre-parsed and cached by problem_id. Proof-mode grading and the client API are unchanged.

Fixed

  • step count on async path
  • reset randomness bug
  • reference solution leakage

@burtenshaw

@burtenshaw
Copy link
Copy Markdown
Collaborator

Thanks @rycerzes . I'll set of the auto review then try it out tomorrow.

@greptile-apps

Comment thread envs/qed_math_env/server/qed_math_environment.py
Comment thread envs/qed_math_env/server/mcp_server.py Outdated
@rycerzes
Copy link
Copy Markdown
Contributor Author

fixed the leak and reward integrity

rycerzes added a commit to 3xcaffeine/qed-math-openenv that referenced this pull request Apr 2, 2026
rycerzes added a commit to 3xcaffeine/qed-math-openenv that referenced this pull request Apr 2, 2026
rycerzes pushed a commit to 3xcaffeine/qed-math-openenv that referenced this pull request Apr 2, 2026
@rycerzes
Copy link
Copy Markdown
Contributor Author

@burtenshaw if you could PTAL :)

rycerzes added a commit to rycerzes/OpenEnv that referenced this pull request Apr 22, 2026
Copy link
Copy Markdown
Contributor

@Darktex Darktex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This is an automated review by Claude Code, not a human review.


Tier 1 — Bugs / Lint

  • envs/qed_math_env/server/qed_math_environment.py + math_verify_service.py — duplicate UnparsableException / NoAnswerException / EmptyBoxedException classes with different base classes (VerificationError vs Exception). The copy in qed_math_environment.py is only reachable from the static, never-called _verify_math. Consolidate or remove the duplicates.
  • qed_math_environment.py::_verify_math is dead code — @staticmethod, never invoked outside tests. Either expose as a public utility (and document) or remove and refactor the tests to exercise _grade_answer_submission with a mocked verifier.
  • envs/qed_math_env/server/rubric.pyMathProofRubric.__init__ defaults grader_model="gemini-2.0-flash", but QEDMathConfig and openenv.yaml both default to gemini-3-pro. Direct instantiation of MathProofRubric silently uses a different model. Align the defaults or add an explicit comment.
  • envs/qed_math_env/prompts/evaluator_prompts/v2.md — template exposes {problem}, {marking_scheme}, {solution} but no {human_solution} slot, while MathProofRubric._build_prompt passes human_solution=reference_solution to .format(), which silently ignores it. Likely intentional for the IMO rubric — add a one-line comment so it doesn't read as a forgotten placeholder.
  • envs/qed_math_env/pyproject.toml — missing trailing comma after "trackio>=0.19.0". Valid TOML, but inconsistent with all surrounding entries.

Tier 2 — Alignment (questions for human reviewers)

  • Widening GenericMCPObservation in shared core: src/openenv/core/mcp_client.py switches to extra="allow". This weakens the wire-level contract for every MCP environment, not just QED Math. Should this instead be a per-env override of _parse_result in qed_math_env/client.py (the typed-client pattern in PATTERNS.md)? The _as_problem_observation helpers already in client.py suggest a typed override was considered.
  • output_length_tokens injection via step_async kwargs: read from kwargs by the training harness and consumed inside submit_proof_payload to drive discount-factor reward shaping. This is an out-of-band channel that bypasses the typed harness interface. If the WebSocket step path doesn't enforce session-level authorization, an agent-side caller could inject this value to manipulate reward shaping. Is there an existing threat-model doc, or does this warrant one?
  • RFC requirement: The PR introduces a non-trivial new grading architecture (dual-mode proof/answer paths, process-pool verifier, LLM judge with retries) and touches shared core (mcp_client.py). Per project policy, architectural changes warrant an RFC. Is qed_math_env exempt because it's an envs/ addition, or should a short RFC document the design before this lands?

Overall

High-quality, well-documented port. The mechanical Tier-1 items are small but concrete; the core-layer GenericMCPObservation widening and the output_length_tokens channel are the two items I'd want a human sign-off on before merge.


Automated review by Claude Code | Learn more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants