fix: handle null base64_image in MCP screenshot responses by dzorlu · Pull Request #476 · huggingface/OpenEnv

dzorlu · 2026-03-29T20:16:45Z

Summary

Add null check for parsed["base64_image"] in _extract_tool_result
When Fleet MCP returns {"base64_image": null} (failed screenshot capture), return a text error message instead of creating an image_url block with url: None
Prevents downstream AttributeError in SkyRL's extract_images_from_conversation

Root cause

_extract_tool_result unconditionally wraps parsed["base64_image"] in an OpenAI image_url block. When the value is None, this creates {"url": None} which crashes .startswith() in the consumer.

Test plan

Verify VL training handles failed screenshots gracefully (text error instead of crash)
Confirm working screenshots still produce valid image_url blocks

🤖 Generated with Claude Code

- FleetTaskEnv wraps FleetEnvClient with task-oriented interface - Accepts task configs from export_training_tasks.py - Creates versioned environments on reset - Injects task prompt into observations - Executes verifier for reward computation on episode completion - Supports both sync and async step methods - Factory functions: make_fleet_task_env, from_json_file - Tests: 20 unit tests for init, specs, verifiers, factories 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The MCP images don't exist for all environment versions, causing FleetVersionNotFoundError when trying to create environments. Changing the default to None allows the Fleet SDK to use standard images which are available for all versions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

FleetEnvClient.from_fleet() was not accepting data_key/data_version parameters, causing them to be passed through **kwargs to HTTPEnvClient which doesn't accept them. - Add data_key and data_version as explicit parameters - Pass them to fleet.make() - Update task_env.py to pass them separately 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fleet SDK expects data_key in "key:version" format, not as separate parameters. Updated from_fleet() to combine them before calling fleet.make(). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

HTTPEnvClient.reset() doesn't support seed parameter yet. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Increases default timeout from 15s to 60s for Fleet API calls. This prevents timeouts during environment initialization. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Previously reset() did partial work and reset_async() added tool fetching. Now reset_async() does all the work (including fetching tools) and reset() is just a sync wrapper that calls it via run_until_complete(). This ensures both methods return identical results including tools. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

MCP's call_tool() returns a CallToolResult Pydantic object, not plain text. This was causing ugly repr strings to be passed to agents like: "meta=None content=[TextContent(type='text', text='...')] ..." Now properly extracts: - Text content from result.content[].text - Tries JSON parsing for structured results - Falls back to structuredContent if available - Handles isError cases 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Tests for: - FleetMCPClient._extract_tool_result(): - Single text content extraction - JSON parsing from text - Multiple text contents - Error result handling - Structured content fallback - Empty result handling - FleetTaskEnv reset: - reset_async() returns tools - reset() calls reset_async() (sync wrapper) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Move fleet.make() and list_tools() into FleetTaskEnv.__init__() - Tools are now fetched at env creation, not during reset - reset_async() calls _orch.reset() with error handling, returns cached tools - Use asyncio.run() for Python 3.13 compatibility - Update tests for new initialization pattern 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Log task_key and verifier code preview when verifier fails - Catch syntax errors separately with clear message - Show which functions were found if 'verify' is missing Helps debug issues like "Verifier code must define a 'verify' function" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace custom _execute_verifier_local() with Fleet SDK's Task.verify_detailed() which properly sets up the verifier namespace with: - Environment type annotation - Helper functions (normalized_contains, etc.) - Proper function discovery (not just "verify" function) This fixes "name 'Environment' is not defined" errors during verifier execution. Changes: - _compute_reward: Create Fleet SDK Task and call verify_detailed() - Support both 'verifier_code' and 'verifier_func' field names - Add comprehensive logging for debugging - Remove broken _execute_verifier_local method Tests: - Update all verifier tests to mock Fleet SDK Task.verify_detailed() - Add tests for various edge cases (no verifier, no orch, exceptions) - Fix fixture to avoid asyncio.run() conflicts with pytest-asyncio 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…context

- Add retry with exponential backoff (3 attempts, 1s/2s/4s delays) - Log errors instead of silently swallowing exceptions - Log warning when some clients fail but others succeed - Log error after all retries exhausted This fixes silent failures when MCP connections are flaky, which caused 'no tools found' errors in SkyRL training.

call_tool now retries with exponential backoff (3 attempts, 1s/2s/4s) on connection errors, similar to list_tools. ValueError (tool not found) is not retried. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Adds exponential backoff retry (3 attempts, 2s base delay) around fleet.make() to handle transient Fleet API errors like health check failures that can occur during instance provisioning. Only retries on transient errors (health check, timeout, connection). Permanent errors are raised immediately. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add Toolathlon-style context management tools for long trajectories: - check_context: Check visible/total turn counts - manage_context: Drop old turns to free up context space - search_history: Search all history (including dropped) - search_tool_output: Search truncated tool output - view_tool_output: Paginate through truncated output The ContextManager class can be used by any training framework that maintains chat_history. It tracks full history and handles truncated tool outputs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Computer-use tasks require MCP-enabled container images (e.g., famazon:mcp0.0.7) which have scrot installed for screenshots and the MCP server with 'computer' tool for mouse/keyboard control.

Previously, tools were only fetched for tool_use modality due to a restrictive condition. This caused computer_use tasks to fail with "no tools found in observation" because the computer tool (mouse, keyboard, screenshot) was never fetched. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When task_modality is computer_use, filter tools to only include the 'computer' tool. This prevents the model from using API tools when it should be using mouse/keyboard control. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

AsyncFleet.make() is not truly non-blocking — diagnostics confirmed the event loop stays healthy while make() blocks internally. Switch to running sync Fleet.make() in a thread pool via asyncio.to_thread() to guarantee non-blocking behavior. Also removes the separate sync Fleet.instance() call since we already have a sync env handle from make(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… loop Three sync calls were blocking the event loop during concurrent trajectories: 1. self._orch.reset() — sync HTTP POST to manager API, now uses reset_async() which runs in a thread pool 2. fleet_task.verify_detailed() — sync verifier execution, now wrapped in asyncio.to_thread() 3. self._fleet_env.close() — sync instance termination, now has close_async() Also reverts AsyncFleet.make() (was correctly async all along — diagnostics confirmed the event loop stays healthy during make() calls). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds create_trace_job() and upload_trace() to fleet_env module. These allow SkyRL eval to send conversation traces (with screenshots) to the Fleet API for viewing in the Fleet UI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…r UI rendering Images were being passed as OpenAI format (type: "image_url") but Fleet's ingest API expects (type: "image", mime_type, data). The API then uploads base64 to S3 and the UI renders them full-size via OpinionatedImage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The ingest API determines session status from score presence, not the status field. Without score, all sessions stay as in_progress. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

New HintGenerator module that produces concise hints from failed rollout data to rescue GRPO signal on hard tasks. Three modes: - Option B: LLM call with verifier code + tool errors + chat_history - Option C: LLM call with verifier code only (cacheable per-task) - Option D: LLM call synthesizing tool errors + verifier failure FleetTaskEnv changes: - Accumulate tool errors during step_async() - Capture verifier error details in _compute_reward() - New properties: verifier_code, tool_errors, verifier_error - New reset_for_hint_async(hint) for hinted rollouts (reuses provisioned instance, resets DB to seed, appends hint to prompt) - compute_hint_reward() utility: R = (1 - raw_score) * hint_score Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This reverts commit fc0508f.

feat: Add Logfire error tracking to fleet env

When `partial_reward=True`, failed verifier runs compute a fractional score from the error/success accumulators instead of binary 0/1. Passing tasks are unaffected. Off by default. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: partial reward support

SkyRL can end trajectories early (context overflow, its own max_turns) without OpenEnv knowing. Previously the model got 0 reward even if the task was completed. Now close_async()/close() run the verifier when step_async() never computed a reward, storing the result in self.final_reward for SkyRL to read. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: Run verifier at close() for orphaned rollouts

Carlisle tasks (354 total, 8 in eval) require models to call submit_final_answer to submit results, but this tool is a harness-level synthetic injected by the orchestrator's SessionWorkflow, not an MCP tool. OpenEnv connects directly to MCP servers, so the tool was missing — causing 0% scores across all carlisle tasks in training. Changes: - Inject submit_final_answer into tool list when prompt references it - Intercept calls locally (not routed to MCP), store the answer - Pass final_answer to verifier via Fleet SDK's verify_detailed() - Run verifier in close()/close_async() for orphaned rollouts - Add unit tests for the synthetic tool Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: Add submit_final_answer synthetic tool for carlisle tasks

Runs k × m rollouts of generated tasks on Fleet environments. Given (prompt, verifier_code, env_key), creates FleetTaskEnv instances, runs agent loops with model inference, and returns structured results for reward computation (learnability variance + model separation). Used as the inner loop of the task-scaling RL pipeline in SkyRL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Instead of calling Anthropic directly and running a local agent loop, the evaluator now: 1. Imports the generated task via fleet.import_task() 2. Creates a harness job via fleet.create_job() 3. Polls for completion 4. Extracts per-session verifier scores from job sessions Uses real Fleet model IDs (claude-sonnet-4.5, claude-opus-4.5) instead of the broken weak/strong mapping that required ANTHROPIC_API_KEY. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fleet harness POST /v1/jobs requires model IDs in 'provider/model' format (e.g., 'anthropic/claude-sonnet-4.5'), not just the model name. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The sync time.sleep() in _poll_job blocked the asyncio event loop, preventing trajectory timeouts from cancelling evaluations. Using asyncio.sleep() allows the event loop to properly handle cancellations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fleet returns session model IDs without provider prefix (e.g., "claude-sonnet-4.5") while we configure them with prefix ("anthropic/claude-sonnet-4.5"). Added _match_model_id() to normalize and match by bare model name, so scores land in the correct results_per_model bucket. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: Add TaskEvaluator for task generation inner loop

Add _tool_errors, _verifier_stdout, _verifier_error to FleetTaskEnv so SkyRL can build hints from failed rollout feedback without an LLM call. - Tool errors accumulated in step_async() on MCP errors and exceptions - Verifier stdout/error captured in _compute_reward() after verifier runs - Verifier exceptions also captured in _verifier_error (not just failures) - All feedback properties reset in reset_async() to prevent cross-episode leakage - Properties: verifier_stdout, verifier_error, tool_errors_list Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Expose verifier feedback properties for hint generation

Add describe_db() and query_db() methods (sync + async) to FleetEnvClient, delegating to the Fleet SDK's SQLiteResource. This enables querying provisioned environment databases (seed/current) from outside the container via HTTP, which is needed for task generation workflows where the model explores DB data before designing tasks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

describe_db_async/query_db_async were wrapping sync describe_db/query_db in asyncio.to_thread, but when _fleet_env is an AsyncEnv (from AsyncFleet.make), .db().describe() and .query() are async coroutines. Now detects whether the resource method is a coroutine and awaits directly instead of using to_thread. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When the Fleet MCP server returns {"base64_image": null} (failed screenshot), _extract_tool_result would create an image_url block with url=None, causing downstream AttributeError in SkyRL's extract_images_from_conversation. Now returns a text error message instead of propagating the null. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

greptile-apps · 2026-03-29T20:20:57Z

Greptile Summary

This PR introduces the fleet_env integration module for OpenEnv — FleetEnvClient, FleetMCPTools, FleetTaskEnv, and supporting utilities — and applies a targeted bug fix: _extract_tool_result now null-checks parsed["base64_image"] before constructing an OpenAI image_url block, preventing a downstream AttributeError in SkyRL's extract_images_from_conversation when Fleet MCP returns {"base64_image": null} for a failed screenshot.

Key findings:

The null-guard fix itself is correct (lines 149–154 of fleet_mcp_client.py). The new if data_url is not None branch returns a descriptive string instead of {"url": None}, cleanly breaking the crash path.
Three tests will fail at runtime (test_fleet_env_from_fleet_returns_orchestrator_and_tools, test_fleet_env_reset_uses_http_manager_base_url, test_fleet_env_step_rejects_tool_actions in test_fleet_env.py). All three call FleetEnvClient.from_fleet() without the three required positional arguments data_key, data_version, and image_type. Python raises TypeError before the test body executes. The TestFleetEnvClientDbQuery block in the same file shows the correct call pattern.
No regression test for the fixed code path. The TestFleetMCPClientExtractToolResult class covers eight scenarios but not the null base64_image case that motivated this PR.

Alignment notes:

Client/server separation is maintained: FleetEnvClient.step() actively rejects ListToolsAction/CallToolAction, enforcing the MCP-only channel for agent actions.
Rewards are computed entirely inside the environment (_compute_reward via the Fleet SDK verifier), consistent with the "rewards inside environment" principle.
No RFC-breaking changes detected; this is additive integration code.

Confidence Score: 4/5

Safe to merge after fixing the three broken test call-sites; the core logic is sound.

One P1 issue remains: three tests in test_fleet_env.py call from_fleet() without required arguments and will raise TypeError, meaning those tests never validate their intended behavior. The core bug fix in fleet_mcp_client.py is correct and all other code is well-structured. Score is 4 rather than 5 because the failing tests should be fixed before merge.

tests/envs/test_fleet_env.py lines 79–109 — three tests use an incomplete from_fleet() signature.

Important Files Changed

Filename	Overview
src/envs/fleet_env/fleet_mcp_client.py	Core bug fix: adds null guard for `base64_image` in `_extract_tool_result`, correctly returning a text error string instead of `{"url": None}` that would crash downstream `.startswith()` calls.
tests/envs/test_fleet_env.py	Three tests call `FleetEnvClient.from_fleet()` without required positional args `data_key`, `data_version`, `image_type` — these will fail with `TypeError`. Also missing a regression test for the null `base64_image` fix.
src/envs/fleet_env/mcp_tools.py	New file: `FleetMCPTools` agent-facing wrapper with retry logic for `list_tools` and `call_tool` over multiple MCP endpoints with deduplication.
src/envs/fleet_env/client.py	New file: `FleetEnvClient` HTTP orchestrator with sync/async provisioning, DB query helpers, and clean separation that rejects MCP tool actions via `step()`.
src/envs/fleet_env/task_env.py	New file: `FleetTaskEnv` Gymnasium-compatible task wrapper with async provisioning, modality-based tool filtering, verifier execution, and rollout telemetry.
tests/envs/test_fleet_task_env.py	New test file for `FleetTaskEnv`; tests use proper mocking and cover init, specs, and rollout lifecycle.

Sequence Diagram

sequenceDiagram
    participant Agent
    participant FleetMCPTools
    participant FleetMCPClient
    participant FleetMCP as Fleet MCP Server

    Agent->>FleetMCPTools: call_tool("computer", {action: "screenshot"})
    FleetMCPTools->>FleetMCPClient: call_tool("computer", args)
    FleetMCPClient->>FleetMCP: session.call_tool(name, arguments)
    FleetMCP-->>FleetMCPClient: CallToolResult(content=[TextContent(text='{"base64_image": null}')])
    FleetMCPClient->>FleetMCPClient: _extract_tool_result(result)
    Note over FleetMCPClient: json.loads → parsed["base64_image"] is None<br/>NEW: returns "Screenshot capture failed (null image)"<br/>OLD: returned [{"type":"image_url","image_url":{"url":None}}]
    FleetMCPClient-->>FleetMCPTools: "Screenshot capture failed (null image)"
    FleetMCPTools-->>Agent: str (safe — no crash downstream)

Prompt To Fix All With AI

This is a comment left during a code review.
Path: tests/envs/test_fleet_env.py
Line: 79-85

Comment:
**`from_fleet` called without required positional arguments**

`FleetEnvClient.from_fleet()` requires three additional positional parameters — `data_key: str`, `data_version: str`, and `image_type: str` — beyond `api_key` and `env_key`. None of these three carry default values. Calling the method while omitting them will raise `TypeError` before any assertion runs, failing the test.

The same problem exists on lines 88–98 and 101–109.

The fix is to pass placeholder values for all three required parameters. The `TestFleetEnvClientDbQuery` block (lines ~488–494) already shows the correct calling convention with `data_key`, `data_version`, and `image_type` filled in.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: tests/envs/test_fleet_env.py
Line: 343-391

Comment:
**No regression test for the null `base64_image` fix**

The PR's stated purpose is to prevent the crash when `{"base64_image": null}` is returned by Fleet MCP, but the test class only exercises the happy path (a valid data URL in `test_extract_base64_image_json_format`). There is no test asserting that `_extract_tool_result` returns a plain string error when `base64_image` is `null`.

A companion test alongside the existing JSON-format tests would guard against this exact regression. The test should construct a `TextContent` whose `.text` is `'{"base64_image": null}'`, call `_extract_tool_result`, and assert the result is a `str` (not a `list` or `dict` with `url: None`).

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix: handle null base64_image from Fleet..." | Re-trigger Greptile}

greptile-apps · 2026-03-29T20:21:00Z

+@pytest.mark.usefixtures("fake_requests_session", "fake_fleet_module")
+def test_fleet_env_from_fleet_returns_orchestrator_and_tools():
+    from envs.fleet_env import FleetEnvClient, FleetMCPTools
+
+    orch, tools = FleetEnvClient.from_fleet(api_key="k", env_key="e")
+    assert isinstance(orch, FleetEnvClient)
+    assert isinstance(tools, FleetMCPTools)


from_fleet called without required positional arguments

FleetEnvClient.from_fleet() requires three additional positional parameters — data_key: str, data_version: str, and image_type: str — beyond api_key and env_key. None of these three carry default values. Calling the method while omitting them will raise TypeError before any assertion runs, failing the test.

The same problem exists on lines 88–98 and 101–109.

The fix is to pass placeholder values for all three required parameters. The TestFleetEnvClientDbQuery block (lines ~488–494) already shows the correct calling convention with data_key, data_version, and image_type filled in.

Prompt To Fix With AI

This is a comment left during a code review. Path: tests/envs/test_fleet_env.py Line: 79-85 Comment: **`from_fleet` called without required positional arguments** `FleetEnvClient.from_fleet()` requires three additional positional parameters — `data_key: str`, `data_version: str`, and `image_type: str` — beyond `api_key` and `env_key`. None of these three carry default values. Calling the method while omitting them will raise `TypeError` before any assertion runs, failing the test. The same problem exists on lines 88–98 and 101–109. The fix is to pass placeholder values for all three required parameters. The `TestFleetEnvClientDbQuery` block (lines ~488–494) already shows the correct calling convention with `data_key`, `data_version`, and `image_type` filled in. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-29T20:21:01Z

+    def test_extract_base64_image_json_format(self):
+        """Should convert Fleet MCP's base64_image JSON format to OpenAI format."""
+        from envs.fleet_env.fleet_mcp_client import FleetMCPClient
+
+        client = FleetMCPClient(url="http://test", api_key="test")
+
+        # Fleet MCP returns screenshot as JSON text with base64_image key
+        class _TextContent:
+            type = "text"
+            text = '{"base64_image": "data:image/jpeg;base64,/9j/4AAQSkZJRg..."}'
+
+        class _Result:
+            content = [_TextContent()]
+            isError = False
+            structuredContent = None
+
+        result = client._extract_tool_result(_Result())
+
+        # Should be converted to OpenAI-compatible format
+        assert isinstance(result, list)
+        assert len(result) == 1
+        assert result[0]["type"] == "image_url"
+        assert (
+            result[0]["image_url"]["url"] == "data:image/jpeg;base64,/9j/4AAQSkZJRg..."
+        )
+
+    def test_extract_base64_image_preserves_other_json(self):
+        """Should preserve normal JSON responses that don't have base64_image."""
+        from envs.fleet_env.fleet_mcp_client import FleetMCPClient
+
+        client = FleetMCPClient(url="http://test", api_key="test")
+
+        # Normal JSON response without base64_image
+        class _TextContent:
+            type = "text"
+            text = '{"status": "success", "data": [1, 2, 3]}'
+
+        class _Result:
+            content = [_TextContent()]
+            isError = False
+            structuredContent = None
+
+        result = client._extract_tool_result(_Result())
+
+        # Should return parsed dict as-is
+        assert isinstance(result, dict)
+        assert result["status"] == "success"
+        assert result["data"] == [1, 2, 3]
+


No regression test for the null base64_image fix

The PR's stated purpose is to prevent the crash when {"base64_image": null} is returned by Fleet MCP, but the test class only exercises the happy path (a valid data URL in test_extract_base64_image_json_format). There is no test asserting that _extract_tool_result returns a plain string error when base64_image is null.

A companion test alongside the existing JSON-format tests would guard against this exact regression. The test should construct a TextContent whose .text is '{"base64_image": null}', call _extract_tool_result, and assert the result is a str (not a list or dict with url: None).

Prompt To Fix With AI

This is a comment left during a code review. Path: tests/envs/test_fleet_env.py Line: 343-391 Comment: **No regression test for the null `base64_image` fix** The PR's stated purpose is to prevent the crash when `{"base64_image": null}` is returned by Fleet MCP, but the test class only exercises the happy path (a valid data URL in `test_extract_base64_image_json_format`). There is no test asserting that `_extract_tool_result` returns a plain string error when `base64_image` is `null`. A companion test alongside the existing JSON-format tests would guard against this exact regression. The test should construct a `TextContent` whose `.text` is `'{"base64_image": null}'`, call `_extract_tool_result`, and assert the result is a `str` (not a `list` or `dict` with `url: None`). How can I resolve this? If you propose a fix, please make it concise.

Deniz and others added 30 commits December 12, 2025 12:14

fleet integartion step 0

38edfc2

updated README

f67bc43

readme update

164853c

another iteraton

935826f

readme

7c09d5b

conb

7efae22

Add __init__.py to envs package for pip install compatibility

791a071

fix: Remove seed parameter from reset() call

7852847

HTTPEnvClient.reset() doesn't support seed parameter yet. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix: fetch tools lazily in reset_async to avoid asyncio.run in async …

ced5eca

…context

debug: add logging for call_tool to trace success/failure paths

a2f3531

fix: unwrap ExceptionGroup to show actual error cause

a08cb6d

debug: add logging for Fleet instance creation timing

9806eb8

Use image_type='mcp' for computer_use tasks

a1ac1a7

Computer-use tasks require MCP-enabled container images (e.g., famazon:mcp0.0.7) which have scrot installed for screenshots and the MCP server with 'computer' tool for mouse/keyboard control.

Deniz and others added 27 commits March 3, 2026 19:49

feat: add trace upload utilities for eval rollouts

f86fa49

Adds create_trace_job() and upload_trace() to fleet_env module. These allow SkyRL eval to send conversation traces (with screenshots) to the Fleet API for viewing in the Fleet UI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: Pass reward as score to ingest API so sessions complete

c99c1e5

The ingest API determines session status from score presence, not the status field. Without score, all sessions stay as in_progress. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Revert "Add hint-based reward for solver RL training (Options B, C, D)"

31fa602

This reverts commit fc0508f.

Merge pull request #6 from fleet-ai/deniz/fleet-logfire

b8ee588

feat: Add Logfire error tracking to fleet env

feat: add partial reward support behind flag

cc5bf37

When `partial_reward=True`, failed verifier runs compute a fractional score from the error/success accumulators instead of binary 0/1. Passing tasks are unaffected. Off by default. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge pull request #9 from fleet-ai/feat/partial-reward

7c5d64f

feat: partial reward support

Merge pull request #10 from fleet-ai/feat/close-verifier

baa192a

feat: Run verifier at close() for orphaned rollouts

merge: resolve conflicts with deniz/fleet_client (_reward_computed)

5111c78

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge pull request #11 from fleet-ai/deniz/submit-final-answer

cf91b04

feat: Add submit_final_answer synthetic tool for carlisle tasks

Fix model ID format: use provider/model prefix for Fleet harness

fafc3ea

Fleet harness POST /v1/jobs requires model IDs in 'provider/model' format (e.g., 'anthropic/claude-sonnet-4.5'), not just the model name. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: Remove unused json import, defensive copy DEFAULT_MODELS

0051e92

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge pull request #12 from fleet-ai/deniz/task-evaluator

8605142

feat: Add TaskEvaluator for task generation inner loop

Merge pull request #14 from fleet-ai/deniz/hint-feedback

438c7b3

Expose verifier feedback properties for hint generation

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 29, 2026

greptile-apps Bot reviewed Mar 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle null base64_image in MCP screenshot responses#476

fix: handle null base64_image in MCP screenshot responses#476
dzorlu wants to merge 85 commits into
huggingface:mainfrom
fleet-ai:fix/null-screenshot-url

dzorlu commented Mar 29, 2026

Uh oh!

greptile-apps Bot commented Mar 29, 2026

Uh oh!

greptile-apps Bot Mar 29, 2026

Uh oh!

greptile-apps Bot Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dzorlu commented Mar 29, 2026

Summary

Root cause

Test plan

Uh oh!

greptile-apps Bot commented Mar 29, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant