fix: handle null base64_image in MCP screenshot responses#476
Conversation
- FleetTaskEnv wraps FleetEnvClient with task-oriented interface - Accepts task configs from export_training_tasks.py - Creates versioned environments on reset - Injects task prompt into observations - Executes verifier for reward computation on episode completion - Supports both sync and async step methods - Factory functions: make_fleet_task_env, from_json_file - Tests: 20 unit tests for init, specs, verifiers, factories 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The MCP images don't exist for all environment versions, causing FleetVersionNotFoundError when trying to create environments. Changing the default to None allows the Fleet SDK to use standard images which are available for all versions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
FleetEnvClient.from_fleet() was not accepting data_key/data_version parameters, causing them to be passed through **kwargs to HTTPEnvClient which doesn't accept them. - Add data_key and data_version as explicit parameters - Pass them to fleet.make() - Update task_env.py to pass them separately 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fleet SDK expects data_key in "key:version" format, not as separate parameters. Updated from_fleet() to combine them before calling fleet.make(). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
HTTPEnvClient.reset() doesn't support seed parameter yet. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Increases default timeout from 15s to 60s for Fleet API calls. This prevents timeouts during environment initialization. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously reset() did partial work and reset_async() added tool fetching. Now reset_async() does all the work (including fetching tools) and reset() is just a sync wrapper that calls it via run_until_complete(). This ensures both methods return identical results including tools. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
MCP's call_tool() returns a CallToolResult Pydantic object, not plain text. This was causing ugly repr strings to be passed to agents like: "meta=None content=[TextContent(type='text', text='...')] ..." Now properly extracts: - Text content from result.content[].text - Tries JSON parsing for structured results - Falls back to structuredContent if available - Handles isError cases 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests for: - FleetMCPClient._extract_tool_result(): - Single text content extraction - JSON parsing from text - Multiple text contents - Error result handling - Structured content fallback - Empty result handling - FleetTaskEnv reset: - reset_async() returns tools - reset() calls reset_async() (sync wrapper) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move fleet.make() and list_tools() into FleetTaskEnv.__init__() - Tools are now fetched at env creation, not during reset - reset_async() calls _orch.reset() with error handling, returns cached tools - Use asyncio.run() for Python 3.13 compatibility - Update tests for new initialization pattern 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Log task_key and verifier code preview when verifier fails - Catch syntax errors separately with clear message - Show which functions were found if 'verify' is missing Helps debug issues like "Verifier code must define a 'verify' function" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace custom _execute_verifier_local() with Fleet SDK's Task.verify_detailed() which properly sets up the verifier namespace with: - Environment type annotation - Helper functions (normalized_contains, etc.) - Proper function discovery (not just "verify" function) This fixes "name 'Environment' is not defined" errors during verifier execution. Changes: - _compute_reward: Create Fleet SDK Task and call verify_detailed() - Support both 'verifier_code' and 'verifier_func' field names - Add comprehensive logging for debugging - Remove broken _execute_verifier_local method Tests: - Update all verifier tests to mock Fleet SDK Task.verify_detailed() - Add tests for various edge cases (no verifier, no orch, exceptions) - Fix fixture to avoid asyncio.run() conflicts with pytest-asyncio 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add retry with exponential backoff (3 attempts, 1s/2s/4s delays) - Log errors instead of silently swallowing exceptions - Log warning when some clients fail but others succeed - Log error after all retries exhausted This fixes silent failures when MCP connections are flaky, which caused 'no tools found' errors in SkyRL training.
call_tool now retries with exponential backoff (3 attempts, 1s/2s/4s) on connection errors, similar to list_tools. ValueError (tool not found) is not retried. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds exponential backoff retry (3 attempts, 2s base delay) around fleet.make() to handle transient Fleet API errors like health check failures that can occur during instance provisioning. Only retries on transient errors (health check, timeout, connection). Permanent errors are raised immediately. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add Toolathlon-style context management tools for long trajectories: - check_context: Check visible/total turn counts - manage_context: Drop old turns to free up context space - search_history: Search all history (including dropped) - search_tool_output: Search truncated tool output - view_tool_output: Paginate through truncated output The ContextManager class can be used by any training framework that maintains chat_history. It tracks full history and handles truncated tool outputs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Computer-use tasks require MCP-enabled container images (e.g., famazon:mcp0.0.7) which have scrot installed for screenshots and the MCP server with 'computer' tool for mouse/keyboard control.
Previously, tools were only fetched for tool_use modality due to a restrictive condition. This caused computer_use tasks to fail with "no tools found in observation" because the computer tool (mouse, keyboard, screenshot) was never fetched. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When task_modality is computer_use, filter tools to only include the 'computer' tool. This prevents the model from using API tools when it should be using mouse/keyboard control. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
AsyncFleet.make() is not truly non-blocking — diagnostics confirmed the event loop stays healthy while make() blocks internally. Switch to running sync Fleet.make() in a thread pool via asyncio.to_thread() to guarantee non-blocking behavior. Also removes the separate sync Fleet.instance() call since we already have a sync env handle from make(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… loop Three sync calls were blocking the event loop during concurrent trajectories: 1. self._orch.reset() — sync HTTP POST to manager API, now uses reset_async() which runs in a thread pool 2. fleet_task.verify_detailed() — sync verifier execution, now wrapped in asyncio.to_thread() 3. self._fleet_env.close() — sync instance termination, now has close_async() Also reverts AsyncFleet.make() (was correctly async all along — diagnostics confirmed the event loop stays healthy during make() calls). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds create_trace_job() and upload_trace() to fleet_env module. These allow SkyRL eval to send conversation traces (with screenshots) to the Fleet API for viewing in the Fleet UI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…r UI rendering Images were being passed as OpenAI format (type: "image_url") but Fleet's ingest API expects (type: "image", mime_type, data). The API then uploads base64 to S3 and the UI renders them full-size via OpinionatedImage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The ingest API determines session status from score presence, not the status field. Without score, all sessions stay as in_progress. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New HintGenerator module that produces concise hints from failed rollout data to rescue GRPO signal on hard tasks. Three modes: - Option B: LLM call with verifier code + tool errors + chat_history - Option C: LLM call with verifier code only (cacheable per-task) - Option D: LLM call synthesizing tool errors + verifier failure FleetTaskEnv changes: - Accumulate tool errors during step_async() - Capture verifier error details in _compute_reward() - New properties: verifier_code, tool_errors, verifier_error - New reset_for_hint_async(hint) for hinted rollouts (reuses provisioned instance, resets DB to seed, appends hint to prompt) - compute_hint_reward() utility: R = (1 - raw_score) * hint_score Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This reverts commit fc0508f.
feat: Add Logfire error tracking to fleet env
When `partial_reward=True`, failed verifier runs compute a fractional score from the error/success accumulators instead of binary 0/1. Passing tasks are unaffected. Off by default. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: partial reward support
SkyRL can end trajectories early (context overflow, its own max_turns) without OpenEnv knowing. Previously the model got 0 reward even if the task was completed. Now close_async()/close() run the verifier when step_async() never computed a reward, storing the result in self.final_reward for SkyRL to read. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: Run verifier at close() for orphaned rollouts
Carlisle tasks (354 total, 8 in eval) require models to call submit_final_answer to submit results, but this tool is a harness-level synthetic injected by the orchestrator's SessionWorkflow, not an MCP tool. OpenEnv connects directly to MCP servers, so the tool was missing — causing 0% scores across all carlisle tasks in training. Changes: - Inject submit_final_answer into tool list when prompt references it - Intercept calls locally (not routed to MCP), store the answer - Pass final_answer to verifier via Fleet SDK's verify_detailed() - Run verifier in close()/close_async() for orphaned rollouts - Add unit tests for the synthetic tool Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: Add submit_final_answer synthetic tool for carlisle tasks
Runs k × m rollouts of generated tasks on Fleet environments. Given (prompt, verifier_code, env_key), creates FleetTaskEnv instances, runs agent loops with model inference, and returns structured results for reward computation (learnability variance + model separation). Used as the inner loop of the task-scaling RL pipeline in SkyRL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of calling Anthropic directly and running a local agent loop, the evaluator now: 1. Imports the generated task via fleet.import_task() 2. Creates a harness job via fleet.create_job() 3. Polls for completion 4. Extracts per-session verifier scores from job sessions Uses real Fleet model IDs (claude-sonnet-4.5, claude-opus-4.5) instead of the broken weak/strong mapping that required ANTHROPIC_API_KEY. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fleet harness POST /v1/jobs requires model IDs in 'provider/model' format (e.g., 'anthropic/claude-sonnet-4.5'), not just the model name. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The sync time.sleep() in _poll_job blocked the asyncio event loop, preventing trajectory timeouts from cancelling evaluations. Using asyncio.sleep() allows the event loop to properly handle cancellations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fleet returns session model IDs without provider prefix (e.g., "claude-sonnet-4.5")
while we configure them with prefix ("anthropic/claude-sonnet-4.5"). Added
_match_model_id() to normalize and match by bare model name, so scores land
in the correct results_per_model bucket.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: Add TaskEvaluator for task generation inner loop
Add _tool_errors, _verifier_stdout, _verifier_error to FleetTaskEnv so SkyRL can build hints from failed rollout feedback without an LLM call. - Tool errors accumulated in step_async() on MCP errors and exceptions - Verifier stdout/error captured in _compute_reward() after verifier runs - Verifier exceptions also captured in _verifier_error (not just failures) - All feedback properties reset in reset_async() to prevent cross-episode leakage - Properties: verifier_stdout, verifier_error, tool_errors_list Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Expose verifier feedback properties for hint generation
Add describe_db() and query_db() methods (sync + async) to FleetEnvClient, delegating to the Fleet SDK's SQLiteResource. This enables querying provisioned environment databases (seed/current) from outside the container via HTTP, which is needed for task generation workflows where the model explores DB data before designing tasks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
describe_db_async/query_db_async were wrapping sync describe_db/query_db in asyncio.to_thread, but when _fleet_env is an AsyncEnv (from AsyncFleet.make), .db().describe() and .query() are async coroutines. Now detects whether the resource method is a coroutine and awaits directly instead of using to_thread. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the Fleet MCP server returns {"base64_image": null} (failed screenshot),
_extract_tool_result would create an image_url block with url=None, causing
downstream AttributeError in SkyRL's extract_images_from_conversation.
Now returns a text error message instead of propagating the null.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR introduces the Key findings:
Alignment notes:
Confidence Score: 4/5Safe to merge after fixing the three broken test call-sites; the core logic is sound. One P1 issue remains: three tests in
Important Files Changed
Sequence DiagramsequenceDiagram
participant Agent
participant FleetMCPTools
participant FleetMCPClient
participant FleetMCP as Fleet MCP Server
Agent->>FleetMCPTools: call_tool("computer", {action: "screenshot"})
FleetMCPTools->>FleetMCPClient: call_tool("computer", args)
FleetMCPClient->>FleetMCP: session.call_tool(name, arguments)
FleetMCP-->>FleetMCPClient: CallToolResult(content=[TextContent(text='{"base64_image": null}')])
FleetMCPClient->>FleetMCPClient: _extract_tool_result(result)
Note over FleetMCPClient: json.loads → parsed["base64_image"] is None<br/>NEW: returns "Screenshot capture failed (null image)"<br/>OLD: returned [{"type":"image_url","image_url":{"url":None}}]
FleetMCPClient-->>FleetMCPTools: "Screenshot capture failed (null image)"
FleetMCPTools-->>Agent: str (safe — no crash downstream)
Prompt To Fix All With AIThis is a comment left during a code review.
Path: tests/envs/test_fleet_env.py
Line: 79-85
Comment:
**`from_fleet` called without required positional arguments**
`FleetEnvClient.from_fleet()` requires three additional positional parameters — `data_key: str`, `data_version: str`, and `image_type: str` — beyond `api_key` and `env_key`. None of these three carry default values. Calling the method while omitting them will raise `TypeError` before any assertion runs, failing the test.
The same problem exists on lines 88–98 and 101–109.
The fix is to pass placeholder values for all three required parameters. The `TestFleetEnvClientDbQuery` block (lines ~488–494) already shows the correct calling convention with `data_key`, `data_version`, and `image_type` filled in.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: tests/envs/test_fleet_env.py
Line: 343-391
Comment:
**No regression test for the null `base64_image` fix**
The PR's stated purpose is to prevent the crash when `{"base64_image": null}` is returned by Fleet MCP, but the test class only exercises the happy path (a valid data URL in `test_extract_base64_image_json_format`). There is no test asserting that `_extract_tool_result` returns a plain string error when `base64_image` is `null`.
A companion test alongside the existing JSON-format tests would guard against this exact regression. The test should construct a `TextContent` whose `.text` is `'{"base64_image": null}'`, call `_extract_tool_result`, and assert the result is a `str` (not a `list` or `dict` with `url: None`).
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix: handle null base64_image from Fleet..." | Re-trigger Greptile |
| @pytest.mark.usefixtures("fake_requests_session", "fake_fleet_module") | ||
| def test_fleet_env_from_fleet_returns_orchestrator_and_tools(): | ||
| from envs.fleet_env import FleetEnvClient, FleetMCPTools | ||
|
|
||
| orch, tools = FleetEnvClient.from_fleet(api_key="k", env_key="e") | ||
| assert isinstance(orch, FleetEnvClient) | ||
| assert isinstance(tools, FleetMCPTools) |
There was a problem hiding this comment.
from_fleet called without required positional arguments
FleetEnvClient.from_fleet() requires three additional positional parameters — data_key: str, data_version: str, and image_type: str — beyond api_key and env_key. None of these three carry default values. Calling the method while omitting them will raise TypeError before any assertion runs, failing the test.
The same problem exists on lines 88–98 and 101–109.
The fix is to pass placeholder values for all three required parameters. The TestFleetEnvClientDbQuery block (lines ~488–494) already shows the correct calling convention with data_key, data_version, and image_type filled in.
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/envs/test_fleet_env.py
Line: 79-85
Comment:
**`from_fleet` called without required positional arguments**
`FleetEnvClient.from_fleet()` requires three additional positional parameters — `data_key: str`, `data_version: str`, and `image_type: str` — beyond `api_key` and `env_key`. None of these three carry default values. Calling the method while omitting them will raise `TypeError` before any assertion runs, failing the test.
The same problem exists on lines 88–98 and 101–109.
The fix is to pass placeholder values for all three required parameters. The `TestFleetEnvClientDbQuery` block (lines ~488–494) already shows the correct calling convention with `data_key`, `data_version`, and `image_type` filled in.
How can I resolve this? If you propose a fix, please make it concise.| def test_extract_base64_image_json_format(self): | ||
| """Should convert Fleet MCP's base64_image JSON format to OpenAI format.""" | ||
| from envs.fleet_env.fleet_mcp_client import FleetMCPClient | ||
|
|
||
| client = FleetMCPClient(url="http://test", api_key="test") | ||
|
|
||
| # Fleet MCP returns screenshot as JSON text with base64_image key | ||
| class _TextContent: | ||
| type = "text" | ||
| text = '{"base64_image": "data:image/jpeg;base64,/9j/4AAQSkZJRg..."}' | ||
|
|
||
| class _Result: | ||
| content = [_TextContent()] | ||
| isError = False | ||
| structuredContent = None | ||
|
|
||
| result = client._extract_tool_result(_Result()) | ||
|
|
||
| # Should be converted to OpenAI-compatible format | ||
| assert isinstance(result, list) | ||
| assert len(result) == 1 | ||
| assert result[0]["type"] == "image_url" | ||
| assert ( | ||
| result[0]["image_url"]["url"] == "data:image/jpeg;base64,/9j/4AAQSkZJRg..." | ||
| ) | ||
|
|
||
| def test_extract_base64_image_preserves_other_json(self): | ||
| """Should preserve normal JSON responses that don't have base64_image.""" | ||
| from envs.fleet_env.fleet_mcp_client import FleetMCPClient | ||
|
|
||
| client = FleetMCPClient(url="http://test", api_key="test") | ||
|
|
||
| # Normal JSON response without base64_image | ||
| class _TextContent: | ||
| type = "text" | ||
| text = '{"status": "success", "data": [1, 2, 3]}' | ||
|
|
||
| class _Result: | ||
| content = [_TextContent()] | ||
| isError = False | ||
| structuredContent = None | ||
|
|
||
| result = client._extract_tool_result(_Result()) | ||
|
|
||
| # Should return parsed dict as-is | ||
| assert isinstance(result, dict) | ||
| assert result["status"] == "success" | ||
| assert result["data"] == [1, 2, 3] | ||
|
|
There was a problem hiding this comment.
No regression test for the null
base64_image fix
The PR's stated purpose is to prevent the crash when {"base64_image": null} is returned by Fleet MCP, but the test class only exercises the happy path (a valid data URL in test_extract_base64_image_json_format). There is no test asserting that _extract_tool_result returns a plain string error when base64_image is null.
A companion test alongside the existing JSON-format tests would guard against this exact regression. The test should construct a TextContent whose .text is '{"base64_image": null}', call _extract_tool_result, and assert the result is a str (not a list or dict with url: None).
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/envs/test_fleet_env.py
Line: 343-391
Comment:
**No regression test for the null `base64_image` fix**
The PR's stated purpose is to prevent the crash when `{"base64_image": null}` is returned by Fleet MCP, but the test class only exercises the happy path (a valid data URL in `test_extract_base64_image_json_format`). There is no test asserting that `_extract_tool_result` returns a plain string error when `base64_image` is `null`.
A companion test alongside the existing JSON-format tests would guard against this exact regression. The test should construct a `TextContent` whose `.text` is `'{"base64_image": null}'`, call `_extract_tool_result`, and assert the result is a `str` (not a `list` or `dict` with `url: None`).
How can I resolve this? If you propose a fix, please make it concise.
Summary
parsed["base64_image"]in_extract_tool_result{"base64_image": null}(failed screenshot capture), return a text error message instead of creating animage_urlblock withurl: NoneAttributeErrorin SkyRL'sextract_images_from_conversationRoot cause
_extract_tool_resultunconditionally wrapsparsed["base64_image"]in an OpenAI image_url block. When the value isNone, this creates{"url": None}which crashes.startswith()in the consumer.Test plan
🤖 Generated with Claude Code