Skip to content

Fix Redis stream leak: MAXLEN on xadd + sliding TTL on stream keys#339

Open
x wants to merge 2 commits intomainfrom
devon.peticolas/fix-redis-stream-leak
Open

Fix Redis stream leak: MAXLEN on xadd + sliding TTL on stream keys#339
x wants to merge 2 commits intomainfrom
devon.peticolas/fix-redis-stream-leak

Conversation

@x
Copy link
Copy Markdown
Contributor

@x x commented May 4, 2026

Summary

The RedisStreamRepository in this SDK has been leaking memory in production: every task:{id} stream grows unbounded for the lifetime of the task and is never deleted. On a long-lived cluster (BP) this manifested as 12k+ orphaned task streams consuming most of a 6GB Redis cache, requiring manual redis-cli DEL flushes.

Two complementary fixes, one commit each. Both mirror patterns the agentex server-side adapter (scaleapi/scale-agentex agentex/src/adapters/streams/adapter_redis.py) already uses.

Commit 1 — perf(streaming): add MAXLEN to Redis xadd

RedisStreamRepository.send_event was calling xadd with no MAXLEN, despite a comment claiming "Add to Redis stream with a reasonable max length". The cap was intended but never wired up.

Matches the server adapter's pattern from scale-agentex PR #111 (Jan 2). Default 10000 entries, overridable via REDIS_STREAM_MAXLEN env var.

Commit 2 — feat(streaming): add sliding TTL to Redis stream keys

Mirrors scale-agentex PR #215 (Stas, just merged) on the SDK side. Pipelines XADD with EXPIRE in one round-trip. Each delta write refreshes the TTL on the stream key, so:

  • An actively-streaming agent (or actively-reading SSE consumer keeping the key warm via server-side reads) keeps the key alive.
  • Once the agent stops writing, the key ages out and Redis auto-deletes it after REDIS_STREAM_TTL_SECONDS.
  • An orphaned stream (no writes, no consumer) self-deletes within the TTL window — no explicit cleanup_stream call needed.

Default TTL: 3600 seconds (1h), matching the server. Configurable via REDIS_STREAM_TTL_SECONDS env var. Setting it to 0 short-circuits to plain XADD (no TTL refresh).

Implementation notes:

  • transaction=False on the pipeline: connection-level batching, no MULTI/EXEC overhead for a hot path.
  • raise_on_error=False: if EXPIRE fails after XADD succeeded, log and move on. We must not propagate the failure to the caller — the message has been published; a retry would duplicate it. The next successful XADD will reset the TTL anyway.

Why this approach instead of cleanup_stream?

An earlier draft of this PR added an explicit cleanup_stream call to all terminal task transitions in TasksService (complete, fail, cancel, terminate, timeout, delete). Stas pointed out two issues with that approach which the EXPIRE pattern sidesteps:

  1. Race with task_updated events. When the server processes a terminal task transition (POST /tasks/{id}/complete), it publishes a task_updated event to the same task:{id} stream so connected SSE consumers see the state change. An SDK-side cleanup_stream immediately after the API call returns would race that event — a frontend mid-XREAD could miss the COMPLETED transition.
  2. Multi-subscriber unfairness. The existing server-side cleanup_stream-on-disconnect path nukes the topic for all subscribers when one disconnects. Adding more cleanup_stream callsites compounds that latent bug.

EXPIRE avoids both: it only fires after inactivity, so any active reader keeps the key alive long enough to consume final events, and the topic-deletion-on-disconnect bug is rendered moot once the SDK no longer relies on explicit DEL.

Validation

End-to-end tested against BP's dev cluster on 2026-05-05 with this branch path-pinned into the agent images. Ran an eval (-k 2, 2 examples → 2 orchestrator workflows + 2 tardis subagent workflows = 4 task streams).

Redis state ~3 minutes after eval finished:

Task Type XLEN TTL
task:58baa63e-... orchestrator 105 3424s
task:251f2dac-... orchestrator 112 3436s
task:7d87bbd3-... tardis subagent 139 3402s
task:49bb2778-... tardis subagent 83 3388s

All four streams have active TTLs of ~3400s (~57 min, which is 3600s - elapsed_since_last_write). They will self-delete from Redis ~57 min from the last delta write — no explicit cleanup call needed.

Side-by-side baseline (same Redis cluster, same time, but pods running the unmodified agentex-sdk==0.10.3 from PyPI):

Task XLEN TTL
task:e2963d73-... 120 NO_TTL
task:f7c9f250-... 171 NO_TTL
task:30526127-... 152 NO_TTL
task:c3dcb6ec-... 94 NO_TTL
(... ~80 more, all NO_TTL ...)

The only difference between the two cohorts is the SDK version pinned in the agent images. The TTL is the difference this PR makes.

This pattern also closes the orphan case our application layer can't fix: long-lived orchestrator workflows (which never call tasks.complete because they're designed to stay alive across multiple user messages) used to leak their stream forever. Now their stream auto-cleans within REDIS_STREAM_TTL_SECONDS of the last delta, regardless of workflow lifecycle.

Test plan

  • SDK lints cleanly (uv run ruff check src/agentex/lib/core/adapters/streams/adapter_redis.py)
  • Existing tests still pass (no behavior change to existing API surface; new kwargs are optional)
  • Live cluster validation: pipelined writes succeed; TTL set on stream keys; side-by-side proof against unmodified SDK on same cluster
  • Backwards compatibility: RedisStreamRepository() with no kwargs continues to work; setting REDIS_STREAM_TTL_SECONDS=0 disables the TTL behavior entirely

The SDK's RedisStreamRepository.send_event was calling xadd with no
MAXLEN, so every task:* stream grew unbounded for the lifetime of the
task. The accompanying comment ("Add to Redis stream with a reasonable
max length") suggested the cap was intended but never wired up.

This change matches the agentex server-side adapter, which has had
maxlen=REDIS_STREAM_MAXLEN, approximate=True since Jan 2 (PR #111 in
scaleapi/scale-agentex). Default is 10000 entries, overridable via the
REDIS_STREAM_MAXLEN env var, same as the server.

Note: this caps each stream's size during generation but does not
delete streams when their task completes -- that's a separate fix.
@smoreinis
Copy link
Copy Markdown
Contributor

on the backend side, i just made a change to pipeline xadd with an expire call to ensure streams get cleaned up by redis if nothing else, maybe worth doing something similar here? ref: scaleapi/scale-agentex#215

Mirror scaleapi/scale-agentex#215 (server-side adapter): pipeline XADD
with EXPIRE so each task:* stream key gets a sliding TTL. Orphaned
streams (no writes for the TTL window) self-delete in Redis without
needing an explicit cleanup_stream call from the caller.

This is the right shape of fix for the SDK's leak: an explicit DEL on
terminal task transitions (an earlier draft of this PR) introduced a
race where the server's task_updated event published to the same
topic could be deleted before a connected frontend SSE consumer read
it. EXPIRE sidesteps that — TTL only fires after inactivity, so an
actively-streaming agent or actively-reading consumer keeps the key
alive, and the key only ages out once everyone is done with it.

Defaults match the server: REDIS_STREAM_TTL_SECONDS=3600 (1h),
overridable via env var. Setting it to 0 short-circuits to plain XADD
(no TTL refresh), matching the server's escape hatch.

Implementation notes:
- transaction=False on the pipeline: connection-level batching, no
  MULTI/EXEC overhead for what's already a fast op.
- raise_on_error=False: an EXPIRE failure after a successful XADD
  must not surface to the caller. The message has been published;
  retrying would duplicate it. We log and move on. Next successful
  XADD will reset the TTL anyway.
@x x force-pushed the devon.peticolas/fix-redis-stream-leak branch from a298f63 to 0a7cfa0 Compare May 5, 2026 18:31
@x x changed the title Fix Redis stream leak: MAXLEN on xadd + cleanup_stream on terminal transitions Fix Redis stream leak: MAXLEN on xadd + sliding TTL on stream keys May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants