Fix Redis stream leak: MAXLEN on xadd + sliding TTL on stream keys by x · Pull Request #339 · scaleapi/scale-agentex-python

x · 2026-05-04T05:51:00Z

Summary

The RedisStreamRepository in this SDK has been leaking memory in production: every task:{id} stream grows unbounded for the lifetime of the task and is never deleted. On a long-lived cluster (BP) this manifested as 12k+ orphaned task streams consuming most of a 6GB Redis cache, requiring manual redis-cli DEL flushes.

Two complementary fixes, one commit each. Both mirror patterns the agentex server-side adapter (scaleapi/scale-agentex agentex/src/adapters/streams/adapter_redis.py) already uses.

Commit 1 — `perf(streaming): add MAXLEN to Redis xadd`

RedisStreamRepository.send_event was calling xadd with no MAXLEN, despite a comment claiming "Add to Redis stream with a reasonable max length". The cap was intended but never wired up.

Matches the server adapter's pattern from scale-agentex PR #111 (Jan 2). Default 10000 entries, overridable via REDIS_STREAM_MAXLEN env var.

Commit 2 — `feat(streaming): add sliding TTL to Redis stream keys`

Mirrors scale-agentex PR #215 (Stas, just merged) on the SDK side. Pipelines XADD with EXPIRE in one round-trip. Each delta write refreshes the TTL on the stream key, so:

An actively-streaming agent (or actively-reading SSE consumer keeping the key warm via server-side reads) keeps the key alive.
Once the agent stops writing, the key ages out and Redis auto-deletes it after REDIS_STREAM_TTL_SECONDS.
An orphaned stream (no writes, no consumer) self-deletes within the TTL window — no explicit cleanup_stream call needed.

Default TTL: 3600 seconds (1h), matching the server. Configurable via REDIS_STREAM_TTL_SECONDS env var. Setting it to 0 short-circuits to plain XADD (no TTL refresh).

Implementation notes:

transaction=False on the pipeline: connection-level batching, no MULTI/EXEC overhead for a hot path.
raise_on_error=False: if EXPIRE fails after XADD succeeded, log and move on. We must not propagate the failure to the caller — the message has been published; a retry would duplicate it. The next successful XADD will reset the TTL anyway.

Why this approach instead of cleanup_stream?

An earlier draft of this PR added an explicit cleanup_stream call to all terminal task transitions in TasksService (complete, fail, cancel, terminate, timeout, delete). Stas pointed out two issues with that approach which the EXPIRE pattern sidesteps:

Race with task_updated events. When the server processes a terminal task transition (POST /tasks/{id}/complete), it publishes a task_updated event to the same task:{id} stream so connected SSE consumers see the state change. An SDK-side cleanup_stream immediately after the API call returns would race that event — a frontend mid-XREAD could miss the COMPLETED transition.
Multi-subscriber unfairness. The existing server-side cleanup_stream-on-disconnect path nukes the topic for all subscribers when one disconnects. Adding more cleanup_stream callsites compounds that latent bug.

EXPIRE avoids both: it only fires after inactivity, so any active reader keeps the key alive long enough to consume final events, and the topic-deletion-on-disconnect bug is rendered moot once the SDK no longer relies on explicit DEL.

Validation

End-to-end tested against BP's dev cluster on 2026-05-05 with this branch path-pinned into the agent images. Ran an eval (-k 2, 2 examples → 2 orchestrator workflows + 2 tardis subagent workflows = 4 task streams).

Redis state ~3 minutes after eval finished:

Task	Type	XLEN	TTL
`task:58baa63e-...`	orchestrator	105	3424s
`task:251f2dac-...`	orchestrator	112	3436s
`task:7d87bbd3-...`	tardis subagent	139	3402s
`task:49bb2778-...`	tardis subagent	83	3388s

All four streams have active TTLs of ~3400s (~57 min, which is 3600s - elapsed_since_last_write). They will self-delete from Redis ~57 min from the last delta write — no explicit cleanup call needed.

Side-by-side baseline (same Redis cluster, same time, but pods running the unmodified agentex-sdk==0.10.3 from PyPI):

Task	XLEN	TTL
`task:e2963d73-...`	120	NO_TTL
`task:f7c9f250-...`	171	NO_TTL
`task:30526127-...`	152	NO_TTL
`task:c3dcb6ec-...`	94	NO_TTL
(... ~80 more, all NO_TTL ...)

The only difference between the two cohorts is the SDK version pinned in the agent images. The TTL is the difference this PR makes.

This pattern also closes the orphan case our application layer can't fix: long-lived orchestrator workflows (which never call tasks.complete because they're designed to stay alive across multiple user messages) used to leak their stream forever. Now their stream auto-cleans within REDIS_STREAM_TTL_SECONDS of the last delta, regardless of workflow lifecycle.

Test plan

SDK lints cleanly (uv run ruff check src/agentex/lib/core/adapters/streams/adapter_redis.py)
Existing tests still pass (no behavior change to existing API surface; new kwargs are optional)
Live cluster validation: pipelined writes succeed; TTL set on stream keys; side-by-side proof against unmodified SDK on same cluster
Backwards compatibility: RedisStreamRepository() with no kwargs continues to work; setting REDIS_STREAM_TTL_SECONDS=0 disables the TTL behavior entirely

The SDK's RedisStreamRepository.send_event was calling xadd with no MAXLEN, so every task:* stream grew unbounded for the lifetime of the task. The accompanying comment ("Add to Redis stream with a reasonable max length") suggested the cap was intended but never wired up. This change matches the agentex server-side adapter, which has had maxlen=REDIS_STREAM_MAXLEN, approximate=True since Jan 2 (PR #111 in scaleapi/scale-agentex). Default is 10000 entries, overridable via the REDIS_STREAM_MAXLEN env var, same as the server. Note: this caps each stream's size during generation but does not delete streams when their task completes -- that's a separate fix.

smoreinis · 2026-05-04T19:35:38Z

on the backend side, i just made a change to pipeline xadd with an expire call to ensure streams get cleaned up by redis if nothing else, maybe worth doing something similar here? ref: scaleapi/scale-agentex#215

Mirror scaleapi/scale-agentex#215 (server-side adapter): pipeline XADD with EXPIRE so each task:* stream key gets a sliding TTL. Orphaned streams (no writes for the TTL window) self-delete in Redis without needing an explicit cleanup_stream call from the caller. This is the right shape of fix for the SDK's leak: an explicit DEL on terminal task transitions (an earlier draft of this PR) introduced a race where the server's task_updated event published to the same topic could be deleted before a connected frontend SSE consumer read it. EXPIRE sidesteps that — TTL only fires after inactivity, so an actively-streaming agent or actively-reading consumer keeps the key alive, and the key only ages out once everyone is done with it. Defaults match the server: REDIS_STREAM_TTL_SECONDS=3600 (1h), overridable via env var. Setting it to 0 short-circuits to plain XADD (no TTL refresh), matching the server's escape hatch. Implementation notes: - transaction=False on the pipeline: connection-level batching, no MULTI/EXEC overhead for what's already a fast op. - raise_on_error=False: an EXPIRE failure after a successful XADD must not surface to the caller. The message has been published; retrying would duplicate it. We log and move on. Next successful XADD will reset the TTL anyway.

x force-pushed the devon.peticolas/fix-redis-stream-leak branch from a298f63 to 0a7cfa0 Compare May 5, 2026 18:31

x changed the title ~~Fix Redis stream leak: MAXLEN on xadd + cleanup_stream on terminal transitions~~ Fix Redis stream leak: MAXLEN on xadd + sliding TTL on stream keys May 5, 2026

smoreinis approved these changes May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Redis stream leak: MAXLEN on xadd + sliding TTL on stream keys#339

Fix Redis stream leak: MAXLEN on xadd + sliding TTL on stream keys#339
x wants to merge 2 commits intomainfrom
devon.peticolas/fix-redis-stream-leak

x commented May 4, 2026 •

edited

Loading

Uh oh!

smoreinis commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

x commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commit 1 — perf(streaming): add MAXLEN to Redis xadd

Commit 2 — feat(streaming): add sliding TTL to Redis stream keys

Why this approach instead of cleanup_stream?

Validation

Test plan

Uh oh!

smoreinis commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

x commented May 4, 2026 •

edited

Loading

Commit 1 — `perf(streaming): add MAXLEN to Redis xadd`

Commit 2 — `feat(streaming): add sliding TTL to Redis stream keys`