Skip to content

feat: Redis cache + pre-warm for dashboard summary endpoints (Phase C of #20)#28

Open
Boanerges1996 wants to merge 5 commits intopeermetrics:masterfrom
Boanerges1996:feat/summary-redis-cache-phase-c
Open

feat: Redis cache + pre-warm for dashboard summary endpoints (Phase C of #20)#28
Boanerges1996 wants to merge 5 commits intopeermetrics:masterfrom
Boanerges1996:feat/summary-redis-cache-phase-c

Conversation

@Boanerges1996
Copy link
Copy Markdown
Contributor

Summary

Stacked on top of #26 and #27 — PR diff will show all Phase 2-5 + Phase C commits until those upstream PRs merge, then rebase down to just the cache commit (`6b2a3a6`).

Phases 0-5 moved dashboard aggregation to SQL. Phase C caches the results: one Redis entry per (endpoint + filter params), 60s TTL, pre-warmed so first visitors don't pay cold-query cost.

  • `app/summary_cache.py` — `cached_json(endpoint, request, compute)` wraps any summary view. Reads Redis; on miss, runs the compute closure and writes back with 60s TTL. Uses existing Django-redis cache backend with `IGNORE_EXCEPTIONS=True`, so a Redis outage degrades to the pre-cache behavior, never breaks.
  • All 8 summary views refactored to pass their compute bodies through the helper. No change to JSON shape or query logic.
  • `manage.py prewarm_summaries` — iterates apps with recent traffic and warms every summary view for the 30-day window the dashboard requests by default. Intended as an ECS scheduled task at ~30s cadence.

Local benchmark (7-day Production clone, ~18k conferences / 38k sessions / 38k connections)

endpoint cold warm speedup
conferences/summary 391ms 12ms 33×
sessions/summary 748ms 11ms 68×
connections/setup-time-summary 373ms 11ms 34×
conferences/participant-count-summary 216ms 7ms 31×
issues/gum-summary 107ms 6ms 18×
connections/summary 57ms 6ms 9.5×
issues/summary 45ms 86ms noise
conferences/duration-summary 19ms 8ms 2.3×

Total warm dashboard cost ≈ 150ms across all 8 endpoints (vs ~2s cold).

Test plan

  • Verify `/v1/conferences/summary` returns identical JSON before and after enabling cache
  • Flush redis, hit all 8 endpoints, confirm 8 keys appear at `:1:summary:*`
  • Hit the same endpoints again — confirm p99 < 50ms
  • Set `SUMMARY_CACHE_TTL=5`, wait 6s, confirm re-query triggers a new compute
  • Kill redis, confirm endpoints still return correct data (just slower)
  • Run `manage.py prewarm_summaries` — confirm 8 keys/app written, slow-query log line for any >500ms compute

Follow-ups (not in this PR)

  • Hook `prewarm_summaries` into the ECS scheduled task config (infra repo).
  • Consider surfacing `X-Cache: HIT|MISS` response header for ops visibility.
  • Pre-existing: local `docker-compose` sets `REDIS_HOST=redis://127.0.0.1:6379` which doesn't resolve inside the api container. Added a local override file; worth a one-liner fix in compose.

🤖 Generated with Claude Code

Boanerges1996 and others added 3 commits April 20, 2026 21:47
Five new endpoints for the remaining dashboard charts that fetch raw data:

- GET /v1/conferences/duration-summary
    Returns conference counts bucketed by duration range (< 1m, 1-3m, etc.)
- GET /v1/conferences/participant-count-summary
    Returns distribution of conferences by participant count
- GET /v1/issues/summary
    Returns issue counts grouped by code with titles
- GET /v1/issues/gum-summary
    Returns getusermedia_error issue counts grouped by error name

Also adds three new filter params to /v1/conferences for click-to-detail
modals on these charts:
- duration_gte, duration_lt (for duration chart)
- issue_code (for most-common-issues chart)

All endpoints accept appId, created_at_gte, created_at_lte and handle
both Python native ISO format and JavaScript's toISOString Z suffix.

Phases 2 and 3 of peermetrics#20 — eliminates the need for the dashboard to
download all conferences (~38MB) and all issues (~73MB).
…ermetrics#20)

Adds three new aggregation endpoints that let the dashboard stop
downloading full /connections and /sessions payloads to build charts
client-side:

- GET /v1/connections/summary — relay vs direct connection counts
  (replaces the Relayed-connections pie chart's client-side reduce)
- GET /v1/connections/setup-time-summary — connection setup-time
  buckets with per-bucket conference_ids for click-to-detail
- GET /v1/sessions/summary — browsers, OS, country, and city/geo
  aggregates (powers Browsers, OS, and Map charts in one roundtrip)

Also accepts `conference_ids=a,b,c` on /conferences so the setup-time
chart can page through matched conferences on click.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…se C of peermetrics#20)

With Phases 0-5 merged, every dashboard chart reads from a server-side
aggregation endpoint. The SQL is fast with indexes, but the same ~8
queries run on every page load, and the heavy ones (sessions.summary,
connections.setup_time_summary) still cost 400-800ms on a live tenant.

Adds a thin caching layer in front of each summary view:

- `app/summary_cache.py` — `cached_json(endpoint, request, compute)`
  hashes (endpoint + filter params) into a short key, reads Redis,
  falls through to `compute()` on miss, and writes back with a 60s TTL.
  Redis failures are tolerated (settings already has IGNORE_EXCEPTIONS).
- Each of the eight summary views moves its existing compute body into
  a local `compute()` closure and returns through the helper. No change
  to the JSON shape, query logic, or error handling.
- `manage.py prewarm_summaries` — scheduled command that iterates apps
  with recent traffic (default: any conference in the last 2 days) and
  runs every summary view with the 30d-window filters the dashboard
  sends by default. Intended to run every ~30s as an ECS scheduled
  task so first visitors never see a cold miss.

Measured locally against a 7-day Production clone (~18k conferences /
38k sessions / 38k connections):

  endpoint                               cold      warm
  conferences/summary                    391ms  →   12ms   (33x)
  sessions/summary                       748ms  →   11ms   (68x)
  connections/setup_time_summary         373ms  →   11ms   (34x)
  conferences/participant_count_summary  216ms  →    7ms   (31x)
  issues/gum_summary                     107ms  →    6ms   (18x)
  connections/summary                     57ms  →    6ms   (9.5x)
  issues/summary                          45ms  →   86ms   (noise; both <100ms)
  conferences/duration_summary            19ms  →    8ms   (2.3x)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@agonza1
Copy link
Copy Markdown
Contributor

agonza1 commented Apr 24, 2026

Some additional feedback:

P1 — real bug: GET /v1/conferences?issue_code=... can return the same conference multiple times when several issues share that code, which breaks pagination and count for dashboard drilldowns (each page should be unique conferences, aligned with aggregated chart semantics).

P2 — policy / correctness: issue_code should not match soft-deleted issues. Per existing BaseModel / API patterns, issue_code should only consider active issues → add issues__is_active=True whenever issue_code is applied.

P2/P3 — hardening: GET /v1/issues/gum-summary walks Issue.data with .get(). If data is not a dict (null is fine; bad legacy JSON is not), the view can 500. Skip non-dict rows and keep aggregating the rest.

Suggested fix direction:

  • ConferencesView.filter (app/views/conference_view.py):
    When issue_code is present: filter with issues__code (already mapped) and issues__is_active=True, then .distinct() on the conference queryset (scoped to the issue_code path so other filters stay unchanged).

  • GetUserMediaSummaryView (app/views/issue_summary_view.py):
    In the loop over Issue.data, continue if data is missing or not isinstance(data, dict).

Follow-ups: regression tests for the three bullets above; one-line README under private /conferences for issue_code + “one row per conference.” No migrations required (behavior-only).

@agonza1
Copy link
Copy Markdown
Contributor

agonza1 commented Apr 24, 2026

Filtering conferences by issue code joins issues, so one conference can show up many times (e.g: camera issue happening 5 times would be counted 5 times and repeated in dashboard) and mess up pagination/counts. Deduplicate and only match active issues, like elsewhere.

The GUM chart reads Issue.data as a dict; if a row isn’t, the handler can crash—skip those rows better than failing the whole request.

agonza1 added 2 commits April 24, 2026 19:23
- Unit-test cache key rules, hit/miss, TTL override, and soft-fail on get/set errors.
- Smoke-test prewarm_summaries for zero apps and one recent app (8 views).

Made-with: Cursor
@agonza1 agonza1 force-pushed the feat/summary-redis-cache-phase-c branch from a16d91f to e7e51f7 Compare April 24, 2026 23:41
@agonza1
Copy link
Copy Markdown
Contributor

agonza1 commented Apr 24, 2026

Ready for one last run @Boanerges1996, if you confirm it is still working for you we can merge all changes here and previous PRs we used as base here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants