Conversation
📝 WalkthroughWalkthroughThis PR introduces comprehensive health check improvements across the API and worker services. It adds concurrent dependency health checks (Redis, Database, ClickHouse) with individual failure tolerance in the API healthcheck controller, implements Kubernetes-compatible health endpoints ( Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant API as API Healthcheck
participant Redis as Redis
participant DB as Database
participant CH as ClickHouse
Client->>API: GET /healthcheck
par Concurrent Checks
API->>Redis: PING
API->>DB: SELECT 1
API->>CH: SELECT 1
and
rect rgba(255, 0, 0, 0.5)
Note over Redis: Executes async
end
rect rgba(255, 0, 0, 0.5)
Note over DB: Executes async
end
rect rgba(255, 0, 0, 0.5)
Note over CH: Executes async
end
end
alt All Dependencies Succeed
Redis-->>API: Connected
DB-->>API: Responsive
CH-->>API: Available
API->>Client: 200 OK {ready: true, dependencies: {...}}
else Any Dependency Fails
Redis-->>API: Error
DB-->>API: Error/Connected
CH-->>API: Error/Available
API->>Client: 503 Service Unavailable {ready: false, dependencies: {...}, errors: {...}}
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
apps/worker/src/index.ts (2)
29-29: Consider making the stale threshold configurable via env.
EVENTS_HEARTBEAT_STALE_MSis hard-coded to 60s. Since it's tightly coupled toEVENT_BLOCKING_TIMEOUT_SEC(which is already env-configurable), operators who tune the blocking timeout upward will silently get false-positive 503s. An env knob (or derivation fromEVENT_BLOCKING_TIMEOUT_SEC * N) would avoid this.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/worker/src/index.ts` at line 29, EVENTS_HEARTBEAT_STALE_MS is hard-coded to 60_000 and can become inconsistent with the env-configured EVENT_BLOCKING_TIMEOUT_SEC; make the stale threshold configurable by reading an env var (e.g., EVENTS_HEARTBEAT_STALE_MS) or derive it from EVENT_BLOCKING_TIMEOUT_SEC (e.g., multiply by a safety factor) in the same module where EVENTS_HEARTBEAT_STALE_MS is defined so operators can tune it via environment and avoid false-positive 503s when EVENT_BLOCKING_TIMEOUT_SEC is increased.
78-120: DRY: this /healthcheck handler is a near-verbatim duplicate of the API'shealthcheckcontroller.The same Redis/DB/CH probe logic, the same
dependencies/dependencyErrors/failed|workingDependenciesshape, and the same 200/503 decision live inapps/api/src/controllers/healthcheck.controller.ts(lines 12–58). Drift between the two is inevitable. Consider extracting a shared helper (e.g.probeDependencies()returning{ status, body, log }) into@openpanel/commonor a small@openpanel/healthpackage, and calling it from both endpoints.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/worker/src/index.ts` around lines 78 - 120, The worker healthcheck duplicates the API controller's probe logic; extract the shared probe into a new helper (e.g. probeDependencies) in a common package (like `@openpanel/common` or `@openpanel/health`) that performs the Promise.all probes (using getRedisCache().ping(), db.$executeRaw`SELECT 1`, chQuery('SELECT 1')), computes dependencies and dependencyErrors, determines status (200/503), builds failedDependencies and workingDependencies, and returns a structured result (suggested shape: { status, body, log }) so both the worker's handler in apps/worker/src/index.ts and the API's healthcheck.controller.ts can call probeDependencies(), use the returned status to set response status/body (ready, dependencies, failedDependencies, workingDependencies), and log via logger.warn using the returned log payload.packages/common/src/try-catch.ts (1)
12-14: Unsafe cast of caught value toE.
catch (error)can receive non-Errorvalues (strings, objects,null), but this blindly casts toE(defaultError). Current callers useerror?.messagewith optional chaining so it degrades toundefinedrather than crashing, but a safer default would be to normalize viaerror instanceof Error ? error : new Error(String(error))whenEdefaults toError. Not blocking — flagging for future hardening.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/common/src/try-catch.ts` around lines 12 - 14, The catch block currently unsafely casts the caught value to E (catch (error) { return { ok: false, data: null, error: error as E }; }), which can be non-Error; instead normalize the caught value before returning: when the generic E is the default Error type, convert non-Error values to an Error (e.g., error instanceof Error ? error : new Error(String(error))) and then return that normalizedError (cast to E only as a last step); update the catch in try-catch.ts to build a safe normalizedError and return { ok: false, data: null, error: normalizedError as E } so callers can reliably access .message without runtime surprises.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/worker/src/boot-workers.ts`:
- Around line 157-158: Remove the non-existent drained listener (the
worker.on('drained', markEventsActivity) line) and instead implement a periodic
idle heartbeat: set up a timer (e.g., setInterval) that calls markEventsActivity
at a safe interval shorter than the readiness timeout (e.g., 30s) so idle
workers stay healthy; start this timer when the Worker (groupmq Worker) is
initialized and ensure you clear it on terminal events such as 'closed', 'error'
or 'graceful-timeout' to avoid leaks, leaving the existing
worker.on('completed', markEventsActivity) intact for job completion updates.
In `@apps/worker/src/index.ts`:
- Around line 131-152: The readiness endpoint /healthz/ready can return 200
before workers are booted because app.listen(PORT, ...) starts the HTTP server
prior to bootWorkers()/bootCron() calling enableEventsHeartbeat(); change the
startup sequencing so readiness is false until boot complete: either delay
app.listen until after bootWorkers()/bootCron() resolve or add an explicit
booted flag (e.g., bootComplete = false -> true after bootWorkers()/bootCron()
finish) and update the /healthz/ready handler (which currently uses
isShuttingDown() and getEventsHeartbeat()/EVENTS_HEARTBEAT_STALE_MS) to return
503 when bootComplete is false; ensure the flag is set only after
enableEventsHeartbeat() is active so the endpoint never reports ready before
workers are running.
---
Nitpick comments:
In `@apps/worker/src/index.ts`:
- Line 29: EVENTS_HEARTBEAT_STALE_MS is hard-coded to 60_000 and can become
inconsistent with the env-configured EVENT_BLOCKING_TIMEOUT_SEC; make the stale
threshold configurable by reading an env var (e.g., EVENTS_HEARTBEAT_STALE_MS)
or derive it from EVENT_BLOCKING_TIMEOUT_SEC (e.g., multiply by a safety factor)
in the same module where EVENTS_HEARTBEAT_STALE_MS is defined so operators can
tune it via environment and avoid false-positive 503s when
EVENT_BLOCKING_TIMEOUT_SEC is increased.
- Around line 78-120: The worker healthcheck duplicates the API controller's
probe logic; extract the shared probe into a new helper (e.g. probeDependencies)
in a common package (like `@openpanel/common` or `@openpanel/health`) that performs
the Promise.all probes (using getRedisCache().ping(), db.$executeRaw`SELECT 1`,
chQuery('SELECT 1')), computes dependencies and dependencyErrors, determines
status (200/503), builds failedDependencies and workingDependencies, and returns
a structured result (suggested shape: { status, body, log }) so both the
worker's handler in apps/worker/src/index.ts and the API's
healthcheck.controller.ts can call probeDependencies(), use the returned status
to set response status/body (ready, dependencies, failedDependencies,
workingDependencies), and log via logger.warn using the returned log payload.
In `@packages/common/src/try-catch.ts`:
- Around line 12-14: The catch block currently unsafely casts the caught value
to E (catch (error) { return { ok: false, data: null, error: error as E }; }),
which can be non-Error; instead normalize the caught value before returning:
when the generic E is the default Error type, convert non-Error values to an
Error (e.g., error instanceof Error ? error : new Error(String(error))) and then
return that normalizedError (cast to E only as a last step); update the catch in
try-catch.ts to build a safe normalizedError and return { ok: false, data: null,
error: normalizedError as E } so callers can reliably access .message without
runtime surprises.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 19defc40-3643-456f-a719-f80a0e74c3f8
⛔ Files ignored due to path filters (1)
pnpm-lock.yamlis excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (8)
apps/api/src/controllers/healthcheck.controller.tsapps/worker/src/boot-workers.tsapps/worker/src/index.tsapps/worker/src/utils/graceful-shutdown.tsapps/worker/src/utils/worker-heartbeat.tspackages/common/index.tspackages/common/src/try-catch.tspnpm-workspace.yaml
| worker.on('completed', markEventsActivity); | ||
| worker.on('drained', markEventsActivity); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect groupmq's GroupWorker implementation for emitted events.
fd -t d '^groupmq$' node_modules 2>/dev/null | head -5
rg -nP "emit\s*\(\s*['\"](completed|drained)" --glob '**/groupmq/**' -C2 2>/dev/null || \
rg -nP "emit\s*\(\s*['\"](completed|drained)" -C2Repository: Openpanel-dev/openpanel
Length of output: 49
🌐 Web query:
groupmq 2.0.0-next.4 Worker events completed drained
💡 Result:
For “groupmq 2.0.0-next.4 Worker events completed drained” there isn’t enough evidence in the sources I found to answer what GroupMQ emits for “completed” and “drained” worker events in version 2.0.0-next.4 specifically. What I could verify: 1) GroupMQ is a Redis-backed per-group FIFO queue for Node/TypeScript, with a BullMQ-compatible API (site + README). It provides a Worker class and worker options (including concurrency), but the sources I found do not describe any “worker events” named “completed” or “drained”, nor do they document event payload semantics. 2) The only “completed/drained” event behavior I found is for BullMQ (different library). In BullMQ, there is a “drained” event and it has been discussed/changed regarding when it emits (e.g., only once when the worker is idle; and older versions emitting more often). This does not establish GroupMQ’s 2.0.0-next.4 behavior. 3) I located GroupMQ releases up to 2.0.0-next.1 (2026-03-12). I did not find a 2.0.0-next.4 release page/changelog entry in the searched results, so I couldn’t corroborate “2.0.0-next.4” content. Conclusion: I cannot confidently state what “completed” and “drained” worker events are in groupmq 2.0.0-next.4, or whether “drained” exists / how it fires. If you want, paste the exact code snippet where you attach the listeners (e.g., worker.on('completed'...) and worker.on('drained'...)) or share the groupmq 2.0.0-next.4 changelog / worker source file path, and I can re-run a targeted search against the exact implementation and produce a definitive, version-specific answer.
Citations:
- 1: https://github.com/Openpanel-dev/groupmq
- 2: https://openpanel-dev.github.io/groupmq/
- 3: https://github.com/Openpanel-dev/groupmq/blob/main/README.md
- 4: https://github.com/Openpanel-dev/groupmq/releases
- 5: *drained* event emitting on intermittent basis taskforcesh/bullmq#1121
- 6: High volume of drained events -- xtrim can't keep up taskforcesh/bullmq-pro-support#69
- 7: https://github.com/Openpanel-dev/groupmq/blob/main/CHANGELOG.md
🏁 Script executed:
#!/bin/bash
# 1. Find package files and check groupmq version
fd -type f -name "package*.json" | head -20 | xargs grep -l "groupmq" 2>/dev/null | head -3
# 2. Extract groupmq version from lock/package files
fd -type f -name "package*.json" | xargs cat 2>/dev/null | grep -A 2 '"groupmq"' | head -15Repository: Openpanel-dev/openpanel
Length of output: 408
🏁 Script executed:
#!/bin/bash
# 3. Read the boot-workers.ts file mentioned in review to see context
cat -n apps/worker/src/boot-workers.ts | sed -n '140,170p'Repository: Openpanel-dev/openpanel
Length of output: 1416
🏁 Script executed:
#!/bin/bash
# 4. Search for groupmq package in node_modules and find its main export
find node_modules -maxdepth 2 -type d -name groupmq 2>/dev/null | head -1Repository: Openpanel-dev/openpanel
Length of output: 49
🏁 Script executed:
#!/bin/bash
# 5. Look for GroupWorker class definition in repository
rg -n "class GroupWorker" --type ts --type tsx --type js --type jsxRepository: Openpanel-dev/openpanel
Length of output: 94
🏁 Script executed:
#!/bin/bash
# Check for groupmq in package locks and version
cat package-lock.json 2>/dev/null | jq '.packages | to_entries[] | select(.key | contains("groupmq")) | {key, version: .value.version}' 2>/dev/null || \
cat pnpm-lock.yaml 2>/dev/null | grep -A 2 "groupmq" | head -10 || \
grep "groupmq" yarn.lock 2>/dev/null | head -5Repository: Openpanel-dev/openpanel
Length of output: 49
🏁 Script executed:
#!/bin/bash
# Find groupmq location (likely in node_modules)
find node_modules -maxdepth 2 -name "groupmq" -type d 2>/dev/nullRepository: Openpanel-dev/openpanel
Length of output: 49
🏁 Script executed:
#!/bin/bash
# Search for GroupWorker class definition - use correct rg syntax
rg -n "class GroupWorker" -t ts -t jsRepository: Openpanel-dev/openpanel
Length of output: 49
🏁 Script executed:
#!/bin/bash
# Search for groupmq mentions anywhere in the repo
rg "groupmq" -i | head -20Repository: Openpanel-dev/openpanel
Length of output: 2333
🏁 Script executed:
#!/bin/bash
# Check package.json structure
ls -la package.json 2>/dev/null && head -30 package.jsonRepository: Openpanel-dev/openpanel
Length of output: 1319
🏁 Script executed:
#!/bin/bash
# Find where GroupWorker is imported from
rg "GroupWorker" -B 2 -A 2 | head -40Repository: Openpanel-dev/openpanel
Length of output: 1407
🏁 Script executed:
#!/bin/bash
# Check if groupmq is a workspace package (internal) or external dependency
ls -la packages/groupmq 2>/dev/null || echo "Not found in packages"
cat pnpm-workspace.yaml | grep -A 20 "packages:"Repository: Openpanel-dev/openpanel
Length of output: 385
🏁 Script executed:
#!/bin/bash
# Check the actual groupmq source - look for it in node_modules or as workspace package
find . -path ./node_modules -prune -o -type f -name "*.ts" -print | xargs grep -l "class GroupWorker\|export.*GroupWorker" 2>/dev/null | head -5Repository: Openpanel-dev/openpanel
Length of output: 49
🌐 Web query:
Openpanel-dev groupmq 2.0.0-next.4 Worker class events completed drained emit
💡 Result:
For Openpanel-dev/groupmq, the “drained” and “completed” worker/job lifecycle events you mention are not found in the official GroupMQ docs/pages returned by the research. The only worker events explicitly documented in the Openpanel-dev/groupmq README/docs are: ready, completed, failed, error, closed, and graceful-timeout. Additionally, GroupMQ’s documented job completion flow includes moving a successfully processed job into a :completed retention sorted set and unlocking the group lock, but this is internal state management—not an exposed “drained” event. So, based on the sources located for this query, there is no evidence of a “Worker class events … drained … completed … emit” sequence in groupmq 2.0.0-next.4. What is evidenced: - Worker emits documented events including completed (and other events), and drained is not listed. - Job completion flow marks the job as completed in the queue’s :completed sorted set and unlocks the group. If you can share the exact file/snippet from groupmq 2.0.0-next.4 you’re looking at (or the expected event names), I can re-check precisely whether drained exists under a different name, is part of QueueEvents (not Worker), or was removed/changed in the 2.0.0-next.x line.
Citations:
- 1: https://github.com/Openpanel-dev/groupmq/tree/4ff4c7571a75e7703fa3d28a059706d4eb671a7e
- 2: https://git.new/groupmq
- 3: https://github.com/Openpanel-dev/groupmq
🏁 Script executed:
#!/bin/bash
# Look at node_modules to find groupmq source
ls -la node_modules/groupmq 2>/dev/null | head -20
# If it exists, check main entry point
cat node_modules/groupmq/package.json 2>/dev/null | jq '.main, .exports'Repository: Openpanel-dev/openpanel
Length of output: 49
🏁 Script executed:
#!/bin/bash
# Check the actual event listeners being used in boot-workers.ts
grep -n "worker.on\|markEventsActivity" apps/worker/src/boot-workers.ts | head -20Repository: Openpanel-dev/openpanel
Length of output: 193
🏁 Script executed:
#!/bin/bash
# Try to find groupmq Worker implementation in node_modules
find node_modules/groupmq -name "*.js" -o -name "*.ts" 2>/dev/null | head -20Repository: Openpanel-dev/openpanel
Length of output: 49
🌐 Web query:
site:github.com/Openpanel-dev/groupmq Worker class emit events implementation
💡 Result:
In libraries like GroupMQ (a BullMQ-inspired Redis job queue for Node.js/TypeScript), the Worker class emits events such as 'ready', 'completed', 'failed', 'error', 'closed', and 'graceful-timeout'. These are listened to using standard Node.js EventEmitter syntax: worker.on('event', handler). Example implementation from GroupMQ documentation: worker.on('ready', => { console.log('Worker is ready'); }); worker.on('completed', (job) => { console.log('Job completed:', job.id); }); worker.on('failed', (job) => { console.log('Job failed:', job.id, job.failedReason); }); worker.on('error', (error) => { console.error('Worker error:', error); }); worker.on('closed', => { console.log('Worker closed'); }); worker.on('graceful-timeout', (job) => { console.log('Job exceeded graceful timeout:', job.id); }); // Remove listeners worker.off('completed', handler); worker.removeAllListeners; The Worker class extends Node.js EventEmitter, allowing it to emit these lifecycle and job status events for monitoring and handling in queue processing systems.
Citations:
- 1: https://github.com/Openpanel-dev/groupmq/tree/4ff4c7571a75e7703fa3d28a059706d4eb671a7e
- 2: https://github.com/Openpanel-dev/groupmq/blob/4ff4c7571a75e7703fa3d28a059706d4eb671a7e/README.md
- 3: https://github.com/Openpanel-dev/groupmq/blob/main/README.md
🏁 Script executed:
#!/bin/bash
# Check the actual GroupMQ source code for Worker.emit calls
# GroupMQ is at catalog version 2.0.0-next.4, search for its source on GitHub
# Let's look at the local boot-workers more carefully to understand the context
cat -n apps/worker/src/boot-workers.ts | sed -n '1,50p'Repository: Openpanel-dev/openpanel
Length of output: 2182
🏁 Script executed:
#!/bin/bash
# Check what markEventsActivity does and how often it's called
rg -n "markEventsActivity" -B 5 -A 5 apps/worker/src/Repository: Openpanel-dev/openpanel
Length of output: 2485
🌐 Web query:
site:github.com/Openpanel-dev/groupmq "drained" event Worker emit
💡 Result:
No evidence found of a "drained" event emitted by a Worker in any job queue library, including BullMQ, groupmq, Bee-Queue, or others. Searches across official documentation and code repositories (primarily groupmq README and BullMQ-inspired libraries) show Worker events like 'ready', 'completed', 'failed', 'error', 'closed', and 'graceful-timeout', but no 'drained' event. The query may refer to a custom implementation, a different library, or a misremembered event name. In Bee-Queue, 'drained' exists but as a queue event when no jobs remain, not emitted by Worker.
Citations:
- 1: https://github.com/Openpanel-dev/groupmq/blob/4ff4c7571a75e7703fa3d28a059706d4eb671a7e/README.md
- 2: Openpanel-dev/groupmq@66a73f4
- 3: https://github.com/Openpanel-dev/groupmq/tree/4ff4c7571a75e7703fa3d28a059706d4eb671a7e
🏁 Script executed:
#!/bin/bash
# Let's check the actual GroupMQ source from the commit referenced in web results
# Look for any emit calls in groupmq that mention "drained"
rg "emit.*drained" --type ts --type js 2>/dev/null || echo "No 'drained' emit found"
# Also check what events ARE emitted by the Worker class
rg "this\.emit|worker\.emit" apps/worker/src/boot-workers.tsRepository: Openpanel-dev/openpanel
Length of output: 90
🌐 Web query:
site:github.com/Openpanel-dev/groupmq Worker.emit all events 2.0.0-next
💡 Result:
In the groupmq library (a Redis-backed per-group FIFO queue for Node.js/TypeScript), Worker instances extend TypedEventEmitter and emit the following events: - 'ready': Emitted when the worker is ready. - 'completed': Emitted when a job completes, with the Job object. - 'failed': Emitted when a job fails, with the Job object and failedReason. - 'error': Emitted on worker errors, with the Error. - 'closed': Emitted when the worker closes. - 'graceful-timeout': Emitted when a job exceeds graceful timeout, with the Job object. No version 2.0.0-next or Cloudflare Workers context found matching the query. Latest groupmq version is v1.1.0 (Nov 2025), with next tags like 1.1.1-next.1. Use worker.on(event, handler) to listen and worker.off or worker.removeAllListeners to remove.
Citations:
- 1: https://github.com/Openpanel-dev/groupmq/blob/4ff4c7571a75e7703fa3d28a059706d4eb671a7e/README.md
- 2: https://github.com/Openpanel-dev/groupmq/tree/4ff4c7571a75e7703fa3d28a059706d4eb671a7e
- 3: https://github.com/Openpanel-dev/groupmq
- 4: https://github.com/Openpanel-dev/groupmq/blob/main/README.md
Remove the worker.on('drained') listener — the event does not exist in groupmq 2.0.0-next.4.
The groupmq Worker class emits only: 'ready', 'completed', 'failed', 'error', 'closed', and 'graceful-timeout'. There is no 'drained' event. This means the heartbeat will only refresh when jobs complete; during idle periods (no jobs in queue), markEventsActivity() never fires, and the readiness probe will report 503 after 60 seconds.
Either implement a separate idle heartbeat mechanism (e.g., a periodic timer that calls markEventsActivity() at shorter intervals) or accept that the readiness probe will cycle to unhealthy during sustained idle periods.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/worker/src/boot-workers.ts` around lines 157 - 158, Remove the
non-existent drained listener (the worker.on('drained', markEventsActivity)
line) and instead implement a periodic idle heartbeat: set up a timer (e.g.,
setInterval) that calls markEventsActivity at a safe interval shorter than the
readiness timeout (e.g., 30s) so idle workers stay healthy; start this timer
when the Worker (groupmq Worker) is initialized and ensure you clear it on
terminal events such as 'closed', 'error' or 'graceful-timeout' to avoid leaks,
leaving the existing worker.on('completed', markEventsActivity) intact for job
completion updates.
| app.get('/healthz/ready', (req, res) => { | ||
| if (isShuttingDown()) { | ||
| res.status(503).json({ ready: false, reason: 'shutting down' }); | ||
| return; | ||
| } | ||
|
|
||
| const { enabled, lastActivityAt } = getEventsHeartbeat(); | ||
| if (enabled) { | ||
| const idleMs = Date.now() - lastActivityAt; | ||
| if (idleMs > EVENTS_HEARTBEAT_STALE_MS) { | ||
| res.status(503).json({ | ||
| ready: false, | ||
| reason: 'events consumer heartbeat stale', | ||
| idleMs, | ||
| thresholdMs: EVENTS_HEARTBEAT_STALE_MS, | ||
| }); | ||
| return; | ||
| } | ||
| } | ||
|
|
||
| res.status(200).json({ ready: true }); | ||
| }); |
There was a problem hiding this comment.
Startup race: /healthz/ready returns 200 before workers are booted.
app.listen(PORT, …) at line 154 runs before bootWorkers() at line 159, and enableEventsHeartbeat() is only called inside bootWorkers. In the window between the HTTP server starting and bootWorkers() completing, enabled is false, so this endpoint reports ready even though no events consumer is running yet. If K8s scrapes readiness in that window it may route traffic / mark the pod ready prematurely.
Consider flipping an explicit "booted" flag after bootWorkers()/bootCron() resolve and returning 503 until then, or starting the HTTP server after boot.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/worker/src/index.ts` around lines 131 - 152, The readiness endpoint
/healthz/ready can return 200 before workers are booted because app.listen(PORT,
...) starts the HTTP server prior to bootWorkers()/bootCron() calling
enableEventsHeartbeat(); change the startup sequencing so readiness is false
until boot complete: either delay app.listen until after
bootWorkers()/bootCron() resolve or add an explicit booted flag (e.g.,
bootComplete = false -> true after bootWorkers()/bootCron() finish) and update
the /healthz/ready handler (which currently uses isShuttingDown() and
getEventsHeartbeat()/EVENTS_HEARTBEAT_STALE_MS) to return 503 when bootComplete
is false; ensure the flag is set only after enableEventsHeartbeat() is active so
the endpoint never reports ready before workers are running.
Summary by CodeRabbit
Release Notes
New Features
/healthz/live,/healthz/ready) for better deployment integrationBug Fixes
Chores