Skip to content

Storable hook cookie + leaf seen-tag fix; AGENTS timeout rule; jperl orphan watchdog#644

Open
fglock wants to merge 4 commits intomasterfrom
fix/storable-orphan-watchdog-20260430-170423
Open

Storable hook cookie + leaf seen-tag fix; AGENTS timeout rule; jperl orphan watchdog#644
fglock wants to merge 4 commits intomasterfrom
fix/storable-orphan-watchdog-20260430-170423

Conversation

@fglock
Copy link
Copy Markdown
Owner

@fglock fglock commented Apr 30, 2026

Summary

Three independent, ready-to-merge fixes carved out of #635 so they can land without waiting on the still-open DBIx::Class walker investigation.

1. Storable hook cookie + leaf seen-tag drift (commits 1, 2)

PR #621's native Storable encoder had two correctness bugs:

  • Hook cookie was UTF-8 encoded in StorableWriter.tryEmitHook(). The cookie is a binary blob, not text — re-encoding turned every 0x80–0xFF byte into a 2-byte sequence. Surface symptom: thaw failed: SX_HASH keylen 2164260864 out of range when a class with a STORABLE_freeze hook (e.g. DBIC's ResultSource) is frozen and thawed.
  • Leaf scalars not assigned a seen-tag on the write side, while the reader's Scalars.java calls recordSeen() for every leaf opcode (matching upstream's SEEN_NN). Tag IDs drifted past the first leaf, so any later SX_OBJECT(tag=N) resolved to the wrong object on thaw. Surface symptom in DBIC: cryptic hash passed to _order_by must have exactly one key (-desc or -asc)_attrs->{order_by}->[0] resolves to the outer _attrs hash because the tag pointed at the wrong slot.

After this PR:

  • DBIx::Class t/84serialize.t: 0/115 → 115/115 pass
  • DBIx::Class t/52leaks.t: was crashing on dclone → now runs to the existing tests-12-18 leak-tracer state

2. AGENTS.md timeout discipline (commit 3)

Adds a mandatory rule documenting today's incident: investigative agents must wrap ./jperl/./jcpan/./jprove calls in timeout N, NEVER bare ./jperl … or /usr/bin/time -p ./jperl …. Without it, hung JVMs survive the agent's exit (reparented to PID 1, no SIGHUP propagation, no JVM-side watchdog) and silently starve the box.

3. Subprocess orphan watchdog (commit 4)

Adds an opt-in JVM-side daemon thread that polls the initial parent PID every 2s and self-exits when the parent is gone. Enabled automatically by ./jcpan and ./jprove via JPERL_ORPHAN_EXIT=1. Direct ./jperl your_script.pl is unaffected (preserves real-Perl exec semantics — child survives parent).

Verified manually:

  • JPERL_ORPHAN_EXIT=1 ./jperl ... + kill -9 of parent: child JVM prints [jperl] orphaned: parent PID NNN is gone — exiting and dies in ~4 s.
  • exec "sh -c 'echo started; sleep 4; echo finished'" from any context: child runs to completion (no premature kill).

Test plan

  • make (unit tests) green
  • DBIC t/84serialize.t: 115/115
  • Direct ./jperl exec/system semantics preserved
  • Manual SIGTERM/SIGKILL parent → child cleanup verified
  • Conflict resolution with master: Configuration.java (Gradle-managed, taken from master) and AGENTS.md Incident Log (merged both rows from master + this branch)

The DBIx::Class harness t/52leaks.t schema-detached intermittent that surfaces under this PR (after the storable fixes unblock visibility) remains under investigation and tracked in #635.

Generated with Devin

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

fglock and others added 4 commits April 30, 2026 17:04
Two related byte-vs-UTF-8 mismatches in the native Storable writer that
corrupted any blob containing bytes 0x80–0xFF:

1. StorableWriter.tryEmitHook() was calling
   `cookieSv.toString().getBytes(UTF_8)` on the cookie returned by
   STORABLE_freeze. The cookie is a binary Storable stream, not text,
   and re-encoding turned every 0x80..0xFF byte into a 2-byte UTF-8
   sequence — breaking the embedded SX_HOOK length prefix and
   producing wild-looking errors like
   `SX_HASH keylen 2164260864 out of range` on the receiving side.

2. Storable.freezeImpl() created the result scalar with the default
   `new RuntimeScalar(String)` constructor (type STRING). The encoded
   stream is bytes 0..255 — it must be tagged BYTE_STRING so any
   downstream consumer that converts back to bytes (file I/O, hook
   re-pack, etc.) doesn't UTF-8-encode the high bytes.

Together these fix `jcpan -t DBIx::Class` t/84serialize.t failures
in the freeze/thaw_method, freeze/thaw_func and dclone code paths
(both go through nested STORABLE_freeze hooks for ResultSet,
ResultSource and ResultSourceHandle).

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The native binary writer was missing recordWriteSeen() calls for leaf
scalars (SX_SCALAR / SX_BYTE / SX_INTEGER / SX_DOUBLE / SX_UTF8STR /
SX_LSCALAR / SX_LUTF8STR / SX_UNDEF / SX_SV_*), but the reader's
Scalars.java already calls recordSeen() for every one of them (matching
upstream's SEEN_NN macro in Storable.xs). The mismatch caused write-side
tag IDs to drift behind the reader after the first leaf scalar, so any
subsequent SX_OBJECT backref resolved to the wrong object on thaw.

Surfaced as `jcpan -t DBIx::Class` failures in t/84serialize.t — for
example, after `nfreeze → thaw` of a DBIC resultset the inner
`_attrs->{order_by}->[0]` would resolve to the outer `_attrs` hash,
producing the cryptic `hash passed to _order_by must have exactly one
key (-desc or -asc)` error from SQL::Abstract::Classic.

Fix: emit a fresh seen-tag at the top of writeScalar() so write/read
counters stay in lockstep.

After this fix:
  - DBIC t/84serialize.t: 0/115 → 115/115
  - DBIC t/52leaks.t: still 11/11

Combined with the prior commit (storable hook-cookie UTF-8 byte-string
fix) this restores `jcpan -t DBIx::Class` to a passing state.

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…me -p`

Adds a new mandatory rule for investigative agents:
- ALL `jperl` / `jcpan` / `prove` invocations that could hang must be
  wrapped in `timeout N`, never `/usr/bin/time -p` (which only measures)
  and never bare `./jperl …`.
- Explains why: `./jperl` ends with `exec java …`, so when the agent's
  bash exits, hung JVMs get reparented to PID 1 and keep running at 100%
  CPU forever — there is no SIGHUP propagation and no JVM-side watchdog.
  A handful of these orphans silently starves the whole machine.
- Includes WRONG/RIGHT examples and the post-investigation cleanup-check
  command (`ps aux | awk '$3>20 {...}'` + `pkill -f "perlonjava-.*\.t"`).

Adds an Incident Log entry for today's PR-#635 work, where this exact
trap caused phantom `t/76joins.t` / `t/96_is_deteministic_value.t`
SIGKILLs in `./jcpan -t DBIx::Class` runs — the symptom looked like a
real DBIx::Class perf regression, but was actually CPU starvation from
~14 orphan JVMs left behind by an earlier investigative agent.

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Adds a self-watchdog daemon thread in `org.perlonjava.app.cli.Main`
that runs ONLY when env var JPERL_ORPHAN_EXIT is set (auto-set by
`./jcpan` and `./jprove`). The thread caches the initial parent PID
at startup and polls `ProcessHandle.of(...)` every 2s; after two
consecutive misses (~4s) it calls `Runtime.halt(143)`.

This fixes the runaway-orphan trap that has burned us repeatedly
today: when the user `kill -9`s an aborted `./jcpan` or
`./jprove` run, every in-flight sub-jperl test JVM gets reparented
to PID 1 and keeps spinning at 100% CPU, since SIGKILL bypasses
all user-space cleanup. The accumulated orphans then starve any
subsequent `./jcpan -t Module` run, manifesting as random tests
SIGKILL'd at the 300s "no output" deadline. AGENTS.md documents
this as today's incident.

Direct `./jperl your_script.pl` does NOT set the env var, so
user programs are unaffected:

  - Real-Perl `exec LIST` semantics ("child survives parent")
    still apply: the watchdog only runs in jperl JVMs, never in
    sh / non-jperl children spawned via `open(... "$cmd|")` or
    `exec`. So a user who calls exec from a long-running script
    still gets the standard nohup-style behavior.
  - No shutdown hook is installed in the runtime — only the
    opt-in watchdog. The earlier draft of this fix (shutdown hook
    in RuntimeIO) was removed because it would have killed
    pipe-open children on parent System.exit, breaking exec
    semantics under the harness.

Verified manually:

  $ JPERL_ORPHAN_EXIT=1 ./jperl ... &     # parent
    parent gets `kill -9`
    sub-jperl prints `[jperl] orphaned: parent PID NNN is gone — exiting`
    and dies in ~4s.

  $ JPERL_ORPHAN_EXIT=1 ./jperl -e \
      'exec "sh", "-c", "echo started; sleep 4; echo finished"'
    Output:  started ... finished        # exec'd child runs to completion ✓

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant