Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,13 +141,63 @@
╚══════════════════════════════════════════════════════════════════════════════╝
```

## ⚠️⚠️⚠️ ALWAYS WRAP `jperl`/`jcpan` IN `timeout` ⚠️⚠️⚠️

```
╔══════════════════════════════════════════════════════════════════════════════╗
║ ║
║ Investigative agents that launch PerlOnJava test runs MUST wrap every ║
║ `jperl`/`jcpan`/`prove` invocation with `timeout N` — NEVER just ║
║ `/usr/bin/time -p` (which only measures, never kills) and NEVER bare ║
║ `./jperl …` for anything that could hang. ║
║ ║
║ # WRONG — JVM survives forever if it hangs ║
║ /usr/bin/time -p ./jperl t/foo.t ║
║ ./jperl t/foo.t & ║
║ ║
║ # RIGHT — JVM is hard-killed after 60 s ║
║ timeout 60 ./jperl t/foo.t ║
║ timeout 60 ./jperl -Ilib -It/lib t/foo.t ║
║ ║
║ Why this matters: ║
║ ║
║ - `./jperl` ends with `exec java …`, so the bash wrapper is replaced ║
║ by the JVM. When the agent's own bash exits, those JVMs get ║
║ reparented to PID 1 and KEEP RUNNING at 100% CPU — there is no ║
║ SIGHUP propagation and no JVM-side self-watchdog. ║
║ - On a 48 GB Mac the JVM defaults to ~12 GB heap. A handful of orphan ║
║ JVMs at 100% CPU silently starves the whole machine, which then ║
║ makes the NEXT `jcpan -t Module` run miss the 300 s no-output deadline ║
║ in `TAP::Parser::Iterator::Process` — the symptom looks like "test ║
║ X hangs" when it's really just CPU starvation from orphans. ║
║ - `t/96_is_deteministic_value.t` and `t/76joins.t` SIGKILLs in PR #635 ║
║ CI runs were caused exactly by this: a previous agent left ~14 orphan ║
║ JVMs at 100% CPU each, load avg climbed to 50, and the harness gave ║
║ up on innocent tests after 5 minutes of no TAP output. ║
║ ║
║ If your run REALLY may exceed any sane wall clock (e.g. a full ║
║ `jcpan -t DBIx::Class` is ~40 min), still wrap it: `timeout 3600 ...`. ║
║ If you spawn parallel test workers, give each its own `timeout`. ║
║ ║
║ When you finish an investigation, sanity-check your cleanup: ║
║ ║
║ ps aux | awk '$3 > 20 {print $2, $3, $11, $12}' ║
║ ║
║ If any unexpected `java …perlonjava…` shows up, kill it: ║
║ ║
║ pkill -9 -f "perlonjava-.*\.jar.*\.t\b" ║
║ ║
╚══════════════════════════════════════════════════════════════════════════════╝
```

## Incident Log (do not delete — this is why the rules above exist)

| Date | What was lost | Root cause |
|------------|------------------------------------------------|---------------------------------------------------|
| 2026-04-28 | ~600 cpan-tester module results (4736 → 4139) | Agent ran `git checkout dev/cpan-reports/` on an unstaged refresh; concurrent `cpan_random_tester.pl` instances also race on `.dat` files (separate bug). |
| 2026-04-29 | cpan-reports refresh commit (briefly, on a feature branch — recovered from reflog) | Agent resolved a rebase conflict with `git checkout --ours` thinking it would keep the branch's version. During rebase, `--ours` means UPSTREAM, so the upstream files were taken, the replayed commit became empty, and rebase silently dropped it. Recovery: `git reset --hard <sha>` from `git reflog`, then re-rebase using `--theirs`. |
| 2026-04-30 | (no work lost — recovered) Working tree on `fix/class-trait-tests` was overwritten with master content | Agent ran `git checkout master -- .` to A/B test failures vs master without first snapshotting and without switching branches. Recovery only worked because the changes had already been committed to HEAD: `git restore .` (also a forbidden command on a dirty tree, but safe here because "dirty" was master content, not user work) brought the tree back from HEAD. Correct workflow would have been: stash via `git diff > /tmp/wip.patch`, or use `git worktree add` for the master comparison instead of mutating the current tree. |
| 2026-04-30 | A full afternoon chasing a phantom "DBIx::Class regression" in `t/76joins.t` / `t/96_is_deteministic_value.t` | Investigative agent launched the test repeatedly under `/usr/bin/time -p ./jperl …` (no `timeout` wrapper). Each hung JVM survived past the agent's lifetime, accumulated as ~14 orphans at 100% CPU each, and starved the active `jcpan` harness — which then SIGKILLed innocent tests after 300 s of no TAP output. Symptom looked exactly like a real perf regression. Fix: always `timeout N ./jperl …` for any potentially-hanging run. |

When you cause a new incident, append a row here in the same commit
that fixes it. Future agents need to see that these warnings are real.
Expand Down
9 changes: 9 additions & 0 deletions jcpan
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,15 @@ fi
# Override: JPERL_TEST_TIMEOUT=0 (disable) or JPERL_TEST_TIMEOUT=600 (10 min)
export JPERL_TEST_TIMEOUT="${JPERL_TEST_TIMEOUT:-300}"

# Enable the orphan-exit watchdog in every jperl this run spawns. If the
# parent jcpan / test_harness process is killed (e.g. SIGKILL'd by the
# user, or terminated by a CI step), each child JVM polls its initial
# parent PID every 2s and self-exits when that parent disappears.
# Without this, killing the harness leaves dozens of in-flight test
# JVMs reparented to PID 1, all spinning at 100% CPU until manually
# pkill'd. See AGENTS.md "ALWAYS WRAP jperl/jcpan IN timeout" rule.
export JPERL_ORPHAN_EXIT=1

# Expose the jperl launcher AND the jcpan launcher itself so distroprefs
# (e.g. Moose.yml) can run upstream tests against the bundled shims with
# `prove --exec jperl`, and bootstrap missing helper modules with
Expand Down
4 changes: 4 additions & 0 deletions jcpan.bat
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,10 @@ goto parse_args
:run
rem Set default per-test timeout (300s) to kill hanging tests
if not defined JPERL_TEST_TIMEOUT set "JPERL_TEST_TIMEOUT=300"
rem Enable orphan-exit watchdog in every jperl this run spawns — when
rem the parent jcpan dies, each child JVM self-exits within ~4s instead
rem of getting reparented to PID 1 and burning 100% CPU forever.
set "JPERL_ORPHAN_EXIT=1"
rem Expose jperl and jcpan launchers, and prepend SCRIPT_DIR to PATH so
rem shell-spawned subprocesses (distroprefs commandlines, prove --exec,
rem etc.) can find jperl/jcpan without tokens that don't expand in
Expand Down
7 changes: 7 additions & 0 deletions jprove
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,11 @@ else
exit 1
fi

# Enable the orphan-exit watchdog in every jperl this run spawns. If the
# parent jprove process is killed (e.g. SIGKILL'd by the user, or
# terminated by a CI step), each child JVM polls its initial parent PID
# every 2s and self-exits when that parent disappears. See the matching
# block in `./jcpan` and AGENTS.md for the full rationale.
export JPERL_ORPHAN_EXIT=1

exec "$SCRIPT_DIR/jperl" "$PROVE_SCRIPT" "$@"
4 changes: 4 additions & 0 deletions jprove.bat
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,9 @@ rem Repository: github.com/fglock/PerlOnJava
rem Get the directory where this script is located
set SCRIPT_DIR=%~dp0

rem Enable orphan-exit watchdog in every jperl this run spawns — when
rem the parent jprove dies, each child JVM self-exits within ~4s.
set "JPERL_ORPHAN_EXIT=1"

rem Run jperl with the prove script
call "%SCRIPT_DIR%jperl.bat" "%SCRIPT_DIR%src\main\perl\bin\prove" %*
59 changes: 59 additions & 0 deletions src/main/java/org/perlonjava/app/cli/Main.java
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,65 @@ public class Main {
static {
// Set default locale to US (uses dot as decimal separator)
Locale.setDefault(Locale.US);

// Optional orphan-exit watchdog. When the env var
// JPERL_ORPHAN_EXIT is set (typically by `./jcpan` and
// `./jprove`, which spawn many short-lived sub-jperls), this
// JVM self-exits a few seconds after its initial parent
// process disappears. Without this, a `kill -9` on the parent
// jcpan/test_harness leaves all in-flight test JVMs reparented
// to PID 1, where they happily keep running at 100% CPU
// forever — burning the box and starving subsequent runs.
//
// SIGTERM-style parent death is already handled by the
// shutdown hook in RuntimeIO; this watchdog covers the SIGKILL
// case (no shutdown hooks fire on the kernel-side kill).
//
// Direct `./jperl your_script.pl` does NOT set the env var, so
// user programs are never killed when their shell exits — they
// get the standard nohup-style behavior they'd expect from any
// long-running interpreter.
if (System.getenv("JPERL_ORPHAN_EXIT") != null) {
startOrphanWatchdog();
}
}

private static void startOrphanWatchdog() {
java.util.Optional<java.lang.ProcessHandle> parentOpt =
java.lang.ProcessHandle.current().parent();
if (parentOpt.isEmpty()) return; // no parent? nothing to watch.
long initialParentPid = parentOpt.get().pid();
// PID 1 = init/launchd. If we were directly spawned by it,
// there's no point watching — we're already at the root.
if (initialParentPid <= 1) return;

Thread watchdog = new Thread(() -> {
// Poll every 2s. Exit only after two consecutive misses
// (~4s) to avoid race with rapid parent restarts.
int missCount = 0;
while (true) {
try {
Thread.sleep(2000);
} catch (InterruptedException ie) {
return;
}
java.util.Optional<java.lang.ProcessHandle> p =
java.lang.ProcessHandle.of(initialParentPid);
boolean parentGone = p.isEmpty() || !p.get().isAlive();
if (parentGone) {
if (++missCount >= 2) {
System.err.println("[jperl] orphaned: parent PID "
+ initialParentPid
+ " is gone — exiting");
Runtime.getRuntime().halt(143); // 128 + SIGTERM
}
} else {
missCount = 0;
}
}
}, "perlonjava-orphan-watchdog");
watchdog.setDaemon(true);
watchdog.start();
}

/**
Expand Down
4 changes: 2 additions & 2 deletions src/main/java/org/perlonjava/core/Configuration.java
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ public final class Configuration {
* Automatically populated by Gradle/Maven during build.
* DO NOT EDIT MANUALLY - this value is replaced at build time.
*/
public static final String gitCommitId = "0416ffb3b";
public static final String gitCommitId = "9a1145435";

/**
* Git commit date of the build (ISO format: YYYY-MM-DD).
Expand All @@ -48,7 +48,7 @@ public final class Configuration {
* Parsed by App::perlbrew and other tools via: perl -V | grep "Compiled at"
* DO NOT EDIT MANUALLY - this value is replaced at build time.
*/
public static final String buildTimestamp = "Apr 30 2026 16:17:32";
public static final String buildTimestamp = "Apr 30 2026 11:43:39";

// Prevent instantiation
private Configuration() {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ private static RuntimeList freezeImpl(RuntimeArray args, boolean netorder) {
// byte-string scalar so consumers see it as raw bytes (matches
// the existing freeze() return shape).
RuntimeScalar result = new RuntimeScalar(encoded);
result.type = RuntimeScalarType.BYTE_STRING;
return result.getList();
} catch (Exception e) {
return WarnDie.die(new RuntimeScalar("freeze failed: " + e.getMessage()), new RuntimeScalar("\n")).getList();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -223,9 +223,23 @@ private boolean tryEmitHook(StorableContext c, RuntimeScalar refScalar, String c

// First element is the frozen cookie; rest are sub-refs.
RuntimeScalar cookieSv = items.get(0);
byte[] frozen = cookieSv == null
? new byte[0]
: cookieSv.toString().getBytes(StandardCharsets.UTF_8);
// The cookie returned by STORABLE_freeze is a binary Storable
// blob (chars 0..255 stored as Java chars). Treat it as raw
// bytes — encoding it as UTF-8 mangles the high bytes (0x80..0xFF
// become 2-byte sequences) and corrupts the embedded stream.
byte[] frozen;
if (cookieSv == null) {
frozen = new byte[0];
} else if (cookieSv.type == RuntimeScalarType.BYTE_STRING) {
String s = cookieSv.toString();
frozen = new byte[s.length()];
for (int i = 0; i < frozen.length; i++) frozen[i] = (byte) s.charAt(i);
} else {
// Plain STRING — also a byte string in practice for hook cookies,
// since STORABLE_freeze returns the result of nfreeze(). Use
// ISO_8859_1 to preserve every char 0..255 as a single byte.
frozen = cookieSv.toString().getBytes(StandardCharsets.ISO_8859_1);
}
int subCount = items.size() - 1;

// Determine object kind from the bless target.
Expand Down Expand Up @@ -433,6 +447,15 @@ public void dispatch(StorableContext c, RuntimeScalar value) {
/** Emit the body of a non-reference scalar. Mirrors
* {@code store_scalar} (Storable.xs L2393). */
private void writeScalar(StorableContext c, RuntimeScalar v) {
// Every fresh leaf scalar consumes a seen-tag on the read side
// (Storable.xs `retrieve_*` for SX_SCALAR / SX_BYTE / SX_INTEGER /
// SX_DOUBLE / SX_UTF8STR / SX_LSCALAR / SX_LUTF8STR / SX_UNDEF /
// SX_SV_* all call SEEN_NN). The writer must allocate the
// matching tag here so subsequent SX_OBJECT backrefs line up.
// The key is unique per emission — leaf scalars don't
// participate in identity-shared backref deduplication.
c.recordWriteSeen(new Object());

// undef
if (v.type == RuntimeScalarType.UNDEF || !v.getDefinedBoolean()) {
c.writeByte(Opcodes.SX_UNDEF);
Expand Down
Loading