Skip to content

gcs: fix flaky test on bridge teardown race#2744

Merged
shreyanshjain7174 merged 1 commit into
microsoft:mainfrom
shreyanshjain7174:fix/gcs-test-teardown-race
May 19, 2026
Merged

gcs: fix flaky test on bridge teardown race#2744
shreyanshjain7174 merged 1 commit into
microsoft:mainfrom
shreyanshjain7174:fix/gcs-test-teardown-race

Conversation

@shreyanshjain7174
Copy link
Copy Markdown
Contributor

@shreyanshjain7174 shreyanshjain7174 commented May 18, 2026

Summary

TestGcsCreateProcess occasionally panics in CI with:

panic: Fail in goroutine after TestGcsCreateProcess has completed

The fake guest goroutine (simpleGcs) outlives the test and calls t.Error from a closed test scope. Observed flake rate locally: ~2/30 iterations; e.g. https://github.com/microsoft/hcsshim/actions/runs/26017233067/job/76470258316.

Root cause

bridge.Close() closes the underlying io.Pipe synchronously without a shutdown handshake. Because io.Pipe is unbuffered, a write of one bridge message arrives at the reader in two chunks (16-byte header, then body). If Close() lands between those chunks, simpleGcs's ReadFull(body) returns io.EOF after a partial read, which readMessage converts to io.ErrUnexpectedEOF (bridge.go:277). simpleGcsLoop only whitelisted io.EOF / io.ErrClosedPipe, so the error propagated into t.Error after the test had already returned — Go's testing framework then panics.

Fix

Two layered changes in internal/gcs/guestconnection_test.go:

  1. simpleGcsLoop also treats io.ErrUnexpectedEOF as a benign teardown signal. The fake guest legitimately observes truncated frames when the bridge tears down abruptly; this is purely a test-side accommodation. Production behavior is unchanged — the real bridge still surfaces ErrUnexpectedEOF as a fatal protocol error via recvLoopkillErrBridgeClosed.

  2. connectGcs joins the simpleGcs goroutine via t.Cleanup. This ensures any future t.Error from background work (e.g. the inner io.Copy goroutine in the RPCExecuteProcess case has the same hazard) lands inside the test scope as a normal failure instead of panicking the runtime.

@shreyanshjain7174 shreyanshjain7174 requested a review from a team as a code owner May 18, 2026 09:46
TestGcsCreateProcess occasionally panicked with Fail in goroutine after Test... completed because the fake guest (simpleGcs) goroutine outlived the test and called t.Error from there.

Two layered fixes in guestconnection_test.go:

1. simpleGcsLoop now treats io.ErrUnexpectedEOF as a benign teardown signal alongside io.EOF and io.ErrClosedPipe. The bridge's brdg.Close() closes the underlying io.Pipe synchronously, so simpleGcs can legitimately observe a truncated frame mid-body when the test tears down. Production behavior is unchanged: a real bridge still surfaces ErrUnexpectedEOF as a fatal protocol error.

2. connectGcs now joins the simpleGcs goroutine via t.Cleanup so any future t.Error from background work lands inside the test scope as a normal failure instead of panicking the runtime.

Reproduced at ~2/30 iterations pre-fix; 100/100 clean post-fix.

Signed-off-by: Shreyansh Jain <shreyanshjain7174@gmail.com>
@shreyanshjain7174 shreyanshjain7174 force-pushed the fix/gcs-test-teardown-race branch from d2980b4 to ba7e2da Compare May 18, 2026 10:05
@shreyanshjain7174 shreyanshjain7174 merged commit 2ad2839 into microsoft:main May 19, 2026
60 of 61 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants