Skip to content

perf: make StringProtocol.range(of:) scan UTF-8 bytes#325

Merged
pepicrft merged 1 commit intomainfrom
fix/range-of-utf8-fast-path
Apr 16, 2026
Merged

perf: make StringProtocol.range(of:) scan UTF-8 bytes#325
pepicrft merged 1 commit intomainfrom
fix/range-of-utf8-fast-path

Conversation

@pepicrft
Copy link
Copy Markdown
Contributor

Summary

  • StandardLibraryExtensions.StringProtocol.range(of:) previously walked the receiver Character by Character, calling distance(from:to:) on every outer iteration — effectively O(n²) in the receiver length and dominated by grapheme-cluster comparison cost.
  • Tuist's XCActivityLog parser (processor/native/xcactivitylog_nif/Sources/XCActivityLogParser/XCActivityLogParser.swift:181-183) calls it three times per diagnostic ("warning: ", "error: ", " ^~"). On builds with thousands of diagnostics (Depop ~6800, Pinterest ~700, monday ~280) this was pinning every dirty CPU scheduler thread (ERL_NIF_DIRTY_JOB_CPU_BOUND in nif_bridge.c:101) in production, matching the perf/gdb hotspot reported: StandardLibraryExtensions.range(of:)String.distance(from:to:).
  • New implementation scans the UTF-8 byte view. UTF-8 is self-synchronizing so byte-level matches are Unicode-scalar-aligned, and String.Index values from the utf8 view remain valid on the receiver. String.UTF8View index ops are O(1), so the inner loop is a tight byte-comparison scan instead of Character compares plus repeated distance(...) calls.
  • Added a StandardLibraryExtensionsTests target (no test coverage existed for StandardLibraryExtensions before) with correctness tests covering: empty pattern, pattern-longer-than-receiver, match at start/middle/end, non-matching, backtrack ("aaab".range(of: "aab")), Substring receivers, ASCII pattern inside a Unicode haystack, and Unicode pattern.

Benchmark

Unit test test_benchmark_many_issue_messages — 6,000 realistic diagnostic messages (~700 chars each, warning + error + plain templates) × 3 range(of:) calls = 18,000 calls, release build, Apple Silicon:

Time Per call
Before 286.04 s ~16 ms
After 0.044 s ~2.4 µs

~6,500× faster. Full package test suite (98 tests) passes.

Extrapolating to the worst offender in the report — Depop ae9124af-... with ~6,805 issues — pre-fix that one build burned ~5.4 minutes of CPU inside range(of:) alone; post-fix it's ~50 ms.

Test plan

  • swift test -c release — 98/98 pass
  • New correctness tests in StringProtocolRangeOfTests — 11/11 pass (ASCII, Unicode, Substring)
  • Benchmark test captures and prints before/after timings
  • Verify on the Tuist processor box that CPU saturation on the listed active builds drops once this is deployed

🤖 Generated with Claude Code

The previous implementation walked the receiver Character by Character,
recomputing `distance(from:to:)` on every outer iteration and comparing
grapheme clusters one at a time. That made it O(n²) in the receiver
length and extremely slow for long strings — a hot callsite in Tuist's
XCActivityLog parser (three `range(of:)` calls per diagnostic) was
pinning dirty CPU scheduler threads in production, taking ~16 ms per
call on ~700 character messages.

Scan the UTF-8 byte view instead. UTF-8 is self-synchronizing, so
byte-level matches are Unicode-scalar-aligned and the returned
`String.Index` values remain valid on the receiver. All index
operations on `String.UTF8View` are O(1), and byte comparisons are
dramatically cheaper than Character comparisons.

Benchmark (6000 realistic diagnostic messages × 3 patterns = 18 000
calls, release build, Apple Silicon):

  before: 286.04 s (~16 ms per call)
  after:    0.044 s (~2.4 µs per call)

~6500× speed-up on the benchmark and full test suite (98 tests) still
passes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@pepicrft pepicrft force-pushed the fix/range-of-utf8-fast-path branch from 52ba687 to c801c75 Compare April 15, 2026 13:53
@pepicrft pepicrft self-assigned this Apr 15, 2026
@pepicrft pepicrft requested review from a team, cschmatzler and fortmarek and removed request for a team April 15, 2026 13:54
@pepicrft pepicrft marked this pull request as ready for review April 15, 2026 13:55
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c801c750d8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +48 to +50
let haystack = utf8
let needle = string.utf8
let nCount = needle.count
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve canonical-equivalent Unicode matches

Switching range(of:) to raw UTF-8 byte comparison changes semantics from character equality to byte equality, which regresses canonical-equivalent matching. For example, a haystack containing decomposed "Cafe\u{301}" used to match a precomposed needle "Café" (via Character equality) but now returns nil. This can cause real substring misses on normalized-vs-decomposed Unicode inputs even though the text is canonically the same.

Useful? React with 👍 / 👎.

@pepicrft pepicrft merged commit f50c9d2 into main Apr 16, 2026
10 checks passed
pepicrft added a commit that referenced this pull request Apr 16, 2026
Three public API-shape hazards removed because they win overload
resolution against Foundation or stdlib and silently replace caller
intent, the same pattern that caused the O(n^2) range(of:) regression
(PR #325):

- String.init(_ bytes: [UInt8]) and String.init(_ bytes:
  ArraySlice<UInt8>) in Binary.Serializable — any call site doing
  String(someByteArray) silently resolves to these lossy UTF-8
  decoders (String(decoding: bytes, as: UTF8.self)), swallowing
  invalid bytes as U+FFFD without the caller opting in.

- Sequence.count(where:) — duplicate of the Swift 6 stdlib built-in
  Sequence.count(where:); a competing declaration creates
  overload-resolution ambiguity on Swift 6 consumers with no
  functional benefit.

- Sequence.max(count:) and Sequence.min(count:) — same signature as
  swift-algorithms, causing ambiguous-overload errors for consumers
  that depend on both packages. The implementation is also
  O(N * K log K) due to a full sort inside the per-element loop
  instead of a partial-sort/heap, which is a footgun for any
  large-N caller.

The only internal use of String.init(_ bytes: [UInt8]) was in
SetCharacter+INCITS_4_1986.swift to build the CRLF Character, which
is rewritten to call String(decoding:as:) explicitly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants