feat: subchunk write order by ilan-gold · Pull Request #3826 · zarr-developers/zarr-python

ilan-gold · 2026-03-24T11:04:30Z

In order to encourage ecosystem compatibility + reserve runtime setting strings/enums (see zarrs/zarrs-python#160), subchunk write order is expanded from morton to include lexicographic, colexicographic, and unordered (which is randomized).

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.md
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

d-v-b · 2026-03-24T11:57:39Z

    end = "end"


+class SubchunkWriteOrder(Enum):


advantage of an enum over Literal["morton", "unordered", "lexicographic", "colexicographic"]?

Just copied what was done for ShardingCodecIndexLocation!

I'm not a huge fan of enums in python (including ShardingCodecIndexingLocation), so unless you object I think it would be better to use a simple Literal + a final tuple of strings, like:

SubchunkWriteOrder = Literal["morton", "unordered", "lexicographic", "colexicographic"] SUBCHUNK_WRITE_ORDER: Final[tuple[str, str, str, str]] = ("morton", "unordered", "lexicographic", "colexicographic")

Done (hopefully)!

Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com>

…o ig/shard_order

ilan-gold · 2026-03-27T16:36:34Z


        if self._is_complete_shard_write(indexer, chunks_per_shard):
-            shard_dict = dict.fromkeys(morton_order_iter(chunks_per_shard))
+            shard_dict = dict.fromkeys(np.ndindex(chunks_per_shard))


cc @mkitti

Here and below, I don't think there is any need to construct the dict in morton order, right? There should be no correctness or performance hit here?

@d-v-b This now ensures we only shuffle in the unordered case once so the test is nice and clean - write once + get order, create a new codec with the same seed + create the iterator from that codec, match orders

In Python, dicts are ordered and I think the optimal iteration order may need to be encoded in the dict the last time I examined the situation. I was just trying to preserve the situation before my edits.

027c469

So this wasn't about dictionary order, but instead in the vectorized case, the order to ShardReader.to_dict_vectorized had to match that of what ShardReader was internally generating, as it turned out morton order. So I'm glad I caught this because I think it means the data was being corrupted for the other orders (which weren't getting hypothesis-tested).

So I'm going to add something to the hyptothesis tests for this.

I had the same feeling initially that the dictionary order mattered, but it turns out the final call to _encode_shard_dict actually handles the ordering for us to the output buffer while writing to the intermediate shard_dict can be done in any order, as long as the final buffer is done in the correct order

codecov · 2026-05-20T14:13:31Z

Codecov Report

❌ Patch coverage is 96.42857% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.43%. Comparing base (5ca1690) to head (18d952e).

Files with missing lines	Patch %	Lines
src/zarr/codecs/sharding.py	93.33%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3826      +/-   ##
==========================================
+ Coverage   93.39%   93.43%   +0.04%     
==========================================
  Files          88       88              
  Lines       11839    11874      +35     
==========================================
+ Hits        11057    11095      +38     
+ Misses        782      779       -3

Files with missing lines	Coverage Δ
src/zarr/codecs/__init__.py	`100.00% <100.00%> (ø)`
src/zarr/testing/strategies.py	`94.66% <100.00%> (+1.77%)`	⬆️
src/zarr/codecs/sharding.py	`92.13% <93.33%> (+0.19%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

d-v-b

this looks good, thanks!

we have some interesting follow-up work:

finding a nice way to expose the sharding codec settings in the signature of create_array
writing a chunk writing routine that can incrementally stream out a shard
formalizing the distinction between codec attributes that get serialized to JSON and runtime-only attributes

…#3826, partial-read opt zarr-developers#3004, _ShardIndex refactor zarr-developers#3975) Resolves conflicts in sharding.py (kept FusedCodecPipeline sync methods + main's _subchunk_order_iter / _load_partial_shard_maybe; fixed _ShardIndex construction to main's 2-arg signature), array.py (took main's cached regular_chunk_spec), test_codec_pipeline.py (kept the dual-pipeline suite + main's evolve test), .gitignore (union). 423 codec/sharding/parity + 807 codecs/indexing tests pass under both pipelines. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…al reads Three integration gaps surfaced when the Fused pipeline met main's new subchunk_write_order (zarr-developers#3826), partial-read coalescing (zarr-developers#3004), and _ShardIndex refactor. Under Fused these caused 25 sharding/parity failures (data was correct in the partial-read cases; the failures were write-order layout + IO-pattern divergence). Fixes: 1. Write order: _encode_shard_dict_sync laid out chunks in hardcoded morton order, ignoring subchunk_write_order. Now iterates _subchunk_order_iter(self.subchunk_write_order), matching the async _encode_shard_dict. Fixes lexicographic/colexicographic/unordered storage. 2. Coalesced sync partial reads: add Store.get_ranges_sync (a synchronous, coalescing counterpart of get_ranges, reusing coalesce_ranges) and ShardingCodec._load_partial_shard_maybe_sync; route _decode_partial_sync's partial branch through it. Sync stores now get zarr-developers#3004's byte-range coalescing without an event loop (fewer, merged reads). 3. Non-sync fallback: FusedCodecPipeline.read now routes non-sync stores (e.g. ZipStore) through the async partial-decode path when the AB codec supports it, instead of _async_read_fallback's whole-shard get(). Matches Batched's IO behavior; avoids over-reading whole shards on partial reads. Tests: the zarr-developers#3004 partial-read tests are made pipeline-aware (assert the active method family: get/get_ranges vs get_sync/get_ranges_sync, gated on store sync support). 573 sharding+parity+pipeline+indexing and 657 codec tests pass under BOTH pipelines (was 25 failing under Fused). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat: subchunk write order (#3826) * feat: subchunk write order * chore: export `SubchunkWriteOrder` * chore: docs * chore: relnote * rename * refactor: no enums * Update docs/user-guide/performance.md Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com> * feat: deterministic but random order * fix: make vectorized fetching less reliant on matching order * chore: add hypothesis * refactor: dead code * refactor: more cleanup * don't shard unless there is something to shard * fix: dont mix chunk grid and sharding --------- Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com> (cherry picked from commit 093a153) * refactor: unordered subchunk order means no-promise, not random Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: pin subchunk_write_order survival through pickle Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: remove rng from ShardingCodec; carry subchunk_write_order through pickle Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: describe unordered subchunk order as no-guarantee, drop rng Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: drive sharding strategy through serializer to exercise subchunk_write_order The hypothesis arrays() strategy passed both shards= and a ShardingCodec serializer, which nested the codecs and left subchunk_write_order governing only a 1-element inner grid. Drive sharding through the serializer alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * harden: guard _subchunk_order_iter; document write-order is not persisted Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cleanup: use np.ndindex for immaterial intermediate order; drop stale FIXME Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * polish: guard scalar arrays in sharding strategy; align doc value ordering Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Revert strategies.py changes — unrelated to this PR These changes were patching a latent bug in `arrays()` where ShardingCodec-as-serializer was being double-stacked with `shards=...`, producing nested sharding. Splitting to a follow-up PR so this one stays focused on removing the `rng`/random-subchunk-order surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Ilan Gold <ilanbassgold@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ilan-gold added 6 commits March 24, 2026 11:53

feat: subchunk write order

5477d70

chore: export SubchunkWriteOrder

2e36679

chore: docs

c6498b2

chore: relnote

58e071c

Merge branch 'main' into ig/shard_order

417df78

rename

11b94c0

d-v-b reviewed Mar 24, 2026

View reviewed changes

ilan-gold added 2 commits March 24, 2026 15:32

refactor: no enums

b0c622d

Merge branch 'main' into ig/shard_order

22a5dda

ilan-gold commented Mar 24, 2026

View reviewed changes

Comment thread src/zarr/codecs/sharding.py

Merge branch 'main' into ig/shard_order

39634f0

ilan-gold requested a review from d-v-b March 24, 2026 17:51

Merge branch 'main' into ig/shard_order

f4498a6

ilan-gold mentioned this pull request Mar 26, 2026

chore: prepare for release zarrs/zarrs-python#162

Merged

d-v-b reviewed Mar 26, 2026

View reviewed changes

Comment thread src/zarr/codecs/sharding.py Outdated

d-v-b reviewed Mar 26, 2026

View reviewed changes

Comment thread docs/user-guide/performance.md Outdated

ilan-gold and others added 5 commits March 27, 2026 09:30

Update docs/user-guide/performance.md

be7ac83

Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com>

Merge branch 'main' into ig/shard_order

a89249a

Merge branch 'main' into ig/shard_order

5ea1cf3

feat: deterministic but random order

7b663ff

Merge branch 'ig/shard_order' of github.com:ilan-gold/zarr-python int…

eae06dd

…o ig/shard_order

ilan-gold commented Mar 27, 2026

View reviewed changes

ilan-gold requested a review from d-v-b March 27, 2026 16:36

ilan-gold added 3 commits March 27, 2026 18:23

Merge branch 'main' into ig/shard_order

f36ea93

fix: make vectorized fetching less reliant on matching order

027c469

chore: add hypothesis

3f53182

ilan-gold mentioned this pull request Mar 27, 2026

Add store methods for writing into an existing buffer #3831

Open

ilan-gold added 3 commits March 27, 2026 22:24

Merge branch 'main' into ig/shard_order

8eb1792

refactor: dead code

1889212

refactor: more cleanup

09e2bc5

Merge branch 'main' into ig/shard_order

73352f8

ilan-gold mentioned this pull request May 20, 2026

perf: phased codecpipeline #3885

Open

ilan-gold added 2 commits May 20, 2026 15:21

Merge branch 'main' into ig/shard_order

3867812

don't shard unless there is something to shard

b3861c8

ilan-gold and others added 2 commits May 20, 2026 16:51

fix: dont mix chunk grid and sharding

a59b997

Merge branch 'main' into ig/shard_order

00823f3

d-v-b approved these changes May 22, 2026

View reviewed changes

Merge branch 'main' into ig/shard_order

18d952e

d-v-b merged commit 093a153 into zarr-developers:main May 22, 2026
30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: subchunk write order#3826

feat: subchunk write order#3826
d-v-b merged 27 commits into
zarr-developers:mainfrom
ilan-gold:ig/shard_order

ilan-gold commented Mar 24, 2026 •

edited

Loading

Uh oh!

d-v-b Mar 24, 2026

Uh oh!

ilan-gold Mar 24, 2026

Uh oh!

d-v-b Mar 24, 2026

Uh oh!

ilan-gold Mar 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ilan-gold Mar 27, 2026 •

edited

Loading

Uh oh!

mkitti Mar 27, 2026

Uh oh!

ilan-gold Mar 27, 2026

Uh oh!

codecov Bot commented May 20, 2026 •

edited

Loading

Uh oh!

d-v-b left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ilan-gold commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d-v-b Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

ilan-gold Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

d-v-b Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

ilan-gold Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ilan-gold Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkitti Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

ilan-gold Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

d-v-b left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ilan-gold commented Mar 24, 2026 •

edited

Loading

ilan-gold Mar 24, 2026 •

edited

Loading

ilan-gold Mar 27, 2026 •

edited

Loading

codecov Bot commented May 20, 2026 •

edited

Loading