Skip to content

feat: Phase 3a — merge metadata aggregation, message types, replaced_split_ids#6352

Merged
g-talbot merged 2 commits intomainfrom
gtt/parquet-merge-pipeline
May 1, 2026
Merged

feat: Phase 3a — merge metadata aggregation, message types, replaced_split_ids#6352
g-talbot merged 2 commits intomainfrom
gtt/parquet-merge-pipeline

Conversation

@g-talbot
Copy link
Copy Markdown
Contributor

@g-talbot g-talbot commented Apr 28, 2026

Summary

Phase 3 (pipeline integration), first PR. Building on Phase 1 (merge engine, #6335) and Phase 2 (merge policy, #6351).

  • merge_parquet_split_metadata() — aggregates input split metadata with MergeOutputFile physical metadata to produce complete ParquetSplitMetadata for merged output. Validates invariant fields (kind, index_uid, partition_id, sort_fields, window), unions metric_names and tags, finalizes tag cardinality after merge. 17 unit tests.
  • Message typesParquetNewSplits, ParquetMergeTask, ParquetMergeScratch for the merge actor chain (planner → scheduler → downloader → executor).
  • replaced_split_ids — added to ParquetSplitBatch and propagated through ParquetUploader (was hardcoded Vec::new()). Enables the merge executor to specify which splits are being replaced during atomic publish-and-replace.

Test plan

  • 17 unit tests for merge_parquet_split_metadata()
  • 4 existing ParquetUploader tests pass with new field
  • cargo clippy clean, cargo doc compiles, license headers OK

🤖 Generated with Claude Code

@g-talbot g-talbot changed the title feat: Phase 3a — merge metadata aggregation, message types, replaced_split_ids feat: Phase 3 — Parquet merge pipeline integration (3a–3c) Apr 29, 2026
@g-talbot g-talbot changed the title feat: Phase 3 — Parquet merge pipeline integration (3a–3c) feat: Phase 3 — Parquet merge pipeline integration (3a–3e) Apr 29, 2026
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline branch from 3227b37 to ceba410 Compare April 29, 2026 12:59
@g-talbot g-talbot changed the title feat: Phase 3 — Parquet merge pipeline integration (3a–3e) feat: Phase 3a–3c — merge metadata, planner, downloader, executor Apr 29, 2026
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline branch from ceba410 to e96a920 Compare April 29, 2026 14:06
@g-talbot g-talbot changed the title feat: Phase 3a–3c — merge metadata, planner, downloader, executor feat: Phase 3a — merge metadata aggregation, message types, replaced_split_ids Apr 29, 2026
@g-talbot g-talbot changed the base branch from main to gtt/parquet-merge-policy April 29, 2026 14:14
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline branch from e96a920 to 9926093 Compare April 29, 2026 15:29
@g-talbot g-talbot changed the base branch from gtt/parquet-merge-policy to gtt/merge-output-split-metadata April 29, 2026 15:29
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline branch 3 times, most recently from acc5099 to 49176b0 Compare April 29, 2026 19:03
@g-talbot g-talbot force-pushed the gtt/merge-output-split-metadata branch from fc6f90a to 720560d Compare April 29, 2026 20:53
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline branch from 49176b0 to 17135dc Compare April 29, 2026 20:54
@g-talbot g-talbot force-pushed the gtt/merge-output-split-metadata branch from 720560d to 3e61af4 Compare April 30, 2026 02:29
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline branch from 17135dc to 5a5ee74 Compare April 30, 2026 02:30
@g-talbot g-talbot force-pushed the gtt/merge-output-split-metadata branch from 3e61af4 to f90a265 Compare April 30, 2026 13:42
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline branch from 5a5ee74 to bde30af Compare April 30, 2026 13:42
@g-talbot g-talbot force-pushed the gtt/merge-output-split-metadata branch from f90a265 to 5c7476e Compare April 30, 2026 13:49
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline branch from bde30af to 6e709a0 Compare April 30, 2026 13:49
@g-talbot g-talbot force-pushed the gtt/merge-output-split-metadata branch from 5c7476e to e87a598 Compare April 30, 2026 14:47
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline branch from 6e709a0 to 6190bb8 Compare April 30, 2026 14:47
@g-talbot g-talbot force-pushed the gtt/merge-output-split-metadata branch from e87a598 to 2474b24 Compare April 30, 2026 16:40
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline branch from 6190bb8 to 3017111 Compare April 30, 2026 16:41
@g-talbot g-talbot force-pushed the gtt/merge-output-split-metadata branch from 2474b24 to f374ad0 Compare April 30, 2026 16:52
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline branch from 3017111 to 1df59b4 Compare April 30, 2026 16:52
@g-talbot g-talbot force-pushed the gtt/merge-output-split-metadata branch from f374ad0 to c639cf6 Compare April 30, 2026 20:24
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline branch 2 times, most recently from 798a93a to e5afaca Compare April 30, 2026 20:40
@g-talbot
Copy link
Copy Markdown
Contributor Author

@codex review

Base automatically changed from gtt/merge-output-split-metadata to main April 30, 2026 20:43
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e5afacaa10

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread quickwit/quickwit-parquet-engine/src/merge/metadata_aggregation.rs
Copy link
Copy Markdown
Contributor

@mattmkim mattmkim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, seems like we have unecessary diff in the *.json files?

Comment thread quickwit/quickwit-parquet-engine/src/merge/metadata_aggregation.rs
g-talbot and others added 2 commits May 1, 2026 12:07
…it_ids (Phase 3a)

Phase 3 pipeline integration, first PR:

- merge_parquet_split_metadata(): aggregates input split metadata with
  MergeOutputFile physical metadata to produce complete ParquetSplitMetadata
  for merged output. Validates invariant fields, unions metric_names and tags,
  finalizes tag cardinality after merge. 17 tests.

- ParquetNewSplits, ParquetMergeTask, ParquetMergeScratch message types for
  the merge actor chain (planner → scheduler → downloader → executor).

- Add replaced_split_ids to ParquetSplitBatch and propagate through
  ParquetUploader (was hardcoded Vec::new()). Enables merge executor to
  specify which splits are being replaced.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…it ID

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline branch from e5afaca to fe969e2 Compare May 1, 2026 16:08
@g-talbot g-talbot enabled auto-merge (squash) May 1, 2026 16:12
@g-talbot g-talbot merged commit 470aed0 into main May 1, 2026
9 checks passed
@g-talbot g-talbot deleted the gtt/parquet-merge-pipeline branch May 1, 2026 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants