Skip to content

feat: Phase 3c — ParquetMergeExecutor + full downloader#6358

Merged
g-talbot merged 4 commits intogtt/parquet-merge-pipeline-3bfrom
gtt/parquet-merge-pipeline-3c
May 1, 2026
Merged

feat: Phase 3c — ParquetMergeExecutor + full downloader#6358
g-talbot merged 4 commits intogtt/parquet-merge-pipeline-3bfrom
gtt/parquet-merge-pipeline-3c

Conversation

@g-talbot
Copy link
Copy Markdown
Contributor

Summary

Stacked on #6357 (Phase 3b).

  • ParquetMergeSplitDownloader: replaces the stub with real download logic — downloads each input split's Parquet file from object storage to a local temp directory via storage.copy_to_file(), forwards ParquetMergeScratch to the executor.
  • ParquetMergeExecutor: runs merge_sorted_parquet_files via run_cpu_intensive, builds output ParquetSplitMetadata via merge_parquet_split_metadata, renames output files to match generated split IDs, sends ParquetSplitBatch with replaced_split_ids to the uploader.
  • checkpoint_delta_opt: ParquetSplitBatch.checkpoint_delta changed to Option<IndexCheckpointDelta> to support merge operations (no checkpoint delta for data reorganization). Ingest path passes Some(delta), merge path passes None.

Test plan

  • 4 existing uploader tests pass (checkpoint_delta_opt change)
  • 4 existing packager tests pass
  • Compiles with and without metrics feature
  • cargo clippy clean

🤖 Generated with Claude Code

@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from 3b171a0 to 84c6dd3 Compare April 29, 2026 15:30
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch from ceba410 to 5937440 Compare April 29, 2026 15:31
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from 84c6dd3 to a23011c Compare April 29, 2026 18:10
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch from 5937440 to 0b1c9cc Compare April 29, 2026 18:10
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from a23011c to 86fd55a Compare April 29, 2026 18:16
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch 2 times, most recently from 66b97e0 to 0f051bc Compare April 29, 2026 18:24
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from 86fd55a to ef391c9 Compare April 29, 2026 18:39
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch from 0f051bc to 16b46d7 Compare April 29, 2026 18:40
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from ef391c9 to 90c5589 Compare April 29, 2026 18:51
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch from 16b46d7 to 8e19b6b Compare April 29, 2026 18:51
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from 90c5589 to 49c6c19 Compare April 29, 2026 19:04
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch from 8e19b6b to f32bd64 Compare April 29, 2026 19:05
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from 49c6c19 to 93a0a20 Compare April 29, 2026 20:54
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch from f32bd64 to de17c0e Compare April 29, 2026 20:54
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from 93a0a20 to 3a91a31 Compare April 30, 2026 02:30
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch from de17c0e to 1f6512e Compare April 30, 2026 02:30
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from 3a91a31 to 956a6e5 Compare April 30, 2026 13:42
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch from 1f6512e to f2c4a8a Compare April 30, 2026 13:42
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from 956a6e5 to 4bde5b8 Compare April 30, 2026 13:49
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch from f2c4a8a to 467f6fb Compare April 30, 2026 13:49
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from 4bde5b8 to 5556bb2 Compare April 30, 2026 14:47
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch from 467f6fb to b6ca9bf Compare April 30, 2026 14:47
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from 5556bb2 to f8f1528 Compare April 30, 2026 16:41
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch from b6ca9bf to be60cf6 Compare April 30, 2026 16:41
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from f8f1528 to 97295a6 Compare April 30, 2026 16:52
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch from be60cf6 to bb6cd8b Compare April 30, 2026 16:52
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from 97295a6 to 40e2b25 Compare April 30, 2026 20:25
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch from bb6cd8b to f34bd04 Compare April 30, 2026 20:25
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from 40e2b25 to dee01ee Compare April 30, 2026 20:41
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch 2 times, most recently from 5c68071 to daacf89 Compare May 1, 2026 14:56
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from fc4ce2e to 7c25b1e Compare May 1, 2026 16:08
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch from daacf89 to 2d82e7e Compare May 1, 2026 16:09
@g-talbot
Copy link
Copy Markdown
Contributor Author

g-talbot commented May 1, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2d82e7e1db

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread quickwit/quickwit-indexing/src/actors/metrics_pipeline/mod.rs
Comment thread quickwit/quickwit-indexing/src/actors/metrics_pipeline/parquet_merge_executor.rs Outdated
Copy link
Copy Markdown
Contributor

@mattmkim mattmkim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to unblock, but few comments

Comment thread quickwit/quickwit-indexing/src/actors/metrics_pipeline/parquet_merge_executor.rs Outdated
Comment thread quickwit/quickwit-indexing/src/actors/metrics_pipeline/parquet_merge_executor.rs Outdated
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from 7c25b1e to 1f6047f Compare May 1, 2026 17:52
g-talbot and others added 4 commits May 1, 2026 13:52
…ase 3c)

Phase 3 pipeline integration, third PR:

- ParquetMergeSplitDownloader: downloads each input split's Parquet file
  from object storage to a local temp directory, forwards ParquetMergeScratch
  to the executor. Replaces the stub from PR 3b.

- ParquetMergeExecutor: runs merge_sorted_parquet_files via run_cpu_intensive,
  builds output ParquetSplitMetadata via merge_parquet_split_metadata, renames
  output files to match generated split IDs, sends ParquetSplitBatch with
  replaced_split_ids to the uploader.

- ParquetSplitBatch.checkpoint_delta -> checkpoint_delta_opt: now Option to
  support merge operations (no checkpoint delta for data reorganization).
  Ingest path passes Some(delta), merge path passes None.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When all input splits are empty, still send a batch with
replaced_split_ids so the metastore marks them for deletion.
Without this, empty splits drained by the planner stay Published
forever and are never cleaned up.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ader

Matches the Tantivy merge split downloader pattern: protect_zone()
prevents the download I/O from triggering heartbeat timeouts, and the
kill switch check allows early abort during pipeline shutdown.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two fixes:

1. The merge permit was dropped when the executor finished, releasing
   the semaphore before upload completed. Now it flows through
   ParquetSplitBatch → ParquetSplitsUpdate → publisher, matching
   the Tantivy pipeline.

2. Use the merge operation's pre-assigned split ID instead of
   generating a second one in merge_parquet_split_metadata(). Keeps
   the ID consistent across scheduling, tracing, and publishing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch from ac2ae00 to 706e1dd Compare May 1, 2026 17:52
@g-talbot g-talbot merged commit ee8b706 into gtt/parquet-merge-pipeline-3b May 1, 2026
2 checks passed
@g-talbot g-talbot deleted the gtt/parquet-merge-pipeline-3c branch May 1, 2026 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants