feat: compute per-output split metadata in merge engine by g-talbot · Pull Request #6359 · quickwit-oss/quickwit

g-talbot · 2026-04-29T15:25:36Z

Summary

Stacked on #6351 (Phase 2). Should be merged before #6352 (Phase 3a).

The merge engine now extracts metric_names, time_range, and low_cardinality_tags (service) from each output file's actual rows during the merge write pass.

Problem: MergeOutputFile previously only had physical metadata (num_rows, size_bytes, row_keys, zonemaps). The downstream metadata_aggregation function inferred logical metadata by unioning all input splits. This is incorrect when num_outputs > 1 — each output contains a different subset of the globally sorted rows and should have metadata reflecting only its own data.

Fix: Each MergeOutputFile now carries per-output logical metadata extracted from the sorted_batch before writing. Reuses extract_metric_names, extract_service_names, extract_time_range from split_writer (made pub(crate)).

Test plan

New test test_merge_per_output_metadata_from_actual_rows — verifies 2-output merge has correct per-output metric_names and time_range
Updated test_merge_multiple_outputs with per-output metadata assertions
All 66 merge tests pass (including proptests)
cargo clippy clean

🤖 Generated with Claude Code

Adds `parquet_merge_policy` section to `IndexingSettings`, making the Parquet merge policy configurable per-index via YAML. Parameters: - merge_factor (default 10): min splits to trigger a merge - max_merge_factor (default 12): max splits per merge - max_merge_ops (default 4): bounds write amplification - target_split_size_bytes (default 256 MiB): target output size - maturation_period (default 48h): split maturity timeout - max_finalize_merge_operations (default 3): cold-window shutdown limit Mirrors the existing merge_policy config pattern for logs/traces. Updates index-config.md documentation with the new section. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…secs Adds `parquet_indexing` section to `IndexingSettings` for per-index Parquet pipeline configuration: - `sort_fields`: sort schema override (Husky-style pipe-delimited syntax with /V2 suffix). Controls row ordering, query pruning, compression locality, and compaction scope. When omitted, uses the product-type default. - `window_duration_secs`: time window for split partitioning (default 900s / 15 min). Must divide 3600. Updates docs/configuration/index-config.md with: - "Parquet indexing settings" section explaining both parameters - Full sort schema syntax reference (column types, direction overrides, & LSM cutoff marker) - Examples showing minimal, custom, and advanced configurations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adding ParquetMergePolicyConfig and ParquetIndexingConfig to IndexingSettings changes the Hash output, which changes the pipeline params fingerprints. Updated the hardcoded test constants. Added a comment explaining how to recompute them when IndexingSettings fields change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The merge engine now extracts metric_names, time_range, and low_cardinality_tags from each output file's actual rows during the merge write pass. Previously, MergeOutputFile only contained physical metadata (num_rows, size_bytes, row_keys, zonemaps). The downstream metadata_aggregation function inferred logical metadata by unioning all input splits — which is incorrect when num_outputs > 1, since each output contains only a subset of the globally sorted rows. Now each MergeOutputFile carries: - metric_names: distinct metrics in this output's rows - time_range: min/max timestamp_secs from this output's rows - low_cardinality_tags: service names from this output's rows Reuses existing extract_metric_names, extract_service_names, and extract_time_range from split_writer (made pub(crate)). Includes test that verifies per-output metadata is computed from actual rows when merging 2 inputs into 2 outputs with different metric names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

g-talbot force-pushed the gtt/merge-output-split-metadata branch from fc6f90a to 720560d Compare April 29, 2026 20:53

g-talbot changed the base branch from gtt/parquet-merge-policy to gtt/parquet-merge-policy-config April 29, 2026 20:53

g-talbot force-pushed the gtt/merge-output-split-metadata branch from 720560d to 3e61af4 Compare April 30, 2026 02:29

g-talbot and others added 2 commits April 30, 2026 09:40

g-talbot force-pushed the gtt/parquet-merge-policy-config branch from bd23d3b to a28e4a6 Compare April 30, 2026 13:42

g-talbot force-pushed the gtt/merge-output-split-metadata branch from 3e61af4 to f90a265 Compare April 30, 2026 13:42

g-talbot force-pushed the gtt/parquet-merge-policy-config branch from a28e4a6 to 4fae305 Compare April 30, 2026 13:49

g-talbot force-pushed the gtt/merge-output-split-metadata branch from f90a265 to 5c7476e Compare April 30, 2026 13:49

mattmkim reviewed Apr 30, 2026

View reviewed changes

Comment thread quickwit/quickwit-parquet-engine/src/merge/writer.rs

g-talbot force-pushed the gtt/parquet-merge-policy-config branch 2 times, most recently from ebab487 to 8656c44 Compare April 30, 2026 14:32

g-talbot force-pushed the gtt/merge-output-split-metadata branch 3 times, most recently from 2474b24 to f374ad0 Compare April 30, 2026 16:52

g-talbot requested a review from mattmkim April 30, 2026 16:55

mattmkim approved these changes Apr 30, 2026

View reviewed changes

g-talbot force-pushed the gtt/parquet-merge-policy-config branch 4 times, most recently from 74b1ce1 to fc03d3d Compare April 30, 2026 19:58

g-talbot force-pushed the gtt/parquet-merge-policy-config branch from fc03d3d to 58d07d0 Compare April 30, 2026 20:04

g-talbot and others added 2 commits April 30, 2026 16:24

fix: nightly rustfmt import ordering

c639cf6

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

g-talbot force-pushed the gtt/merge-output-split-metadata branch from f374ad0 to c639cf6 Compare April 30, 2026 20:24

Base automatically changed from gtt/parquet-merge-policy-config to main April 30, 2026 20:27

Merge branch 'main' into gtt/merge-output-split-metadata

fbd9628

g-talbot enabled auto-merge (squash) April 30, 2026 20:29

g-talbot merged commit 4cafb74 into main Apr 30, 2026
9 checks passed

g-talbot deleted the gtt/merge-output-split-metadata branch April 30, 2026 20:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: compute per-output split metadata in merge engine#6359

feat: compute per-output split metadata in merge engine#6359
g-talbot merged 6 commits intomainfrom
gtt/merge-output-split-metadata

g-talbot commented Apr 29, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

g-talbot commented Apr 29, 2026

Summary

Test plan

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants