Skip to content

feat: compute per-output split metadata in merge engine#6359

Merged
g-talbot merged 6 commits intomainfrom
gtt/merge-output-split-metadata
Apr 30, 2026
Merged

feat: compute per-output split metadata in merge engine#6359
g-talbot merged 6 commits intomainfrom
gtt/merge-output-split-metadata

Conversation

@g-talbot
Copy link
Copy Markdown
Contributor

Summary

Stacked on #6351 (Phase 2). Should be merged before #6352 (Phase 3a).

The merge engine now extracts metric_names, time_range, and low_cardinality_tags (service) from each output file's actual rows during the merge write pass.

Problem: MergeOutputFile previously only had physical metadata (num_rows, size_bytes, row_keys, zonemaps). The downstream metadata_aggregation function inferred logical metadata by unioning all input splits. This is incorrect when num_outputs > 1 — each output contains a different subset of the globally sorted rows and should have metadata reflecting only its own data.

Fix: Each MergeOutputFile now carries per-output logical metadata extracted from the sorted_batch before writing. Reuses extract_metric_names, extract_service_names, extract_time_range from split_writer (made pub(crate)).

Test plan

  • New test test_merge_per_output_metadata_from_actual_rows — verifies 2-output merge has correct per-output metric_names and time_range
  • Updated test_merge_multiple_outputs with per-output metadata assertions
  • All 66 merge tests pass (including proptests)
  • cargo clippy clean

🤖 Generated with Claude Code

@g-talbot g-talbot force-pushed the gtt/merge-output-split-metadata branch from fc6f90a to 720560d Compare April 29, 2026 20:53
@g-talbot g-talbot changed the base branch from gtt/parquet-merge-policy to gtt/parquet-merge-policy-config April 29, 2026 20:53
@g-talbot g-talbot force-pushed the gtt/merge-output-split-metadata branch from 720560d to 3e61af4 Compare April 30, 2026 02:29
g-talbot and others added 2 commits April 30, 2026 09:40
Adds `parquet_merge_policy` section to `IndexingSettings`, making the
Parquet merge policy configurable per-index via YAML. Parameters:

- merge_factor (default 10): min splits to trigger a merge
- max_merge_factor (default 12): max splits per merge
- max_merge_ops (default 4): bounds write amplification
- target_split_size_bytes (default 256 MiB): target output size
- maturation_period (default 48h): split maturity timeout
- max_finalize_merge_operations (default 3): cold-window shutdown limit

Mirrors the existing merge_policy config pattern for logs/traces.
Updates index-config.md documentation with the new section.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…secs

Adds `parquet_indexing` section to `IndexingSettings` for per-index
Parquet pipeline configuration:

- `sort_fields`: sort schema override (Husky-style pipe-delimited
  syntax with /V2 suffix). Controls row ordering, query pruning,
  compression locality, and compaction scope. When omitted, uses
  the product-type default.
- `window_duration_secs`: time window for split partitioning
  (default 900s / 15 min). Must divide 3600.

Updates docs/configuration/index-config.md with:
- "Parquet indexing settings" section explaining both parameters
- Full sort schema syntax reference (column types, direction
  overrides, & LSM cutoff marker)
- Examples showing minimal, custom, and advanced configurations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot force-pushed the gtt/parquet-merge-policy-config branch from bd23d3b to a28e4a6 Compare April 30, 2026 13:42
@g-talbot g-talbot force-pushed the gtt/merge-output-split-metadata branch from 3e61af4 to f90a265 Compare April 30, 2026 13:42
@g-talbot g-talbot force-pushed the gtt/parquet-merge-policy-config branch from a28e4a6 to 4fae305 Compare April 30, 2026 13:49
@g-talbot g-talbot force-pushed the gtt/merge-output-split-metadata branch from f90a265 to 5c7476e Compare April 30, 2026 13:49
Comment thread quickwit/quickwit-parquet-engine/src/merge/writer.rs
@g-talbot g-talbot force-pushed the gtt/parquet-merge-policy-config branch 2 times, most recently from ebab487 to 8656c44 Compare April 30, 2026 14:32
@g-talbot g-talbot force-pushed the gtt/merge-output-split-metadata branch 3 times, most recently from 2474b24 to f374ad0 Compare April 30, 2026 16:52
@g-talbot g-talbot requested a review from mattmkim April 30, 2026 16:55
@g-talbot g-talbot force-pushed the gtt/parquet-merge-policy-config branch 4 times, most recently from 74b1ce1 to fc03d3d Compare April 30, 2026 19:58
Adding ParquetMergePolicyConfig and ParquetIndexingConfig to
IndexingSettings changes the Hash output, which changes the pipeline
params fingerprints. Updated the hardcoded test constants.

Added a comment explaining how to recompute them when IndexingSettings
fields change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot force-pushed the gtt/parquet-merge-policy-config branch from fc03d3d to 58d07d0 Compare April 30, 2026 20:04
g-talbot and others added 2 commits April 30, 2026 16:24
The merge engine now extracts metric_names, time_range, and
low_cardinality_tags from each output file's actual rows during the
merge write pass.

Previously, MergeOutputFile only contained physical metadata (num_rows,
size_bytes, row_keys, zonemaps). The downstream metadata_aggregation
function inferred logical metadata by unioning all input splits — which
is incorrect when num_outputs > 1, since each output contains only a
subset of the globally sorted rows.

Now each MergeOutputFile carries:
- metric_names: distinct metrics in this output's rows
- time_range: min/max timestamp_secs from this output's rows
- low_cardinality_tags: service names from this output's rows

Reuses existing extract_metric_names, extract_service_names, and
extract_time_range from split_writer (made pub(crate)).

Includes test that verifies per-output metadata is computed from actual
rows when merging 2 inputs into 2 outputs with different metric names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot force-pushed the gtt/merge-output-split-metadata branch from f374ad0 to c639cf6 Compare April 30, 2026 20:24
Base automatically changed from gtt/parquet-merge-policy-config to main April 30, 2026 20:27
@g-talbot g-talbot enabled auto-merge (squash) April 30, 2026 20:29
@g-talbot g-talbot merged commit 4cafb74 into main Apr 30, 2026
9 checks passed
@g-talbot g-talbot deleted the gtt/merge-output-split-metadata branch April 30, 2026 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants