feat: compute per-output split metadata in merge engine#6359
Merged
Conversation
fc6f90a to
720560d
Compare
720560d to
3e61af4
Compare
Adds `parquet_merge_policy` section to `IndexingSettings`, making the Parquet merge policy configurable per-index via YAML. Parameters: - merge_factor (default 10): min splits to trigger a merge - max_merge_factor (default 12): max splits per merge - max_merge_ops (default 4): bounds write amplification - target_split_size_bytes (default 256 MiB): target output size - maturation_period (default 48h): split maturity timeout - max_finalize_merge_operations (default 3): cold-window shutdown limit Mirrors the existing merge_policy config pattern for logs/traces. Updates index-config.md documentation with the new section. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…secs Adds `parquet_indexing` section to `IndexingSettings` for per-index Parquet pipeline configuration: - `sort_fields`: sort schema override (Husky-style pipe-delimited syntax with /V2 suffix). Controls row ordering, query pruning, compression locality, and compaction scope. When omitted, uses the product-type default. - `window_duration_secs`: time window for split partitioning (default 900s / 15 min). Must divide 3600. Updates docs/configuration/index-config.md with: - "Parquet indexing settings" section explaining both parameters - Full sort schema syntax reference (column types, direction overrides, & LSM cutoff marker) - Examples showing minimal, custom, and advanced configurations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bd23d3b to
a28e4a6
Compare
3e61af4 to
f90a265
Compare
a28e4a6 to
4fae305
Compare
f90a265 to
5c7476e
Compare
mattmkim
reviewed
Apr 30, 2026
ebab487 to
8656c44
Compare
2474b24 to
f374ad0
Compare
mattmkim
approved these changes
Apr 30, 2026
74b1ce1 to
fc03d3d
Compare
Adding ParquetMergePolicyConfig and ParquetIndexingConfig to IndexingSettings changes the Hash output, which changes the pipeline params fingerprints. Updated the hardcoded test constants. Added a comment explaining how to recompute them when IndexingSettings fields change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fc03d3d to
58d07d0
Compare
The merge engine now extracts metric_names, time_range, and low_cardinality_tags from each output file's actual rows during the merge write pass. Previously, MergeOutputFile only contained physical metadata (num_rows, size_bytes, row_keys, zonemaps). The downstream metadata_aggregation function inferred logical metadata by unioning all input splits — which is incorrect when num_outputs > 1, since each output contains only a subset of the globally sorted rows. Now each MergeOutputFile carries: - metric_names: distinct metrics in this output's rows - time_range: min/max timestamp_secs from this output's rows - low_cardinality_tags: service names from this output's rows Reuses existing extract_metric_names, extract_service_names, and extract_time_range from split_writer (made pub(crate)). Includes test that verifies per-output metadata is computed from actual rows when merging 2 inputs into 2 outputs with different metric names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
f374ad0 to
c639cf6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on #6351 (Phase 2). Should be merged before #6352 (Phase 3a).
The merge engine now extracts
metric_names,time_range, andlow_cardinality_tags(service) from each output file's actual rows during the merge write pass.Problem:
MergeOutputFilepreviously only had physical metadata (num_rows, size_bytes, row_keys, zonemaps). The downstreammetadata_aggregationfunction inferred logical metadata by unioning all input splits. This is incorrect whennum_outputs > 1— each output contains a different subset of the globally sorted rows and should have metadata reflecting only its own data.Fix: Each
MergeOutputFilenow carries per-output logical metadata extracted from thesorted_batchbefore writing. Reusesextract_metric_names,extract_service_names,extract_time_rangefromsplit_writer(madepub(crate)).Test plan
test_merge_per_output_metadata_from_actual_rows— verifies 2-output merge has correct per-output metric_names and time_rangetest_merge_multiple_outputswith per-output metadata assertionscargo clippyclean🤖 Generated with Claude Code