Skip to content

Zone Map Pruning for Metrics#6363

Open
alexanderbianchi wants to merge 1 commit intomainfrom
bianchi/zonemap
Open

Zone Map Pruning for Metrics#6363
alexanderbianchi wants to merge 1 commit intomainfrom
bianchi/zonemap

Conversation

@alexanderbianchi
Copy link
Copy Markdown
Collaborator

@alexanderbianchi alexanderbianchi commented Apr 29, 2026

Summary

  • Adds conservative scan-time metadata pruning for metrics splits after metastore split discovery.
  • Uses exact split metadata first (metric_name, low-cardinality tags), then falls back to per-column zonemap_regexes when available.
  • Supports string equality, IN, and simple prefix LIKE patterns such as host LIKE 'ID-07%'.
  • Evaluates prefix LIKE with DFA prefix-language matching against the zonemap regex, so it checks whether any value matching the prefix could exist in the split.
  • Caches compiled exact-match regexes and prefix DFAs with mini-moka.
  • Keeps splits when metadata is missing, regexes are invalid, DFA compilation exceeds the per-regex limit, or an expression is outside the supported pruning subset.
  • Allows metadata pruning for any declared table column that has split metadata, not only the default metrics tag columns.

Notes

Zonemap metadata is a column_name -> superset_regex map generated by the parquet writer for string-valued sort-schema columns. This PR only prunes when the fetched split metadata has usable information for the queried column; DataFusion still applies the row-level filter afterward.

The Java/logs evaluator tests that apply to metrics string zonemaps were ported. Logs-specific integer min/max, hash rowset, ReaderQL function, and case-insensitive ReaderQL cases are out of scope for this metrics/DataFusion path.

Testing

  • cargo fmt -p quickwit-datafusion
  • cargo test -p quickwit-datafusion

@alexanderbianchi
Copy link
Copy Markdown
Collaborator Author

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. 🎉

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@alexanderbianchi alexanderbianchi force-pushed the bianchi/zonemap branch 2 times, most recently from f2b6096 to 93bcde4 Compare April 30, 2026 12:14
"timeseries_id",
"timestamp_secs",
];
const ZONEMAP_DFA_SIZE_LIMIT_BYTES: usize = 1_000_000;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this number?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a code comment here. The 1 MB limit caps DFA determinization/memory for pathological zonemap regexes; if compilation exceeds it, we fall back to conservative keep-split behavior.

Copy link
Copy Markdown
Contributor

@g-talbot g-talbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two concerns. The first is that the compiled DFA for the regex should be cached. Creating it can be a very expensive operation. Second, let's make sure that all of the tests that the Java code ran against this matching are translated here. Other than that, this LG and I'll LGTM once these are addressed.

}

fn zonemap_may_match_any_prefix(superset_regex: &str, prefixes: &[String]) -> bool {
let dfa_config = dense::Config::new()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be very important for performance to cache the config and DFA. It can be very expensive to build these.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. The scan path now caches compiled regex::Regex values and compiled prefix DFAs by zonemap regex string, uses shared DFA/syntax config, and keeps the DFA cache bounded separately because each DFA can be much larger than a regex::Regex.

);
}

#[test]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we doing all the same tests that the Java code did?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed for the matching semantics that apply to metrics metadata. I added tests mirroring the Java evaluator string-zonemap cases: missing metadata, exact regex, superset regex, case-sensitive matching, multi-column AND, conservative OR, IN lists, invalid regex, newline/DOTALL behavior, and unsupported string forms staying conservative. The Java integer min/max, hash rowset, ReaderQL function, and case-insensitive ReaderQL equality cases are logs/Event Store-specific, so I called those out in the PR description instead of porting them as metrics tests.

@alexanderbianchi alexanderbianchi force-pushed the bianchi/zonemap branch 6 times, most recently from 79803a9 to 85cf63c Compare May 1, 2026 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants