Zone Map Pruning for Metrics#6363
Conversation
cdd27b7 to
910756a
Compare
|
@codex review |
|
Codex Review: Didn't find any major issues. 🎉 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
f2b6096 to
93bcde4
Compare
| "timeseries_id", | ||
| "timestamp_secs", | ||
| ]; | ||
| const ZONEMAP_DFA_SIZE_LIMIT_BYTES: usize = 1_000_000; |
There was a problem hiding this comment.
Added a code comment here. The 1 MB limit caps DFA determinization/memory for pathological zonemap regexes; if compilation exceeds it, we fall back to conservative keep-split behavior.
g-talbot
left a comment
There was a problem hiding this comment.
Two concerns. The first is that the compiled DFA for the regex should be cached. Creating it can be a very expensive operation. Second, let's make sure that all of the tests that the Java code ran against this matching are translated here. Other than that, this LG and I'll LGTM once these are addressed.
| } | ||
|
|
||
| fn zonemap_may_match_any_prefix(superset_regex: &str, prefixes: &[String]) -> bool { | ||
| let dfa_config = dense::Config::new() |
There was a problem hiding this comment.
It will be very important for performance to cache the config and DFA. It can be very expensive to build these.
There was a problem hiding this comment.
Addressed. The scan path now caches compiled regex::Regex values and compiled prefix DFAs by zonemap regex string, uses shared DFA/syntax config, and keeps the DFA cache bounded separately because each DFA can be much larger than a regex::Regex.
| ); | ||
| } | ||
|
|
||
| #[test] |
There was a problem hiding this comment.
Are we doing all the same tests that the Java code did?
There was a problem hiding this comment.
Addressed for the matching semantics that apply to metrics metadata. I added tests mirroring the Java evaluator string-zonemap cases: missing metadata, exact regex, superset regex, case-sensitive matching, multi-column AND, conservative OR, IN lists, invalid regex, newline/DOTALL behavior, and unsupported string forms staying conservative. The Java integer min/max, hash rowset, ReaderQL function, and case-insensitive ReaderQL equality cases are logs/Event Store-specific, so I called those out in the PR description instead of porting them as metrics tests.
79803a9 to
85cf63c
Compare
85cf63c to
b28a172
Compare
Summary
metric_name, low-cardinality tags), then falls back to per-columnzonemap_regexeswhen available.IN, and simple prefixLIKEpatterns such ashost LIKE 'ID-07%'.LIKEwith DFA prefix-language matching against the zonemap regex, so it checks whether any value matching the prefix could exist in the split.mini-moka.Notes
Zonemap metadata is a
column_name -> superset_regexmap generated by the parquet writer for string-valued sort-schema columns. This PR only prunes when the fetched split metadata has usable information for the queried column; DataFusion still applies the row-level filter afterward.The Java/logs evaluator tests that apply to metrics string zonemaps were ported. Logs-specific integer min/max, hash rowset, ReaderQL function, and case-insensitive ReaderQL cases are out of scope for this metrics/DataFusion path.
Testing
cargo fmt -p quickwit-datafusioncargo test -p quickwit-datafusion