Skip to content

use has()/hasAny() on KV token columns for SQL/facet map filters#2295

Open
vinzee wants to merge 1 commit into
hyperdxio:mainfrom
vinzee:va/kv_optimization_fix
Open

use has()/hasAny() on KV token columns for SQL/facet map filters#2295
vinzee wants to merge 1 commit into
hyperdxio:mainfrom
vinzee:va/kv_optimization_fix

Conversation

@vinzee
Copy link
Copy Markdown
Contributor

@vinzee vinzee commented May 16, 2026

Summary

Before this change, only quoted Lucene terms could use KV-items direct read, and only when HyperDX successfully detected the materialized token column at startup. SQL and facet filters were always emitted as Map['key'] IN (...), which cannot use the text index.

This PR adds the same has()/hasAny() rewrite for SQL filters. Lucene behavior is unchanged and still depends on upstream kvItems detection (hyperdx#2248) plus ResourceAttributeTokens and its text(tokenizer=array) index on the cluster.

How to test on Vercel preview

N/A — non-UI change (@hyperdx/common-utils only).

References

  • Linear Issue: n/a
  • Related PRs: hyperdx#2248 (Lucene KV-items / ResourceAttributeTokens detection)

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 16, 2026

🦋 Changeset detected

Latest commit: 26fc055

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@hyperdx/common-utils Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link
Copy Markdown

vercel Bot commented May 16, 2026

@vinzee is attempting to deploy a commit to the HyperDX Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 16, 2026

PR Review

Solid PR with comprehensive test coverage for the regex-based rewrite. SQL injection risks are mitigated by SqlString.format with parameterized placeholders. Caching via Metadata.getKvItemsLookup consolidates the lookup logic nicely. A few items worth fixing:

  • ⚠️ Word-boundary gap in materialized-column regexes (packages/common-utils/src/core/renderChartConfig.ts:1103-1110) → The lookbehind (?<![.'])excludes., backtick, and 'but not word characters. A column liketag_facilitywould falsely match a materializedfacility rewrite (tag_facility IN ('x')has(...)). Tighten to (?<![\w.']) (and likewise for the eq regex) and add a regression test.
  • ⚠️ sqlForAliasMapParser truncates at first WHERE/ORDER BY (packages/common-utils/src/core/utils.ts:782-792) → If a SELECT contains a subquery (SELECT (SELECT x FROM t WHERE …) AS a FROM …), the function will slice mid-subquery and node-sql-parser will fail. Today chSqlToAliasMap would just swallow the warning, so behavior matches pre-PR for those queries, but worth a comment noting the limitation or guarding via paren depth.
  • ℹ️ Minor: MAP_EQ_RE (line 1204) uses (?<!!)= to skip !=, which works, but is easy to misread. A short comment would help future maintainers.
  • ℹ️ Minor: The new getKvItemsLookup call adds an awaited metadata round-trip per SQL where-clause render. Cached after first hit, but worth confirming there's no cold-path regression on first paint of a dashboard with many SQL filter conditions.

No blocking issues — the word-boundary fix is the only one I'd push for before merge.

@vinzee vinzee force-pushed the va/kv_optimization_fix branch 2 times, most recently from 7555402 to 501593a Compare May 16, 2026 15:53
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 16, 2026

Deep Review

🔴 P0/P1 -- must fix

  • packages/common-utils/src/core/renderChartConfig.ts:1102-1110 -- The materialized-column regex (?<![.\'])(?:(\w+).)?${escapedColumn}\s+IN\s+(lacks a word boundary, so a registered materialized column likefacilitymatches as a suffix of unrelated identifiers (e.g.,total_facility IN ('x')) and emits broken SQL such as total_has(`ResourceAttributeTokens`, ...); the same hazard applies to the =` variant.
    • Fix: Replace the leading deny-set lookbehind with one that also rejects word chars, e.g. (?<![\w.\'])`, and add a regression test for the suffix-collision case.
    • kieran-typescript, testing

🟡 P2 -- recommended

  • packages/common-utils/src/core/utils.ts:783-794 -- sqlForAliasMapParser uses a non-global regex against the whole SQL string, so the first \s+WHERE\s+ or \s+GROUP\s+BY\s+ is matched inside a subquery (SELECT a FROM (SELECT b FROM t WHERE c=1) sub) or inside a string literal, truncating the SQL to an unparseable prefix and silently returning an empty alias map.

    • Fix: Walk the SQL with awareness of single-quoted literals, --//* */ comments, and paren depth, only recognizing clause keywords at depth 0 outside literals/comments.
    • correctness
  • packages/common-utils/src/core/renderChartConfig.ts:1052-1129 -- MAP_IN_PREFIX_RE and MAP_EQ_RE have no (?<![.\'])lookbehind, so a backticked table alias (``t.ResourceAttributes['k'] IN ('v') ``) matches with match[1]undefined, the alias-skip path is bypassed, and the rewriter emits ``t`.has(...) `` which ClickHouse rejects.

    • Fix: Add the same lookbehind in front of the optional alias group on both regexes (and treat a backtick-quoted alias as match[1] truthy so the rewrite is skipped).
    • correctness
  • packages/common-utils/src/core/renderChartConfig.ts:1052-1129 -- The rewriter scans the raw condition with no awareness of single-quoted SQL literals or --//* */ comments, so any condition that contains a literal or comment with the substring Map['key'] IN (...) will be rewritten inside it.

    • Fix: Skip over single-quoted literals (reusing parseSqlSingleQuotedLiteralAt) and comment regions in the outer scan, like findClosingParenIndex already does for inner-paren matching.
    • correctness, kieran-typescript
  • packages/common-utils/src/core/renderChartConfig.ts:1183-1198 -- The new KV-items fetch logs a warning on failure while the adjacent getMaterializedColumnsLookupTable fetch silently swallows; both are best-effort optimizations and should share one policy so incident debugging is consistent.

    • Fix: Either log both console.warn (preferred — silent swallow hides real misconfigurations) or silently skip both, with a comment explaining why.
    • kieran-typescript
  • packages/common-utils/src/__tests__/renderChartConfig.test.ts -- Test coverage is missing for: materialized column whose name is a left-substring of another identifier (total_facility IN ('x') with facility registered), comments inside the condition, IN( without whitespace before (, lowercase in/In, two rewriteable map predicates in one condition, and case-mismatched map column names where the regex matches case-insensitively but the lookup is case-sensitive.

    • Fix: Add focused unit tests under describe('rewriteSqlWithKvItems') for each missed case, pinning current behavior.
    • testing, kieran-typescript
🔵 P3 nitpicks (8)
  • packages/common-utils/src/core/metadata.ts:16-21 -- type KvItemsInfo is imported but never referenced.

    • Fix: Drop KvItemsInfo from the import block.
  • packages/common-utils/src/core/renderChartConfig.ts:809-811 -- escapeRegExp duplicates escapeRegex already defined in drain/template-miner.ts:170 (the two have diverged by one character).

    • Fix: Export a single escapeRegExp from core/utils.ts and import from both call sites.
  • packages/common-utils/src/core/renderChartConfig.ts:858, 872, 878, 1075 -- noUncheckedIndexedAccess is not enabled in this package, so listBody[i]! and match[2]!/match[3]! are unnecessary, and the file mixes asserted and non-asserted indexed access for no semantic reason.

    • Fix: Drop the ! non-null assertions and consider destructuring (const [, alias, mapCol, key] = match;) to remove magic indices.
  • packages/common-utils/src/core/renderChartConfig.ts:946-996 -- rewriteSqlInClauses has zero JSDoc despite three callback outcomes (getMapKey → null, parseSqlInStringLiteralList → undefined, rewrite → undefined) that drive subtly different re.lastIndex and slice behavior.

    • Fix: Add a JSDoc block describing the callback contract and that inPrefixRe must end with (.
  • packages/common-utils/src/core/renderChartConfig.ts:932-938 -- isNotInMatch is unreachable: both regexes require \s+IN\s+\(, so MapCol['k'] NOT IN (...) already fails to match before the guard runs.

    • Fix: Delete the dead guard, or loosen the regex to \s+(?:NOT\s+)?IN and rely on the guard with a longer lookback that fits NOT .
  • packages/common-utils/src/core/renderChartConfig.ts:998-1014 -- buildMaterializedMapKeyLookup parses default_expression with a strict ^(\w+)\['([^']+)'\]$ that rejects backticked or whitespace-padded ClickHouse-formatted variants and silently drops those candidates.

    • Fix: Normalize the expression (strip backticks, collapse whitespace) before applying MAP_SUBSCRIPT_EXPR_RE, or relax the regex to allow optional backticks.
  • packages/common-utils/src/core/renderChartConfig.ts:807-1129 -- A 320-line regex-based SQL rewriter with nine private helpers sits inside a 2100-line chart-rendering file with a single call site; the helpers are generic SQL string utilities, not chart-rendering concerns.

    • Fix: Extract rewriteSqlWithKvItems and its helpers into core/rewriteKvItemsFilters.ts.
  • packages/common-utils/src/queryParser.ts:1115-1124 -- CustomSchemaSQLSerializerV2.buildKvItemsLookup is now a one-line delegate to metadata.getKvItemsLookup, yet the class still owns a kvItemsLookupPromise pre-fetch that duplicates the metadata cache.

    • Fix: Remove the redundant pre-fetch and read directly from metadata.getKvItemsLookup at the call site, or document why the eager warm-up still buys something.

Reviewers (6): correctness, performance, maintainability, testing, security, kieran-typescript.

Testing gaps:

  • No test asserts that getKvItemsLookup is served from cache after first call across N renders.
  • No test for materialized column names containing regex metacharacters routed through escapeRegExp to confirm no partial-match SQL.

SQL and facet filters such as ResourceAttributes['key'] IN ('value')
were passed through unchanged, so ClickHouse used map subscript/IN and
could not use text-index direct read on materialized key=value tokens.

Changes:
- Add rewriteSqlWithKvItems() for Map['key'] IN (=) patterns
- Add Metadata.getKvItemsLookup() (shared with Lucene kvItems path)
- Export KV_ITEMS_STRATEGIES and normalizeChExpression from queryParser

When a materialized KV column (e.g. ResourceAttributeTokens) has a
text(tokenizer=array) skip index, SQL filters are rewritten to
has()/hasAny() with concat(key, separator, value), matching the
existing Lucene quoted-term optimization.
@vinzee vinzee force-pushed the va/kv_optimization_fix branch from 501593a to 26fc055 Compare May 16, 2026 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant