Skip to content

ORCA: order-preserving LINT mapping for string statistics#1742

Open
yjhjstz wants to merge 2 commits into
apache:mainfrom
yjhjstz:orca/string-stats-ordering
Open

ORCA: order-preserving LINT mapping for string statistics#1742
yjhjstz wants to merge 2 commits into
apache:mainfrom
yjhjstz:orca/string-stats-ordering

Conversation

@yjhjstz
Copy link
Copy Markdown
Member

@yjhjstz yjhjstz commented May 13, 2026

Previously, string statistics were mapped to LINT values via hashtext, which does not preserve lexicographic order. As a result, ORCA's range/comparison estimates on string columns could be wildly inaccurate. This change introduces an order-preserving mapping so that a < b on strings implies LINT(a) < LINT(b), restoring meaningful selectivity estimates for inequality predicates on text columns.

Replace the hash-based LINT mapping used by ORCA's varchar/text/bpchar statistics with an order-preserving 7-byte locale sort-key prefix, so that single-column MCV and histogram estimates respect the column's collation order instead of collapsing to hash values.

Fixes #ISSUE_Number

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


@yjhjstz yjhjstz force-pushed the orca/string-stats-ordering branch 3 times, most recently from 7feb5b7 to 56cb1fc Compare May 15, 2026 15:30
yjhjstz added 2 commits May 19, 2026 21:49
Replace the hash-based LINT mapping used by ORCA's varchar/text/bpchar
statistics with an order-preserving 7-byte locale sort-key prefix, so
that single-column MCV and histogram estimates respect the column's
collation order instead of collapsing to hash values.
PG's bpchareq treats trailing spaces as insignificant: a char(50) MCV
stored as 'Books' is space-padded to 50 bytes in pg_statistic, while a
query constant ``i_category='Books'`` arrives unpadded.  In
ExtractLintValueFromDatum the two produce different sort-key prefixes
and therefore different LINTs, so CDatumGenericGPDB::StatsAreEqual
(via IDatum::StatsAreEqual on the LINT mapping) misses every MCV
bucket -- ORCA falls back to MinRows = 1 and downstream plans go
wildly wrong (e.g. TPC-DS sf=5 Q33 estimates 1 row for i_category
='Books', actual 5331, then cascades into a 14M-row IndexNL).

Trim trailing ASCII spaces from the bpchar payload before feeding it
to the locale sort-key transform.  Only the LINT path is affected;
the round-tripped varlena bytes are preserved so that bpchar
constants in INSERT/SELECT plans still carry their declared padding.
@yjhjstz yjhjstz marked this pull request as ready for review May 19, 2026 14:05
@yjhjstz yjhjstz force-pushed the orca/string-stats-ordering branch from 56cb1fc to 10b447c Compare May 19, 2026 14:10
@leborchuk leborchuk self-requested a review May 21, 2026 07:50
Copy link
Copy Markdown
Contributor

@leborchuk leborchuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great changes! Everything looks great. But we should decide what we can do with cardinality estimation regressions. At least let's discuss it, even if we can't fix it.

-> Seq Scan on ao_tbl (cost=0.00..431.00 rows=7 width=4)
Filter: ((path_hash)::text = 'ABC'::text)
Optimizer: GPORCA
(4 rows)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a regression. Bitmap Index Scan switched to Seq Scan. But it's a wrong decision.

Earlier we performed:

insert into ao_tbl select 'abc' from generate_series(1,20) i;

So ao_tbl contains only lowecased strings. And when we search ABC in uppercase it returns no rows. Using index to check that there actually no rows looks preferable. Ad old cost is lower then new cost.

estimated | actual
-----------+--------
36 | 50
2 | 50
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the second problem. The new estimation is worse then old algorithm.

Query Old estimate New estimate Actual
a < 1 AND b < '1' 36 2 50
a < 5 AND b < '1' AND c < 5 85 5 50

As I understand here we just have a correlation between predicates:

postgres=# SELECT count(*) FROM mcv_lists WHERE a < 1;
 count
-------
    50
(1 row)

postgres=# SELECT count(*) FROM mcv_lists WHERE b < '1';
 count
-------
   100
(1 row)

postgres=# SELECT count(*) FROM mcv_lists WHERE a < 1 AND b < '1';
 count
-------
    50
(1 row)

postgres=# SELECT count(*) FROM mcv_lists;
 count
-------
  5000
(1 row)

So the overall cardinality should be like 5000 * 50/5000 * 100/5000 == 1. The old approach used default cardinality values and it appears closer to the actual cardinality for skew.

I do not know what to do with it. It's not a flaw in the current approach, but it is inspired by changes. But anyway, maybe we can improve it somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants