ORCA: order-preserving LINT mapping for string statistics by yjhjstz · Pull Request #1742 · apache/cloudberry

yjhjstz · 2026-05-13T23:48:37Z

Previously, string statistics were mapped to LINT values via hashtext, which does not preserve lexicographic order. As a result, ORCA's range/comparison estimates on string columns could be wildly inaccurate. This change introduces an order-preserving mapping so that a < b on strings implies LINT(a) < LINT(b), restoring meaningful selectivity estimates for inequality predicates on text columns.

Replace the hash-based LINT mapping used by ORCA's varchar/text/bpchar statistics with an order-preserving 7-byte locale sort-key prefix, so that single-column MCV and histogram estimates respect the column's collation order instead of collapsing to hash values.

Fixes #ISSUE_Number

What does this PR do?

Type of Change

Bug fix (non-breaking change)
New feature (non-breaking change)
Breaking change (fix or feature with breaking changes)
Documentation update

Breaking Changes

Test Plan

Unit tests added/updated
Integration tests added/updated
Passed make installcheck
Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Followed contribution guide
Added/updated documentation
Reviewed code for security implications
Requested review from cloudberry committers

Additional Context

CI Skip Instructions

Replace the hash-based LINT mapping used by ORCA's varchar/text/bpchar statistics with an order-preserving 7-byte locale sort-key prefix, so that single-column MCV and histogram estimates respect the column's collation order instead of collapsing to hash values.

PG's bpchareq treats trailing spaces as insignificant: a char(50) MCV stored as 'Books' is space-padded to 50 bytes in pg_statistic, while a query constant ``i_category='Books'`` arrives unpadded. In ExtractLintValueFromDatum the two produce different sort-key prefixes and therefore different LINTs, so CDatumGenericGPDB::StatsAreEqual (via IDatum::StatsAreEqual on the LINT mapping) misses every MCV bucket -- ORCA falls back to MinRows = 1 and downstream plans go wildly wrong (e.g. TPC-DS sf=5 Q33 estimates 1 row for i_category ='Books', actual 5331, then cascades into a 14M-row IndexNL). Trim trailing ASCII spaces from the bpchar payload before feeding it to the locale sort-key transform. Only the LINT path is affected; the round-tripped varlena bytes are preserved so that bpchar constants in INSERT/SELECT plans still carry their declared padding.

leborchuk

Great changes! Everything looks great. But we should decide what we can do with cardinality estimation regressions. At least let's discuss it, even if we can't fix it.

leborchuk · 2026-05-21T10:59:45Z

+   ->  Seq Scan on ao_tbl  (cost=0.00..431.00 rows=7 width=4)
+         Filter: ((path_hash)::text = 'ABC'::text)
+ Optimizer: GPORCA
+(4 rows)


It looks like a regression. Bitmap Index Scan switched to Seq Scan. But it's a wrong decision.

Earlier we performed:

insert into ao_tbl select 'abc' from generate_series(1,20) i;

So ao_tbl contains only lowecased strings. And when we search ABC in uppercase it returns no rows. Using index to check that there actually no rows looks preferable. Ad old cost is lower then new cost.

leborchuk · 2026-05-21T11:23:18Z

 estimated | actual 
 -----------+--------
-        36 |     50
+         2 |     50


Here the second problem. The new estimation is worse then old algorithm.

Query Old estimate New estimate Actual

a < 1 AND b < '1' 36 2 50

a < 5 AND b < '1' AND c < 5 85 5 50

As I understand here we just have a correlation between predicates:

postgres=# SELECT count(*) FROM mcv_lists WHERE a < 1; count ------- 50 (1 row) postgres=# SELECT count(*) FROM mcv_lists WHERE b < '1'; count ------- 100 (1 row) postgres=# SELECT count(*) FROM mcv_lists WHERE a < 1 AND b < '1'; count ------- 50 (1 row) postgres=# SELECT count(*) FROM mcv_lists; count ------- 5000 (1 row)

So the overall cardinality should be like 5000 * 50/5000 * 100/5000 == 1. The old approach used default cardinality values and it appears closer to the actual cardinality for skew.

I do not know what to do with it. It's not a flaw in the current approach, but it is inspired by changes. But anyway, maybe we can improve it somehow.

yjhjstz force-pushed the orca/string-stats-ordering branch 3 times, most recently from 7feb5b7 to 56cb1fc Compare May 15, 2026 15:30

yjhjstz added 2 commits May 19, 2026 21:49

yjhjstz marked this pull request as ready for review May 19, 2026 14:05

yjhjstz force-pushed the orca/string-stats-ordering branch from 56cb1fc to 10b447c Compare May 19, 2026 14:10

leborchuk self-requested a review May 21, 2026 07:50

leborchuk reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORCA: order-preserving LINT mapping for string statistics#1742

ORCA: order-preserving LINT mapping for string statistics#1742
yjhjstz wants to merge 2 commits into
apache:mainfrom
yjhjstz:orca/string-stats-ordering

yjhjstz commented May 13, 2026

Uh oh!

leborchuk left a comment

Uh oh!

leborchuk May 21, 2026

Uh oh!

leborchuk May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yjhjstz commented May 13, 2026

What does this PR do?

Type of Change

Breaking Changes

Test Plan

Impact

Checklist

Additional Context

CI Skip Instructions

Uh oh!

leborchuk left a comment

Choose a reason for hiding this comment

Uh oh!

leborchuk May 21, 2026

Choose a reason for hiding this comment

Uh oh!

leborchuk May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants