Skip to content

Moved gp_relsizes_stats extention from Greenplum#1757

Open
Vlasdislav wants to merge 116 commits into
apache:mainfrom
Vlasdislav:gp_relsizes_stats
Open

Moved gp_relsizes_stats extention from Greenplum#1757
Vlasdislav wants to merge 116 commits into
apache:mainfrom
Vlasdislav:gp_relsizes_stats

Conversation

@Vlasdislav
Copy link
Copy Markdown

@Vlasdislav Vlasdislav commented May 19, 2026

What does this PR do?

This PR ports the gp_relsizes_stats extension from the standalone open-gpdb/gp_relsizes_stats repository into the Cloudberry monorepo under gpcontrib/.

The extension collects and stores statistics about table and file sizes on master and segment hosts. It supports automatic collection via a Background Worker and manual one-shot collection via relsizes_stats_schema.relsizes_collect_stats_once().

Changes made during porting:

  • Makefile — added dual-mode build support: USE_PGXS=1 for standalone builds (original behavior) and in-tree build via $(top_builddir)/src/Makefile.global + contrib-global.mk for building inside the Cloudberry source tree
  • gpcontrib/Makefile — added gp_relsizes_stats to recurse_targets for both release and debug (enable_debug_extensions=yes) build configurations
  • README.md — Updated
  • rebuild_in_docker.sh — updated default PG_CONFIG path from Greenplum to Cloudberry (/usr/local/cloudberry-db/bin/pg_config) and updated default container directory path accordingly

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Not applicable.

Test Plan

  • Integration tests added/updated — regression tests in test/sql/gp_relsizes_stats.sql and test/sql/grants.sql are included from the upstream repository
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:
No impact on existing functionality. The background worker is disabled by default (gp_relsizes_stats.enabled = false).

User-facing changes:
New extension gp_relsizes_stats available in gpcontrib. Must be added to shared_preload_libraries to enable the background worker.

Dependencies:
None. Uses standard PostgreSQL/Cloudberry APIs only.

Checklist

Additional Context

Upstream repository: https://github.com/open-gpdb/gp_relsizes_stats

Smyatkin-Maxim and others added 30 commits March 18, 2024 12:06
…ototypes for load_data and write_data in table functions
- give ignored_db_names as array of datum (not as List)
- add names of ignored dbs to sql-query and remove ignore-check loop
- rename some varables
- remove unused variable
- remove recursive logic (base/dboid directory can not contain other directories
- rename new_path -> file_path (because it can be just path to file)
- remove constant len buffers for query
- make buffer bigger for query when write into it
- remove constant len buffer for query (in get_file_sizes_for_databases)
- remove MAX_QUERY_SIZE constant
- dynamically allocate buffer for data_dir
- remove all warnings during building
- remove some commented line
- apply clang-format
AndrewOvvv and others added 21 commits August 28, 2025 16:34
Fix problem with PG_RE_THROW() cycling and rework comments in code
The function made pauses for gp_relsizes_stats.file_naptime (1ms) between lstat
calls. It took 1000s to process one million files in database directory.

Remove the pauses for relsizes_collect_stats_once. Save the pauses for the case
when gp_relsizes_stats_worker gets statistics on files.

Increase timeout for database_relsizes_collector_worker from 30 seconds to 5
hours.

Additionally, fix the dboid argument type: it should be OID (unsigned), but not
INTEGER (signed).
Fixed typo in column name in README.
…_stats_once

Speedup relsizes_collect_stats_once
In the presence of partitioned tables, the size of their files was taken into
account twice when calculating the schema size.

Add the own_file column to table_files to distinguish its own table files from
files of its partitions. Use this new column to add the size of each file only
once when schema size is calculated.

In addition, replace UNION with UNION ALL to simplify the query plan, as
duplicate rows are not possible.
Fix the calculation of a schema size when partitioned tables present
…worker

When relsizes_collect_stats_once was called and shared_preload_libraries was
empty, then segfault happend, because shared memory was not initialized.

Now it's not necessary to fill shared_preload_libraries to run the tests.

The get_databases_oids function is simplified - it fetchs OIDs only. Move
common code from relsizes_collect_stats_once and relsizes_collect_stats to
relsizes_collect_stats_once_internal. Remove shmem_startup_hook usage.

Database name in the database worker name is replaced with OID to make
the code simpler . Getting name by OID makes the code more complex, because
we need to open a transaction.
Use worker argument instead of shared memory to pass database to the worker
Replace memset with initialization at declaration. This gives compiler more
freedom to optimize the code.
Replace GetConfigOptionByName with direct access to variable.
Workers run queries which Orca cannot build plan for or which Orca cannot build
a better plan for. So use Postgres planner, because it is faster and there will
be no log records about "Feature not supported".
The extenstion should work without errors when gp_relsizes_stats.so is built
from this commit, but ALTER EXTENSION has not yet been executed.
When relsizes_stats_schema.get_stats_for_database has one argument, the one
argument only is passed to the function.
Revert this commit after decision how to upgrade extensions is made.
truncate takes access exclusive lock access

delete from table doesn't take lock

now we can select statistics if collecting at the moment
Add privileges for user who creates gp_relsizes_stats to use all tables, views
and functions from the extension. This user can grant this privileges to others.
It is not necessary to grant EXECUTE on the functions, because users can
execute them by default when they have the USAGE privilege on the schema.
Non-super users can use the extension right after they create it
…23caa7e897a2bb6d4e9'

git-subtree-dir: gpcontrib/gp_relsizes_stats
git-subtree-mainline: 61633d2
git-subtree-split: fa47d5b
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @Vlasdislav welcome!🎊 Thanks for taking the effort to make our project better! 🙌 Keep making such awesome contributions!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ports the gp_relsizes_stats extension into the Cloudberry monorepo under gpcontrib/, integrating it into the in-tree build and adding regression tests.

Changes:

  • Adds the new gp_relsizes_stats extension (C code, SQL install/upgrade scripts, control file, docs).
  • Integrates the extension into gpcontrib/Makefile recursion targets and adds a dual-mode (PGXS vs in-tree) Makefile.
  • Adds regression tests and expected outputs for functionality and privilege behavior.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
gpcontrib/Makefile Adds gp_relsizes_stats to the gpcontrib build recurse targets.
gpcontrib/gp_relsizes_stats/Makefile Builds/installs the extension in-tree or via PGXS; wires up regression tests.
gpcontrib/gp_relsizes_stats/src/gp_relsizes_stats.c Implements background worker orchestration and filesystem stats collection.
gpcontrib/gp_relsizes_stats/sql/gp_relsizes_stats--1.3.sql Creates schema/tables/views and declares C functions.
gpcontrib/gp_relsizes_stats/sql/gp_relsizes_stats--1.2--1.3.sql Upgrade script (grants).
gpcontrib/gp_relsizes_stats/sql/gp_relsizes_stats--1.1--1.2.sql Upgrade script (view changes).
gpcontrib/gp_relsizes_stats/sql/gp_relsizes_stats--1.0--1.1.sql Upgrade script (function signature change).
gpcontrib/gp_relsizes_stats/gp_relsizes_stats.control Registers extension metadata (version, module path, trusted flag).
gpcontrib/gp_relsizes_stats/test/sql/grants.sql Adds privileges/regression coverage for creator + granting to another role.
gpcontrib/gp_relsizes_stats/test/sql/gp_relsizes_stats.sql Adds functional regression coverage for stats collection and size reporting.
gpcontrib/gp_relsizes_stats/test/expected/grants.out Expected output for grants.sql.
gpcontrib/gp_relsizes_stats/test/expected/gp_relsizes_stats.out Expected output for gp_relsizes_stats.sql.
gpcontrib/gp_relsizes_stats/README.md Extension documentation.
gpcontrib/gp_relsizes_stats/LICENCE Bundled Apache 2.0 license text for the extension.
gpcontrib/gp_relsizes_stats/.gitignore Ignores local build artifacts in the extension directory.
gpcontrib/gp_relsizes_stats/.clang-format Local formatting configuration for the extension sources.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

} s;
} DbWorkerArg;

static_assert(sizeof(Datum) == sizeof(DbWorkerArg), "Invalid size of structure in DbWorkerArg");
Comment on lines +220 to +222
bool oid_nullable;

heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc, &oid_datum, &oid_nullable);
Comment on lines +192 to +205
if (create_transaction) {
SetCurrentStatementStartTimestamp();
StartTransactionCommand();
}

if (SPI_connect() < 0) {
error = "get_databases_oids: SPI_connect failed";
goto finish_transaction;
}
if (create_transaction) {
PushActiveSnapshot(GetTransactionSnapshot());
pgstat_report_activity(STATE_RUNNING, sql);
}

}
SPI_finish();
finish_transaction:
PopActiveSnapshot();
pgstat_report_activity(STATE_RUNNING, sql_get_stats);
retcode = SPI_execute_with_args(sql_get_stats, 1,
(Oid[]){INT4OID},
(Datum[]){ObjectIdGetDatum(MyDatabaseId)},
Comment on lines +10 to +24
SELECT '\! cp "' || setting || '/pg_hba.conf" "' || setting || '/pg_hba.conf.backup"' as cp_backup
FROM pg_settings
WHERE name = 'data_directory' \gset

:cp_backup

SELECT '\! echo "local all user1,user2 trust" >> ' || setting || '/pg_hba.conf' as add_users
FROM pg_settings
WHERE name = 'data_directory' \gset

:add_users

-- start_ignore
\! gpstop -u
-- end_ignore
Comment on lines +54 to +59
SELECT EXTRACT(EPOCH FROM LOCALTIMESTAMP(0)) t1 \gset

SELECT relsizes_stats_schema.relsizes_collect_stats_once();

SELECT (EXTRACT(EPOCH FROM LOCALTIMESTAMP(0)) - :t1) < 5;

Comment thread gpcontrib/gp_relsizes_stats/README.md Outdated
make && make install
```

### Confguration
Comment thread gpcontrib/gp_relsizes_stats/README.md Outdated
Comment on lines +13 to +20
### Installation
Install from source:
```
git clone git@github.com:open-gpdb/gp_relsizes_stats.git
cd gp_relsizes_stats
# Build it. Building would require GP installed nearby and sourcing greenplum_path.sh
source <path_to_gp>/greenplum_path.sh
make && make install
Comment thread gpcontrib/gp_relsizes_stats/Makefile Outdated
DATA = $(wildcard sql/*--*.sql)
REGRESS = grants gp_relsizes_stats
REGRESS_OPTS = --inputdir=test/
PGFILEDESC = "gp_relsizes_stats - an extension to track table on-disc sizes in greenplum"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@tuhaihe
Copy link
Copy Markdown
Member

tuhaihe commented May 21, 2026

Hi @Vlasdislav thanks for your work. The following are my comments for your reference:

  1. Import the commit history to the gpcontrib/gp_relsizes_stats from the start for a better rebase:
git filter-repo --force --to-subdirectory-filter gpcontrib/gp_relsizes_stats
  1. Rewrite some commits for a better clean & lean commit history

  2. ASF license compliance. See more comments in the code.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the standard ASF license header. Can take this for a reference:

/*-------------------------------------------------------------------------
*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*
* gp_stats_collector.c
*
* IDENTIFICATION
* gpcontrib/gp_stats_collector/src/gp_stats_collector.c
*
*-------------------------------------------------------------------------
*/

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have already added it locally

Comment thread gpcontrib/gp_relsizes_stats/LICENCE Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this file.

Comment thread gpcontrib/gp_relsizes_stats/Makefile Outdated
DATA = $(wildcard sql/*--*.sql)
REGRESS = grants gp_relsizes_stats
REGRESS_OPTS = --inputdir=test/
PGFILEDESC = "gp_relsizes_stats - an extension to track table on-disc sizes in greenplum"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you enable the tests in the .github/workflows CI workflows files for this new extension? Then we can see if it can build/run well in the test env.

@Vlasdislav Vlasdislav force-pushed the gp_relsizes_stats branch from 3514aa1 to 6ee3710 Compare May 21, 2026 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants