Moved gp_relsizes_stats extention from Greenplum#1757
Conversation
…ototypes for load_data and write_data in table functions
- give ignored_db_names as array of datum (not as List) - add names of ignored dbs to sql-query and remove ignore-check loop
- rename some varables - remove unused variable
- remove recursive logic (base/dboid directory can not contain other directories - rename new_path -> file_path (because it can be just path to file)
- remove constant len buffers for query - make buffer bigger for query when write into it
- remove constant len buffer for query (in get_file_sizes_for_databases) - remove MAX_QUERY_SIZE constant
- dynamically allocate buffer for data_dir - remove all warnings during building
- remove some commented line - apply clang-format
Fix problem with PG_RE_THROW() cycling and rework comments in code
The function made pauses for gp_relsizes_stats.file_naptime (1ms) between lstat calls. It took 1000s to process one million files in database directory. Remove the pauses for relsizes_collect_stats_once. Save the pauses for the case when gp_relsizes_stats_worker gets statistics on files. Increase timeout for database_relsizes_collector_worker from 30 seconds to 5 hours. Additionally, fix the dboid argument type: it should be OID (unsigned), but not INTEGER (signed).
Fixed typo in column name in README.
…_stats_once Speedup relsizes_collect_stats_once
In the presence of partitioned tables, the size of their files was taken into account twice when calculating the schema size. Add the own_file column to table_files to distinguish its own table files from files of its partitions. Use this new column to add the size of each file only once when schema size is calculated. In addition, replace UNION with UNION ALL to simplify the query plan, as duplicate rows are not possible.
Fix the calculation of a schema size when partitioned tables present
…worker When relsizes_collect_stats_once was called and shared_preload_libraries was empty, then segfault happend, because shared memory was not initialized. Now it's not necessary to fill shared_preload_libraries to run the tests. The get_databases_oids function is simplified - it fetchs OIDs only. Move common code from relsizes_collect_stats_once and relsizes_collect_stats to relsizes_collect_stats_once_internal. Remove shmem_startup_hook usage. Database name in the database worker name is replaced with OID to make the code simpler . Getting name by OID makes the code more complex, because we need to open a transaction.
Use worker argument instead of shared memory to pass database to the worker
Replace memset with initialization at declaration. This gives compiler more freedom to optimize the code. Replace GetConfigOptionByName with direct access to variable.
Workers run queries which Orca cannot build plan for or which Orca cannot build a better plan for. So use Postgres planner, because it is faster and there will be no log records about "Feature not supported".
Disable Orca in workers
The extenstion should work without errors when gp_relsizes_stats.so is built from this commit, but ALTER EXTENSION has not yet been executed. When relsizes_stats_schema.get_stats_for_database has one argument, the one argument only is passed to the function. Revert this commit after decision how to upgrade extensions is made.
truncate takes access exclusive lock access delete from table doesn't take lock now we can select statistics if collecting at the moment
Add privileges for user who creates gp_relsizes_stats to use all tables, views and functions from the extension. This user can grant this privileges to others. It is not necessary to grant EXECUTE on the functions, because users can execute them by default when they have the USAGE privilege on the schema.
Non-super users can use the extension right after they create it
There was a problem hiding this comment.
Hi, @Vlasdislav welcome!🎊 Thanks for taking the effort to make our project better! 🙌 Keep making such awesome contributions!
There was a problem hiding this comment.
Pull request overview
This PR ports the gp_relsizes_stats extension into the Cloudberry monorepo under gpcontrib/, integrating it into the in-tree build and adding regression tests.
Changes:
- Adds the new
gp_relsizes_statsextension (C code, SQL install/upgrade scripts, control file, docs). - Integrates the extension into
gpcontrib/Makefilerecursion targets and adds a dual-mode (PGXS vs in-tree) Makefile. - Adds regression tests and expected outputs for functionality and privilege behavior.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| gpcontrib/Makefile | Adds gp_relsizes_stats to the gpcontrib build recurse targets. |
| gpcontrib/gp_relsizes_stats/Makefile | Builds/installs the extension in-tree or via PGXS; wires up regression tests. |
| gpcontrib/gp_relsizes_stats/src/gp_relsizes_stats.c | Implements background worker orchestration and filesystem stats collection. |
| gpcontrib/gp_relsizes_stats/sql/gp_relsizes_stats--1.3.sql | Creates schema/tables/views and declares C functions. |
| gpcontrib/gp_relsizes_stats/sql/gp_relsizes_stats--1.2--1.3.sql | Upgrade script (grants). |
| gpcontrib/gp_relsizes_stats/sql/gp_relsizes_stats--1.1--1.2.sql | Upgrade script (view changes). |
| gpcontrib/gp_relsizes_stats/sql/gp_relsizes_stats--1.0--1.1.sql | Upgrade script (function signature change). |
| gpcontrib/gp_relsizes_stats/gp_relsizes_stats.control | Registers extension metadata (version, module path, trusted flag). |
| gpcontrib/gp_relsizes_stats/test/sql/grants.sql | Adds privileges/regression coverage for creator + granting to another role. |
| gpcontrib/gp_relsizes_stats/test/sql/gp_relsizes_stats.sql | Adds functional regression coverage for stats collection and size reporting. |
| gpcontrib/gp_relsizes_stats/test/expected/grants.out | Expected output for grants.sql. |
| gpcontrib/gp_relsizes_stats/test/expected/gp_relsizes_stats.out | Expected output for gp_relsizes_stats.sql. |
| gpcontrib/gp_relsizes_stats/README.md | Extension documentation. |
| gpcontrib/gp_relsizes_stats/LICENCE | Bundled Apache 2.0 license text for the extension. |
| gpcontrib/gp_relsizes_stats/.gitignore | Ignores local build artifacts in the extension directory. |
| gpcontrib/gp_relsizes_stats/.clang-format | Local formatting configuration for the extension sources. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } s; | ||
| } DbWorkerArg; | ||
|
|
||
| static_assert(sizeof(Datum) == sizeof(DbWorkerArg), "Invalid size of structure in DbWorkerArg"); |
| bool oid_nullable; | ||
|
|
||
| heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc, &oid_datum, &oid_nullable); |
| if (create_transaction) { | ||
| SetCurrentStatementStartTimestamp(); | ||
| StartTransactionCommand(); | ||
| } | ||
|
|
||
| if (SPI_connect() < 0) { | ||
| error = "get_databases_oids: SPI_connect failed"; | ||
| goto finish_transaction; | ||
| } | ||
| if (create_transaction) { | ||
| PushActiveSnapshot(GetTransactionSnapshot()); | ||
| pgstat_report_activity(STATE_RUNNING, sql); | ||
| } | ||
|
|
| } | ||
| SPI_finish(); | ||
| finish_transaction: | ||
| PopActiveSnapshot(); |
| pgstat_report_activity(STATE_RUNNING, sql_get_stats); | ||
| retcode = SPI_execute_with_args(sql_get_stats, 1, | ||
| (Oid[]){INT4OID}, | ||
| (Datum[]){ObjectIdGetDatum(MyDatabaseId)}, |
| SELECT '\! cp "' || setting || '/pg_hba.conf" "' || setting || '/pg_hba.conf.backup"' as cp_backup | ||
| FROM pg_settings | ||
| WHERE name = 'data_directory' \gset | ||
|
|
||
| :cp_backup | ||
|
|
||
| SELECT '\! echo "local all user1,user2 trust" >> ' || setting || '/pg_hba.conf' as add_users | ||
| FROM pg_settings | ||
| WHERE name = 'data_directory' \gset | ||
|
|
||
| :add_users | ||
|
|
||
| -- start_ignore | ||
| \! gpstop -u | ||
| -- end_ignore |
| SELECT EXTRACT(EPOCH FROM LOCALTIMESTAMP(0)) t1 \gset | ||
|
|
||
| SELECT relsizes_stats_schema.relsizes_collect_stats_once(); | ||
|
|
||
| SELECT (EXTRACT(EPOCH FROM LOCALTIMESTAMP(0)) - :t1) < 5; | ||
|
|
| make && make install | ||
| ``` | ||
|
|
||
| ### Confguration |
| ### Installation | ||
| Install from source: | ||
| ``` | ||
| git clone git@github.com:open-gpdb/gp_relsizes_stats.git | ||
| cd gp_relsizes_stats | ||
| # Build it. Building would require GP installed nearby and sourcing greenplum_path.sh | ||
| source <path_to_gp>/greenplum_path.sh | ||
| make && make install |
| DATA = $(wildcard sql/*--*.sql) | ||
| REGRESS = grants gp_relsizes_stats | ||
| REGRESS_OPTS = --inputdir=test/ | ||
| PGFILEDESC = "gp_relsizes_stats - an extension to track table on-disc sizes in greenplum" |
|
Hi @Vlasdislav thanks for your work. The following are my comments for your reference:
|
There was a problem hiding this comment.
Please add the standard ASF license header. Can take this for a reference:
cloudberry/gpcontrib/gp_stats_collector/src/gp_stats_collector.c
Lines 1 to 26 in d4f437b
There was a problem hiding this comment.
Yes, I have already added it locally
There was a problem hiding this comment.
I think we can remove this file.
| DATA = $(wildcard sql/*--*.sql) | ||
| REGRESS = grants gp_relsizes_stats | ||
| REGRESS_OPTS = --inputdir=test/ | ||
| PGFILEDESC = "gp_relsizes_stats - an extension to track table on-disc sizes in greenplum" |
There was a problem hiding this comment.
Could you enable the tests in the .github/workflows CI workflows files for this new extension? Then we can see if it can build/run well in the test env.
3514aa1 to
6ee3710
Compare
What does this PR do?
This PR ports the
gp_relsizes_statsextension from the standalone open-gpdb/gp_relsizes_stats repository into the Cloudberry monorepo undergpcontrib/.The extension collects and stores statistics about table and file sizes on master and segment hosts. It supports automatic collection via a Background Worker and manual one-shot collection via
relsizes_stats_schema.relsizes_collect_stats_once().Changes made during porting:
Makefile— added dual-mode build support:USE_PGXS=1for standalone builds (original behavior) and in-tree build via$(top_builddir)/src/Makefile.global+contrib-global.mkfor building inside the Cloudberry source treegpcontrib/Makefile— addedgp_relsizes_statstorecurse_targetsfor both release and debug (enable_debug_extensions=yes) build configurationsREADME.md— Updatedrebuild_in_docker.sh— updated defaultPG_CONFIGpath from Greenplum to Cloudberry (/usr/local/cloudberry-db/bin/pg_config) and updated default container directory path accordinglyType of Change
Breaking Changes
Not applicable.
Test Plan
test/sql/gp_relsizes_stats.sqlandtest/sql/grants.sqlare included from the upstream repositorymake installcheckmake -C src/test installcheck-cbdb-parallelImpact
Performance:
No impact on existing functionality. The background worker is disabled by default (
gp_relsizes_stats.enabled = false).User-facing changes:
New extension
gp_relsizes_statsavailable ingpcontrib. Must be added toshared_preload_librariesto enable the background worker.Dependencies:
None. Uses standard PostgreSQL/Cloudberry APIs only.
Checklist
Additional Context
Upstream repository: https://github.com/open-gpdb/gp_relsizes_stats