Skip to content

Manually trigger repair data partition#17530

Open
zerolbsony wants to merge 5 commits intoapache:masterfrom
zerolbsony:manually-trigger-repair-data-partition
Open

Manually trigger repair data partition#17530
zerolbsony wants to merge 5 commits intoapache:masterfrom
zerolbsony:manually-trigger-repair-data-partition

Conversation

@zerolbsony
Copy link
Copy Markdown
Contributor

image

Copy link
Copy Markdown
Contributor

@CRZbulabula CRZbulabula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Overview

This PR adds a new SQL command REPAIR DATA PARTITION TABLE that allows operators to manually trigger the DataPartitionTableIntegrityCheckProcedure on the ConfigNode. Previously this procedure was invoked via ProcedureManager and blocked synchronously (up to 24 hours). Now it's fire-and-forget: the SQL returns immediately after submitting the procedure.

The PR also removes the config property partition_table_recover_wait_all_dn_up_timeout_ms and its associated plumbing, suggesting the old automatic-wait-on-startup behavior is no longer needed.

The core plumbing is solid — grammar → parser → statement → visitor → task → executor → RPC → ConfigNode all wired up correctly. A few concerns below:


Issues & Suggestions

1. Bug: Double-set on SettableFuture in error path (Medium)

In ClusterConfigTaskExecutor.repairDataPartitionTable():

TSStatus tsStatus = new TSStatus();           // code defaults to 0
try (ConfigNodeClient client = ...) {
    tsStatus = client.dataPartitionTableIntegrityCheck();
} catch (ClientManagerException | TException e) {
    future.setException(e);                   // (1) sets exception
}
// falls through...
if (tsStatus.getCode() == TSStatusCode.SUCCESS_STATUS.getStatusCode()) {  // 0 != 200
    ...
} else {
    future.setException(new IoTDBException(tsStatus));  // (2) tries to set again — silently fails
}

When the RPC throws, future.setException(e) is called, then execution falls through to the else branch which tries to set a second exception. Guava's SettableFuture silently ignores the second call (returns false), so the first real exception is preserved — but the intent is unclear and fragile.

Suggestion: Add a return future; after future.setException(e) in the catch block to make the flow explicit.

2. Behavioral change: synchronous → fire-and-forget (Design — please confirm intentional)

Before (ProcedureManager):

executor.submitProcedure(procedure);
return waitingProcedureFinished(procedure, 86400000);  // blocks up to 24h

After (PartitionManager):

executor.submitProcedure(procedure);
return new TSStatus(TSStatusCode.SUCCESS_STATUS.getStatusCode());  // returns immediately

The client gets an instant SUCCESS regardless of whether the repair actually succeeds. This is reasonable for an admin trigger, but there's no way for the user to track whether the repair completed or failed. Consider guiding users to check procedure status (e.g., SHOW PROCEDURES) after triggering.

3. Missing table-model support (Bug)

The TableConfigTaskVisitor handles the sibling commands StartRepairData / StopRepairData, but the PR does not add a handler for RepairDataPartitionTable. Users in table-model SQL dialect won't be able to run this command.

This likely requires:

  • A relational AST node in org.apache.iotdb.db.queryengine.plan.relational.sql.ast
  • A handler in TableConfigTaskVisitor
  • Possibly relational grammar additions

4. Naming convention mismatch (Style)

The new statement class is RepairDataPartitionTable, but sibling classes use the ...Statement suffix:

  • StartRepairDataStatement
  • StopRepairDataStatement
  • MergeStatement, FlushStatement, etc.

Should be renamed to RepairDataPartitionTableStatement for consistency.

5. Unused field in RepairDataPartitionTableTask (Minor)

private final RepairDataPartitionTable repairDataPartitionTable;

This field is stored in the constructor but never read. The execute() method calls configTaskExecutor.repairDataPartitionTable() with no arguments. Either remove the field or pass the statement to the executor if parameters are planned.

6. Synchronized block — concurrent submission concern

In PartitionManager.dataPartitionTableIntegrityCheck(), synchronizing on this only serializes submission. If two users run REPAIR DATA PARTITION TABLE simultaneously, two procedures will be submitted. Please verify that DataPartitionTableIntegrityCheckProcedure is safe for concurrent instances, or add duplicate-submission prevention.

Copy link
Copy Markdown
Contributor

@CRZbulabula CRZbulabula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL. And fix both 5. and 6. of the first comment.

Comment on lines +1688 to +1690
return checkGlobalAuth(
context.setAuditLogOperation(AuditLogOperation.CONTROL), PrivilegeType.SYSTEM, () -> "");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot employ checkGlobalAuth since the repair data partition action is not an "objectAuthentication" event.

Comment on lines -325 to -340
long partitionTableRecoverWaitAllDnUpTimeoutInMs =
Long.parseLong(
properties.getProperty(
"partition_table_recover_wait_all_dn_up_timeout_ms",
String.valueOf(conf.getPartitionTableRecoverWaitAllDnUpTimeoutInMs())));
if (partitionTableRecoverWaitAllDnUpTimeoutInMs <= 0) {
LOGGER.warn(
"partition_table_recover_wait_all_dn_up_timeout_ms should be greater than 0, "
+ "but current value is {}, ignore that and use the default value {}",
partitionTableRecoverWaitAllDnUpTimeoutInMs,
conf.getPartitionTableRecoverWaitAllDnUpTimeoutInMs());
partitionTableRecoverWaitAllDnUpTimeoutInMs =
conf.getPartitionTableRecoverWaitAllDnUpTimeoutInMs();
}
conf.setPartitionTableRecoverWaitAllDnUpTimeoutInMs(
partitionTableRecoverWaitAllDnUpTimeoutInMs);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has no configurable parameters anymore?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants