Skip to content

HDFS-17906. (3.4) Fix issue that DataNodes get stuck in infinite loop when meet InvalidBlockReportLeaseException.#8441

Open
cxzl25 wants to merge 1 commit intoapache:branch-3.4from
cxzl25:HDFS-17906_34
Open

HDFS-17906. (3.4) Fix issue that DataNodes get stuck in infinite loop when meet InvalidBlockReportLeaseException.#8441
cxzl25 wants to merge 1 commit intoapache:branch-3.4from
cxzl25:HDFS-17906_34

Conversation

@cxzl25
Copy link
Copy Markdown
Contributor

@cxzl25 cxzl25 commented Apr 17, 2026

Backport #8416 to branch-3.4

Description of PR

#5460 (comment)

HDFS-16942 introduced InvalidBlockReportLeaseException, which the NameNode now throws back to the DataNode via RPC when a block report is rejected due to an invalid lease. On a DataNode that also includes HDFS-16942, the exception is caught and fullBlockReportLeaseId is reset to 0, allowing the DN to request a new lease on the next heartbeat and retry.

However, during a rolling upgrade where the NameNode has been upgraded (with HDFS-16942) but DataNodes are still running an older version (without HDFS-16942), the old DataNode code does not have the InvalidBlockReportLeaseException handling branch in BPServiceActor.offerService(). This causes the DN to enter an infinite failure loop where it can never successfully send a full block report.

Root Cause

In BPServiceActor.offerService(), the logic works as follows:

  1. The DN requests a block report lease during heartbeat only when fullBlockReportLeaseId == 0:
boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
        scheduler.isBlockReportDue(startTime);
  1. After sending the block report, fullBlockReportLeaseId is reset to 0:
if ((fullBlockReportLeaseId != 0) || forceFullBr) {
    cmds = blockReport(fullBlockReportLeaseId);
    fullBlockReportLeaseId = 0;  // not reached if blockReport() throws
}
  1. When the upgraded NN throws InvalidBlockReportLeaseException, blockReport() propagates the exception. The fullBlockReportLeaseId = 0 line after the call is never executed.

  2. The exception is caught by the generic RemoteException catch block. The old DN code does not recognize InvalidBlockReportLeaseException, so fullBlockReportLeaseId remains set to the stale invalid value.

  3. On the next heartbeat iteration, because fullBlockReportLeaseId != 0, requestBlockReportLease is false — the DN does not request a new lease. It then attempts blockReport() again with the same stale lease, which the NN rejects again. This repeats indefinitely.

How was this patch tested?

Add test

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

AI Tooling

If an AI tool was used:

…meet InvalidBlockReportLeaseException. (apache#8416). Contributed by dzcxzl

Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org>
# Conflicts:
#	hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockReportLease.java
@hadoop-yetus
Copy link
Copy Markdown

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 0s Docker mode activated.
-1 ❌ docker 24m 43s Docker failed to build run-specific yetus/hadoop:tp-5388}.
Subsystem Report/Notes
GITHUB PR #8441
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8441/1/console
versions git=2.34.1
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@pan3793 pan3793 changed the title HDFS-17906. Fix issue that DataNodes get stuck in infinite loop when meet InvalidBlockReportLeaseException. HDFS-17906. (3.4) Fix issue that DataNodes get stuck in infinite loop when meet InvalidBlockReportLeaseException. Apr 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants