Skip to content

fix(pt_expt): let deepmd.pt import errors propagate in comm op check#5474

Open
wanghan-iapcm wants to merge 3 commits into
deepmodeling:masterfrom
wanghan-iapcm:fix-pt-expt-comm-import-error
Open

fix(pt_expt): let deepmd.pt import errors propagate in comm op check#5474
wanghan-iapcm wants to merge 3 commits into
deepmodeling:masterfrom
wanghan-iapcm:fix-pt-expt-comm-import-error

Conversation

@wanghan-iapcm
Copy link
Copy Markdown
Collaborator

@wanghan-iapcm wanghan-iapcm commented May 28, 2026

Summary

_check_underlying_ops_loaded() in deepmd/pt_expt/utils/comm.py wraps import deepmd.pt in a blanket except Exception: pass, then falls through to a generic RuntimeError telling the user to build libdeepmd_op_pt.so.

The problem: when the .so is built but loaded against a mismatched torch version, import deepmd.pt raises an ImportError with diagnostic detail (e.g. undefined symbol: ...) — exactly the message the user needs. The current code hides it and tells them to rebuild a library that's already built.

This PR removes the try/except and lets the import error propagate. The downstream RuntimeError still fires for the case where the import succeeds but the ops still aren't registered.

Trade-off

External callers that previously caught RuntimeError from comm.py import will now see the raw ImportError for the .so-mismatch case. No in-tree caller does this. The diagnostic gain outweighs the contract change.

Test plan

  • Existing pt_expt tests (every consumer imports comm.py) — happy path unchanged
  • CI green

Summary by CodeRabbit

  • Bug Fixes
    • Improved initialization error reporting: when native registrations fail to load, the underlying import/ABI error is now preserved and surfaced instead of being masked by a generic message, making root causes clearer for troubleshooting.

Review Change Stack

The blanket except in _check_underlying_ops_loaded swallowed ABI /
torch-version mismatches against libdeepmd_op_pt.so (e.g. 'undefined
symbol' ImportError), leaving callers with only the generic 'build
the .so' RuntimeError that misleads users into rebuilding an already-
built library.
@dosubot dosubot Bot added breaking change Breaking changes that should notify users. bug labels May 28, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0a1dc5440e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread deepmd/pt_expt/utils/comm.py Outdated
deepmd/pt/__init__.py loads cxx_op (registers deepmd_export ops) before
running load_entry_point('deepmd.pt'). A broken third-party entry point
makes the import raise after the ops were already registered, so the
previous unconditional propagation skipped the fake/autograd registrations
even when the underlying ops were present.

Catch the import error, re-check registration, and only re-raise when the
ops are still missing — preserving the diagnostic detail (e.g. ABI
'undefined symbol') for the genuine .so-load-failure path.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 93e7a586-ae90-4f5f-99d3-a8d7af9180b9

📥 Commits

Reviewing files that changed from the base of the PR and between add14af and 3d1ff02.

📒 Files selected for processing (1)
  • deepmd/pt_expt/utils/comm.py

📝 Walkthrough

Walkthrough

Refactors _check_underlying_ops_loaded() to centralize op-registration checks, attempt import deepmd.pt only when registrations are missing, capture any import error, and re-raise that import error if the ops remain unregistered.

Changes

Import Error Diagnostics

Layer / File(s) Summary
Ops registration and import chaining
deepmd/pt_expt/utils/comm.py
Adds a local _ops_registered() helper, attempts import deepmd.pt only when registrations are missing, captures any import exception, re-checks registrations, and re-raises the captured deepmd.pt import error if registrations remain absent; otherwise falls back to the existing generic RuntimeError.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly describes the main change: allowing deepmd.pt import errors to propagate in the comm operation check function, which is the core fix in this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@wanghan-iapcm wanghan-iapcm requested a review from njzjz May 28, 2026 14:29
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deepmd/pt_expt/utils/comm.py`:
- Around line 64-67: The broad-except in the import block around the `import
deepmd.pt` statement triggers Ruff BLE001; suppress it by adding an explicit
noqa on the except clause (e.g., append "# noqa: BLE001" to the `except
Exception as exc:` line) so the linter ignores this deliberate broad catch while
keeping the existing explanatory comment about cxx_op registration intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 87245f99-dd41-4ca8-825f-550b643b2244

📥 Commits

Reviewing files that changed from the base of the PR and between 0a1dc54 and add14af.

📒 Files selected for processing (1)
  • deepmd/pt_expt/utils/comm.py

Comment thread deepmd/pt_expt/utils/comm.py
@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

❌ Patch coverage is 55.55556% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.25%. Comparing base (2087416) to head (3d1ff02).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
deepmd/pt_expt/utils/comm.py 55.55% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5474      +/-   ##
==========================================
- Coverage   82.25%   82.25%   -0.01%     
==========================================
  Files         833      833              
  Lines       89094    89099       +5     
  Branches     4225     4225              
==========================================
+ Hits        73288    73290       +2     
- Misses      14515    14518       +3     
  Partials     1291     1291              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@njzjz-bot njzjz-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tightening this path — the re-check after import deepmd.pt is the right shape for the entry-point failure case.

One thing looks inconsistent with the PR description and commit message: when import deepmd.pt fails and the ops are still missing, the current code still raises the generic RuntimeError, only chaining the original exception via raise ... from import_err. That preserves the details in a full traceback, but callers that surface str(exc) or exception type will still see the same generic “Build libdeepmd_op_pt.so” message, not the raw ABI/import diagnostic. The PR body also says external callers will now see the raw ImportError for the .so-mismatch case.

I’d suggest making that branch actually re-raise the import error, and reserve the generic RuntimeError for the “import succeeded but ops are still not registered” case, e.g.:

    if not _ops_registered():
        if import_err is not None:
            raise import_err
        raise RuntimeError(...)

That keeps the useful entry-point fallback while matching the stated diagnostic behavior.

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

The prior commit wrapped the captured import error in a generic
RuntimeError via 'raise ... from import_err'. Callers that look at the
exception type or str(exc) saw only the generic 'build the .so' message;
the diagnostic detail (e.g. 'undefined symbol' for a torch-version /
ABI mismatch against libdeepmd_op_pt.so) survived only in the chained
traceback.

Re-raise the original import error directly when ops are still missing;
reserve the generic RuntimeError for the case where 'import deepmd.pt'
succeeded but the ops still aren't registered.
Copy link
Copy Markdown
Contributor

@njzjz-bot njzjz-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed the latest update. The new branch now re-raises the captured import error directly when import deepmd.pt fails and the ops are still missing, while preserving the entry-point-failure fallback when ops were registered before a later import failure. That matches the intended diagnostic behavior.

CI is green aside from expected skipped jobs; I don't see further blockers.

— OpenClaw 2026.5.12 (f066dd2) (model: custom-chat-jinzhezeng-group/gpt-5.5)

@njzjz njzjz added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking change Breaking changes that should notify users. bug Python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants