fix(pt_expt): let deepmd.pt import errors propagate in comm op check#5474
fix(pt_expt): let deepmd.pt import errors propagate in comm op check#5474wanghan-iapcm wants to merge 3 commits into
Conversation
The blanket except in _check_underlying_ops_loaded swallowed ABI / torch-version mismatches against libdeepmd_op_pt.so (e.g. 'undefined symbol' ImportError), leaving callers with only the generic 'build the .so' RuntimeError that misleads users into rebuilding an already- built library.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0a1dc5440e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
deepmd/pt/__init__.py loads cxx_op (registers deepmd_export ops) before
running load_entry_point('deepmd.pt'). A broken third-party entry point
makes the import raise after the ops were already registered, so the
previous unconditional propagation skipped the fake/autograd registrations
even when the underlying ops were present.
Catch the import error, re-check registration, and only re-raise when the
ops are still missing — preserving the diagnostic detail (e.g. ABI
'undefined symbol') for the genuine .so-load-failure path.
|
Actionable comments posted: 0 |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughRefactors _check_underlying_ops_loaded() to centralize op-registration checks, attempt ChangesImport Error Diagnostics
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@deepmd/pt_expt/utils/comm.py`:
- Around line 64-67: The broad-except in the import block around the `import
deepmd.pt` statement triggers Ruff BLE001; suppress it by adding an explicit
noqa on the except clause (e.g., append "# noqa: BLE001" to the `except
Exception as exc:` line) so the linter ignores this deliberate broad catch while
keeping the existing explanatory comment about cxx_op registration intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 87245f99-dd41-4ca8-825f-550b643b2244
📒 Files selected for processing (1)
deepmd/pt_expt/utils/comm.py
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5474 +/- ##
==========================================
- Coverage 82.25% 82.25% -0.01%
==========================================
Files 833 833
Lines 89094 89099 +5
Branches 4225 4225
==========================================
+ Hits 73288 73290 +2
- Misses 14515 14518 +3
Partials 1291 1291 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
njzjz-bot
left a comment
There was a problem hiding this comment.
Thanks for tightening this path — the re-check after import deepmd.pt is the right shape for the entry-point failure case.
One thing looks inconsistent with the PR description and commit message: when import deepmd.pt fails and the ops are still missing, the current code still raises the generic RuntimeError, only chaining the original exception via raise ... from import_err. That preserves the details in a full traceback, but callers that surface str(exc) or exception type will still see the same generic “Build libdeepmd_op_pt.so” message, not the raw ABI/import diagnostic. The PR body also says external callers will now see the raw ImportError for the .so-mismatch case.
I’d suggest making that branch actually re-raise the import error, and reserve the generic RuntimeError for the “import succeeded but ops are still not registered” case, e.g.:
if not _ops_registered():
if import_err is not None:
raise import_err
raise RuntimeError(...)That keeps the useful entry-point fallback while matching the stated diagnostic behavior.
Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
The prior commit wrapped the captured import error in a generic RuntimeError via 'raise ... from import_err'. Callers that look at the exception type or str(exc) saw only the generic 'build the .so' message; the diagnostic detail (e.g. 'undefined symbol' for a torch-version / ABI mismatch against libdeepmd_op_pt.so) survived only in the chained traceback. Re-raise the original import error directly when ops are still missing; reserve the generic RuntimeError for the case where 'import deepmd.pt' succeeded but the ops still aren't registered.
njzjz-bot
left a comment
There was a problem hiding this comment.
Re-reviewed the latest update. The new branch now re-raises the captured import error directly when import deepmd.pt fails and the ops are still missing, while preserving the entry-point-failure fallback when ops were registered before a later import failure. That matches the intended diagnostic behavior.
CI is green aside from expected skipped jobs; I don't see further blockers.
— OpenClaw 2026.5.12 (f066dd2) (model: custom-chat-jinzhezeng-group/gpt-5.5)
Summary
_check_underlying_ops_loaded()indeepmd/pt_expt/utils/comm.pywrapsimport deepmd.ptin a blanketexcept Exception: pass, then falls through to a genericRuntimeErrortelling the user to buildlibdeepmd_op_pt.so.The problem: when the .so is built but loaded against a mismatched torch version,
import deepmd.ptraises anImportErrorwith diagnostic detail (e.g.undefined symbol: ...) — exactly the message the user needs. The current code hides it and tells them to rebuild a library that's already built.This PR removes the
try/exceptand lets the import error propagate. The downstreamRuntimeErrorstill fires for the case where the import succeeds but the ops still aren't registered.Trade-off
External callers that previously caught
RuntimeErrorfromcomm.pyimport will now see the rawImportErrorfor the .so-mismatch case. No in-tree caller does this. The diagnostic gain outweighs the contract change.Test plan
comm.py) — happy path unchangedSummary by CodeRabbit