gimamdsmi: add smi_session RAII helper for on-demand amdsmi lifecycle#67
Open
spraveenio wants to merge 1 commit into
Open
gimamdsmi: add smi_session RAII helper for on-demand amdsmi lifecycle#67spraveenio wants to merge 1 commit into
spraveenio wants to merge 1 commit into
Conversation
rsrikanth86
reviewed
Jun 1, 2026
rsrikanth86
reviewed
Jun 2, 2026
rsrikanth86
reviewed
Jun 2, 2026
rsrikanth86
reviewed
Jun 2, 2026
rsrikanth86
reviewed
Jun 2, 2026
a9657c6 to
565fabb
Compare
rsrikanth86
reviewed
Jun 3, 2026
rsrikanth86
reviewed
Jun 3, 2026
rsrikanth86
reviewed
Jun 3, 2026
rsrikanth86
reviewed
Jun 3, 2026
565fabb to
cdf7990
Compare
rsrikanth86
reviewed
Jun 4, 2026
rsrikanth86
reviewed
Jun 4, 2026
rsrikanth86
reviewed
Jun 4, 2026
075ca40 to
b19cd57
Compare
Moves the gim amdsmi backend from a persistent (init-once-at-startup,
never-shut-down) model to an optional per-request session model that
releases /dev/gim-smi0 between calls, freeing the device for other
processes (e.g. the GIM daemon).
Behavior is controlled by AGA_SMI_LAZY_INIT=1; when unset the existing
persistent behavior is preserved. The active mode is logged at startup.
Concurrency
smi_session uses std::mutex + refcount. The lock is held ONLY around
refcount transitions and the amdsmi_init / amdsmi_shut_down calls
themselves. While sessions are open, amdsmi API calls run UNLOCKED so
concurrent gRPC handlers (up to AGA_MAX_GRPC_THREADS = 256) and the
watcher thread all run in parallel on the same refcount-gated init.
This is the same shared-init concurrency model used by persistent
mode today.
Invariants (under init_mutex_):
refcount_() == 0 <=> amdsmi is NOT initialized
refcount_() > 0 <=> amdsmi IS initialized; exactly ONE successful
amdsmi_init has occurred without a matching
amdsmi_shut_down
amdsmi_init runs ONLY on the 0 -> 1 transition
amdsmi_shut_down runs ONLY on the 1 -> 0 transition
Handle lifecycle
amdsmi handles can change across init/shutdown cycles, so cached
handles are unsafe in lazy mode. Public smi_api entry points take a
new const aga_obj_key_t *gpu_key trailing parameter; the gim
AGA_SMI_SESSION_GUARD(gpu_key, handle) macro opens the session and
re-resolves a local gpu_handle via
amdsmi_get_processor_handle_from_uuid(gpu_key->str(), ...) inside the
per-request session. The watcher walks gpu_db() per tick and resolves
each entry's handle the same way - no cached UUID/handle state.
The amdsmi backend takes the gpu_key for signature parity and ignores
it (handles are stable there).
smi_state::init() is restructured into a clean if/else: persistent mode
does init+discover only; lazy mode does init+discover+shutdown.
UUID->string conversion reuses the canonical aga_obj_key_t::str() helper
from api/include/base.hpp - no new helper added.
Test results (amd@10.30.60.190 device-metrics-exporter):
AGA_SMI_LAZY_INIT=1: daemon starts cleanly, startup log shows lazy
mode, gpuagent.sock is created, exporter connects, /dev/gim-smi0 fd
is released between calls.
Persistent mode (default): unchanged behavior, fd held for life.
amdsmi backend untouched in behavior; new gpu_key param is unused
there.
Co-Authored-By: Claude Sonnet 4 (1M context) <noreply@anthropic.com>
c149de9 to
f597c92
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
smi_sessionRAII class that wrapsamdsmi_init/amdsmi_shut_downwith arecursive_mutex+ depth counter for safe nested use on the same threadAGA_SMI_SESSION_GUARD()macro to all publicsmi_api.ccentry points — opens a per-request session in lazy mode, no-op in persistent modeAGA_SMI_LAZY_INIT=1env var selects on-demand init/shutdown (releases/dev/gim-smi0between calls); unset keeps the existing persistent behaviorgimamdsmi/smi_state.ccgets a properteardown()matching the amdsmi backend — stops watcher beforeamdsmi_shut_down()in persistent mode, no-op in lazy modeTest results
AGA_SMI_LAZY_INIT=1, correct devices mountedTest Summary
Environment: RHEL9 SR-IOV/GIM host, 1 GPU (hypervisor passthrough mode)
Image:
device-metrics-exporter-sriov:latestbuilt fromDockerfile.sriov.exporter-releasegpuagent binary:
gpuagent_gimfromfeature/gimamdsmi-smi-session@075ca40(smi_session RAII helper for on-demand amdsmi lifecycle)AGA_SMI_LAZY_INIT=1env var/var/run/gpuagent.sockgpu-agent.loggpu_*/pcie_*withdeployment_mode="hypervisor",gpu_partition_idlabels)Files changed
Only
gimamdsmi/backend files and shared headers are touched —amdsmi,rocmsmi,main.cc,init.cc, and all common SMI plumbing are unmodified.gimamdsmi/smi_session.hpp/smi_session.cc— new RAII session classgimamdsmi/smi_api.cc—AGA_SMI_SESSION_GUARD()on all public entry pointsgimamdsmi/smi_state.cc— lazy init logic, startup mode log,teardown()implementationgimamdsmi/amd_smi_mock_impl.cc— init/shutdown counters for test assertionssmi_state.hpp—lazy_init_member + accessor, all existing fields preservedsmi_api_mock_impl.hpp— counter APIs for test assertions🤖 Generated with Claude Code