gimamdsmi: add smi_session RAII helper for on-demand amdsmi lifecycle by spraveenio · Pull Request #67 · ROCm/gpu-agent

spraveenio · 2026-05-28T13:19:18Z

Summary

Introduces smi_session RAII class that wraps amdsmi_init/amdsmi_shut_down with a recursive_mutex + depth counter for safe nested use on the same thread
Adds AGA_SMI_SESSION_GUARD() macro to all public smi_api.cc entry points — opens a per-request session in lazy mode, no-op in persistent mode
AGA_SMI_LAZY_INIT=1 env var selects on-demand init/shutdown (releases /dev/gim-smi0 between calls); unset keeps the existing persistent behavior
Startup log always prints which mode is active
gimamdsmi/smi_state.cc gets a proper teardown() matching the amdsmi backend — stops watcher before amdsmi_shut_down() in persistent mode, no-op in lazy mode

Test results

Scenario	Before	After
`AGA_SMI_LAZY_INIT=1`, correct devices mounted	gpuagent stuck on futex, sock never created, logs 0 bytes	Daemon starts, sock created, exporter connects
Persistent mode (default)	Unchanged	Unchanged

Test Summary

Environment: RHEL9 SR-IOV/GIM host, 1 GPU (hypervisor passthrough mode)
Image: device-metrics-exporter-sriov:latest built from Dockerfile.sriov.exporter-release
gpuagent binary: gpuagent_gim from feature/gimamdsmi-smi-session @ 075ca40 (smi_session RAII helper for on-demand amdsmi lifecycle)

Check	Result
Container startup	Clean — no assertion/abort, restarts=0
`AGA_SMI_LAZY_INIT=1` env var	Confirmed inside container
`/var/run/gpuagent.sock`	Present; IPC messages flowing in `gpu-agent.log`
Metric lines per scrape	298 (`gpu_` / `pcie_` with `deployment_mode="hypervisor"`, `gpu_partition_id` labels)
Sequential load (30 req, 1 req/s)	30/30 ok, 0 fail
Parallel load (40 req, 4 concurrent × 10)	40/40 ok, 0 fail

Files changed

Only gimamdsmi/ backend files and shared headers are touched — amdsmi, rocmsmi, main.cc, init.cc, and all common SMI plumbing are unmodified.

gimamdsmi/smi_session.hpp / smi_session.cc — new RAII session class
gimamdsmi/smi_api.cc — AGA_SMI_SESSION_GUARD() on all public entry points
gimamdsmi/smi_state.cc — lazy init logic, startup mode log, teardown() implementation
gimamdsmi/amd_smi_mock_impl.cc — init/shutdown counters for test assertions
smi_state.hpp — lazy_init_ member + accessor, all existing fields preserved
smi_api_mock_impl.hpp — counter APIs for test assertions

🤖 Generated with Claude Code

Moves the gim amdsmi backend from a persistent (init-once-at-startup, never-shut-down) model to an optional per-request session model that releases /dev/gim-smi0 between calls, freeing the device for other processes (e.g. the GIM daemon). Behavior is controlled by AGA_SMI_LAZY_INIT=1; when unset the existing persistent behavior is preserved. The active mode is logged at startup. Concurrency smi_session uses std::mutex + refcount. The lock is held ONLY around refcount transitions and the amdsmi_init / amdsmi_shut_down calls themselves. While sessions are open, amdsmi API calls run UNLOCKED so concurrent gRPC handlers (up to AGA_MAX_GRPC_THREADS = 256) and the watcher thread all run in parallel on the same refcount-gated init. This is the same shared-init concurrency model used by persistent mode today. Invariants (under init_mutex_): refcount_() == 0 <=> amdsmi is NOT initialized refcount_() > 0 <=> amdsmi IS initialized; exactly ONE successful amdsmi_init has occurred without a matching amdsmi_shut_down amdsmi_init runs ONLY on the 0 -> 1 transition amdsmi_shut_down runs ONLY on the 1 -> 0 transition Handle lifecycle amdsmi handles can change across init/shutdown cycles, so cached handles are unsafe in lazy mode. Public smi_api entry points take a new const aga_obj_key_t *gpu_key trailing parameter; the gim AGA_SMI_SESSION_GUARD(gpu_key, handle) macro opens the session and re-resolves a local gpu_handle via amdsmi_get_processor_handle_from_uuid(gpu_key->str(), ...) inside the per-request session. The watcher walks gpu_db() per tick and resolves each entry's handle the same way - no cached UUID/handle state. The amdsmi backend takes the gpu_key for signature parity and ignores it (handles are stable there). smi_state::init() is restructured into a clean if/else: persistent mode does init+discover only; lazy mode does init+discover+shutdown. UUID->string conversion reuses the canonical aga_obj_key_t::str() helper from api/include/base.hpp - no new helper added. Test results (amd@10.30.60.190 device-metrics-exporter): AGA_SMI_LAZY_INIT=1: daemon starts cleanly, startup log shows lazy mode, gpuagent.sock is created, exporter connects, /dev/gim-smi0 fd is released between calls. Persistent mode (default): unchanged behavior, fd held for life. amdsmi backend untouched in behavior; new gpu_key param is unused there. Co-Authored-By: Claude Sonnet 4 (1M context) <noreply@anthropic.com>

spraveenio requested review from rsrikanth86 and sarat-k May 28, 2026 13:40

rsrikanth86 reviewed Jun 1, 2026

View reviewed changes

Comment thread sw/nic/gpuagent/api/smi/gimamdsmi/smi_api.cc Outdated

rsrikanth86 reviewed Jun 2, 2026

View reviewed changes

Comment thread sw/nic/gpuagent/api/smi/gimamdsmi/amd_smi_mock_impl.cc Outdated

rsrikanth86 reviewed Jun 2, 2026

View reviewed changes

Comment thread sw/nic/gpuagent/api/smi/gimamdsmi/smi_session.cc Outdated

rsrikanth86 reviewed Jun 2, 2026

View reviewed changes

Comment thread sw/nic/gpuagent/api/smi/gimamdsmi/smi_state.cc Outdated

rsrikanth86 reviewed Jun 2, 2026

View reviewed changes

Comment thread sw/nic/gpuagent/api/smi/gimamdsmi/smi_state.cc Outdated

spraveenio force-pushed the feature/gimamdsmi-smi-session branch from a9657c6 to 565fabb Compare June 2, 2026 22:28

rsrikanth86 reviewed Jun 3, 2026

View reviewed changes

Comment thread sw/nic/gpuagent/api/smi/amdsmi/smi_api.cc Outdated

rsrikanth86 reviewed Jun 3, 2026

View reviewed changes

Comment thread sw/nic/gpuagent/api/smi/gimamdsmi/smi_api.cc Outdated

rsrikanth86 reviewed Jun 3, 2026

View reviewed changes

Comment thread sw/nic/gpuagent/api/smi/gimamdsmi/smi_api.cc Outdated

rsrikanth86 reviewed Jun 3, 2026

View reviewed changes

Comment thread sw/nic/gpuagent/api/smi/gimamdsmi/smi_session.cc

spraveenio force-pushed the feature/gimamdsmi-smi-session branch from 565fabb to cdf7990 Compare June 3, 2026 12:49

rsrikanth86 reviewed Jun 4, 2026

View reviewed changes

Comment thread sw/nic/gpuagent/api/smi/gimamdsmi/smi_state.cc

rsrikanth86 reviewed Jun 4, 2026

View reviewed changes

Comment thread sw/nic/gpuagent/api/smi/gimamdsmi/smi_utils.hpp Outdated

rsrikanth86 reviewed Jun 4, 2026

View reviewed changes

Comment thread sw/nic/gpuagent/api/gpu_api.cc Outdated

spraveenio force-pushed the feature/gimamdsmi-smi-session branch 2 times, most recently from 075ca40 to b19cd57 Compare June 4, 2026 18:27

spraveenio force-pushed the feature/gimamdsmi-smi-session branch from c149de9 to f597c92 Compare June 4, 2026 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gimamdsmi: add smi_session RAII helper for on-demand amdsmi lifecycle#67

gimamdsmi: add smi_session RAII helper for on-demand amdsmi lifecycle#67
spraveenio wants to merge 1 commit into
ROCm:mainfrom
spraveenio:feature/gimamdsmi-smi-session

spraveenio commented May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

spraveenio commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test results

Test Summary

Files changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

spraveenio commented May 28, 2026 •

edited

Loading