Skip to content

gimamdsmi: add smi_session RAII helper for on-demand amdsmi lifecycle#67

Open
spraveenio wants to merge 1 commit into
ROCm:mainfrom
spraveenio:feature/gimamdsmi-smi-session
Open

gimamdsmi: add smi_session RAII helper for on-demand amdsmi lifecycle#67
spraveenio wants to merge 1 commit into
ROCm:mainfrom
spraveenio:feature/gimamdsmi-smi-session

Conversation

@spraveenio
Copy link
Copy Markdown
Contributor

@spraveenio spraveenio commented May 28, 2026

Summary

  • Introduces smi_session RAII class that wraps amdsmi_init/amdsmi_shut_down with a recursive_mutex + depth counter for safe nested use on the same thread
  • Adds AGA_SMI_SESSION_GUARD() macro to all public smi_api.cc entry points — opens a per-request session in lazy mode, no-op in persistent mode
  • AGA_SMI_LAZY_INIT=1 env var selects on-demand init/shutdown (releases /dev/gim-smi0 between calls); unset keeps the existing persistent behavior
  • Startup log always prints which mode is active
  • gimamdsmi/smi_state.cc gets a proper teardown() matching the amdsmi backend — stops watcher before amdsmi_shut_down() in persistent mode, no-op in lazy mode

Test results

Scenario Before After
AGA_SMI_LAZY_INIT=1, correct devices mounted gpuagent stuck on futex, sock never created, logs 0 bytes Daemon starts, sock created, exporter connects
Persistent mode (default) Unchanged Unchanged

Test Summary

Environment: RHEL9 SR-IOV/GIM host, 1 GPU (hypervisor passthrough mode)
Image: device-metrics-exporter-sriov:latest built from Dockerfile.sriov.exporter-release
gpuagent binary: gpuagent_gim from feature/gimamdsmi-smi-session @ 075ca40 (smi_session RAII helper for on-demand amdsmi lifecycle)


Check Result
Container startup Clean — no assertion/abort, restarts=0
AGA_SMI_LAZY_INIT=1 env var Confirmed inside container
/var/run/gpuagent.sock Present; IPC messages flowing in gpu-agent.log
Metric lines per scrape 298 (gpu_* / pcie_* with deployment_mode="hypervisor", gpu_partition_id labels)
Sequential load (30 req, 1 req/s) 30/30 ok, 0 fail
Parallel load (40 req, 4 concurrent × 10) 40/40 ok, 0 fail

Files changed

Only gimamdsmi/ backend files and shared headers are touched — amdsmi, rocmsmi, main.cc, init.cc, and all common SMI plumbing are unmodified.

  • gimamdsmi/smi_session.hpp / smi_session.cc — new RAII session class
  • gimamdsmi/smi_api.ccAGA_SMI_SESSION_GUARD() on all public entry points
  • gimamdsmi/smi_state.cc — lazy init logic, startup mode log, teardown() implementation
  • gimamdsmi/amd_smi_mock_impl.cc — init/shutdown counters for test assertions
  • smi_state.hpplazy_init_ member + accessor, all existing fields preserved
  • smi_api_mock_impl.hpp — counter APIs for test assertions

🤖 Generated with Claude Code

@spraveenio spraveenio requested review from rsrikanth86 and sarat-k May 28, 2026 13:40
Comment thread sw/nic/gpuagent/api/smi/gimamdsmi/smi_api.cc Outdated
Comment thread sw/nic/gpuagent/api/smi/gimamdsmi/amd_smi_mock_impl.cc Outdated
Comment thread sw/nic/gpuagent/api/smi/gimamdsmi/smi_session.cc Outdated
Comment thread sw/nic/gpuagent/api/smi/gimamdsmi/smi_state.cc Outdated
Comment thread sw/nic/gpuagent/api/smi/gimamdsmi/smi_state.cc Outdated
@spraveenio spraveenio force-pushed the feature/gimamdsmi-smi-session branch from a9657c6 to 565fabb Compare June 2, 2026 22:28
Comment thread sw/nic/gpuagent/api/smi/amdsmi/smi_api.cc Outdated
Comment thread sw/nic/gpuagent/api/smi/gimamdsmi/smi_api.cc Outdated
Comment thread sw/nic/gpuagent/api/smi/gimamdsmi/smi_api.cc Outdated
Comment thread sw/nic/gpuagent/api/smi/gimamdsmi/smi_session.cc
@spraveenio spraveenio force-pushed the feature/gimamdsmi-smi-session branch from 565fabb to cdf7990 Compare June 3, 2026 12:49
Comment thread sw/nic/gpuagent/api/smi/gimamdsmi/smi_state.cc
Comment thread sw/nic/gpuagent/api/smi/gimamdsmi/smi_utils.hpp Outdated
Comment thread sw/nic/gpuagent/api/gpu_api.cc Outdated
@spraveenio spraveenio force-pushed the feature/gimamdsmi-smi-session branch 2 times, most recently from 075ca40 to b19cd57 Compare June 4, 2026 18:27
Moves the gim amdsmi backend from a persistent (init-once-at-startup,
never-shut-down) model to an optional per-request session model that
releases /dev/gim-smi0 between calls, freeing the device for other
processes (e.g. the GIM daemon).

Behavior is controlled by AGA_SMI_LAZY_INIT=1; when unset the existing
persistent behavior is preserved. The active mode is logged at startup.

Concurrency

  smi_session uses std::mutex + refcount. The lock is held ONLY around
  refcount transitions and the amdsmi_init / amdsmi_shut_down calls
  themselves. While sessions are open, amdsmi API calls run UNLOCKED so
  concurrent gRPC handlers (up to AGA_MAX_GRPC_THREADS = 256) and the
  watcher thread all run in parallel on the same refcount-gated init.
  This is the same shared-init concurrency model used by persistent
  mode today.

  Invariants (under init_mutex_):
    refcount_() == 0  <=>  amdsmi is NOT initialized
    refcount_() >  0  <=>  amdsmi IS initialized; exactly ONE successful
                            amdsmi_init has occurred without a matching
                            amdsmi_shut_down
    amdsmi_init runs ONLY on the 0 -> 1 transition
    amdsmi_shut_down runs ONLY on the 1 -> 0 transition

Handle lifecycle

  amdsmi handles can change across init/shutdown cycles, so cached
  handles are unsafe in lazy mode. Public smi_api entry points take a
  new const aga_obj_key_t *gpu_key trailing parameter; the gim
  AGA_SMI_SESSION_GUARD(gpu_key, handle) macro opens the session and
  re-resolves a local gpu_handle via
  amdsmi_get_processor_handle_from_uuid(gpu_key->str(), ...) inside the
  per-request session. The watcher walks gpu_db() per tick and resolves
  each entry's handle the same way - no cached UUID/handle state.
  The amdsmi backend takes the gpu_key for signature parity and ignores
  it (handles are stable there).

smi_state::init() is restructured into a clean if/else: persistent mode
does init+discover only; lazy mode does init+discover+shutdown.

UUID->string conversion reuses the canonical aga_obj_key_t::str() helper
from api/include/base.hpp - no new helper added.

Test results (amd@10.30.60.190 device-metrics-exporter):
  AGA_SMI_LAZY_INIT=1: daemon starts cleanly, startup log shows lazy
  mode, gpuagent.sock is created, exporter connects, /dev/gim-smi0 fd
  is released between calls.
  Persistent mode (default): unchanged behavior, fd held for life.
  amdsmi backend untouched in behavior; new gpu_key param is unused
  there.

Co-Authored-By: Claude Sonnet 4 (1M context) <noreply@anthropic.com>
@spraveenio spraveenio force-pushed the feature/gimamdsmi-smi-session branch from c149de9 to f597c92 Compare June 4, 2026 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants