TensorRT inference benchmarking suite for battery State-of-Health (SoH) models on the NVIDIA Jetson Orin Nano 4GB. Takes trained PyTorch models exported from Battery-ML-Training, compiles them into optimised TensorRT engines (FP32, FP16, INT8), and evaluates them using a two-pass methodology that separates accuracy/latency measurement from steady-state power profiling.
- Overview
- Repository Structure
- Prerequisites
- Workflow
- Benchmark Methodology
- Output Files
- Configuration
- Power Modes (Jetson Orin Nano)
- Project Structure
This repository is the inference and evaluation stage of a two-repo pipeline:
Battery-ML-Training ──(ONNX exports)──► Edge-ML (this repo)
PyTorch training TensorRT compilation
ONNX export Benchmarking on Jetson
Models exported as .onnx from Battery-ML-Training are compiled here into .engine files in three precision formats:
| Precision | Description |
|---|---|
| FP32 | Full float32 baseline — no quantisation |
| FP16 | Half-precision float — ~2× speedup on Tensor Cores with minimal accuracy loss |
| INT8 | 8-bit integer — maximum throughput, requires calibration data |
INT8 calibration uses the training split of the dataset (70% of cells) to compute per-layer activation scale factors, ensuring the held-out test split is never used during calibration.
- NVIDIA Jetson Orin Nano 4GB running JetPack 5.x
- Python 3 with a virtual environment at
.venv/(recommended) - TensorRT and CUDA (bundled with JetPack)
tegrastatsavailable on the system path (bundled with JetPack)- ONNX models placed under
models/<folder>/ - Preprocessed test tensors under
data/tensor_qv/
Install Python dependencies inside your venv before running:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt # numpy, pyyaml, tensorrt bindings, etc.Optional: Create a
.envfile in the repo root to setDISCORD_WEBHOOK_URLfor benchmark completion notifications.
Run quantise.sh to compile all ONNX models in a given folder into FP16 and INT8 TensorRT engines. FP32 engines are compiled as a baseline by onnx2engine.py automatically.
./quantise.sh <folder>
# e.g.
./quantise.sh v4This script runs two steps internally:
calibration_cache.py— Profiles the Q-V training data and writes per-model INT8 activation scale-factor caches tomodels/<folder>/.onnx2engine.py— Compiles every.onnxmodel into optimised FP32, FP16, and INT8.enginefiles.
The script can be sourced (
source quantise.sh v4) or run directly. When sourced from a tmux session, it automatically re-launches in a subshell to protect the parent shell environment.
Use run_benchmark.sh to lock hardware state and execute the full benchmark suite:
./run_benchmark.sh <model_folder> [power_mode]
# e.g.
./run_benchmark.sh v4 # MAXN mode (default)
./run_benchmark.sh v4 1 # 7W_AI efficiency modeThe script:
- Activates
.venvand loads.envif present - Locks the Jetson hardware state with
sudo nvpmodelandsudo jetson_clocks - Sets a fixed fan PWM (default: 127/255) to eliminate thermal variability
- Runs
benchmark.pywith elevated privileges viasudo -E(preserving env vars) - Restores automatic fan control on completion
A 10-second startup delay is built into
benchmark.py— this window can be used to detach from an SSH session (e.g. viatmux) to remove SSH overhead before inference begins.
The benchmark uses a two-pass approach to eliminate the observer effect — power logging overhead must not inflate latency measurements, and latency-optimised inference must not suppress the GPU into a low-power state during power profiling.
Runs with no power-logging overhead. All test samples are passed through the engine, recording per-sample predictions and wall-clock latencies.
- Warmup: 50 inference iterations are discarded before timing begins.
- Padding: For batch sizes > 1, dummy samples pad the dataset to a full batch; dummy predictions are discarded before accuracy calculation.
- Outputs:
accuracy.csvandlatency.csv
Latency metrics recorded:
| Metric | Description |
|---|---|
mean_latency_ms |
Mean inference time per batch |
p95_latency_ms |
95th-percentile latency (tail latency) |
throughput_cells_per_sec |
Cells processed per second |
norm_latency_ms_per_cell |
Mean latency normalised per cell (batch-size independent) |
Accuracy metrics (denormalised to Ah against nominal capacity of 2.4 Ah):
| Metric | Description |
|---|---|
rmse_ah |
Root mean square error |
mae_ah |
Mean absolute error |
max_abs_error_ah |
Worst-case absolute error |
Both global and per-cell breakdowns are written.
Runs inference continuously for a fixed 10-second window with tegrastats recording hardware telemetry. Predictions are discarded — only power.log is written.
The fixed time window is critical for fast models: a TCN with batch size 96 can complete a full dataset pass in ~7 ms, yielding far too few tegrastats samples (default 50 ms interval) if inference were stopped at dataset end. The 10-second window guarantees ~200 samples regardless of model speed.
tegrastats captures:
VDD_IN— Total board input powerVDD_SOC— SoC power railVDD_CPU_CV— CPU + CV engine power rail
Results are written to results/<run_name>/<model>/bs<N>/<precision>/:
results/
└── v4_powm0/
└── <model_name>/
└── bs<batch_size>/
└── <precision>/ # fp32 | fp16 | int8
├── accuracy.csv
├── latency.csv
└── power.log
A final aggregated summary.csv is written to results/<run_name>/summary.csv, combining all engines into a single table for cross-model comparison.
configs.yaml defines the train/val/test cell splits. These splits must match those used in Battery-ML-Training to ensure correct INT8 calibration and unbiased evaluation.
| Split | Proportion | Purpose |
|---|---|---|
train |
70% (47 cells) | Model training + INT8 calibration |
val |
15% (12 cells) | Hyperparameter tuning / early stopping |
test |
15% (13 cells) | Held-out evaluation only — never used for calibration or training |
The NOMINAL_CAPACITY constant (2.4 Ah) is used to denormalise model outputs from the [0, 1] range back into physical units for accuracy reporting.
| Mode | nvpmodel -m |
TDP | Description |
|---|---|---|---|
| MAXN | 0 |
~10 W | All CPU/GPU cores unlocked, maximum performance |
| 7W_AI | 1 |
~7 W | CPU frequency capped, GPU prioritised |
| 7W_CPU | 2 |
~7 W | Both CPU and GPU frequency capped |
jetson_clocks is called after nvpmodel to lock clocks to their maximum within the selected power envelope, preventing dynamic frequency scaling from introducing variance in benchmark results.
Edge-ML/
├── benchmark.py # Main benchmark entry point
├── run_benchmark.sh # Hardware-locking benchmark runner
├── quantise.sh # TensorRT compilation pipeline
├── configs.yaml # Train/val/test cell split definitions
├── models/ # ONNX inputs + compiled .engine outputs
│ └── <folder>/
├── data/
│ └── tensor_qv/ # Preprocessed input tensors
├── results/ # Benchmark output (generated at runtime)
└── src/
├── benchmark/
│ ├── aggregate.py # Aggregates per-engine CSVs into summary.csv
│ ├── dataloader.py # Loads and batches test tensors
│ ├── discover_files.py # Discovers .engine files and parses metadata
│ ├── inference.py # TRTWrapper — loads engine, runs inference
│ └── power_log.py # TegrastatsLogger context manager
├── quantise/
│ ├── calibration_cache.py # Generates INT8 activation scale caches
│ └── onnx2engine.py # Compiles ONNX → TensorRT .engine
└── utils/
└── notify.py # Optional Discord webhook notifications