Random-access memory characterization for NVIDIA GB10 using a GUPS-derived workload.
Target platform:
- NVIDIA GB10 (SM 12.1)
- DGX Spark and Spark-class systems
- Hardware-coherent unified memory
- Shared LPDDR5X memory subsystem
Previous GB10 characterization focused primarily on sequential, streaming, and fixed-buffer access patterns. Representative measurements include:
- SYS-scope vs GPU-scope atomic latency ratio: 1.00x
- CPU-write / GPU-read contention throughput loss: ~2.2%
- No measurable migration-related cold-start penalty observed
Those measurements characterize coherent memory behavior under structured access patterns. This project extends that work to randomized update workloads derived from HPCC RandomAccess (GUPS).
The objective is to characterize how GB10 behaves when CPU and GPU access patterns defeat caches, prefetchers, and TLB locality, creating a substantially different memory-access regime than previous GB10 measurements.
| Variant | Memory model | Purpose |
|---|---|---|
cpu_gups.c |
Host memory (malloc) |
CPU random-update baseline |
cuda_gups.cu |
Device memory (cudaMalloc) |
GPU random-update baseline |
managed_gups.cu |
Unified memory (cudaMallocManaged) |
Concurrent CPU/GPU random updates |
The managed-memory variant measures aggregate behavior under simultaneous CPU and GPU updates to a shared table. Observed performance reflects the combined effects of coherence traffic, memory-controller arbitration, cache ownership transitions, atomic serialization, and random-access bandwidth limitations.
This benchmark does not isolate any individual mechanism. It characterizes overall platform behavior under randomized concurrent access.
If the aggregate rate remains close to the sum of the isolated CPU and GPU baselines, that indicates the combined overhead is small enough that throughput remains close to the isolated baseline sum under randomized concurrent access.
If the aggregate rate is significantly lower, the suite can distinguish driver-reported throttle events from ordinary memory-bound contention using NVML throttle-state tracking. Additional targeted measurements would still be required to attribute the underlying cause precisely.
cpu_gups.c determines a default table size at startup using
MemAvailable from /proc/meminfo and the original HPCC RandomAccess
sizing rule: the largest power-of-two table not exceeding half of
available memory.
cuda_gups.cu and managed_gups.cu take table size as an explicit
argument. This keeps all three variants directly comparable and avoids
reliance on CUDA memory-availability queries for benchmark sizing.
# CPU variant — any Linux platform
gcc -O2 -Wall -o cpu_gups cpu_gups.c -lpthread -ldl
# CUDA variants — GB10 target
nvcc -O3 -arch=sm_121 -o cuda_gups cuda_gups.cu
nvcc -O3 -arch=sm_121 -Xcompiler -pthread -o managed_gups managed_gups.cu
# CUDA variants — Pascal toolchain-verification target
nvcc -O3 -arch=sm_61 -o cuda_gups cuda_gups.cu
nvcc -O3 -arch=sm_61 -Xcompiler -pthread -o managed_gups managed_gups.cuCUDA 13.0 is the recommended GB10 build target.
Prior GB10 characterization observed incorrect PTX %clock64
behavior under CUDA 13.1 and 13.2 on SM 12.1.
Source: https://forums.developer.nvidia.com/t/gb10-hardware-baseline-first-direct-measurements-and-findings/367851/9
This suite uses CUDA event timing rather than direct %clock64
timing, but CUDA 13.0 is retained for consistency with the broader
GB10 diagnostic toolchain.
# CPU baseline — auto-sizes from live /proc/meminfo
./cpu_gups
# Force a specific size for direct comparison across all three:
./cpu_gups 29 4.0
./cuda_gups 29 4.0
./managed_gups 29 4.0 0.5 # 0.5 = 50/50 CPU/GPU update split
# Pass isolated rates from prior cpu_gups/cuda_gups runs at the same
# table size to get coherence_efficiency in the output and JSON log
# (managed_gups GUP/s divided by the sum of the two isolated rates —
# omitted entirely, not fabricated as zero, if not provided):
./managed_gups 29 4.0 0.5 0.0233 0.590All three write a structured JSON log to results/, including the
full per-sample thermal/power/throttle-bitmask time series, not just
summary statistics — see results/<binary>_log2_<n>_<timestamp>.json
after any run.
No GB10 unit was available during development, so all three binaries were compiled and executed on a GTX 1080 (SM 6.1, driver 570.195.03) as a compiler/runtime validation target.
This verified:
- C/CUDA compilation
- CUDA kernel execution
- CPU threading
- JSON logging
- progress reporting
- NVML integration
- throttle-state decoding
- power-telemetry fallback behavior
The resulting GUP/s measurements are specific to a Pascal discrete-GPU memory architecture and are not reported as GB10 results. The purpose of this testing was to validate the implementation, not characterize GB10 behavior.
| File | Compiled | Executed | Verified on GB10 |
|---|---|---|---|
cpu_gups.c |
Yes | Yes | Not yet |
cuda_gups.cu |
Yes | Yes | Not yet |
managed_gups.cu |
Yes | Yes | Not yet |
- https://github.com/parallelArchitect/cuda-unified-memory-analyzer
- https://github.com/parallelArchitect/sparkview
- https://github.com/parallelArchitect/spark-gpu-throttle-check
- https://github.com/parallelArchitect/nvidia-uma-fault-probe
This repository contains code derived from the OpenSHMEM version of HPCC RandomAccess (GUPS):
``` OpenSHMEM version: Copyright (c) 2011 - 2015 University of Houston System and UT-Battelle, LLC. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
o Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
o Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
o Neither the name of the University of Houston System, UT-Battelle, LLC. nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ```
This is the OpenSHMEM version of GUPS (Giga Updates Per Second). The update kernel was translated from the MPI version of the HPCC RandomAccess (GUPS) benchmarks, version 1.3.1. All original copyrights are retained. HPCC code was distributed under the BSD original license: http://icl.cs.utk.edu/hpcc/faq/index.html#263
The same notice is reproduced in full in the header of every derived source file in this repository (`cpu_gups.c`, `cuda_gups.cu`, `managed_gups.cu`), per the license's own redistribution terms.