All notable changes to gb10-gups are documented here.
Initial three-file suite
cpu_gups.c— CPU-only GUPS baseline, plainmalloc, auto-sizes table from live/proc/meminfoMemAvailable.cuda_gups.cu— GPU-only GUPS baseline,cudaMalloc, explicit table size argument (CUDA memory queries unreliable on GB10).managed_gups.cu— CPU+GPU concurrent GUPS,cudaMallocManaged, configurable CPU/GPU update split.- LFSR random generator,
starts()seed-jump, and XOR read-modify-write update logic ported from HPCC RandomAccess / OpenSHMEM GUPS (Copyright 2011-2015 University of Houston System and UT-Battelle, LLC, BSD-licensed). SHMEM multi-process partitioning machinery intentionally not reused — does not apply to single-chip unified memory.
License compliance fix
- Added the verbatim BSD 3-Clause notice (full text, not paraphrased) to all three files' headers, per that license's own redistribution terms. Original provenance description alone was not sufficient.
Runtime platform classification
- Ported
query_um_paradigm()byte-for-byte fromcuda-unified-memory-analyzer'sum_analyzer_v8.cu, verified against the original source via direct diff (zero logic differences). Both.cufiles now self-reportFULL_EXPLICIT/FULL_HARDWARE_COHERENT/FULL_SOFTWARE_COHERENT/LIMITED/UNKNOWNat runtime, correctly identifying whichever GPU architecture they are actually run on rather than assuming the build target.
Thermal and power sampling
- Added background-thread NVML sampling (
ThermalSampler,thermal_sample_thread) to all three binaries, sampling every 0.5s for the full run duration — catches real climb on long runs, correctly shows nothing meaningful on sub-second runs. - Added
read_power_w()implementing the spbm_hwmon-first, NVML-fallback, explicit-"unavailable"-third chain, ported from the realsparkview/power.py. Power source is recorded and reported so readers know which sensor produced a given number. - Added average power tracking alongside peak (matching the
gb10-kernel-probeforum baseline's avg+peak reporting pattern, not peak alone).
NVML throttle-reason bitmask tracking
- Added
nvmlDeviceGetCurrentClocksThrottleReasonsreading and accumulation (throttle_flags_seen), with bitmask constants ported fromspark-gpu-throttle-check.py. - Fixed the misleading generic "possible throttle" message: the output now distinguishes a real driver-reported throttle flag from a clock drop with no flag set. On Pascal, confirmed the latter case in practice — see Fixed section below.
Structured JSON results logging
- Added
write_results_json()to all three binaries — writesresults/<binary>_log2_<n>_<timestamp>.jsonper run, including the full per-sample thermal/power/throttle time series (not just summary stats), table size, GUP/s, and UM paradigm classification.
Progress display
- Added live progress reporting to
cpu_gups.c's update loop andmanaged_gups.cu's CPU thread (the only loops instrumentable from the host side — a running GPU kernel cannot report progress mid-flight).cuda_gups.cuusescudaEventQuerypolling instead of a blocking sync, printing elapsed time while the async kernel runs. managed_gups.cu's progress line additionally shows live temp/clock/power read from the sharedThermalSampler, so there is real machine state to look at during multi-minute runs, not just a percentage.
- Forward-declaration ordering bug:
wall_seconds()was used insidethermal_sample_threadbefore its definition later inmanaged_gups.cu(pre-existing in that file for the CPU thread). Caused a real compile failure (identifier "wall_seconds" is undefined), confirmed and fixed with a forward declaration. - ETA calculation noise: dividing by a near-zero
pctin the first few progress samples produced wildly inflated ETA estimates (observed: "ETA 29046s" / ~8 hours at 0.2% complete on a run that actually completed in ~250-300s). Fixed by suppressing the ETA display until at least 2% complete and 5 seconds elapsed. - Premature
min_clock_mhzdisplay: the live progress line was readingmin_clock_mhzbefore it had been set past its0xFFFFFFFFsentinel, printing a value that looked real but was not yet meaningful. Fixed by gating the display on an explicit "has this been set" check; shows "MHz(measuring)" until a real minimum exists. - Stray trailing text on completion line: the "100.0% complete | done" message did not fully overwrite the longest possible prior progress line, leaving fragments like "peak)" visible after completion. Fixed by padding the completion message with sufficient trailing whitespace.
- Mislabeled "GB10 GUPS" banner on Pascal hardware: initial
versions of
cuda_gups.cu/managed_gups.cuprinted a static "GB10 GUPS" banner regardless of actual hardware. Fixed by adding thequery_um_paradigm()port (see Added) so the banner and classification reflect the real detected hardware.
All three binaries compiled (gcc -Wall zero warnings; nvcc -arch=sm_61 zero errors) and ran to completion repeatedly on real
hardware (NVIDIA GTX 1080, SM 6.1, driver 570.195.03):
cpu_gups: multiple runs at 2^22, 2^27, 2^29 (auto-sized).cuda_gups: multiple runs at 2^24, 2^29.managed_gups: multiple runs at 2^24, 2^29 (50/50 CPU/GPU split).
Confirmed working as designed: throttle-flag discrimination
correctly distinguished Pascal's idle-clock-state cycling (observed:
clock dropping to 139 MHz during long managed_gups runs, zero NVML
throttle flags set throughout) from a real hardware throttle event,
cross-checked against an independent clean PASS from
spark-gpu-throttle-check.py run minutes apart on the same card.
No run has been performed on GB10 hardware. All numbers produced to date are from the Pascal control case described above.