HPCToolkit Release Notes
Enhancements:
- Hpcrun includes initial support for using the OpenMP OMPT interface for
profiling and tracing of OpenMP TARGET operations on AMD GPUs in code
generated by ROCM 5.1's clang-based AOMP compiler. - Hpcrun supports profiling of kernels on AMD GPUs using publicly available
hardware counters with AMD's rocprofiler API. - Hpcrun obtains binaries for code that executes on AMD GPUs using AMD's
Roctracer API instead of ROCm Debug API. - Hpcrun emits a better error message when an application unexpectedly closes
hpcrun's log file. - Hpcrun now uses an embedded implementation of an MD5 hash function for
naming CPU and GPU binaries revealed in memory. - Hpcstruct now supports caching of structure files from binaries it analyzes.
A cache greatly reduces the time to analyze binaries for executions as the
cache will almost always contain up to date analysis results for commonly
used shared platform libraries, e.g. libc, libm, as well as libraries for MPI.
When a binary changes, results in the cache are updated as needed. - Hpcstruct no longer pretty-prints its output by default. Omitting leading
blanks due to pretty printing reduced the output size by over 15%, which was
quite significant when analyzing multi-gigabyte binaries. - When applied to a measurements directory, hpcstruct will analyze only CPU and
GPU binaries that were measured in the execution using a mix of parallelism
and concurrency. Binaries that did not get any profile hits are not analyzed. - Hpcstruct's parallel efficiency has been improved. Changes that contributed
to that improvement include enhancements to parallelism in Dyninst’s
finalization of binary analysis and parallel assembly of hpcstruct's output
file. - Update hpcstruct to support analysis of CUDA binaries from 11.5+ to
accommodate change to NVIDIA's nvdisasm output format. - When measuring hardware counter metrics for kernels on AMD GPUs, disable
kernel measurement with Roctracer because it gives an incorrect timestamp
for the first kernel. The timestamp is wrong by a mile and destroys the
accuracy of kernel profiles and traces.
Bug fixes:
- Adjust tracing for ROCm GPU activities to correct alignment between CPU and
GPU timelines. - Fix use of Dyninst by hpcstruct so that it sees inlining info in Intel GPU
binaries.
Infrastructure improvements:
- Code for hpcrun's use of LD_AUDIT has been streamlined.
- Fixed recording of program path names as part of metadata in hpcrun's
output files.
Dependency changes:
- Deletions
- Mbedtls - superceded by internal MD5 hash implementation
- ROCm Debug API - obtain GPU binaries using Roctracer API instead
- Gotcha - unused and removed
- Additions
- Rocprofiler API - included for a spack '+rocm' install to provide access
to hardware counters on AMD GPUs - HSA - included for a spack '+rocm' install to support rocprofiler
- Rocprofiler API - included for a spack '+rocm' install to provide access
Known Issues:
- Profile measurements and traces for AMD GPUs, which are new for ROCm 5.1,
should be viewed with some skepticism.
Also, elapsed time for copies seem too large for executions that we've
measured. For a 96-thread run of miniqmc, the aggregate time for copies
reported by AMD's OMPT implementation for its GPUs was almost 100x longer
than the real time of the execution. If timestamps are incorrect for
OpenMP events on AMD GPUs, this will affect the accuracy of both profile
and trace views.
Furthermore, trace items for OpenMP events on AMD GPUs are known to
overlap. For that reason, having hpcviewer render them on a single
trace line, which it does, is problematic. As a result, overlapping
trace items will cause incorrect statistics in trace view. In such
cases, the profile view will accurately represent the aggregate values
reported by OMPT for AMD GPUs. - In some cases, attribution of exclusive metrics for BLOCKTIME and
CTXT SWTCH to call paths within the Linux kernel may be missing
even though inclusive costs for these metrics are attributed properly.
HPCViewer Release Notes
Enhancements:
- Improved call site icons
- Double buffering x and y axis in the trace view
- Simplify metric number in derived metrics
- Set maximum database history to 20
- Set the default GPU trace exposure to true