With the update of ROCm to version 5.2.3, improved profiler tools become available for GPU applications. That is because the new version of ROCm enables full access to hardware performance counters. As a result, rocProf can now collect performance counters on kernels run on AMD GPU architectures. rocProf works for HIP kernels, as well as GPU offloading OpenMP and OpenACC applications.
Omniperf
There is an open source tool Omniperf which allows interpreting profile data collected by RocProf. Let's take a look at how Omniperf can be useful in the context of an example code that performs a scalar multiplication and vector addition (SAXPY).
matilda@nid001000:~> module load PrgEnv-cray/8.3.3 rocm/5.2.3 craype-accel-amd-gfx90a matilda@nid001000:~> cat saxpy.cpp #include "hip/hip_runtime.h" #include <stdio.h> __global__ void saxpy(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } int main(void) { int N = 1<<30; float *x, *y, *d_x, *d_y; x = (float*)malloc(N*sizeof(float)); y = (float*)malloc(N*sizeof(float)); hipMalloc(&d_x, N*sizeof(float)); hipMalloc(&d_y, N*sizeof(float)); for (int i = 0; i < N; i++) { x[i] = 1.0f; y[i] = 2.0f; } hipMemcpy(d_x, x, N*sizeof(float), hipMemcpyHostToDevice); hipMemcpy(d_y, y, N*sizeof(float), hipMemcpyHostToDevice); // Perform SAXPY on 1M elements hipLaunchKernelGGL(saxpy, dim3((N+255)/256), dim3(256), 0, 0, N, 2.0f, d_x, d_y); hipMemcpy(y, d_y, N*sizeof(float), hipMemcpyDeviceToHost); float maxError = 0.0f; for (int i = 0; i < N; i++) maxError = max(maxError, abs(y[i]-4.0f)); printf("Max error: %f\n", maxError); hipFree(d_x); hipFree(d_y); free(x); free(y); } matilda@nid001000:~> hipcc --amdgpu-target=gfx90a -g -c -o saxpy.o saxpy.cpp matilda@nid001000:~> hipcc --amdgpu-target=gfx90a saxpy.o -o saxpy matilda@nid001000:~> srun omniperf profile -n saxpy -- ./saxpy ... ROC Profiler: /opt/rocm/bin/rocprof /opt/rocm/.info/version ------------- Profile only ------------- omniperf ver: 1.0.6 Path: /lus/joey/scratch/pawsey0001/matilda/SAXPY_ROCM5.2.3/workloads Target: mi200 Command: ./saxpy Kernel Selection: None Dispatch Selection: None IP Blocks: All /opt/rocm/.info/version /opt/rocm/.info/version
Omniperf generates profile data in workloads
folder. It is a good practice to analyse the profile data on your local machine without the overhead associated with remote communication. For that you need to install preferably the same version of omniperf on your local machine. Installation instructions are similar to the ones for the remote machine and it is given at the end of this page. Once omniperf is installed on the local machine the workloads
folder with profile data can be transferred to your computer and analysed via local omniperf client
ubuntu@above-basilisk:~$ scp -r matilda@joey.pawsey.org.au:/scratch/pawsey0001/matilda/SAXPY_ROCM5.2.3/workloads . ubuntu@above-basilisk:~$ omniperf analyze -p workloads/saxpy/mi200/ --gui -------- Analyze -------- Dash is running on http://0.0.0.0:8050/ * Serving Flask app 'omniperf_analyze.omniperf_analyze' * Debug mode: off WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead. * Running on all addresses (0.0.0.0) * Running on http://127.0.0.1:8050 * Running on http://192.168.64.8:8050 Press CTRL+C to quit ...
At this stage you need to open your preferred browser and navigate to http://0.0.0.0:8050/
The page allows interaction with user and there are several panels, including Empirical Roofline Analysis.
matilda@nid001000:~> module load PrgEnv-cray/8.3.3 rocm/5.2.3 craype-accel-amd-gfx90a matilda@nid001000:~> cat saxpy_openmp.f90 program main implicit none integer, parameter :: n = 2**24 real :: x(n), y(n), a = 2.3 integer :: i print *, "Initializing X and Y..." !$omp target teams !$omp distribute simd do i = 1, n x(i) = sqrt(real(i)) y(i) = sqrt(1.0/real(i)) end do !$omp end target teams print *, "Computing the SAXPY operation..." !$omp target teams !$omp distribute simd do i = 1, n y(i) = a*x(i) + y(i) end do !$omp end target teams end program main matilda@nid001000:~> ftn -O3 -homp -Wl,--no-relax -f PIC saxpy_openmp.f90 -o saxpy_openmp matilda@nid001000:~> srun omniperf profile -n saxpy_openmp -- ./saxpy_openmp ... ROC Profiler: /opt/rocm/bin/rocprof /opt/rocm/.info/version ------------- Profile only ------------- omniperf ver: 1.0.6 Path: /lus/joey/scratch/pawsey0001/matilda/SAXPY_ROCM5.2.3/openmp/workloads Target: mi200 Command: ./saxpy_openmp Kernel Selection: None Dispatch Selection: None IP Blocks: All /opt/rocm/.info/version /opt/rocm/.info/version
Omniperf generates profile data in workloads
folder. It is a good practice to analyse the profile data on your local machine without the overhead associated with remote communication. For that you need to install preferably the same version of omniperf on your local machine. Installation instructions are similar to the ones for the remote machine. Once omniperf is installed on the local machine the workloads
folder with profile data can be transferred to your computer and analysed via local omniperfcclient.
ubuntu@above-basilisk:~$ scp -r matilda@joey.pawsey.org.au:/scratch/pawsey0001/matilda/SAXPY_ROCM5.2.3/openmp/workloads . ubuntu@above-basilisk:~$ omniperf analyze -p workloads/saxpy_openmp/mi200/ --gui -------- Analyze -------- Dash is running on http://0.0.0.0:8050/ * Serving Flask app 'omniperf_analyze.omniperf_analyze' * Debug mode: off WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead. * Running on all addresses (0.0.0.0) * Running on http://127.0.0.1:8050 * Running on http://192.168.64.8:8050 Press CTRL+C to quit ...
At this stage you need to open your preferred browser and navigate to http://0.0.0.0:8050/
The page allows interaction with user and there are several panels, including Empirical Roofline Analysis.
matilda@nid001000:~> module load PrgEnv-cray/8.3.3 rocm/5.2.3 craype-accel-amd-gfx90a matilda@nid001000:~> cat saxpy_openacc.f90 program main implicit none integer, parameter :: n = 2**24 real :: x(n), y(n), a = 2.3 integer :: i print *, "Initializing X and Y..." !$ACC PARALLEL LOOP do i = 1, n x(i) = sqrt(real(i)) y(i) = sqrt(1.0/real(i)) end do !$ACC END PARALLEL LOOP print *, "Computing the SAXPY operation..." !$ACC PARALLEL LOOP do i = 1, n y(i) = a*x(i) + y(i) end do !$ACC END PARALLEL LOOP end program main matilda@nid001000:~> ftn -O3 -hacc -Wl,--no-relax -f PIC saxpy_openacc.f90 -o saxpy_openacc matilda@nid001000:~> srun omniperf profile -n saxpy_openacc -- ./saxpy_openacc ... ROC Profiler: /opt/rocm/bin/rocprof /opt/rocm/.info/version ------------- Profile only ------------- omniperf ver: 1.0.6 Path: /lus/joey/scratch/pawsey0001/matilda/SAXPY_ROCM5.2.3/openacc/workloads Target: mi200 Command: ./saxpy_openacc Kernel Selection: None Dispatch Selection: None IP Blocks: All /opt/rocm/.info/version /opt/rocm/.info/version
Omniperf generates profile data in workloads
folder. It is a good practice to analyse the profile data on your local machine without the overhead associated with remote communication. For that you need to install preferably the same version of omniperf on your local machine. Installation instructions are similar to the ones for the remote machine. Once omniperf is installed on the local machine the workloads
folder with profile data can be transferred to your computer and analysed via local omniperf client.
ubuntu@above-basilisk:~$ scp -r matilda@joey.pawsey.org.au:/scratch/pawsey0001/matilda/SAXPY_ROCM5.2.3/openacc/workloads . ubuntu@above-basilisk:~$ omniperf analyze -p workloads/saxpy/mi200/ --gui -------- Analyze -------- Dash is running on http://0.0.0.0:8050/ * Serving Flask app 'omniperf_analyze.omniperf_analyze' * Debug mode: off WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead. * Running on all addresses (0.0.0.0) * Running on http://127.0.0.1:8050 * Running on http://192.168.64.8:8050 Press CTRL+C to quit ...
At this stage you need to open your preferred browser and navigate to http://0.0.0.0:8050/
The page allows interaction with user and there are several panels, including Empirical Roofline Analysis.
Omnitrace
Omnitrace is an AMD research initiative aimed at gathering runtime performance data for software applications. It is compatible with programs coded in C, C++, Fortran, and Python, as well as with computational frameworks such as OpenCL and HIP. Please ensure that you load the necessary modules for Omnitrace.
Profiling can be done in two steps. First, omnitrace instruments the application for profiling. Second, it runs the generated *.inst file.
matilda@nid001000:~> module load omnitrace/1.10.2 matilda@nid001000:~> omnitrace-instrument -o saxpy.inst -- ./saxpy ... matilda@nid001000:~> omnitrace-run -- ./saxpy.inst ...
It creates omnitrace-saxpy.inst-output folder with a date-stamped subfolder, where profile data are stored. At this stage one can download *.proto file to a local computer and open it with ui.perfetto.dev. From the perfetto analysis one can observe the timing and duration of the code executions on the host, as well as the timing of kernel executions on the device. Additionally, you should be able examine all host to device and device to host data transfers.
Profiling can be done in two steps. First, omnitrace instruments the application for profiling. Second, it runs the generated *.inst file.
matilda@nid001000:~> module load omnitrace/1.10.2 matilda@nid001000:~> omnitrace-instrument -o saxpy_openmp.inst -- ./saxpy_openmp ... matilda@nid001000:~> omnitrace-run -- ./saxpy_openmp.inst ...
It creates omnitrace-saxpy_openmp.inst-output folder with a date-stamped subfolder, where profile data are stored. At this stage one can download *.proto file to a local computer and open it with ui.perfetto.dev. From the perfetto analysis one can observe the timing and duration of the code executions on the host, as well as the timing of kernel executions on the device. Additionally, you should be able examine all host to device and device to host data transfers.
Profiling can be done in two steps. First, omnitrace instruments the application for profiling. Second, it runs the generated *.inst file.
matilda@nid001000:~> module load omnitrace/1.10.2 matilda@nid001000:~> omnitrace-instrument -o saxpy_openacc.inst -- ./saxpy_openacc ... matilda@nid001000:~> omnitrace-run -- ./saxpy_openacc.inst ...
It creates omnitrace-saxpy_openacc.inst-output folder with a date-stamped subfolder, where profile data are stored. At this stage one can download *.proto file to a local computer and open it with ui.perfetto.dev. From the perfetto analysis one can observe the timing and duration of the code executions on the host, as well as the timing of kernel executions on the device. Additionally, you should be able examine all host to device and device to host data transfers.
Installing omniperf
matilda@nid001000:/software/pawsey0001/matilda> module load py-pip/23.1.2-py3.10.10 matilda@nid001000:/software/pawsey0001/matilda> wget https://github.com/AMDResearch/omniperf/releases/download/v1.0.6/omniperf-v1.0.6.tar.gz matilda@nid001000:/software/pawsey0001/matilda> tar -xf omniperf-v1.0.6.tar.gz matilda@nid001000:/software/pawsey0001/matilda> cd omniperf-1.0.6/ matilda@nid001000:/software/pawsey0001/matilda/omniperf-1.0.6> cmake --version cmake version 3.24.3 CMake suite maintained and supported by Kitware (kitware.com/cmake). matilda@nid001000:/software/pawsey0001/matilda/omniperf-1.0.6> export INSTALL_DIR=/software/pawsey0001/matilda/omniperf matilda@nid001000:/software/pawsey0001/matilda/omniperf-1.0.6> python3 -m pip install -t ${INSTALL_DIR}/python-libs -r requirements.txt ... matilda@nid001000:/software/pawsey0001/matilda/omniperf-1.0.6> mkdir build matilda@nid001000:/software/pawsey0001/matilda/omniperf-1.0.6> cd build/ matilda@nid001000:/software/pawsey0001/matilda/omniperf-1.0.6/build> cmake -DCMAKE_INSTALL_PREFIX=${INSTALL_DIR}/1.0.6 \ > -DPYTHON_DEPS=${INSTALL_DIR}/python-libs \ > -DMOD_INSTALL_PATH=${INSTALL_DIR}/modulefiles .. ... matilda@nid001000:/software/pawsey0001/matilda/omniperf-1.0.6/build> make install ... matilda@nid001000:/software/pawsey0001/matilda/omniperf-1.0.6/build> module use /software/pawsey0001/matilda/omniperf/modulefiles matilda@nid001000:/software/pawsey0001/matilda/omniperf-1.0.6/build> module load omniperf/1.0.6