With the update of ROCm to version 5.2.3, improved profiler tools become available for GPU applications. That is because the new version of ROCm enables full access to hardware performance counters. As a result, rocProf can now collect performance counters on kernels run on AMD GPU architectures. rocProf works for HIP kernels, as well as GPU offloading OpenMP and OpenACC applications.

Omniperf

There is an open source tool Omniperf which allows interpreting profile data collected by RocProf. Let's take a look at how Omniperf can be useful in the context of an example code that performs a scalar multiplication and vector addition (SAXPY).

matilda@nid001000:~> module load PrgEnv-cray/8.3.3 rocm/5.2.3 craype-accel-amd-gfx90a
matilda@nid001000:~> cat saxpy.cpp 
#include "hip/hip_runtime.h"
#include <stdio.h>

__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

int main(void)
{
  int N = 1<<30;
  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  hipMalloc(&d_x, N*sizeof(float));
  hipMalloc(&d_y, N*sizeof(float));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  hipMemcpy(d_x, x, N*sizeof(float), hipMemcpyHostToDevice);
  hipMemcpy(d_y, y, N*sizeof(float), hipMemcpyHostToDevice);

  // Perform SAXPY on 1M elements
hipLaunchKernelGGL(saxpy, dim3((N+255)/256), dim3(256), 0, 0, N, 2.0f, d_x, d_y);

  hipMemcpy(y, d_y, N*sizeof(float), hipMemcpyDeviceToHost);

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %f\n", maxError);

  hipFree(d_x);
  hipFree(d_y);
  free(x);
  free(y);
}
matilda@nid001000:~> hipcc --amdgpu-target=gfx90a -g -c -o saxpy.o saxpy.cpp
matilda@nid001000:~> hipcc --amdgpu-target=gfx90a saxpy.o -o saxpy
matilda@nid001000:~> srun omniperf profile -n saxpy -- ./saxpy
...
ROC Profiler:  /opt/rocm/bin/rocprof

/opt/rocm/.info/version

-------------
Profile only
-------------

omniperf ver:  1.0.6
Path:  /lus/joey/scratch/pawsey0001/matilda/SAXPY_ROCM5.2.3/workloads
Target:  mi200
Command:  ./saxpy
Kernel Selection:  None
Dispatch Selection:  None
IP Blocks: All
/opt/rocm/.info/version
/opt/rocm/.info/version

Omniperf generates profile data in workloads folder. It is a good practice to analyse the profile data on your local machine without the overhead associated with remote communication. For that you need to install preferably the same version of omniperf on your local machine. Installation instructions are similar to the ones for the remote machine and it is given at the end of this page. Once omniperf is installed on the local machine the workloads folder with profile data can be transferred to your computer and analysed via local omniperf client

ubuntu@above-basilisk:~$ scp -r matilda@joey.pawsey.org.au:/scratch/pawsey0001/matilda/SAXPY_ROCM5.2.3/workloads .
ubuntu@above-basilisk:~$ omniperf analyze -p workloads/saxpy/mi200/ --gui

--------
Analyze
--------

Dash is running on http://0.0.0.0:8050/

 * Serving Flask app 'omniperf_analyze.omniperf_analyze'
 * Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:8050
 * Running on http://192.168.64.8:8050
Press CTRL+C to quit
...

At this stage you need to open your preferred browser and navigate to http://0.0.0.0:8050/

The page allows interaction with user and there are several panels, including Empirical Roofline Analysis. 

 

matilda@nid001000:~> module load PrgEnv-cray/8.3.3 rocm/5.2.3 craype-accel-amd-gfx90a
matilda@nid001000:~> cat saxpy_openmp.f90 
program main

   implicit none

   integer, parameter :: n = 2**24
   real :: x(n), y(n), a = 2.3
   integer :: i

   print *, "Initializing X and Y..."

!$omp target teams
!$omp distribute simd
   do i = 1, n
      x(i) = sqrt(real(i))
      y(i) = sqrt(1.0/real(i))
   end do
!$omp end target teams

   print *, "Computing the SAXPY operation..."

!$omp target teams
!$omp distribute simd
   do i = 1, n
      y(i) = a*x(i) + y(i)
   end do
!$omp end target teams

end program main
matilda@nid001000:~> ftn -O3 -homp -Wl,--no-relax -f PIC saxpy_openmp.f90 -o saxpy_openmp 
matilda@nid001000:~> srun omniperf profile -n saxpy_openmp -- ./saxpy_openmp 
... 
ROC Profiler:  /opt/rocm/bin/rocprof

/opt/rocm/.info/version

-------------
Profile only
-------------

omniperf ver:  1.0.6
Path:  /lus/joey/scratch/pawsey0001/matilda/SAXPY_ROCM5.2.3/openmp/workloads
Target:  mi200
Command:  ./saxpy_openmp
Kernel Selection:  None
Dispatch Selection:  None
IP Blocks: All
/opt/rocm/.info/version
/opt/rocm/.info/version

Omniperf generates profile data in workloads folder. It is a good practice to analyse the profile data on your local machine without the overhead associated with remote communication. For that you need to install preferably the same version of omniperf on your local machine. Installation instructions are similar to the ones for the remote machine. Once omniperf is installed on the local machine the workloads folder with profile data can be transferred to your computer and analysed via local omniperfcclient.

ubuntu@above-basilisk:~$ scp -r matilda@joey.pawsey.org.au:/scratch/pawsey0001/matilda/SAXPY_ROCM5.2.3/openmp/workloads .
ubuntu@above-basilisk:~$ omniperf analyze -p workloads/saxpy_openmp/mi200/ --gui

--------
Analyze
--------

Dash is running on http://0.0.0.0:8050/

 * Serving Flask app 'omniperf_analyze.omniperf_analyze'
 * Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:8050
 * Running on http://192.168.64.8:8050
Press CTRL+C to quit
...

At this stage you need to open your preferred browser and navigate to http://0.0.0.0:8050/

The page allows interaction with user and there are several panels, including Empirical Roofline Analysis. 

 

matilda@nid001000:~> module load PrgEnv-cray/8.3.3 rocm/5.2.3 craype-accel-amd-gfx90a
matilda@nid001000:~> cat saxpy_openacc.f90 
program main

   implicit none

   integer, parameter :: n = 2**24
   real :: x(n), y(n), a = 2.3
   integer :: i

   print *, "Initializing X and Y..."

!$ACC PARALLEL LOOP
   do i = 1, n
      x(i) = sqrt(real(i))
      y(i) = sqrt(1.0/real(i))
   end do
!$ACC END PARALLEL LOOP

   print *, "Computing the SAXPY operation..."

!$ACC PARALLEL LOOP
   do i = 1, n
      y(i) = a*x(i) + y(i)
   end do
!$ACC END PARALLEL LOOP

end program main
matilda@nid001000:~> ftn -O3 -hacc -Wl,--no-relax -f PIC saxpy_openacc.f90 -o saxpy_openacc 
matilda@nid001000:~> srun omniperf profile -n saxpy_openacc -- ./saxpy_openacc
... ROC Profiler:  /opt/rocm/bin/rocprof

/opt/rocm/.info/version

-------------
Profile only
-------------

omniperf ver:  1.0.6
Path:  /lus/joey/scratch/pawsey0001/matilda/SAXPY_ROCM5.2.3/openacc/workloads
Target:  mi200
Command:  ./saxpy_openacc
Kernel Selection:  None
Dispatch Selection:  None
IP Blocks: All
/opt/rocm/.info/version
/opt/rocm/.info/version

Omniperf generates profile data in workloads folder. It is a good practice to analyse the profile data on your local machine without the overhead associated with remote communication. For that you need to install preferably the same version of omniperf on your local machine. Installation instructions are similar to the ones for the remote machine. Once omniperf is installed on the local machine the workloads folder with profile data can be transferred to your computer and analysed via local omniperf client.

ubuntu@above-basilisk:~$ scp -r matilda@joey.pawsey.org.au:/scratch/pawsey0001/matilda/SAXPY_ROCM5.2.3/openacc/workloads .
ubuntu@above-basilisk:~$ omniperf analyze -p workloads/saxpy/mi200/ --gui

--------
Analyze
--------

Dash is running on http://0.0.0.0:8050/

 * Serving Flask app 'omniperf_analyze.omniperf_analyze'
 * Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:8050
 * Running on http://192.168.64.8:8050
Press CTRL+C to quit
...

At this stage you need to open your preferred browser and navigate to http://0.0.0.0:8050/

The page allows interaction with user and there are several panels, including Empirical Roofline Analysis. 

 

Omnitrace

Omnitrace is an AMD research initiative aimed at gathering runtime performance data for software applications. It is compatible with programs coded in C, C++, Fortran, and Python, as well as with computational frameworks such as OpenCL and HIP. Please ensure that you load the necessary modules for Omnitrace.


Profiling can be done in two steps. First, omnitrace  instruments  the application for profiling. Second, it runs the generated *.inst file.

matilda@nid001000:~> module load omnitrace/1.10.2
matilda@nid001000:~> omnitrace-instrument -o saxpy.inst -- ./saxpy 
...
matilda@nid001000:~> omnitrace-run -- ./saxpy.inst
...

It creates  omnitrace-saxpy.inst-output folder with a date-stamped subfolder, where profile data are stored. At this stage one can download *.proto file to a local computer and open it with   ui.perfetto.dev. From the perfetto analysis one can observe the timing and duration of the code executions on the host, as well as the timing of kernel executions on the device. Additionally, you should be able examine all host to device and device to host data transfers.

 

Profiling can be done in two steps. First, omnitrace  instruments  the application for profiling. Second, it runs the generated *.inst file.

matilda@nid001000:~> module load omnitrace/1.10.2
matilda@nid001000:~> omnitrace-instrument -o saxpy_openmp.inst -- ./saxpy_openmp 
...
matilda@nid001000:~> omnitrace-run -- ./saxpy_openmp.inst
...

It creates omnitrace-saxpy_openmp.inst-output folder with a date-stamped subfolder, where profile data are stored. At this stage one can download *.proto file to a local computer and open it with   ui.perfetto.dev. From the perfetto analysis one can observe the timing and duration of the code executions on the host, as well as the timing of kernel executions on the device. Additionally, you should be able examine all host to device and device to host data transfers.

 

Profiling can be done in two steps. First, omnitrace  instruments  the application for profiling. Second, it runs the generated *.inst file.

matilda@nid001000:~> module load omnitrace/1.10.2
matilda@nid001000:~> omnitrace-instrument -o saxpy_openacc.inst -- ./saxpy_openacc
...
matilda@nid001000:~> omnitrace-run -- ./saxpy_openacc.inst
...

It creates omnitrace-saxpy_openacc.inst-output folder with a date-stamped subfolder, where profile data are stored. At this stage one can download *.proto file to a local computer and open it with   ui.perfetto.dev. From the perfetto analysis one can observe the timing and duration of the code executions on the host, as well as the timing of kernel executions on the device. Additionally, you should be able examine all host to device and device to host data transfers.

 

Installing omniperf

matilda@nid001000:/software/pawsey0001/matilda> module load py-pip/23.1.2-py3.10.10
matilda@nid001000:/software/pawsey0001/matilda> wget https://github.com/AMDResearch/omniperf/releases/download/v1.0.6/omniperf-v1.0.6.tar.gz
matilda@nid001000:/software/pawsey0001/matilda> tar -xf omniperf-v1.0.6.tar.gz
matilda@nid001000:/software/pawsey0001/matilda> cd omniperf-1.0.6/
matilda@nid001000:/software/pawsey0001/matilda/omniperf-1.0.6> cmake --version
cmake version 3.24.3

CMake suite maintained and supported by Kitware (kitware.com/cmake).
matilda@nid001000:/software/pawsey0001/matilda/omniperf-1.0.6> export INSTALL_DIR=/software/pawsey0001/matilda/omniperf
matilda@nid001000:/software/pawsey0001/matilda/omniperf-1.0.6> python3 -m pip install -t ${INSTALL_DIR}/python-libs -r requirements.txt
...
matilda@nid001000:/software/pawsey0001/matilda/omniperf-1.0.6> mkdir build
matilda@nid001000:/software/pawsey0001/matilda/omniperf-1.0.6> cd build/
matilda@nid001000:/software/pawsey0001/matilda/omniperf-1.0.6/build> cmake -DCMAKE_INSTALL_PREFIX=${INSTALL_DIR}/1.0.6 \
>         -DPYTHON_DEPS=${INSTALL_DIR}/python-libs \
>         -DMOD_INSTALL_PATH=${INSTALL_DIR}/modulefiles ..
...
matilda@nid001000:/software/pawsey0001/matilda/omniperf-1.0.6/build> make install
...
matilda@nid001000:/software/pawsey0001/matilda/omniperf-1.0.6/build> module use /software/pawsey0001/matilda/omniperf/modulefiles
matilda@nid001000:/software/pawsey0001/matilda/omniperf-1.0.6/build> module load omniperf/1.0.6 

  • No labels