Page tree
Skip to end of metadata
Go to start of metadata

Athena User Guide



This guide is currently under development during the early adopter phase, feedback can be provided via the Pawsey Service Desk.

Overview


Athena is an SGI Linux cluster primarily intended to allow Pawsey researchers to engage with cutting-edge technologies, and inform the eventual capital refresh of the current petascale system.

It consists of two emerging technology architectures:

  • An Intel Knights Landing (KNL) Xeon Phi cluster with a 100 Gbps Omnipath interconnect
  • A NVIDIA Pascal GPU cluster with a 100 Gbps EDR Infiniband interconnect

A summary of the node configuration is provided in the following table:

Athena KNL Nodes
Hostnamesa001-a080
SLURM Partitionknlq
Processor

1x Intel Xeon Phi 7210 @1.30GHz

  • 64 cores with 4 hardware threads
  • Cluster mode: Quadrant, Memory mode: Cache
  • 32K L1i/L1d, 1MB L2 (shared by 2 cores), no L3
  • 16 GB MCDRAM with up to 450GB/s bandwidth
Memory96 GB DDR4 with up to 102GB/s bandwidth
Interconnect100 Gbps Omni-Path
Athena Pascal Nodes
Hostnamesa081-a091
SLURM Partitiongpuq
Host Processors

2x Intel Xeon Broadwell E5-2680 v4 @2.40GHz

  • 14 cores per CPU
  • 32K L1i/L1d, 256K L2, 32M L3
Memory128 GB DDR4 with up to 76.8GB/s bandwidth
GPU Processors

4x NVIDIA Tesla P100

  • 3584 CUDA cores
  • 16 GB global memory
  • NVLINK between GPUs
Interconnect100 Gbps EDR Infiniband

Access to nodes is managed by the SLURM queuing system via the partitions listed in the table above.

To facilitate the data management related to such work, Athena shares /home, /group and /scratch file systems with other Pawsey Supercomputing Centre systems.

Back to Top

Configurations on KNL partition

KNL offers a variety of configurations through memory types (DDR and MCDRAM), memory modes (Cache, Flat and Hybrid), and cluster modes (All-to-All, Quadrant, Hemisphere, SNC-2 and SNC-4). These different modes match to different programming styles and provide enhanced performance when used for suitable applications. They are set on boot time and the current settings on Athena knlq partition are Quadrant cluster mode and Cache memory mode

The meanings of the three memory modes and their implications on programming are listed below.

Memory ModeCharacteristicsImplications on Programming
Cache
  • MCDRAM is configured as a cache for the DDR memory.
  • MCDRAM does not exhibit as a separate NUMA node.
  • Transparent to software, no code change required.
  • Cache-friendly code can benefit from the high bandwidth provided by MCDRAM.
  • Code with high cache misses may suffer as memory accesses need to go through MCDRAM to get to DDR.
Flat
  • MCDRAM is configured as emory just like DDR and is part of the same address space.
  • MCDRAM exhibits as a separate NUMA node from DDR.
  • Code changes may be needed.Tools such as numactl, memkind and autohbw can be used to explicitly allocate data into MCDRAM.
  • Data allocated to MCDRAM will benefit from high bandwidth regardless of hit rate, unlike in Cache mode.
  • Bandwidth sensitive data should be allocated to MCDRAM and the rest to DDR.
Hybrid
  • A portion of MCDRAM is configured as cache and the remainder is configured as flat in the same address space as DDR.
  • MCDRAM exhibits as a separate NUMA node from DDR.
  • A combination of Cache mode and Flat mode.
  • Suitable for application that can benefit from general caching and can also take advantage of high bandwidth memory by storing critical or frequently accessed data in flat memory.
  • The amount of cache memory (25% or 50%) or flat memory (75% or 50%) is less than in pure Cache mode or pure Flat mode.

The five cluster modes (three variants) are explained below. 

Cluster ModeTransaction Routing
All-to-AllOriginate from any tile, route to any tag directory, to any memory controller.

Quadrant/Hemisphere

Originate from any tile, to a tag directory colocated in the same quadrant/hemisphere as the memory.

SNC-4/SNC-2

Tile, tag directory, and target memory controller are all colocated within the same cluster.

SNC-4: 4 clusters; SNC-2: 2 clusters.

The interactions of different memory modes and cluster modes are summaries below. 

Cluster ModeMemory ModeMCDRAM Cache SizeNUMA Nodes
All-to-AllCache100% cache, 0% memoryNone
Flat0% cache, 100% memory

Total 2:

1 node for MCDRAM and 1 node for DDR

Hybrid

25% or 50% cache,

75% or 50% memory

Quadrant/HemisphereCache100% cache, 0% memoryNone
Flat0% cache, 100% memory

Total 2:

1 node for MCDRAM and 1 node for DDR

Hybrid

25% or 50% cache,

75% or 50% memory

SNC-4/SNC-2Cache100% cache, 0% memory

Total 4:

4 clusters, each with 1 node for DDR

(SNC-2: total 2)

Flat0% cache, 100% memory

Total 8:

4 clusters, each with 1 node for MCDRAM and 1 node for DDR

(SNC-2: total 4)

Hybrid

25% or 50% cache,

75% or 50% memory

Users can query the memory configuration by running the 'numactl -H' command. For example, a006 is configured as Qudrant-Cache and there is only 1 NUMA node, node 0, shown in the returned result. Node 0 is the entire DDR with size 96GB, and MCDRAM does not show as a NUMA node, since it's configured to be a cache. Note that the numactl module should be used as it is a more recent version than the packaged one.

cyang@athena:~> salloc -p knlq -n 1 
salloc: Granted job allocation 2791
cyang@athena:~> srun hostname
a006
cyang@athena:~> module load knl intel numactl
cyang@athena:~> srun numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
node 0 size: 96529 MB
node 0 free: 90206 MB
node distances:
node   0 
  0:  10 
cyang@athena:~> exit
exit
salloc: Relinquishing job allocation 2791

Back to Top

Connecting and Transferring Files


Logging in

Interactive access is via secure shell to the address "athena.pawsey.org.au"

~> ssh [username]@athena.pawsey.org.au

The front-end node "athena" handles logins, editing, and other standard activities.

To enable X11 forwarding, use the -X option:

~> ssh -X [username]@athena.pawsey.org.au

Mac users may need to specify ssh -Y (a relaxed security mode) instead of ssh -X to get a reliable X connection. If you wish to log in from a Windows machine, you will have to install an appropriate ssh agent, such as MobaXterm. A number of X11 client/server implementations are also available for Windows if GUI work is required. Please refer to the Remote GUI Programs with X over SSH and Remote GUI Programs with VNC over SSH page.

Files may be transferred using secure copy scp. See cross-centre Data Management for further details on file systems and transferring files.

Back to Top

Development Environment


The Module System and Compilers

Athena uses a framework referred to as the Module System to manage how users see different software packages and different versions of the same package. Which modules are used determines which set of compilers and libraries are used when compiling and linking code (and may also be important to load the correct modules at run time, even if using an application compiled by someone else). There are a number of different compilers available on Athena. Two main compiler suites are available: Intel and GNU. For GPU code, NVIDIA "nvcc" is available. Another important module is mvapich, which provides a Message Passing Interface (MPI) implementation that supports both the Omni-Path and InfiniBand interconnects and is Application Binary Interface (ABI) compatible with the relevant MPICH version. All these are described in greater detail below.

A number of commands are available to provide information about the module system. To use the contents of a module, the module must be loaded.

What modules are loaded: module list

The module list command provides a numbered list of the currently loaded modules, which are usually displayed in the form "name/version".

athena:~> module list
Currently Loaded Modulefiles:
  1) pawseytools/1.13   2) slurm/16.05.8

As Athena has several processor architectures, relevant modules such as architecture, compilers, and MPI are not loaded by default. More details are provided for the relevant architectures later in this section.

What modules are available: module avail

The module avail command will provide a full list of all the modules that are currently available (but not necessarily loaded). It may be seen that some modules with the same name will be available in a number of different versions, one of which will be labelled (default).

athena:~> module avail

Loading, Unloading, and Swapping Modules

To load a new module, use the module load command, e.g.,

athena:~> module load python

Note here python is specified without a version number, which means the default version is loaded. If a specific available version is wanted, we must unload the exiting one and load the required version explicitly:

athena:~> module unload python
athena:~> module load python/3.6.1

A short-cut to replace the separate unload / load steps is available: module swap, e.g.,

athena:~> module swap python python/3.6.1

which unloads python and loads python/3.6.1.

Additional module information

A list of dependencies and environment variables associated with a module can be obtained via module display:

athena:~> module display intel

Architecture, compiler, and MPI environments

The following table provides details of the relevant modules to use for different combinations of processor architecture and compiler suites.

ArchitectureCompilerModule Command
Intel Xeon Phi (KNL)Intel 17module load knl intel mvapich
Intel Xeon Phi (KNL)Intel 16module load knl intel/16.0.4 mvapich
Intel Xeon Phi (KNL)GCC 6.3.0module load knl gcc mvapich
Intel Xeon Phi (KNL)GCC 5.4.0module load knl gcc/5.4.0 mvapich
Intel Xeon Phi (KNL)GCC 4.8.5module load knl gcc/4.8.5 mvapich
NVIDIA Tesla (Pascal)Intel 17module load broadwell intel cuda mvapich-gpu
NVIDIA Tesla (Pascal)Intel 16module load broadwell intel/16.0.4 cuda mvapich-gpu
NVIDIA Tesla (Pascal)GCC 6.3.0not currently supported
NVIDIA Tesla (Pascal)GCC 5.4.0module load broadwell gcc/5.4.0 cuda mvapich-gpu
NVIDIA Tesla (Pascal)GCC 4.8.5module load broadwell gcc/4.8.5 cuda mvapich-gpu

The mvapich/mvapich-gpu module can be omitted if MPI support is not needed.

The architecture modules (knl, broadwell, sandybridge) ensure other modules use the correct paths for libraries compiled against that architecture. They do not provide a cross-compilation environment.

Back to Top

KNL Compilation


Environment

Compilation for the Xeon Phi should occur on the compute nodes in the knlq partition, either interactively for simple programs or via a job script for larger software suites.  Failure to compile on the compute nodes may result in compilation errors or reduced performance.

To compile interactively, first submit a job request with salloc:

athena:~> salloc --partition knlq --time 1:00:00 --nodes 1

Once the job has started, load the relevant modules to specify the compiler as detailed in the previous section, for example:

athena:~> module load knl intel mvapich

Compilation commands can then be executed on the allocated compute node with srun, for example:

athena:~> srun icc -O2 -xMIC-AVX512 code.c 

To compile via a job script first prepare the script to request a node, load the relevant environment, and execute the compilation commands, for example a script file named compile.slurm containing:

#!/bin/bash --login
#SBATCH --nodes=1
#SBATCH --partition=knlq
#SBATCH --time=00:10:00
#SBATCH --account=[your-account]
#SBATCH --export=NONE

module load knl intel mvapich

icc -O2 -xMIC-AVX512 code.c

Could then be submitted to the queue with:

athena:~> sbatch compile.slurm

See the SLURM section below for more details on working with the scheduler.

Compiling an OpenMP application

Compiler options for OpenMP applications are given in the following table

Module C codeC++ code Fortran code
gccgcc -O2 -fopenmp -march=knl code.cg++ -O2 -fopenmp -march=knl code.cppgfortran -O2 -fopenmp -march=knl code.f90
intelicc -O2 -qopenmp  -xMIC-AVX512 code.cicpc -O2 -qopenmp -xMIC-AVX512 code.cppifort -O2 -qopenmp -xMIC-AVX512 code.f90

Compiling MPI applications

To compile an application using MPI (for either GNU or Intel compilers), access to the MPI library is provided via the MVAPICH module mvapich. If not loaded as described above in the module section, this module must be loaded after the relevant architecture and compiler modules:

module load mvapich

The appropriate compiler wrapper should then be invoked; this will call the appropriate underlying compiler (GNU or Intel) with additional flags to locate the MPI header files and libraries. The MPI compiler wrappers are:

  • mpicc
  • mpicxx
  • mpif90
  • mpif08

for, respectively, C, C++, and Fortran (1990 and 2008 standards).

To compile a hybrid MPI/OpenMP application, use the compiler wrappers along with the appropriate OpenMP compiler switch as set out in the table above. For example, a mixed OpenMP/MPI hybrid application in C may be compiled in the default GNU environment via

mpicc -fopenmp -O2 code_hybrid.c


Using Maali on KNL nodes

To compile a package using Pawsey's build system Maali, you should set the following environment variable

export MAALI_CORES=16

By default Maali will try to use all available cores during compilation; on KNL nodes this means using 256 cores, which results in resource contention issues.


Back to Top

NVIDIA Pascal Compilation


Environment

Compilation for the NVIDIA Pascal should occur on the compute nodes in the gpuq partition, either interactively for simple programs or via a job script for larger software suites.

To compile interactively, first submit a job request with salloc:

athena:~> salloc --partition gpuq --time 1:00:00 --nodes 1 --gres=gpu:1

Once the job has started, load the relevant modules to specify the compiler as detailed in the modules section, for example:

athena:~> module load broadwell gcc/5.4.0 cuda mvapich

Compilation commands can then be executed on the allocated compute node with srun, for example:

athena:~> srun nvcc -O2 -arch=sm_60 code_host.c code_cuda.cu 

To compile via a job script first prepare the script to request a node, load the relevant environment, and execute the compilation commands, for example a script file named compile.slurm containing:

#!/bin/bash --login
#SBATCH --nodes=1
#SBATCH --partition=gpuq
#SBATCH --gres=gpu:1
#SBATCH --time=00:10:00
#SBATCH --account=[your-account]
#SBATCH --export=NONE

module load broadwell gcc/5.4.0 cuda mvapich

nvcc -O2 -arch=sm_60 code_host.c code_cuda.cu

Could then be submitted to the queue with:

athena:~> sbatch compile.slurm

See the SLURM section below for more details on working with the scheduler.

Compiling a CUDA application

The easiest strategy for CUDA applications using the CUDA run time API is to use nvcc for all relevant source files. The NVIDIA nvcc compiler automatically delegates some phases of the compilation to a host compiler which is from the GNU suite. It is therefore appropriate to load the "gcc" compiler module. If not loaded as described above in the module section, the cuda module must be loaded in addition to the host compiler:

module load cuda

Compilation can then take place with, e.g.,

nvcc -O2 -arch=sm_60 code_host.c code_cuda.cu

to provide an executable a.out.

Alternatively, if separate compilation and link stages are required, compilation (only, with the "-c" option) should take place via, e.g.,

g++ -c code_host.cpp
nvcc -c code_cuda.cu

Link stage should take place using the host compiler, and include the CUDA run time library via "-lcudart":

g++ code_cuda.o code_host.o -lcudart

Compiling an OpenACC application

Similarly, to compile an OpenACC application, you can either do it interactively via salloc or in a batch job:

athena:~> salloc --partition gpuq --time 1:00:00 --nodes 1 --gres=gpu:1
athena:~> module load pgi
athena:~> srun pgcc -acc code_openacc.c 

or

#!/bin/bash --login
#SBATCH --nodes=1
#SBATCH --partition=gpuq
#SBATCH --gres=gpu:1
#SBATCH --time=00:10:00
#SBATCH --account=[your-account]
#SBATCH --export=NONE

module load pgi
pgcc -acc code_openacc.c

The PGI compilers for C++ and Fortran are pgc++ and pgfortran.

Back to Top

Submitting jobs to SLURM


On athena, the SLURM resource manager is in use. SLURM controls the allocation of resources (compute nodes) to users’ jobs and monitors those jobs when they run.

Requesting KNL nodes

The Knights Landing Xeon Phi nodes are available via the knlq partition in SLURM, and one or more nodes can be requested, eg:

#SBATCH --partition=knlq
#SBATCH --nodes=2

The Xeon Phi nodes by default are configured in cache memory mode and quadrant cluster mode, contact the Pawsey Service Desk to request nodes with a different configuration.

Requesting Pascal nodes

The NVIDIA Pascal nodes are available via the gpuq partition in SLURM, and one or more nodes can be requested, eg:

#SBATCH --partition=gpuq
#SBATCH --nodes=2

Additionally, the number of GPUs per node (up to four) must also be specified using the "gres" keyword. If this is omitted, codes running in the gpuq partition will not be able to access the GPUs.

#SBATCH --gres=gpu:1

The type of GPU can also be optionally specified via the "constraint" keyword. Currently on Athena this is only "p100".

#SBATCH --constraint=p100

Thread placement

There are a number of environment variables that control the thread placement on both the Xeon Phi and Broadwell host processors.

Both the Intel and GCC compilers support the standard OpenMP variables:

  • OMP_PLACES specifies thread placement either by name with optional suffix of the number of places in brackets, or by an explicit list of places.

    Example valueEffect
    threads(n)Each place corresponds to a single hardware thread, limited by the optional value n if included in brackets
    cores(n)Each place corresponds to a single core, limited by the optional value n if included in brackets
    sockets(n)Each place corresponds to a single socket, limited by the optional value n if included in brackets
  • OMP_PROC_BIND specifies how the threads are ordered across the available places.

    Example valueEffect
    falseThe threads are not bound to places and may be redistributed over time
    spreadThe threads are bound and distributed sparsely across the available places
    closeThe threads are bound and placed contiguously

The Intel compilers also support Intel-specific versions of these variables:

  • KMP_HW_SUBSET specifies the allocation of hardware resources, assigning the number of cores and number of threads per core. A comma separated list is used, with a "c" suffix for cores, "t" suffix for threads, and "o" suffix for offset.

    Example valueEffect
    64c,4tUses 64 cores, each with 4 threads
  • KMP_AFFINITY specifies how threads are placed within the allocated resources.

    Example valueEffect
    scatterThreads are distributed across consecutive cores
    compactThreads are placed contiguously
    balancedThreads are distributed across the cores, but placed contiguously within a core

MPI job launcher: srun

Any applications requiring message passing must be launched with the parallel job launcher srun provided by SLURM. This takes a number of arguments specifying various options, for example:

module load mvapich
srun -N <number of nodes> -n <number of MPI tasks> ./application

Here, the "-N" option specifies the number of nodes required for the job and the "-n" option specifies the total number of MPI tasks to run on "-N" nodes.

Athena: SLURM script for packing serial tasks on KNL

Separate serial jobs with no dependencies may be "packed" via the standard Unix mechanism of running each task in the background. A "wait" statement is required to prevent the job exiting after the first task has completed. The time taken be the job will be dictated by the longest serial task. The following example uses two separate inputs, but tasks can be packed up to the number of cores in the code.

#!/bin/bash --login
#SBATCH --partition=knlq
#SBATCH --nodes=1
#SBATCH --time=00:10:00
#SBATCH --account=[your-account]
#SBATCH --export=NONE

./main_serial input_one &
./main_serial input_two &

wait

If exact placement of independent tasks on different cores is required this can be done with numactl. Note that the numactl module should be used as it is a more recent version than the packaged one.

Non-packed serial work flows should not be run on Athena.

Athena: SLURM script for OpenMP jobs on KNL

The following two examples each request 1 node and run one 16-thread OpenMP instance. There are some differences depending on which compiler has been used to produce the relevant executable.

Intel compiled executables. Note you should load the architecture and compiler modules, ie the knl and intel modules.

#!/bin/bash --login
#SBATCH --partition=knlq
#SBATCH --nodes=1
#SBATCH --time=00:10:00
#SBATCH --account=[your-account]
#SBATCH --export=NONE

module load knl intel

export OMP_NUM_THREADS=16
srun -n 1 -c $OMP_NUM_THREADS ./main_omp

GNU compiled executables. Note you should load the architecture and compiler modules, ie the knl and gcc modules.

#!/bin/bash --login
#SBATCH --partition=knlq
#SBATCH --nodes=1
#SBATCH --time=00:10:00
#SBATCH --account=[your-account]
#SBATCH --export=NONE

module load knl gcc 

export OMP_NUM_THREADS=16
srun -n 1 -c $OMP_NUM_THREADS ./main_omp

Athena: SLURM script for MPI jobs on KNL

The following script requests 2 nodes and runs 32 tasks on them with 16 MPI processes each. The intel module can be substituted for gcc and module versions included if needed.

#!/bin/bash --login
#SBATCH --partition=knlq
#SBATCH --nodes=2
#SBATCH --time=00:10:00
#SBATCH --account=[your-account]
#SBATCH --export=NONE
module load knl intel mvapich

srun -n 32 -N 2 ./main_mpi

Athena: SLURM script for hybrid OpenMP/MPI jobs on KNL

The following script requests 2 nodes and runs 1 MPI process with 64 OpenMP threads on each using an intel compiled executable.

#!/bin/bash --login
#SBATCH --partition=knlq
#SBATCH --nodes=2
#SBATCH --time=00:10:00
#SBATCH --account=[your-account]
#SBATCH --export=NONE

module load knl intel mvapich

export OMP_NUM_THREADS=64
srun -n 2 -N 2 ./main_hybrid

Athena: SLURM script for CUDA jobs on Pascal

The following script requests 1 node with 4 P100 GPUs.

#!/bin/bash --login
#SBATCH --partition=gpuq
#SBATCH --nodes=1
#SBATCH --gres=gpu:4
#SBATCH --constraint=p100
#SBATCH --time=00:10:00
#SBATCH --account=[your-account]
#SBATCH --export=NONE

module load broadwell gcc/5.4.0 cuda
./main_cuda

Athena: SLURM script for OpenACC jobs on Pascal

The following script requests 1 node with 1 P100 GPUs.

#!/bin/bash --login
#SBATCH --partition=gpuq
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --constraint=p100
#SBATCH --time=00:10:00
#SBATCH --account=[your-account]
#SBATCH --export=NONE

module load pgi 
./main_openacc

Athena: Job Arrays

Job Arrays are disabled on Athena.  Job arrays are in general not compatible with the purpose of the Athena cluster.

External References


CUDA

MVAPICH

SLURM

Back to Top


  • No labels