Description:
Pawsey and Engineering staff have worked through the finalisation of the building system changes required for the implementation of the Pawsey refresh technologies.
Coinciding this work with the November scheduled maintenance window has been taken into account and some systems maintenance will be undertaken as systems come back on-line to reduce the need for another outage.
The services listed above will be off-line for periods during the nominated window and that work may impact other services which rely on them.
We appreciate your support and ask if you have any questions please e-mail help@pawsey.org.au.
Changes:
Updates:
- Zeus
- Infiniband switch failure
- Slurm "highmemq" paretition is offline due to infiniband switch failure (Partition will be restored subject to replacement part being available)
- hpc-data2, hpc-data4 & hpc-data6 is offline due to infiniband switch failure (hpc-data[1,3,5 will continue to process copyq jobs)
- Infiniband switch failure
- Topaz
- Infiniband switch failure
- "nvlinkq" and "nvlinkq-dev" is offline due to infiniband switch failure (Partition will be restored subject to replacement part being available)
- Infiniband switch failure
- Garrawarla
- Infiniband switch failure
- hpc-data8 is offline due to infiniband switch failure (hpc-data7 will continue to process copyq jobs)
- Infiniband switch failure
- Remotevis.pawsey.org.au
- SSL Certificates has been updated
- Services Restored Without nvlinkq on Topaz until replacement part arrives.
Post Updates:
- 7th December 2021, EDR infiniband arrived / replaced (Had to be manufactured)
- 8th December 2021
- Topaz nvlinkq | nvlinkq-dev restored (One node has corrupted rom on GPU)
- Zeus highmemq restored
- Pending Zeus copyq full capacity restored
- Pending Garrwarla copyq full capacity restored