Points of Contact:
- Mark Gray
- Ugo Varetto
- A power loss event has taken out several Pawsey systems. Staff are investigating the extent.
- 16:13. It would appear power is no longer being drawn by a power distribution board feeding some supercompute cell equipment including filesystems, infiniband fabric and some compute systems. Pawsey staff are working with CSIRO Building and Infrastructure Services (CBIS) to identify and rectify the fault.
- 16:45. CBIS has identified an upstream breaker to the power distribution board has tripped. This has been reset and Pawsey are working on bringing equipment back in to service.
- 18:30. Most equipment has been brought up but the astrofs is still being stubborn. This is required for most cluster services to return to service so is holding us up.
- 20:00. The astro filesystem is still giving issues. Work on that will continue tonight but testing and bringing back Magnus, Galaxy, Topaz and Zeus will take place in the morning.
- 10:00 3/7/2020. astro filesystem hardware replacement is underway. This will replace problematic UPS batteries in the disk arrays so that the controllers and disk can be brought back online.
- 11:30. Magnus, Galaxy, Topaz and Zeus are being brought back in to service without access to most of /astro so that jobs can start again.
- 14:00. Galaxy, Topaz and Zeus are back in service without full access to /astro. MWA researchers have had their jobs put on hold to stop the /astro issues affecting their jobs.
- 14.10. Magnus back in service without full access to /astro.
- 15:00 Our vendor has resolved the issue with the /astro filesystem arrays and were are working on bringing them all back online.
- 15:20 Filesystem "/astro" has been restored
- 15:25 Zeus and Topaz is fully operational with /astro filesystem if applicable
- 16:55 magnus-2 re-entered service: magnus-1 had already been made available