Description:
An extended outage is required to allow both an update to the NEO software on /scratch and /software and the final slingshot integration tasks. This outage is planned to have a duration of 7 days and, if the planned works are completed successfully, it will be the final outage required to integrate Setonix phases 1 and 2.
Updates:
- Tuesday 7th
- 14:30 Our on-site vendor engineers, plus the man flown in from Sydney to assist with the work, report
that work is progressing, in parallel, in respect of the storage (NEO) and networking (Slingshot) tasks - 15:45 The vendor engineers currently on-site have informed us that the NEO software updates have
been applied. Tomorrow they wil commence required firnware upgrades on the control plane servers. - Thursday 9th
- 09:30 The vendor engineers currently on-site have informed us that they now have flashed 50% of the
storage component firmware, and that of the 141 broken network links within the Slingshot fabric,
they are only now looiking at four.
Work progresses. - Friday 10th
- 10:30 Our on-site vendor engineers, the man from Sydney has returned home, have informed us that
they now have flashed 80% of the storage component firmware, and that all of the broken network
links within the Slingshot fabric have now been fixed. - Monday 13th
08:30 HPE have handed over Phase 1 to Pawsey (8:30 AM AWST). We are working as quickly as possible to verify hardware and ensure the configuration changes haven't affected the Pawsey running environment. Please note due to HPE adding Phase 2 nodes to the SLURM configuration, they have wiped the jobs which were sitting in the queue prior to maintenance. We were unaware this was a risk, and HPE have been unable to restore the state.
- 12:07 Reframe tests have passed.
- 12:51 Services have blessed the return to service and the maintenance reservation has been removed.