Page tree
Skip to end of metadata
Go to start of metadata

Status:

 COMPLETE 

Points of Contact:

 help@pawsey.org.au 

Start Date/Time (AWST):

08:00

Responsible:

 Mark O'Shea, SC Lead

Estimated End Date/Time (AWST):

12:00

Accountable:

 Mark Gray, Head of Platforms 

End Date/Time (AWST):

12:00

Informed:

 pawsey users 

Summary:

Pawsey Scheduled Maintenance

Systems/Services Affected:

Setonix, Garrawarla, Zeus

Description:

Changes:

  • Vendor work on scratch filesystem (Setonix)
  • Pawsey filesystem checks on certain OSTs of MWA Astro filesystem.
  • Updating of Slurm module on Setonix to set output formats.
  • Updating default stacksize setting on Setonix compute nodes.
  • Apply minimal quota to /group filesystem
  • Apply AWST to Setonix Slurm Controller POD

Updates:

Tuesday 6th

  • 12:10 Changes to compute nodes and UANs have been completed and tested.
    We are waiting for the vendor to complete their work on the Lustre filesystem before bringing them back in to service.
  • 16:30 Our on-site Cray engineers have informed us that they had encountered issues with returning
    the Setonix /scratch to service, but that it's heading towards a state where we can mount it for use
  • 17:20 The remedial work for the /astro filesystem hasn't cured all of the issues we were seeing on it.
  • 19:00 Vendor engineers are still trying to bring up the Setonix /scratch filesystem. We will bring Setonix back to service in the morning providing /scratch is available.
  • 19:30 Zeus and Garrawarla has been released for general use

Wednesday 7th

  • 10:15 Our on-site Cray engineers have informed us that they believe they have fixed the Setonix /scratch filesystem
    We will now be testing it out ahead of returning Setonix to service
  • 14:00 Problems remain with the Setonix /scratch filesystem and Our on-site Cray engineers continue to investigate.
  • 16:00 The Lustre problem has now been elevated to a vendor specialist engineer in the UK: we await then coming "on-line"


Thursday 8th

  • 09:00 Overnight, AWST, vendor engineers have identified the problem with the Lustre fiesystem providing /scratch
    We will now be testing it out ahead of returning Setonix to service
  • 11:30 All of the testing passed and nodes have been returned to service

Further information:

This page documents the work of the SCOps team on this Maintenance day: other work may be visible via the page for the whole day