Page tree
Skip to end of metadata
Go to start of metadata

Status

COMPLETED 

Start Date/Time (AWST)

06:00

Estimated End Date/Time (AWST)

20:00

End Date/Time (AWST)

09:30

SummaryPawsey Yearly Maintenance
Systems/Services AffectedAll Supercomputing and ancillary services


Points of Contact:

  • help@pawsey.org.au

Responsible:

  • Mark O'Shea

Accountable

  • Ugo Varreto

Informed:

  • Pawsey users


Changes:

  • Major change will be the upgrade of the Operating System on the production Crays (Magnus and Galaxy)
    so that Pawsey will continue to receive security updates from Cray who ended-the-life-of our current OS
    some time ago now, although we have waited until the yearly maintenance to do the upgrade so as to
    minimise the downtime of the production Crays. 
    The impact on users is detailed at CLE OS Upgrade Effects: PAWSEY_OS (from CLE 6.0.UP05 to CLE 6.0.UP07).
  • Reduction of /home quota on the supercomputers from 10GB to 1GB.  See Quota Limit on your home area.
  • Slurm upgraded to 19.05.5
  • Intel compiler suite and Cray Development Toolkit will have 2019 options instead of just the 2017 versions.
    (The three-year-old versions will remain the defaults though)

Updates:

  • 2019-12-20 12:00 Note that whilst the Estimated End Time of this maintenance matches the one in the
    announcement email, some of the CBIS (CSIRO's Business and Infrastructure Service) work will not now
    be going ahead, and so Pawsey staff will be able to commence their work ahead of the original time,
    a scenario that we expect will see us able to complete the maintenance ahead of the original End Time
  • 2020-01-03 06:30 Both production Crays were taken out of service
  • 2020-01-03 08:00 Work commenced on the UP07 upgrade to the  production Crays
  • 2020-01-03 18:08 Both production Crays  exhibited errors after their firmware upgrades
    Work continues to rectify this issue
  • 2020-01-03 19:07 Galaxy was completely flashed: Magnus still fighting
  • 2020-01-03 19:55 Less Magnus components failed to be flashed the second time
    Join us tomorrow for more upgrade fun!
  • 2020-01-04 09:30 Magnus has now been completely flashed.
    Work continues towards returning the production Crays into service
  • 2020-01-04 14:20 The PrgEnv images have been built. CLE OS images are currently building.
  • 2020-01-05 11:30 The CLE system within Galaxy has booted
    Work continues towards booting the CLE system within Magnus and the eLogin systems
  • 2020-01-05 13:00 The CLE system within Magnus has booted
    Work continues towards booting the eLogin systems and completing acceptance testing
  • 2020-01-05 16:30 The eLogin systems have been seen to boot
    Work will continue tomorrow (Mon 6th), mainly around completing acceptance testing, along
    with some hardware fixes that require our on-site Cray engineer
  • 2020-01-06 12:00 Our on-site Cray engineer carried out the hardware swaps.
  • 2020-01-06 18:00 Testing of the redeployed systems is ongoing.
  • 2020-01-07 16:00 Galaxy is being returned to service. Any issues: email help@pawsey.org.au
    Magnus should be back soon
  • 2020-01-07 16:40 Zeus and Topaz have been returned to service
    • Slurm 19.05.5
    • Nvidia Based GPU nodes are running driver "440.33.01" which will support up to Cuda 10.2
  • 2020-01-07 19:10 Magnus is being returned to service, however only magnus-1
    is currently available as a job submission resource whilst we use magnus-2 to
    investigate a mount timing issue.
    Any other issues: email help@pawsey.org.au
  • 2020-01-08 14:40 The Shifter module, whilst still present, no longer functions.
    Pawsey recommendation for containerised workflows is Basics of Singularity .




Post-Maintenance Summary: