Page tree
Skip to end of metadata
Go to start of metadata

Status

COMPLETED 

Start Date/Time (AWST)

 08:00

Estimated End Date/Time (AWST)

 20:00

End Date/Time (AWST)

19:52

SummaryPawsey Scheduled Maintenance
Systems/Services AffectedMagnus, Galaxy, Zeus, Topaz, Garrawarla


Points of Contact

  • help@pawsey.org.au

Responsible:

  • Mark Gray

Accountable

  • Mark O'Shea

Informed:

  • pawsey_users

Changes:

  • Update the ME4 array firmware on askapbuffer
  • Remove NHC from the Zeus login node
  • Update /etc/sudoers to directly notify helpdesk
  • Configure Slurm topology on Galaxy and Magnus
  • Create acceptance partitions on Galaxy, Garrawarla, Magnus, Topaz and Zeus
  • Fix the MTU on the IB bond interfaces on the askapbuffer servers
  • Lower debugq node count from 8 to 4 on Zeus
  • Add postfix wrapper script to set extra mail headers for Slurm on Magnus and Zeus
  • Relocate opa-lnet[3-4] from KNL racks to Zeus rack
  • Add ldap.conf to all Galaxy and Magnus compute nodes
  • Apply PS44 Cumulative Patchset Update to the Production Crays
  • Slurm updated to 20.02.5
  • Enable (basic) firewall on glacier (influx2)
  • Remove old Galaxy GPU nodes from service.
  • Tune compute node memory defaults on Garrawarla
  • Increase to default Eio and UnkillableStep timeout values to 5 minutes
  • Enabling of replacement central logging system

Updates

  • 08:58 Production Crays are down: maintenance work commences

  • 12:10 Our on-site Cray engineer reports some issues with powering up the CLE portion of Magnus
    New CLE images, with the updated Slurm have been created
  • 16:20 The CLE portions of Magnus and Galaxy are being booted
    Magnus required 6 rectifiers and a compute blade to be swapped out
    Galaxy, clearly upset at the withdrawl of its GPU nodes, refused to accept a replacement compute blade
  • 17:45 Magnus undergoing acceptance testing: Galaxy nearly ready for acceptance testing
  • 19:20 Magnus has passed acceptance testing. Queued jobs are now running. Front-end access will be restored shortly
  • 19:30 Magnus has been returned to service. Galaxy has passed acceptance testing. Queued jobs are now running. Front-end access will be restored shortly
  • 19:35 Galaxy has been returned to service.

Post-Maintenance Summary: