Maintenance of all Pawsey systems happens on a regular basis. Maintenance, whether requiring an outage or not, allows Pawsey to take preventative action towards mitigating any hazards or risks that might affect the ongoing functionality of those systems. Maintenance usually includes software/hardware updates, routine performance checks and faulty component replacements.
The current maintenance time is scheduled for the first Tuesday of each month, however users will normally be notified, by email, the week before.
Incidents are, by their nature, unscheduled and, furthermore, service outages can arise from incidents beyond the Pawsey systems themselves (eg, Power, Cooling)
When incidents do occur, we do ask users for their patience and understanding, as our system admins will be working very hard to try and get the systems back up while sustaining all the jobs in the queues, and may not be as able to keep the Incident pages as up-to-date, as they try to do with Maintenance pages.
For a list of scheduled/recent Maintenance or Incidents pages, please see below. Older pages can be accessed from within the tree view.
Pages are typically prefixed with either 'M-' (Maintenance) or 'I-' (Incident).
Pages are typically suffixed so as to indicate the systems affected: '-All'; '-SC' (Supercomputing); '-Data'; '-Nimbus (Cloud),or '-Vis' (Visualisation).
|Log||Status:||Start Date/Time (AWST):||End Date/Time (AWST):||Systems/Services Affected:||Summary:|
|Magnus, Galaxy, Zeus, Topaz||Pawsey Scheduled Maintenance|
|HSM||Pawsey Scheduled Maintenance|
Ticketing System (Jira)
|Pawsey Scheduled Maintenance|
|Zeus||Zeus: Omnipath Director class switch failure - Affects all omnipath nodes which includes workq, knlq and debugq|