Page tree
Skip to end of metadata
Go to start of metadata

Status:

  RESOLVED  

Points of Contact:

 help@pawsey.org.au 

Start Date/Time (AWST):

 

Responsible:

Estimated End Date/Time (AWST):

Accountable:

End Date/Time (AWST):

 09:00

Informed:

 setonix_users 

Summary:

Setonix - Reduce Compute Capacity

Systems/Services Affected:

Setonix

Description:

Setonix Hardware Failures

  • 25/11/2022 Hardware failures identified
    • One particular failure was identified on one of the worker nodes in the Kubernetes backend
    • This affected services POD residing on the worker node where failover services to another node could not happen
    • Affected the projected filesystem service ie DVS, thus any node attached to the worker node would eventually stall / fail if using DVS from this worker node

Changes:

  • 25/11/2022 No onsite spare was found / Parts replacement was requested by vendor Engineer
  • 28/11/2022
    • Parts  was replaced in one of the worker nodes
    • Worker node being recommissioned (ncn-w001)
  • 29/11/2022
    • Worker Node has been recommissioned
    • PODS redistributed
      • As the DVS worker node map has changed in relation to compute nodes
      • Compute Nodes would have to be rebooted
    • Reservation has been created at 11:20 am to stop usage of nodes
      • Nodes with existing jobs will be set to be drained and be rebooted
      • Nodes previously offline will be rebooted and gradually be released to be used
      • Drained nodes will be rebooted
    • At "2pm" there has been a setback, nodes are experiencing a secondary DVS issue on the reboot which onsite engineers are trying to remedy
  • 30/11/2022
    • Vendor engineer has confirmed there is an issue with one of four DVS services
      • Issue with the original DVS service POD servers
      • ncn-w002 has an issue with the highspeed network interface (hsn0)
    • Workaround is in place
      • Compute Nodes has been rebooted with the new DVS map of servers (ncn-w002 not being used)
        • Job on long queue partition has been removed (Jobs might not of finished as they are attached to ncn-w002)
      • Compute nodes released for general use ~2pm

Updates: