Page tree
Skip to end of metadata
Go to start of metadata

Status

ONGOING

Start Date/Time (AWST)

 13:30

Estimated End Date/Time (AWST)
End Date/Time (AWST)
Summary:Zeus: Omnipath Director class switch failure - Affects all omnipath nodes which includes workq, knlq and debugq
Systems/Services Affected:Zeus


Points of Contact:

  • help@pawsey.org.au


Responsible:

  • Ashley Chew


Accountable

  • Mark O'Shea


Informed:

  • zeus_users@pawsey.org.au


Updates:

  • The Omnipath Director class switch is causing issues in the Omnipath network that connects the nodes in the Zeus workq (cpu workloads) and knlq (KNL workloads) causing them to not be able to communicate with each other, nor see the filesystems
  • Leaf Switch 105 has failed - all 24 ports has become unavailable (No port lights)
    • Leaf Switch 105 contains the Subnet Fabric Manager
    • Leaf Switch 105 contains the Lustre Lnet routers
  • Due to Leaf Failure, the omnipath HBA cards are unable to function as there is no Fabric
  • Work Around, re-wired the Fabric Subnet Manger and Lustre to a functional leaf switch
    • Due to insufficient functional omnipath HBA ports, Zeus computational node being available will be reduced
    • Will be working with the vendor to get a replacement leaf switch to be replaced in the Director Class Switch
    • Compute nodes without access to Fabric will be powered off
  • 05/05/2020 15:30:  We've identified the failed component and the Omni-Path fabric is stable. The Omni-Path connected Lustre clients are in recovery mode and may take some time to complete.
  • 05/05/2020 24:00: Lustre Clients still in recovery mode
  • 06/05/2020 01:00: Partial restoration, all omni path nodes not connected to Leaf Switch 105 has been restored for general use (insufficient spare ports due to failure means I cannot physically restore the remaining node until a leaf replacement is arranged)
    • Partition "longq" nodes → Unavailable as it sits on failed leaf switch
    • Partition "debugq" nodes → Unavailable as it sits on failed leaf switch
    • Partition "workq" nodes → Available (Reduced Capacity)
    • Partition "knlq" nodes → Available (Reduced Capacity)
    • Partition "gpuq|gpuq-dev" nodes → Available (Non omni, Full Capacity)
    • Partition "highmemq" nodes → Available (Non omni, Full Capacity)
  • 06/05/2020
    • Alerted vendor, we will be requiring 2 x Leaf Switch to replace Leaf Switch 105A|B
  • 07/05/2020
    • Vendor has lodged a part request with the manufacturer
  • 08/05/2020
    • Further logs was collected from the Omnpath Director switch to ascertain anything else is required
  • 11/05/2020
    • Vendor Manufacturer has confirmed RMA for part
  • 12/05/2020
    • Around 5:10pm there was no lights on the Leaf Switch where the OMNIpath infrastructure nodes was relocated to
      • Forced to restart the Omnipath Director class switch as without these nodes on that leaf switch, omnpath will cease to function. 
    • reshuffled ports around (Reducing the nodes available in the knlq to restore other nodes)
    • Partition "longq" nodes → Available (Restored, reduced Capacity)
    • Partition "debugq" nodes → Available (Restored, reduced Capacity)
    • Partition "workq" nodes → Available (Reduced Capacity)
    • Partition "knlq" nodes → Available (Further Reduced Capacity)
    • Partition "gpuq|gpuq-dev" nodes → Available (Non omni, Full Capacity)
    • Partition "highmemq" nodes → Available (Non omni, Full Capacity)
  • 14/05/2020 
    • Vendor confirmed waiting for an ETA for replacement Parts from Manufacturer (Delay in parts is possible due to specialisation of the parts and the covid 19 situation)
    • Majority of the ndoes has been restored for the exception of the knl nodes within the knlq (Priority has been placed to other nodes for a functional Omnipath port)
    • There will be no further updates until we get an ETA for the parts
  • 18/05/2020
    • Replacement module for Leaf Switch 105 A|B confirmed it will be shipping from overseas
    • We just waiting for shipment / tracking notice so we make preparations for replacement as some down time is required
  • 21/05/2020
    • Shipment of parts confirmed
    • Enroute from US (Louisville, KY) for  → Malaysia (Penang) → Australia (Perth)

Post-Incident Summary:

  • Physical Failure of internal leaf switch 105 in the Director Class switch
  • Leaf Switch 105 contained all the core essential infrastructure for Omnipath core network and inter network functions
  • Reworked physical networking to restore backend Omnipath related network infrastructure
  • Restore compute nodes that still have a physical functional Omnipath link
  • Nodes with functional Omnipath link undergoes client side checking back to network filesystem
  • Nodes with functional Omnipath link released back to service
  • Liason with vendor to arrange a replacement Leaf Switch 105
  • Port reshuffling (Reduce number of nodes in knlq for other nodes as they are used more)