Page tree
Skip to end of metadata
Go to start of metadata

Status:

 RESOLVED

Points of Contact:

help@pawsey.org.au

Start Date/Time (AWST):

14:58

Responsible:

Gregory Orange

Estimated End Date/Time (AWST):


Accountable:

Mark Gray (Head of Platforms)

End Date/Time (AWST):

15:35

Informed:

nimbus_users@pawsey.org.au

Summary:

Nimbus dashboard and authentication endpoint unavailable

Systems/Services Affected:

Nimbus
Status Page entry:https://status.pawsey.org.au/incidents/t70bnrcglqts

Updates:

2023-09-03

  • 14:58 Issue discovered - endpoint timeout, dashboard unresponsive
  • 15:30 Issue identified - keepalived IPs are absent.
  • 15:35 Restarting keepalived containers has reinstated the IP, and APIs are working properly.
  • 16:59 Underlying upgrades, restarts and keepalived response identified identified as contributing factors. This incident is resolved, but we will work on improving resilience in light of it.

Post-Incident Summary:

  • Monitoring did not catch the outage due to the transitional state of our migration, due to be completed in the coming weeks.
  • We need to monitor keepalived or its current IP even though it is somewhat hidden from external view.
  • We need to experiment with ways to test and ensure resilience of keepalived.