Updates:
2023-09-03
- 14:58 Issue discovered - endpoint timeout, dashboard unresponsive
- 15:30 Issue identified - keepalived IPs are absent.
- 15:35 Restarting keepalived containers has reinstated the IP, and APIs are working properly.
- 16:59 Underlying upgrades, restarts and keepalived response identified identified as contributing factors. This incident is resolved, but we will work on improving resilience in light of it.
Post-Incident Summary:
- Monitoring did not catch the outage due to the transitional state of our migration, due to be completed in the coming weeks.
- We need to monitor keepalived or its current IP even though it is somewhat hidden from external view.
- We need to experiment with ways to test and ensure resilience of keepalived.