Page tree
Skip to end of metadata
Go to start of metadata

Status

RESOLVED

Start Date/Time (AWST)

 08:45

Estimated End Date/Time (AWST)
End Date/Time (AWST)

11:30

Summary:Nimbus authentication services offline, cluster unreliable
Systems/Services Affected:Nimbus

Points of Contact

Responsible

Accountable

Informed

Updates

  • Progress at CI-376
  • 09:00 identified keystone and rabbit errors
  • 09:20 looked at jenkins reports and identified file descriptor limit errors
  • 09:30 increased fd limits allows some message traffic, but unreliable
  • 11:00 identified testing rabbits in the production cluster, then the root cause (see PIR)

Post-Incident Summary

  • Removed testing rabbitmq servers from production cluster
  • Returned file descriptor counts to previous values
  • Tested and watched for stability
  • PIR for I-2020-10-07-Nimbus