Updates:
- After some investigation one of the storage volumes ie OST 29 has been marked as read only ie
hpc-data6:~ # lfs df -h |grep R$ askapfs1-OST001d_UUID 57.7T 49.0T 5.8T 89% /askapbuffer[OST:29] R
This can also be confirmed on the storage node hosting the OST 29 ie
[Thu Oct 6 18:05:21 2022] LustreError: 21668:0:(tgt_lastrcvd.c:1016:tgt_client_new()) askapfs1-OST001d: Failed to write client lcd at idx 14, rc -30
Post-Incident Summary:
- Issue started to manifest on "Thu Oct 6 00:00:55 2022" and at some point due to multiple failures writting to the target it has been marked as read only by the system
- An attempt will be made roughly at 10:00am to address OST 29
- From examination of the OST 29, seems the journal on the storage pool needs to be flushed (As it could not commit the journal transactions, the storage pool was marked as Read Only)
- Askapbuffer OSS Storage Nodes 1 & 2 was placed in maintenance mode, so no failover would happen as we want to address a single storage pool
- On askapbuffer OSS Node 2, OST 29 was removed where a fsck was performed to commit / flush the journal for the storage pool
- Journal should be flushed/commited for OST 29
- OST 29 was inserted back into the system as there are no longer any errors
- Maintenance mode was removed from Askapbuffer Nodes 1 & 2
- Clients reconnecting
- Around 10:30am, operations for read/write should resume now
hpc-data6:~ # lfs df -h |grep R$ hpc-data6:~ # Ie no storage pools in Read only