Page tree
Skip to end of metadata
Go to start of metadata

Status:

RESOLVED

Points of Contact:

 help@pawsey.org.au 

Start Date/Time (AWST):

 

Responsible:

 Ashley Chew 

Estimated End Date/Time (AWST):

Accountable:

 Ashley Chew 

End Date/Time (AWST):

~ 10.45am

Informed:

 galaxy_users@pawsey.org.au 

Summary:

Filesystem Lustre "askapbuffer" - OST001D readonly

Systems/Services Affected:

Any Cluster system with /askapbuffer mounted on ie Galaxy, Zeus (copy nodes)

Updates:

  • After some investigation one of the storage volumes ie OST 29 has been marked as read only ie
    • hpc-data6:~ # lfs df -h |grep R$
      askapfs1-OST001d_UUID       57.7T       49.0T        5.8T  89% /askapbuffer[OST:29] R
    • This can also be confirmed on the storage node hosting the OST 29 ie 

      [Thu Oct  6 18:05:21 2022] LustreError: 21668:0:(tgt_lastrcvd.c:1016:tgt_client_new()) askapfs1-OST001d: Failed to write client lcd at idx 14, rc -30

Post-Incident Summary:

  • Issue started to manifest on "Thu Oct  6 00:00:55 2022" and at some point due to multiple failures writting to the target it has been marked as read only by the system
  • An attempt will be made roughly at 10:00am to address OST 29 
  • From examination of the OST 29, seems the journal on the storage pool needs to be flushed (As it could not commit the journal transactions, the storage pool was marked as Read Only)
  • Askapbuffer OSS Storage Nodes 1 & 2 was placed in maintenance mode, so no failover would happen as we want to address a single storage pool
  • On askapbuffer OSS Node 2, OST 29 was removed where a fsck was performed to commit / flush the journal for the storage pool
    • Journal should be flushed/commited for OST 29
  • OST 29 was inserted back into the system as there are no longer any errors
  • Maintenance mode was removed from Askapbuffer Nodes 1 & 2
  • Clients reconnecting
  • Around 10:30am, operations for read/write should resume now 
    • hpc-data6:~ # lfs df -h |grep R$
      hpc-data6:~ # 
      
      Ie no storage pools in Read only