Page tree
Skip to end of metadata
Go to start of metadata

Status

COMPLETED

Planned Start Date/Time (WST)

 10:15

Planned End Date/Time (WST) 
End Date/Time (WST) 
Summary/group filesystem degraded
Systems/Services AffectedMagnus, Galaxy, Zeus

 

Points of Contact

  • help@pawsey.org.au


Responsible:

  • Mark O'Shea


Accountable

  • David Schibeci


Informed:



Changes:

  • On Sunday around 10AM one of the OSTs on group seems to have issues. (OST0033) - this is leading to errors accessing any of the files stored on there and problems with quota allocation

    OSS log
    Feb 17 10:18:38 pgfs-oss10.pawsey.org.au pengine[3389]:   notice:  * Start      pgfs-OST0033     ( pgfs-oss10.pawsey.org.au )
    Feb 17 10:19:17 pgfs-oss10.pawsey.org.au kernel: LustreError: 137-5: pgfs-OST0033_UUID: not available for connect from 864@gni1 (no target). If you are running an HA pair check that the target is mounted on the other server.
    Feb 17 10:20:23 pgfs-oss10.pawsey.org.au kernel: LustreError: 137-5: pgfs-OST0033_UUID: not available for connect from 10.10.100.218@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server.
    Feb 17 10:20:58 pgfs-oss10.pawsey.org.au crmd[3390]:   notice: Initiating start operation pgfs-OST0033_start_0 locally on pgfs-oss10.pawsey.org.au
    Feb 17 10:20:58 pgfs-oss10.pawsey.org.au Lustre(pgfs-OST0033)[165794]: INFO: Starting to mount /dev/mapper/OST0033
    Feb 17 10:21:43 pgfs-oss10.pawsey.org.au kernel: Lustre: pgfs-OST0033: Not available for connect from 10@gni (not set up)
    Feb 17 10:21:43 pgfs-oss10.pawsey.org.au kernel: Lustre: pgfs-OST0033: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
    Feb 17 10:21:43 pgfs-oss10.pawsey.org.au kernel: LustreError: 165989:0:(qsd_entry.c:211:qsd_refresh_usage()) $$$ failed to read disk usage, rc:-3 qsd:pgfs-OST0033 qtype:grp id:0 enforced:0 granted:0 pending:0 waiting:0 req:0 usage:0 qunit:0 qtune:0 edquot:0
    Feb 17 10:21:43 pgfs-oss10.pawsey.org.au kernel: Lustre: pgfs-OST0033: in recovery but waiting for the first client to connect
    Feb 17 10:21:43 pgfs-oss10.pawsey.org.au kernel: Lustre: pgfs-OST0033: Will be in recovery for at least 2:30, or until 2308 clients reconnect
    Feb 17 10:21:43 pgfs-oss10.pawsey.org.au kernel: Lustre: 165989:0:(qsd_reint.c:503:qsd_reint_main()) pgfs-OST0033: reintegration for [0x200000005:0x1089:0x0] failed with -3
    Feb 17 10:21:44 pgfs-oss10.pawsey.org.au Lustre(pgfs-OST0033)[165998]: INFO: /dev/mapper/OST0033 mounted successfully
    Feb 17 10:21:44 pgfs-oss10.pawsey.org.au crmd[3390]:   notice: Result of start operation for pgfs-OST0033 on pgfs-oss10.pawsey.org.au: 0 (ok)
    Feb 17 10:21:44 pgfs-oss10.pawsey.org.au crmd[3390]:   notice: Initiating monitor operation pgfs-OST0033_monitor_20000 locally on pgfs-oss10.pawsey.org.au
    Feb 17 10:21:50 pgfs-oss10.pawsey.org.au kernel: Lustre: pgfs-OST0033: Connection restored to 5b415fcd-14d0-72e2-e026-f98972e4de39 (at 44@gni1)
    Feb 17 10:23:43 pgfs-oss10.pawsey.org.au kernel: LustreError: 166585:0:(qsd_entry.c:211:qsd_refresh_usage()) $$$ failed to read disk usage, rc:-3 qsd:pgfs-OST0033 qtype:grp id:0 enforced:0 granted:0 pending:0 waiting:0 req:0 usage:0 qunit:0 qtune:0 edquot:0
    Feb 17 10:23:43 pgfs-oss10.pawsey.org.au kernel: Lustre: 166585:0:(qsd_reint.c:488:qsd_reint_main()) pgfs-OST0033: reint global for [0x200000006:0x1020000:0x0] failed. -3
    Feb 17 10:23:54 pgfs-oss10.pawsey.org.au kernel: Lustre: pgfs-OST0033: deleting orphan objects from 0x0:72046716 to 0x0:72046881
    Feb 17 10:23:54 pgfs-oss10.pawsey.org.au kernel: Lustre: pgfs-OST0033: Recovery over after 2:11, of 2308 clients 2308 recovered and 0 were evicted.
    Feb 17 10:23:54 pgfs-oss10.pawsey.org.au kernel: LustreError: 4840:0:(qsd_entry.c:211:qsd_refresh_usage()) $$$ failed to read disk usage, rc:-3 qsd:pgfs-OST0033 qtype:usr id:24159 enforced:0 granted:0 pending:0 waiting:0 req:0 usage:0 qunit:0 qtune:0 edquot:0
    Feb 17 10:23:54 pgfs-oss10.pawsey.org.au kernel: LustreError: 4626:0:(ofd_dev.c:1887:ofd_destroy_hdl()) pgfs-OST0033: error destroying object [0x100330000:0x41644d4:0x0]: -3
    Feb 17 10:23:54 pgfs-oss10.pawsey.org.au kernel: LustreError: 4735:0:(qsd_handler.c:1172:qsd_op_adjust()) pgfs-OST0033: fail to locate lqe for id:24165, type:0
    Feb 17 10:23:55 pgfs-oss10.pawsey.org.au kernel: LustreError: 4452:0:(ofd_dev.c:1887:ofd_destroy_hdl()) pgfs-OST0033: error destroying object [0x100330000:0x4165e90:0x0]: -3
    Feb 17 10:23:56 pgfs-oss10.pawsey.org.au kernel: LustreError: 307:0:(qsd_handler.c:1172:qsd_op_adjust()) pgfs-OST0033: fail to locate lqe for id:24165, type:0


  • Staff are arranging a suitable window to check the filesystem and see if we can restore functionality.


Post-Maintenance Summary:

  • Filesystem checks were successful and the OST brought back online.