Points of Contact:
- Mark O'Shea <email@example.com>
- David Schibeci <firstname.lastname@example.org>
- Dear colleagues,
Those of you using our compute facilities (zeus, galaxy and magnus) may have noticed some interruptions in access of files on the /group filesystem over the past few weeks. This has been due to the servers running the filesystem (of which there are 16) locking up and failing over to the high availability partner (or in some cases not locking up enough and failing to fail over). The investigation by my team has led to the conclusion that the version of lustre that this filesystem is using combined with the large number of nodes accessing the filesystem that we have is the likely cause of the issue and has been seen at other centres. We were going to upgrade the version of lustre on the next maintenance day but as we’re seeing the issue more and more we’ve decided to do a rolling upgrade of the filesystem starting today.
The filesystem itself will be up and running during the upgrade and will be running on the “High Availability” pair of the server being upgraded at any time. You may notice some interruptions of 2-5 minutes every now and then during the failover and recovery process that Lustre uses to ensure data integrity when we’ve finished upgrading a server and are rebalancing the object store targets. All I/O operations block during this time and continue afterwards with no loss of response (just delay) so running jobs should not be affected.
We’re sorry for the inconvenience caused by this. We believe that it’s better overall to attempt to fix the issue now rather than wait until maintenance and continue to suffer interruptions anyway.
If you have any questions about the upgrade or even about Lustre in general (I for one like talking about it) please contact us at email@example.com
- 13:00 The first 4 servers have been upgraded successfully and further servers are being upgraded.
- 09:00 Dear Researchers,
As Mark announced yesterday, we started a rolling upgrade of the /group lustre filesystem yesterday:
On 20 Feb 2019, at 10:49 am, Mark O'Shea <firstname.lastname@example.org> wrote:
We were going to upgrade the version of lustre on the next maintenance day but as we’re seeing the issue more and more we’ve decided to do a rolling upgrade of the filesystem starting today.
The final four servers were upgraded earlier this morning, and the service has been restored to its normal state. Staff will continue to monitor the performance of /group over the next few weeks, however we would like to remind all users of the various HPC filesystems we have at Pawsey and their primary purpose. This is documented at https://support.pawsey.org.au/documentation/display/US/File+Systems%2C+File+Transfers+and+File+Management, but in summary:
* /home - storage of relatively small numbers of important system files such as your Linux profile, shell configuration etc.
* /group - storage of executables, input datasets, important output data, and so on, for the life time of the project.
* /scratch - temporary storage related to production runs in progress.
/group or /home should _not_ be used as the primary IO space for running HPC jobs - use /scratch
If you have any further questions, please don't hesitate to get in touch via email@example.com or our support portal at https://pawsey.org.au/support/