Pawsey User Forum (Sydney), University of Sydney, 27th November 2019
I am interested to know how many GPUs will be available in the new supercomputer?
As part of the preparation for the procurement, Pawsey has conducted an extensive review of GPU usage that was modelled on a similar process carried out at NERSC. Concerns were raised at previous user forums that researchers with GPU workflows that were not currently running on Pawsey systems due to lack of GPUs may be under-represented, and in response Pawsey has conducted a compute survey of its user base to inform the refresh process. It is expected that the new system will contain significant GPU-enabled computational resources, but the exact specification will be determined by the outcome of the tender process.
I would like to provide feedback that the support for containerisation at Pawsey is wonderful, it has been a game-changer for our workflows. I hope this will continue into the future?
Pawsey has been early adopters of containerisation in supercomputing, and currently have Shifter and Singularity available on its systems. We have found that this technology has not only improved the portability of complex software stacks for some of our users, but also improved performance by coalescing file access patterns - particularly for parallel python and OpenFOAM workflows. The adoption of containerisation for appropriate workflows has been championed and supported by several of our staff. We expect our support for containerisation to continue and remain up to date as the technology continues to develop.
How do you support containers currently?
Currently, we help researchers learn to build and run containers on our facilities. There is documentation available and we periodically conduct containerisation training webinars for researchers to attend online.
I would be very interested to see benchmarking and optimisation efforts around containerisation of workflows.
Some initial investigations of containerisations have shown positive effects on the performance of some workflows. This has primarily been a result of the container reducing the amount of metadata load for reading and writing files by reducing the number of file transactions.
Which vendor provides Pawsey's computational infrastructure?
Pawsey's computational infrastructure is funded by the federal government, and are purchased through the appropriate procurement processes. As a result, Pawsey has parts of its infrastructure from a variety of vendors that have been successful in those processes.
I am investigating machine learning workloads, and I am particularly interested in GPUs with tensor cores available for use. Are there any at Pawsey?
Pawsey has a number of NVIDIA V100 GPUs with tensor cores. The Nimbus research cloud has 12 GPU-enabled nodes each with a single V100 GPU. The most recent GPU system, Topaz, has 22 nodes each with two V100 GPUs and is currently in pre-production testing with early adopters. We expect Topaz to be in production and available to all Pawsey users early next year.
A number of my colleagues have found the system reliability to be an issue for their work. Has anything been done to improve the reliability of Pawsey's systems?
The two leading causes of unplanned outages are failures in the buildings cooling systems and the shared filesystems. These incidents are investigated and measures are put in place where possible to reduce or eliminate the issue reoccurring. For example, a number of firmware updates applied to the shared filesystems during scheduled maintenance sessions has resulted in a reduction in the number of unplanned incidents occurring. Where reliability does impact workflows running on the system, researchers are encouraged to contact the Pawsey helpdesk.
What training opportunities are there for researchers using Pawsey facilities?
We conduct introductory 2-day user training sessions throughout the year in Adelaide, Brisbane, Melbourne, Perth and Sydney. These sessions are a great opportunity to meet Pawsey staff in person and discuss workflows in detail. We also run online training sessions on specific topics of interest, such as containerisation and parallel programming. Signing up to the Pawsey Friends mailing list, reading the Pawsey newsletter, and checking the events page on the Pawsey website are good ways to find out about upcoming training sessions.
Pawsey User Forum (Melbourne), University of Melbourne, 11th September 2019
I have completed the recent user compute survey, and I think it was good to have this opportunity to provide input into the capital refresh process.
Thank you for taking the time to participate in the survey. It was suggested at the previous user forum, and we expect the feedback we collect will provide valuable insight to further inform the capital refresh of the new Pawsey supercomputer.
Are there any updates that can be shared regarding the capital refresh?
A vendor briefing event was held alongside the recent HPC-AI AC conference, the information presented on the day is publicly available.
Are you able to provide more information regarding the amount of GPUs that will be in the new system?
We cannot provide specific details regarding the procurement, to ensure probity in the process. We have undertaken an extensive review of the scientific applications used at scale on Pawsey infrastructure to assess the suitability of GPUs, modelled after similar activities at other international centres. Additionally, we are conducting the user compute survey mentioned above to capture user GPU requirements, including for workflows run elsewhere and in the coming years.
Could other groups do more to ensure their research codes are able to take advantage of GPUs?
There are a large number of project groups that utilise Pawsey facilities, spread across a wide range of disciplines and using a diverse range of scientific applications. In some cases, these codes have built in GPU support that may simply require recompilation. At the other extreme, there are applications with large code bases developed over many years that are simply too time intensive to port to GPUs in the immediate future, or use algorithms that are simply not suited to the architecture.
Internationally, there are exascale funding initiatives specifically to support application development and code modernisation, could more be done in Australia?
This topic has been raised at a number of user forums. Initiatives such as these would be beneficial, particularly for domains that have more recently started to work at scale. While this would be a significant investment of time in the short term, the benefits in terms of performance, scalability, and even simplification of software management would be even greater in the longer term.
Has there been any collaboration between NCI and Pawsey regarding their respective capital refresh processes?
Pawsey's Chief Technology Officer and the Technical Manager for the capital refresh both have participated as observers in NCI's process. Both Pawsey and NCI are working hard to ensure our next generation systems align to support the computational needs of Australian researchers.
We have been very happy with the services provided by Pawsey, our main limitation is simply needing more compute time.
The need for more resources was also a common feedback from our last annual user survey. The new supercomputers at both Pawsey and NCI will provide additional capacity to support Australian computational research at scale.
It would be helpful if the NCMAS call was more closely aligned with the ARC process. It is difficult to write proposals for each scheme that are dependent on the success of both.
Applicants with an active ARC track record and well justified computational requirements should be competitive in the NCMAS scheme. This is interesting feedback for us to look into in the longer term.
The latest version of OpenFOAM on Magnus is 17.12, it would be great to have a more recent version such as 19.06 available as a module.
Thanks for the feedback, we'll take a look.
Is GPUDirect RDMA not currently working on Zeus?
We are in the process of finalising a modest increase to our GPU capacity in preparation for the upcoming capital refresh. This will be more suitable for GPUDirect RDMA workloads.
I found Athena really useful as a GPU development environment for production jobs that I run on larger systems at other international centres. In particular, this was because of the fast turnaround of jobs. On the next system, would it be possible to have larger development queues to support this kind of work?
Our application team staff have had similar discussions around the support of GPU development work. It is great to know that this is something our users would also find useful, and we'll certainly look to address this in the configuration of the next Pawsey supercomputer.
Pawsey User Forum (Adelaide), Adelaide University, 18th July 2019
I am concerned that the capital refresh will need a significant GPU component be competitive internationally with other high performance computing centres.
Pawsey has a large user base with a diverse range of scientific codes, which vary in their capacity to make use of GPUs for computation. Identifying the appropriate level of infrastructure to support GPU-accelerated computing has been raised by the capital refresh governance. We are currently investigating the current and potential future capability of user workflows to make use of GPU resources, informed by similar studies carried out at other international high performance computing centres.
Our group currently runs the majority of our GPU workflows at other centres and use Pawsey facilities for our CPU workflows. I am concerned that without surveying project leaders directly the results of an investigation will not accurately represent the actual needs of the user base.
This is helpful feedback, we will look into preparing a survey to capture the expected needs of our users over the next four years to better inform the procurement process.
Is the plan to continue the allocation based on merit, as opposed to fee for service, for the new capital refresh infrastructure?
Access mechanisms and processes for Pawsey infrastructure are determined by the Pawsey Allocation Committee. We are not aware of any plans to change the current competitive merit approach.
We often use our allocation during the first month of the quarter, is Magnus underutilised?
Compute time on Magnus is allocated based on a 48 weeks of availability per year (92.3%), allowing for 4 weeks of outages. The actual utilisation of Magnus in 2018 was 92.2%.
Groups that always have jobs available in the queue tend to work through their allocation faster as they gain priority through job aging. Once a group exceeds their allocation, further jobs are run at a lower priority as described below.
What is Pawsey's position regarding groups exceeding their allocation?
Each quarter, there are a small number of groups that do not fully utilise their allocation and it is not feasible to reallocate this time through formal processes. Instead we allow groups allocated time through competitive merit to exceed their allocation, rather than let the system run idle and not produce scientific outcomes. Project utilisation is provided to the merit allocation committees for consideration in the allocations for the subsequent year.
To ensure fairness, the scheduler priorities are set such that a job from a project that has exceeded its allocation will not start running unless there are no valid jobs from projects with allocation remaining.
Does Pawsey oversubscribe its supercomputing systems?
No. The amount of resource given to each share of Pawsey supercomputers assumes a 48-week uptime, to allow for maintenance and unscheduled outages. This total is distributed to the various allocation shares (NCMAS, Energy & Resources, Partners, Directors) based on their assigned percentages.
What options are there for prioritising workflows?
In exceptional circumstances, such as once off external deadlines, Pawsey staff can boost the priority of a job for projects that have sufficient allocation remaining for the job in question. See Extraordinary Resource Requests for more details.
We are currently investigating providing this capability to users such that they can self-select a higher priority for a small portion of their allocation, to provide more flexibility in managing workflows.
Will there be any charge factors or bonuses for different priority?
For clarity in merit applications and reporting, we prefer that job priorities do not affect the apparent usage of a project.
We are having issues with shorter wall times at other centres, resulting in inefficient workloads due to unnecessary check pointing. Please don't reduce the wall time limit.
We do not have any plans to reduce the wall time limit from 24 hours for production jobs. We also have the longq partition on Zeus available for 28 core jobs running for up to 4 days. Larger long running jobs are considered on a case by case basis, contact us via the User Support Portal with your request.
How can I enforce usage of allocations within my project group?
Most project groups rely on good communication regarding usage to manage their allocations, with the assistance of tools such as pawseyAccountBalance. Allocations can be split into sub-projects with different teams if needed. Note that allocation committees require single applications for the overall project to reduce the workload on both applicants and reviewers.
I need to host my dataset and currently the best way to do so is using volume storage on Nimbus?
This is an appropriate use case on our cloud infrastructure for datasets of up to around 200 TB. For larger data collections that need to be accessible online for others to work with and collaborate, contact our staff via the User Support Portal.
I've had to move a non-trivial amount of data between several centres, it would be helpful if it could be hosted in the longer term at Pawsey.
As the size of research data sets grows larger, transfers can become challenging aspects of the computational workflows. Pawsey staff have expertise in making the most of the available connections within Australia and internationally. Up to 10TB of long term storage is available to projects with merit allocations on Pawsey supercomputing facilities, and applications can be submitted to support larger data collections.
I'm looking to move my MPI workflow to Pawsey, what is the best path?
Now is the ideal time to apply for a director share allocation to access Pawsey supercomputing resources. This will allow the installation and benchmarking of software, which is critical to include in the upcoming merit allocation call in September for supercomputing time in the subsequent 2020 calendar year. A strong merit application will clearly identify the computational work based on these benchmarks needed to accomplish the scientific outcomes of the project, with the established workflow on the systems supporting the feasibility of the application.
Pawsey User Forum (Perth), Curtin University, 28th May 2019
Can more detail be provided regarding the specifications of the new systems?
As the capital refresh is funded by the Australian federal government, we must maintain a fair and open process for our procurements. To ensure this is the case, the only public release of the detailed technical requirements is in the tender documents made available to the vendors.
It is also difficult to predict the exact specification of the system, as it is very dependent on the release and availability of architectures, price fluctuations on components with supply and demand, and ultimately what vendors are able to provide that represents the best value for our user community. We won't know the exact specification until towards the end of the procurement process, which is typically when we announce the new systems.
I feel like I haven't had an opportunity to provide input on the new systems, is there an appropriate avenue to do so?
The user requirements for the capital refresh have been informed by discussions at the previous user forums, information provided via merit applications, observed system usage, and the technology evaluation carried out with the advanced technology cluster (Athena).
The initial procurements have had specific scope, and sub-groups of the User Reference Group have been created where necessary with appropriate representation for endorsement.
We are always happy to receive your input and the best avenue is through open discussion at the user forums. If you are not able to make it in person, feedback can be provided via the registration form or by raising a ticket in the Pawsey user support portal.
It's also important to note that the most useful feedback is the functional needs of your computational research, rather than specific technical specifications, as the latter is more likely to change over time.
Previous user feedback has been provided to the User Reference Group in full detail, and ongoing feedback will continue to be provided.
My research involves global collaboration with large data sets. Is Pawsey considering improvements to data movement?
The ability to shift data to the appropriate resource for processing is rapidly becoming one of the most critical aspects of computational research.
It is for this reason that the procurement program has a significant component that is the High Speed Network Backbone. The current generation of our various computational facilities are not as well connected as we would prefer, so we are working hard to ensure that it will be easy to access data across the various new systems. This will be critical, particularly as modern workflows are maturing to require different types of computational resources for different stages.
Pawsey's connections to the rest of Australia and the world is also improving as new fibre optic links connecting Perth to Singapore and Sydney come online.
Two of the biggest issues we encounter are needing longer wall times and more memory per node.
We have received this feedback consistently at several user forums and directly, particularly from our genomics researchers.
For this reason we have implemented the longq on Zeus with wall times of up to 4 days, and have recently procured six 1TB nodes to form the highmemq partition on Zeus.
The configuration of Zeus is better suited to my research than Magnus.
This is not surprising, as Magnus has remained largely unchanged since the initial Pawsey project procurement. Magnus is better suited for traditional capability-scale software written in C/Fortran with MPI, and since it was procured there has been a large growth in demand for Python-based and data-intensive workflows.
In the meantime, Zeus has grown from its original role as a visualisation and data staging cluster to meet the needs of the user community over time. It now supports capacity workflows that require large numbers of cores processing with little communication, and absorbed Athena to provide GPU computing capability as well. As mentioned above, it also supports longer running jobs and workflows with high memory requirements.
We have been using Zeus to prototype the configuration for the next Pawsey supercomputer, which we expect will support all of the above as well as MPI-enabled capability workflows currently supported by Magnus.
Are there any plans to include fast I/O nodes, or on-node storage?
Our radio astronomy researchers have been very interested in fast I/O, given the data intensive nature of their workflows. It may potentially be applicable to emerging data intensive research in the bioscience domains as well.
We have also seen Python workflows and computational fluid dynamic codes that require a large amount of I/O to numerous small files that have not been well suited to our existing filesystem configuration.
We are aware of these issues, which effect a portion of the workflows on our systems, and will be looking to ensure they are addressed in the refresh.
There are many ways to support these workflows effectively, and for the reasons already discussed it is too early to know whether it will take the form of on-node fast I/O.
What is Pawsey's role in porting user workflows to make use of GPU computing?
In addition to its role in operating high performance computing facilities to support Australian computational research, Pawsey also provides support to enable researchers to make best use of its facilities. Energy usage is an important consideration for HPC facilities, and it is important to transition to more energy efficient architectures, such as GPUs where possible. However, given the large number of codes used by the Australian computational research community, we target our efforts where they will have the biggest impact where possible. This is typically codes developed directly by our users that are unlikely to be ported by the larger international computational research community. It's also important to recognise that not all algorithms are well suited for porting to the GPU architecture.
There are a number of activities at Pawsey that support the adoption of GPU computing resources, targeted at different levels of user expertise. Pawsey recently hosted the GPU Hackathon for the second time in Australia, in which teams from the Pawsey user community are partnered with international GPU experts for a week of focused development effort to GPU-accelerate software used by the teams. Our uptake project program, in which users are partnered with Pawsey staff to improve software and workflows that use Pawsey facilities, often have GPU computing projects. One of our recent GPU projects had the software performance improved by a factor of 200 over the original serial code. We also host GPU computing training courses periodically, presented in some cases by NVIDIA or our own staff.
We notice our students with software development skills are better prepared for their subsequent careers, and training provided by Pawsey is critically important.
Our training program is focused across several areas of expertise. At an introductory level, it provides the key skills to enable researchers to access and use our facilities. To ensure these systems are used in a way to best support the computational research of our users, we also provide more advanced training for advanced workflows and software deployment. To encourage an ecosystem of software that can scale to large computational challenges, we also provide a range of courses covering parallel programming, optimisation, and various profilers and debuggers that support this kind of work.
The need for computational research is increasing across most areas of science. More should be done to ensure the next generation of researchers are well equipped for the future.
There is a need for computational researchers with a background both in computer science and their scientific research domain. This is particularly true in domains that have relatively recently transitioned to large-scale computational research from mostly field and laboratory-based work.
Pawsey works with lecturers and domain experts at its partner institutions to provide access to supercomputing resources as part of the coursework. In our experience, this works well when the lecturer is an experienced Pawsey user, and Pawsey has provided access to the supercomputing resources and an introductory guest lecture at the start of the course to introduce students to the HPC environment.
Could there be better communication around maintenance and outages?
We are continuing to improve our communication strategy for maintenance and outages. We are aware this is an important issue, with ongoing effort and consideration at various levels within Pawsey for improvement, including the Pawsey Board.
Over time, we have engaged several strategies for improving maintenance communication based on feedback from the user community. This has included a maintenance and incidents page on the user support website, and adopting a predictable schedule of the first Tuesday of each month for planned maintenance. Following feedback from a previous user forum, we have improved the level of detail provided on the maintenance and outage pages. Pawsey staff are striving to perform more of our planned maintenance without the need for outages.
We also are working on a project to provide more responsive details about the state of our systems that is under active development.
I am interested in finding out more about the Pawsey Uptake Projects for 2019. In particular, it's not clear if there are specific calls or if it's an ongoing process.
Historically, we have held calls once or twice a year. Some of these calls have been targeted at specific activities, such as the Petascale Pioneers call to focus on scaling up users on Magnus or the Code Sprint to evaluate GPU and Xeon Phi processor technologies. Other calls have had a broader scope of improving the full range of user workflows and software performance on Pawsey facilities. At the same time, we have also had projects that have emerged outside of these calls, often through service desk tickets that have turned into more involved assistance.
We have been observing the process carefully for the most recent calls, and identified some common issues that reduced the impact of a small portion of projects:
- The proposed project work was not appropriate or did not address the actual performance issues of the code
- A lack of engagement from users once a project was started
- Absence of unit tests or test cases to verify correctness
- Unwillingness of software maintainers to accept code improvements
To address these issues we are updating our process, which will include both the open calls and more informal projects. In particular, it will feature a more in-depth review that should provide both the applicants and Pawsey staff with a more detailed understanding of the workflow, and any appropriate improvements that could be made.
Our call will be slightly later than last year and will likely support fewer projects, as many of our staff are actively involved with processes for the capital refresh.
The call for 2019 closes on the 27th of July, see our website for details on how to apply.
Pawsey User Forum (Brisbane), University of Queensland, 19th March 2019
I've been using Pawsey facilities for some time, and it has been a really good experience. The staff were really helpful when setting up the job scripts and my code scales really well on Magnus up to thousands of cores. The developers were surprised by the performance and are interested in finding out more about the performance of the code on the system.
It's always great to hear that codes are performing well on our systems. Where our users are running codes at scale, we have in the past provided developers access to our systems with modest allocations for benchmarking, profiling, and improving software build systems to add support for our facilities. This can typically be supported either by the developer being added to a project and making use of an existing allocation, or applying for a director share project specifically to support this kind of activity. Don't hesitate to get in touch via our user support portal.
My computational research is Computational Fluid Dynamics, we are always looking to increase the resolution of our simulations. As a result, we need larger allocations to support our work.
We are very aware of the increasing computational needs of our user base. While the upcoming refresh of our facilities should provide more capability for all our users, it is critical to submit well prepared merit applications that justifies the requested time by clearly detailing how it is needed to realise the impact of the research. We provide a guide to submitting a competitive merit application in our user support documentation, and are always happy to provide advice if you get in touch well ahead of the application closing date.
Steering simulations by repetitively submitting jobs can be frustrating and inefficient. My field is moving toward interactively steering computations that are running at scale.
We provide remote visualisation facilities that can be used in some cases to enable this kind of workflow. It is very dependent on how the particular piece of software supports computational steering. If you need assistance investigating whether this is something that can be run on Pawsey facilities for your software, contact us via our user support portal.
We typically need a modest number of nodes with GPUs and a reasonable amount of RAM for our meshing workflows.
Our GPU facilities on Zeus may be a suitable place to run, and this is useful feedback for the future.
When should be making use of institutional clusters and facilities instead of Pawsey?
Institutional resources should be used when they are sufficient for your computational research. In our discussions with institutional facilities, we find they often have one or two projects with workflows that consume a lot of the computational resources. By moving these larger workflows to the national HPC facilities, NCI and Pawsey, the institutional facilities have more availability to support the majority of their users.
I applied for time through competitive merit processes, but did not realise there was a final click needed to actually submit my application. What should I do?
Unfortunately the time is often already allocated by the review committees to the applications that were submitted on time. Where possible, we check for applications that appear complete shortly after the closing date and contact applicants to check, but the ultimately the applicant is responsible for submitting the application. If this has occurred, contact us as soon as possible to see what, if anything, can be done.
I seem to need a lot of different usernames and passwords to access various parts of Pawsey facilities and services, is there anything that can be done to make this more convenient?
Pawsey compute and storage resources all use the same Pawsey username and password to log in. The helpdesk system can create a temporary user account to allow non-Pawsey users to submit inquiries. The application forms for Pawsey resources allow non-Pawsey users to apply so they currently have a separate authentication system. We are currently investigating leveraging AAF to provide an easier access method.
We have a lot of local data that we would like to move to Pawsey, what is the best way to go about doing it and how fast will it be?
For datasets approaching terabytes and larger, feel free to get in touch via the user support portal. Refer to our research storage documentation for details of how to apply for storage capacity on Pawsey facilities. The speed of individual data transfers is very dependent on the path the it takes, and the software used to transfer the data.