A job script intended to be executed on Magnus works fine when submitted from Magnus itself. But when the job script is submitted from Zeus using the clause:
then the number of nodes are not set correctly and the script fails or misbehaves.
We have found that when using the multi-cluster operation, some variables set on one cluster by default are not equally set when the job has been remotely submitted from another cluster. Usually, the offending parameters were not set explicitly in the script and the user relies on their default values. Unfortunately, default values may change when submission starts from a different cluster. In this case, when submitted from Zeus, the variable SLURM_HINT is not set properly, which creates a problem with the number of tasks to be executed per node.
Our proposed solution in this case is to explicitly set the number of tasks within the job script header:
This is the recommended practice for every job script, even if it is intended to always be submitted from Magnus itself.
Another solution would be to explicitly set the offending variable to the value it would assume on Magnus by default:
When debugging this kind of problem, it is always useful to check the values of slurm variables at the different stages of the workflow in order to identify which parameters are creating the problem:
Or echo the value of specific variables, like: