According to squeue the cluster seems empty, but my job doesn't start[Bearbeiten]
In contrast to many other hpc-clusters that use Slurm, on Noctua the command squeue only shows your own jobs. You can use scluster to see the numbers of currently used und free nodes.
I have tried to log into Noctua a few times with a wrong password or wrong username. Now I can't login even with the correct username and password.[Bearbeiten]
We use fail2ban on Noctua. After a few failed login attempts your IP address will be banned from connecting to Noctua. Please notify our our support to remove the ban on your IP address.
How can I see when my jobs are estimated to start?[Bearbeiten]
Please use spredict. In the last column, it will show the predicted start time of your job. This is only an estimate based on the currently submitted jobs on the system and depends on the runtimes of other jobs. For example, if you have used up your monthly quota and a different user submits jobs after you but has not used up his/her monthly quota, your jobs will have a lower priority. So the jobs of the other user will start before yours. Hence, the starting time can't be predicted in an exact way here.
I experience some limits when submitting jobs on Noctua. Can they be relaxed?[Bearbeiten]
We (as PC2) don't set job submission limits on Noctua by default. By Default you can submit as many jobs as the database and Slurm can handle. However, your project administrator can set some individual limits within your project. Hence, he/she is the person to contact and relax or remove the limits. You can find your limits and your project coordinator/admin with the command line tool "pc2status" on Noctua.
My parallel program is running much slower than expected. What could be the reason?[Bearbeiten]
There can be a plethora of reasons. We list here only a few of the unforseen ones:
- We have found that if you use --hint=nomultithread on Noctua, it can lead to unexpected pinning of multiple MPI-ranks to one cpu-core and drastically slow down your program. The reason is that the Nodes on Noctua have Hypertrheding disabled which leads to an unexpected behaviour of this hint.
- You might be using an MPI-implementation that is not compiled for the OmniPath interconnect on Noctua.
module avail doesn't show all the modules[Bearbeiten]
Please use module --ignore_cache avail to circumvent the cache or use rm -rf ~/.lmod.d/.cache to reset the cache.
Applications compilied with Intel Parallel Studio XE 2019 may terminate with an error message[Bearbeiten]
This can happen after several minutes of programm execution. Typical messages are:
"srun: error: <node name>: tasks <task_id>: Segmentation fault"
"Fatal error in PMPI_Iprobe: Invalid communicator, error stack .."
"KILLED BY SIGNAL: 11 (Segmentation fault)"
"BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES"
Use latest Intel PS XE 2019.3. Compiler and run the application in the same Intel PS XE environment!
module add intel/19.0.3_compiler cray-impi/4_19.0.3
If you still have problems, use Intel PS XE 2018.