OCuLUS-FAQ

Aus PC2 Doc
Wechseln zu: Navigation, Suche

FAQs

My job was aborted with the message: "... usage .. exceeded limit"

If you request resources in shared mode, CCS will observe your job and abort it, if the job uses more than the requested resources.

This is to ensure that all jobs running on a node will be able to use their requested resources.

Chapter 5.6 in the User-Manual explaines this limit enforcement more detailed.

If you cannot estimate the needed resources, it is a good starting point to allocate nodes exclusively and to activate

  • Email notification (ccsalloc -m ...) or
  • writing of a job trace file (ccsalloc --trace ...)

After the job ended, you will get a report of the used resources.

ccsinfo <reqID> prints the current resource usage while the job is running.

ccstracejob <reqID> prints this information after the job ended. See also the FAQ below.

E.g., to allocate 4 nodes exclusively you may use

 * ccsalloc -n 4 
 * ccsalloc --res=rset=4:ncpus=1,place=:excl

The latter one, gives you more options to specify the characteristics of a node.

In Brief Instructions you will find a lot of examples on how to use ccsalloc --res

How to get information about current resource usage

  • you may use the command pc2status, which gives you also this information
  • ccsinfo -u gives you something like this
 Allocated Resources of Group: hpc-prf-hell
 Resource             Limit  Allocated (% of Limit)
 ====================================================================
 ncpus                 2048        228 ( 11.13%)
 mem                    N/A    885.13g (    N/A)
 vmem                   N/A      1.07t (    N/A)
 gpus                    10          1 ( 10.00%)

How to convert job scripts from another WLM (SLURM, PBS, Torque, ...)

Read WLM-Rosetta and Oculus Brief Instructions

How to run job-chains (one job after another)

Job chains are useful if one job depends on the output of another job.

Example: We have 3 jobs: job1.sh, job2.sh, and job3.sh

Job3 depends on job2 and job2 depends on job1.

So, we have to start them one after another.

Because CCS does not support job-chains directly, we provide the script $PC2SW/examples/submitCCSJobchain.sh.

Calling it without a parameter prints a help text.

How to search for module files

We provide the command mdlsearch to search for module files.

$ mdlsearch 
purpose: search for module files
usage: mdlsearch ITEM

ITEM is not case sensitive.

examples:
 mdlsearch fftw      prints all modules which names contain fftw
 mdlsearch "^fftw"   prints all modules which names start with fftw

ccsinfo reqID does not find the data

OpenCCS holds the data of completed jobs for 30 minutes in its memory. After that time the data is removed and ccsinfo prints this message:

  [kel@fe2]$ ccsinfo 4229599
ERROR:Inquiry denied for request (4229599):Unknown request 4229599

The command ccstracejob allows to print log and accounting data of such jobs. Refer to the man page of ccstracejob for detailed information.

No host found which provides the requested resources:Perhaps a collision with a node freepool

A node freepool limits the access to a node. The constraints are part of the node's properties.

One can inspect these propertiers by calling ccsinfo -n --state=%H%p%m.

More details are in the OpenCCS User-Manual Appendix K

OpenCCS did not write a job trace file

It may happen that OpenCCS is temporarily not able to write the trace file to the specified directory. In such cases OpenCCS writes the file to $CCS/tmp/OCULUS/TRACES.

Which node types are available?

Type Nodes CPU-Type Cores Memory Accelerator
normal 552 two Intel Xeon E5-2670 16 64GB -
washington 20 two Intel Xeon E5-2670 16 256GB -
tesla 1 two Intel Xeon E5-2670 16 64GB 1 nVIDIA K20 (Kepler)
tesla 7 two Intel Xeon E5-2670 16 64GB 2 nVIDIA K20 (Kepler)
gtx1080 14 two Intel Xeon E5-2670 16 64GB 1 nVIDIA GeForce GTX-1080 Ti)
gtx1080 2 two Intel Xeon E5-2670 16 64GB 2 nVIDIA GeForce GTX-1080 Ti)
rtx2080 15 two Intel Xeon E5-2670 16 64GB 1 nVIDIA GeForce RTX-2080 Ti)
rtx2080 2 two Intel Xeon E5-2670 16 64GB 2 nVIDIA GeForce RTX-2080 Ti)
smp 4 four Intel Xeon E5-4670 32 1TB -

'ccsinfo -n' shows 'only local jobs = true'. What does it mean?

This node does only accept jobs which run completey on that node. Jobs using more than one node are not mapped to this node.

Cannot get 16 cores on a gpu node

Nodes hosting GPU cards keep one core per GPU free for jobs requesting GPUs. Hence, jobs not requesting a GPU card, will only get 14 or 15 cores at maximum on that nodes.

HowTos

Install Python packages in your home directory

To accelerate the file-I/O and avoid quota problems with $HOME, you may establish softlinks from $HOME/.local and $HOME/.cache to your $PC2PFS group directory.

Example:
#save the original directories
mv $HOME/.local $HOME/.local-sic
mv $HOME/.cache $HOME/.cache-sic

#create the new directories in $PC2PFS
cd $PC2PFS/MYGROUP/MYDIR
mkdir -p homelocal homecache

#create the softlinks
cd $HOME
ln -s $PC2PFS/MYGROUP/MYDIR/homelocal .local
ln -s $PC2PFS/MYGROUP/MYDIR/homecache .cache

Choose the python release you want to use by loading the related module (e.g. module load lang/Python/3.7.4-GCCcore-8.3.0).

You may search for Python modules by using mdlsearch python. Then install for example numpy.

     $ pip install --user numpy

This will install the numpy package in your home directory: $HOME/.local/. Once that is done, you will need to make sure:

       $HOME/.local/bin is in your ‘PATH’ variable and
       $HOME/.local/lib/pythonx.y/site-packages/ is in your PYTHONPATH.

Make sure to replace the x.y part with the actual version of Python you are using. For instance:

  $ export PATH=$HOME/.local/bin:$PATH
  $ export PYTHONPATH=$HOME/.local/lib/python2.7/site-packages/:$PYTHONPATH

How to (un)select specific node types?

Normally, you won't care about this question, because you just request cores, memory, accelerators, or licences and CCS cares about the mapping. However, for benchmarking purposes it may be useful to (un)select specific node types. For this purpose, we provide resources of type Boolean to (un)select node types. ccsinfo -a shows all available resources:

Name            Type, Amount               Default    Purpose
                Flags Used/Online/Max
=============================================================
ncpus           U,C   4775/8784/9264       1          Number of cores
nodes           U,C   447/545/619          1          Number of nodes
mem             S,C   23.77t/40.63t/42.91t 3.93g      Physical memory
vmem            S,C   25.18t/52.04t/55.09t 4.93g      Virtual memory
cput            T,    -                    N/A        CPU time
walltime        T,J   -                    N/A        Walltime
hostname        A,    -                    N/A        Hostname
arch            A,    -                    N/A        Host architecture
mpiprocs        U,    -                    N/A        Number of MPI processes per chunk
ompthreads      U,    -                    N/A        Number of threads per chunk
amd             B,    -                    N/A        node with AMD CPU
gpunode         B,    -                    N/A        GPU node
gpus            U,C   13/48/59             0          GPU
gtx1080         B,    -                    N/A        NVIDIA GTX1080Ti GPU
ibswitch        V,    -                    N/A        Infiniband-switch number
mdce            U,CJ  0/256/256            false      Matlab Distributed Computing Environment licenses
norm            B,    -                    N/A        62GiByte compute node
rack            U,    -                    N/A        Rack number
rtx2080         B,    -                    N/A        NVIDIA RTX2080Ti GPU
smp             B,    -                    N/A        SMP node
tesla           B,    -                    N/A        NVIDIA Tesla K20xm GPU
vtune           B,    -                    N/A        node equipped with VTune HW-performance counter
wash            B,    -                    N/A        Washington node

Examples:

  • run a job only on the 62GiByte compute nodes
    • --res=rset=2:ncpus=5:norm=t
  • run a job only on the washington nodes
    • --res=rset=2:ncpus=5:wash=t
  • Requesting a GPU of any type
    • --res=rset=ncpus=8:mem:40g:gpus=1
  • Requesting a GPU of any type but no Tesla
    • --res=rset=ncpus=8:mem:40g:gpus=1:tesla=f
  • Excluding GPU nodes:
    • --res=rset=2:ncpus=5:gpunodes=f

How to allocate / avoid AMD CPUs

Since AMD CPUs are compatible to Intel CPUs, CCS does not distinguish between Intel and AMD CPUs. However, if you explicitely want to use AMD CPUs use:

 --res=rset=2:amd=t:ncpus=5

If you explicitely NOT want to use AMD CPUs use:

 --res=rset=2:amd=f:ncpus=5

How to allocate GPUs

For GPUs, we provide the consumable resource gpus. This avoids that more than one job will be scheduled to a card at the same time. The boolean resources tesla, gtx1080, and rtx2080 may be used to (un)select specific GPU types.

Hence, to request 2 chunks each with 8 cpus and one Tesla card use:

 --res=rset=2:ncpus=8:gpus=1:tesla=true

For offload jobs CCS sets the environment variable:

 CUDA_VISIBLE_DEVICES=0

For jobs mapped on a GPU node but not requesting a GPU CCS sets the environment variable:

 CUDA_VISIBLE_DEVICES=1024

which is an invalid value.

For jobs mapped to a non GPU node, CCS does not set the environment variable CUDA_VISIBLE_DEVICES

Typically sufficient vmem has to be allocated.

  • Tesla: --res=rset=1:ncpus=1:tesla=t:gpus=1:mem=8g:vmem=85g
  • GTX1080: --res=rset=1:ncpus=1:gtx1080=t:gpus=1:mem=4g
  • RTX2080: --res=rset=1:ncpus=1:rtx2080=t:gpus=1:mem=4g

How to keep Java from using to many cpu cores

Even if the user sets the number of threads within a java program, java uses additional threads for the garbage collection (GC). Thus, it can happen that a job running a java program exceeds its allowed cpu usage on Oculus if the number of compute threads is identical to the choice of ncpus in the job resource specification. The number of threads for the garbage collection can be controlled with the command line arguments

 -XX:ParallelGCThreads

and

 -XX:ConcGCThreads

For example,

 java -XX:ParallelGCThreads=2 -XX:ConcGCThreads=1 HelloWorld

sets to use 2 threads for the parallel GC and one thread for the concurrent garbage collectors.

If you want to be on the safe side, the number of (ParallelGCThreads)+(ConcGCThreads)+(number of compute threads) should equal the number of requested cpu cores (ncpus).

It is also possible to use the argument

 -XX:+UseSerialGC 

do use a serial garbage collection that uses exactly one thread. With UseSerialGC the safe choice would be (number of compute threads)+1=ncpus.