HPC Vega Introductory Workshop for SMASH Fellows

Agenda

First login
Basic Linux commands
Data management
Software stack and environment
Basic Slurm commands
Submit the first job
Job management and fetching the results
Open OnDemand
Slurm reports

Once you have an account, log into cluster using a two-factor authentication (2FA) and secure shell (SSH).

HPC Vega has 8 login nodes which are used for:

Interactive work.
Submitting jobs via Slurm or Nordugrid ARC.
Preparation of environment.
Compilation of codes.
Transferring data from the cluster to your workstation (scp, sftp, dCache, xrootd,..)
...

You can choose one of the following login nodes:

Log in on one of the eight login nodes.

ssh <username>@login.vega.izum.si

Log in on one of the four login CPU nodes.

ssh <username>@logincpu.vega.izum.si

Log in on one of the four login GPU nodes.

ssh <username>@logingpu.vega.izum.si

Logout

$ exit

Verbose mode: ssh -vvv @login.vega.izum.si

Basic Linux commands

Working with bash shell..

ls - list directory content
pwd - print working directory
cp - copy file(s)
mv - move file(s)
rm - remove file(s)
mkdir - create directory
rmdir - remove directory
touch - create blank file
cat - print the file content
grep - search by provided string within file(s)
watch - monitor changes in command output

Data management

Backup

There is no backup on Vega.
/ceph/hpc/home/$USER is storage for project duration with a quota of 100GB.
All the rest is treated as temporary data (unless otherwise agreed support@sling.si).

Directories

Project directories on Ceph (fs) with a quota per project.

/ceph/hpc/data/

Scratch directories on Ceph (fs) with a quota of 20GB.

/ceph/hpc/scratch/user/

Project directories on Lustre (fs) with a quota per project.

/exa5/data/

Scratch directories on Lustre (fs) with a quota of 20GB.

/exa5/scratch/user/

Software stack and environment

Modules

Software is available as loadable Modules.
Modules allow a set of combinations of different versions of the software.
Users can request additional modules support@sling.si.

Currently available modules on cluster:

module avail

Is PyTorch available?

module avail PyTorch

Module information:

module show Python
module show PyTorch/2.0.1-foss-2022a

Load specific module:

module load Python/3.11.3-GCCcore-12.3.0

Currently loaded modules:

module list

Check Python version:

python3 -V

Remove currently loaded modules:

module purge

Check Python version (system):

python3 -V

Modules are availiable on Ceph (local), CernVM-FS (/cvmfs) and EESSI.

Containers

Singularity is a platform designed to create and run containers within a supercomputer environment. Containers have access to a common operating system, file system, and software installed on nodes.

Advantages:

High level of portability.
Repeatability of containers (definition files).
Isolation.
Security (namespaces, cryptographic signatures).
Singularity Image Format (SIF).
...

Preparing your own container is possible in several ways:

Take pre-prepared containers, adjust them to your needs.
Create your container with the help of a definition file.

Path to images:

ls -al /ceph/hpc/software/containers/singularity/images/

Path to definition files:

ls -al /ceph/hpc/software/containers/singularity/def/

The read-only container (compressed squashfs) is a SIF (Singularity Image Format) format that can be converted to an image with the --writable switch or the sandbox switch to sandbox directories. Fakeroot is a functionality that allows non-privileged users to obtain the appropriate "root" rights within containers with the --fakeroot switch. Host filesystem (FS) rights are mapped appropriately within the container.

Build your own container from Docker Hub:

singularity build --sandbox --fix-perms smash-container/ docker://ubuntu:latest

Shell:

singularity shell --writable --fakeroot smash-container/

Excecute command within container

Exec without fakeroot switch:

singularity exec smash-container/ whoami

Exec with fakeroot switch:

singularity exec --fakeroot smash-container/ whoami

Build containr from definition file

Example definition file example.def:

bootstrap: docker
from: ubuntu:latest

%environment
    export LC_ALL=C
%runscript
    echo "This is what happens when you run the container.."
%post 
   apt-get -y update
   # apt-get -y install <package-name>
   apt-get clean
%labels
   Maintainer SMASH
   Version 1.0
%help
    Description of help section.

Build container from definition file:

singularity build --fix-perms --fakeroot smash-container.sif example.def

Check OS version and release within container:

singularity exec smash-container.sif cat /etc/os-release

Slurm - Simple Linux Utility for Resource Management

Slurm is a free and open-source job scheduler for Linux (and Unix-like) that manages computing resources and schedules tasks on them.
Scheduler – used for finding appropriate resources to run computational tasks, it allocates nodes, CPUs, memory, and other computing resources.
Partition – is a set of compute nodes with a common feature.
Node – A node is a physical server that can handle computing tasks and run batch jobs.
Job– is a base unit of computing in Slurm.
Task – is a single process. A multi-process program has several tasks.
Priority – is order of pending jobs in the queue with jobs having a higher score running before those with a lower score. The priority calculation is based primarily on the historical usage of cluster resources by an account.

Slurm basic commands

srun - run a command on allocated computes.
salloc - create allocation on computes.
sbatch - submit a job script.
scancel - cancel submited job.
sinfo - show nodes and partitions.
squeue - show jobs and information.
scontrol - show information on cluster (only as users).
sstat – show status of running jobs.
sprio – view of the components that affect the priority of the work.
sreport – display information from the accounting database on jobs, users, clusters.
sacct - display accounting data for all jobs.

For help use man <command> or add switch --help to display help section for each command.

Let's try it

Login..

ssh <username>@logingpu.vega.izum.si

Print working directoy:

pwd

Check procesor information:

lscpu

Check information:

sinfo -s

Check the queue for cpu. Valid options are: cpu,gpu,dev or largemem:

squeue -p cpu

Check partitions:

scontrol show partition

Check reservations:

scontrol show res

Check default account:

sacctmgr show user $USER format=user,defaultaccount%30

Submit the first job on HPC Vega

Check queue per user with extra switches:

squeue --user $USER --long

Monitor the output:

watch -n 5 squeue --user $USER --long

srun

Submit job via srun

srun --job-name SMASH --nodes 1 --ntasks 2 --ntasks-per-node 2 --mem 100M -o %j.out -e %j.err -t 00:05:00 -p dev hostname; sleep 30

SBATCH

Create and open and empty file for SBATCH script.

touch smash.sh; vi smash.sh

Copy content from documentation to smash.sh:

#!/bin/bash

#SBATCH --job-name=smash
###SBATCH --account=<your-account>  # Specify for multiple projects!!
#SBATCH --partition=dev # Partition to run the job (dev,cpu,gpu,largemem)
#SBATCH --nodes=2 # Number of nodes
#SBATCH --ntasks=2 # Number of tasks per job (equal to MPI processes)
#SBATCH --ntasks-per-node=1 # Number of tasks per node. 
#SBATCH --mem=100M # Required amount of memory M,GB,..
#SBATCH --output=%j.out # Standard output file.
#SBATCH --error=%j.err  # Standard error file.
#SBATCH --time=00:10:00 # Requested walltime limit for the job.
###SBATCH --exclude=cn0001 # Exclude specified nodes from job allocation (or -x cn0001)
###SBATCH --nodelist=gn01 # Request specified nodes for job allocation (or -w gn01)
###SBATCH --mail-user=<email> # Email address for notifications.

# Job information
echo "WORKING DIRECTORY: $PWD" 
echo "JOB ID           : $SLURM_JOB_ID"
echo "JOB NAME         : $SLURM_JOB_NAME"
echo "NUMBER OF NODES  : $SLURM_NNODES"
echo "NODELIST         : $SLURM_NODELIST"
echo "NUMBER OF TASKSK : $SLURM_NTASKS"
echo "REQUESTED MEMORY : $SLURM_MEM_PER_NODE"

srun hostname
sleep 30

Submit job via sbatch:

sbatch smash.sh

Job management and fetching the results

Check the job information:

scontrol show jobid <jobid>

Check the job results:

cat <jobid>.out

Check if there's any errors within job:

cat <jobid>.err

salloc

Request resources via salloc on dev or gpu partition.

salloc -n 1 -N 1 --gres=gpu:1 -p dev -t 00:30:00

Connect to allocated node login000[5-8],gn[01-60]:

ssh <hostname>

NVIDIA System Management Interface program

nvidia-smi

Is CUDA is available within container?

export SINGULARITYENV_CUDA_VISIBLE_DEVICES=0 | singularity exec --nv /ceph/hpc/software/containers/singularity/images/pytorch-23.12-py3.sif python3 -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.device_count())"

Open OnDemand

Open OnDemand is a user-friendly web portal that allows users to initiate, customize, and monitor their jobs, as well as transfer files to and from the cluster directly through a web browser.

Available on: https://ondemand.vega.izum.si/

Slurm reports

There are multiple ways how to monitor resource consumption. It is crucial for users to monitor their resource usage closely, as we will soon implement limits on total usage for all users on Vega. We are already closely tracking resource utilization in order to ensure fair and efficient utilization of resources on our cluster.

Job informations for running or pending jobs

scontrol show job <job id>

Job informations for past jobs

sacct -j <job id> --format=JobID,JobName,Partition,AllocTRES%35,Elapsed

Resource usage report

/ceph/hpc/bin/accountingreport.sh

sreport job SizesByAccount -T billing User=<user> -t Hours Start=2021-07-01 End=`date -d tomorrow +%Y-%m-%d` account=<slurm account> grouping=9999999 Partition=cpu format=Account%25

HPC Vega Introductory Workshop for SMASH Fellows

Agenda

First login on HPC Vega

Basic Linux commands

Data management

Backup

Directories

Software stack and environment

Modules

Containers

Slurm - Simple Linux Utility for Resource Management

Slurm basic commands

Let's try it

Submit the first job on HPC Vega

srun

SBATCH

Job management and fetching the results

salloc

Open OnDemand

Slurm reports