HPC Vega Introductory Workshop for SMASH Fellows
Agenda
- First login
- Basic Linux commands
- Data management
- Software stack and environment
- Basic Slurm commands
- Submit the first job
- Job management and fetching the results
- Open OnDemand
- Slurm reports
First login on HPC Vega
Once you have an account, log into cluster using a two-factor authentication (2FA) and secure shell (SSH).
HPC Vega has 8 login nodes which are used for:
- Interactive work.
- Submitting jobs via Slurm or Nordugrid ARC.
- Preparation of environment.
- Compilation of codes.
- Transferring data from the cluster to your workstation (scp, sftp, dCache, xrootd,..)
- ...
You can choose one of the following login nodes:
- Log in on one of the eight login nodes.
ssh <username>@login.vega.izum.si
- Log in on one of the four login CPU nodes.
ssh <username>@logincpu.vega.izum.si
- Log in on one of the four login GPU nodes.
ssh <username>@logingpu.vega.izum.si
Logout
$ exit
Verbose mode: ssh -vvv
Basic Linux commands
Working with bash shell..
- ls - list directory content
- pwd - print working directory
- cp - copy file(s)
- mv - move file(s)
- rm - remove file(s)
- mkdir - create directory
- rmdir - remove directory
- touch - create blank file
- cat - print the file content
- grep - search by provided string within file(s)
- watch - monitor changes in command output
Data management
Backup
- There is no backup on Vega.
/ceph/hpc/home/$USER
is storage for project duration with a quota of 100GB.- All the rest is treated as temporary data (unless otherwise agreed
support@sling.si
).
Directories
Project directories on Ceph (fs) with a quota per project.
/ceph/hpc/data/
Scratch directories on Ceph (fs) with a quota of 20GB.
/ceph/hpc/scratch/user/
Project directories on Lustre (fs) with a quota per project.
/exa5/data/
Scratch directories on Lustre (fs) with a quota of 20GB.
/exa5/scratch/user/
Software stack and environment
Modules
- Software is available as loadable Modules.
- Modules allow a set of combinations of different versions of the software.
- Users can request additional modules
support@sling.si
.
Currently available modules on cluster:
module avail
Is PyTorch available?
module avail PyTorch
Module information:
module show Python
module show PyTorch/2.0.1-foss-2022a
Load specific module:
module load Python/3.11.3-GCCcore-12.3.0
Currently loaded modules:
module list
Check Python version:
python3 -V
Remove currently loaded modules:
module purge
Check Python version (system):
python3 -V
Containers
Singularity is a platform designed to create and run containers within a supercomputer environment. Containers have access to a common operating system, file system, and software installed on nodes.
Advantages:
- High level of portability.
- Repeatability of containers (definition files).
- Isolation.
- Security (namespaces, cryptographic signatures).
- Singularity Image Format (SIF).
- ...
Preparing your own container is possible in several ways:
- Take pre-prepared containers, adjust them to your needs.
- Create your container with the help of a definition file.
Path to images:
ls -al /ceph/hpc/software/containers/singularity/images/
Path to definition files:
ls -al /ceph/hpc/software/containers/singularity/def/
The read-only container (compressed squashfs) is a SIF (Singularity Image Format) format that can be converted to an image with the --writable switch or the sandbox switch to sandbox directories. Fakeroot is a functionality that allows non-privileged users to obtain the appropriate "root" rights within containers with the --fakeroot switch. Host filesystem (FS) rights are mapped appropriately within the container.
Build your own container from Docker Hub:
singularity build --sandbox --fix-perms smash-container/ docker://ubuntu:latest
Shell:
singularity shell --writable --fakeroot smash-container/
Excecute command within container
Exec without fakeroot
switch:
singularity exec smash-container/ whoami
Exec with fakeroot
switch:
singularity exec --fakeroot smash-container/ whoami
Build containr from definition file
Example definition file example.def
:
bootstrap: docker
from: ubuntu:latest
%environment
export LC_ALL=C
%runscript
echo "This is what happens when you run the container.."
%post
apt-get -y update
# apt-get -y install <package-name>
apt-get clean
%labels
Maintainer SMASH
Version 1.0
%help
Description of help section.
Build container from definition file:
singularity build --fix-perms --fakeroot smash-container.sif example.def
Check OS version and release within container:
singularity exec smash-container.sif cat /etc/os-release
Slurm - Simple Linux Utility for Resource Management
- Slurm is a free and open-source job scheduler for Linux (and Unix-like) that manages computing resources and schedules tasks on them.
- Scheduler – used for finding appropriate resources to run computational tasks, it allocates nodes, CPUs, memory, and other computing resources.
- Partition – is a set of compute nodes with a common feature.
- Node – A node is a physical server that can handle computing tasks and run batch jobs.
- Job– is a base unit of computing in Slurm.
- Task – is a single process. A multi-process program has several tasks.
- Priority – is order of pending jobs in the queue with jobs having a higher score running before those with a lower score. The priority calculation is based primarily on the historical usage of cluster resources by an account.
Slurm basic commands
- srun - run a command on allocated computes.
- salloc - create allocation on computes.
- sbatch - submit a job script.
- scancel - cancel submited job.
- sinfo - show nodes and partitions.
- squeue - show jobs and information.
- scontrol - show information on cluster (only as users).
- sstat – show status of running jobs.
- sprio – view of the components that affect the priority of the work.
- sreport – display information from the accounting database on jobs, users, clusters.
- sacct - display accounting data for all jobs.
For help use man <command>
or add switch --help
to display help section for each command.
Let's try it
Login..
ssh <username>@logingpu.vega.izum.si
Print working directoy:
pwd
Check procesor information:
lscpu
Check information:
sinfo -s
Check the queue for cpu
. Valid options are: cpu
,gpu
,dev
or largemem
:
squeue -p cpu
Check partitions:
scontrol show partition
Check reservations:
scontrol show res
Check default account:
sacctmgr show user $USER format=user,defaultaccount%30
Submit the first job on HPC Vega
Check queue per user with extra switches:
squeue --user $USER --long
Monitor the output:
watch -n 5 squeue --user $USER --long
srun
Submit job via srun
srun --job-name SMASH --nodes 1 --ntasks 2 --ntasks-per-node 2 --mem 100M -o %j.out -e %j.err -t 00:05:00 -p dev hostname; sleep 30
SBATCH
Create and open and empty file for SBATCH script.
touch smash.sh; vi smash.sh
Copy content from documentation to smash.sh
:
#!/bin/bash
#SBATCH --job-name=smash
###SBATCH --account=<your-account> # Specify for multiple projects!!
#SBATCH --partition=dev # Partition to run the job (dev,cpu,gpu,largemem)
#SBATCH --nodes=2 # Number of nodes
#SBATCH --ntasks=2 # Number of tasks per job (equal to MPI processes)
#SBATCH --ntasks-per-node=1 # Number of tasks per node.
#SBATCH --mem=100M # Required amount of memory M,GB,..
#SBATCH --output=%j.out # Standard output file.
#SBATCH --error=%j.err # Standard error file.
#SBATCH --time=00:10:00 # Requested walltime limit for the job.
###SBATCH --exclude=cn0001 # Exclude specified nodes from job allocation (or -x cn0001)
###SBATCH --nodelist=gn01 # Request specified nodes for job allocation (or -w gn01)
###SBATCH --mail-user=<email> # Email address for notifications.
# Job information
echo "WORKING DIRECTORY: $PWD"
echo "JOB ID : $SLURM_JOB_ID"
echo "JOB NAME : $SLURM_JOB_NAME"
echo "NUMBER OF NODES : $SLURM_NNODES"
echo "NODELIST : $SLURM_NODELIST"
echo "NUMBER OF TASKSK : $SLURM_NTASKS"
echo "REQUESTED MEMORY : $SLURM_MEM_PER_NODE"
srun hostname
sleep 30
Submit job via sbatch
:
sbatch smash.sh
Job management and fetching the results
Check the job information:
scontrol show jobid <jobid>
Check the job results:
cat <jobid>.out
Check if there's any errors within job:
cat <jobid>.err
salloc
Request resources via salloc
on dev
or gpu
partition.
salloc -n 1 -N 1 --gres=gpu:1 -p dev -t 00:30:00
Connect to allocated node login000[5-8],gn[01-60]
:
ssh <hostname>
NVIDIA System Management Interface program
nvidia-smi
Is CUDA is available within container?
export SINGULARITYENV_CUDA_VISIBLE_DEVICES=0 | singularity exec --nv /ceph/hpc/software/containers/singularity/images/pytorch-23.12-py3.sif python3 -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.device_count())"
Open OnDemand
Open OnDemand is a user-friendly web portal that allows users to initiate, customize, and monitor their jobs, as well as transfer files to and from the cluster directly through a web browser.
Available on: https://ondemand.vega.izum.si/
Slurm reports
There are multiple ways how to monitor resource consumption. It is crucial for users to monitor their resource usage closely, as we will soon implement limits on total usage for all users on Vega. We are already closely tracking resource utilization in order to ensure fair and efficient utilization of resources on our cluster.
Job informations for running or pending jobs
scontrol show job <job id>
Job informations for past jobs
sacct -j <job id> --format=JobID,JobName,Partition,AllocTRES%35,Elapsed
Resource usage report
/ceph/hpc/bin/accountingreport.sh
sreport job SizesByAccount -T billing User=<user> -t Hours Start=2021-07-01 End=`date -d tomorrow +%Y-%m-%d` account=<slurm account> grouping=9999999 Partition=cpu format=Account%25