Skip to content

DMTCP

DMTCP is a tool for recording the intermediate state of an application, which allows you to save and restore the state of applications, without having to modify the source code or use special libraries. This is particularly useful for long-running computations where it is necessary to save the state of the program and resume it later.

DMTCP can be used if the execution of our application is longer than the time limit of 48 hours. DMTCP saves the current state when the timeout is reached. We then restart our application using DMTCP from the previously saved state.

It supports applications using different implementations of MPI, OpenMP and many programming languages such as MATLAB, Python, Perk and many others. TightVNC allows the creation of checkpoints for graphical applications running on an X server.

--interval flag

The --interval flag in DMTCP allows the user to specify the time interval in seconds between automatic checkpoints. This is particularly useful for long running processes where you want to save the application state periodically to reduce data loss in case of error or interruption.

Syntax

dmtcp_launch --interval <time_interval> ./my_app

Parameters

  • : The time interval between automatic checkpoints in seconds. For example, --interval 3600 will generate a checkpoint every hour.

Usage

On HPC Vega, DMTCP is loaded with the command:

ml dmtcp

Sbatch script

Example of using DMTCP in a sbatch script:

#! /bin/sh

#SBATCH --partition=<partition_name>     # change to the appropriate partition name or remove
#SBATCH --time=<reservation_time>        # enter the appropriate reservation time

#SBATCH --nodes=<number_of_nodes>        # number of nodes
#SBATCH --ntasks-per-node=<tasks_per_node> # number of tasks per node
#SBATCH --mem=<memory_allocation>        # memory resources
#SBATCH --job-name="<job_name>"          # change to the name of your job
#SBATCH --output=<output_file>.out       # output to this file
#SBATCH --error=<error_file>.err         # error output to this file

ml dmtcp

# Time interval is 4 seconds
dmtcp_launch --new-coordinator  --interval 4 ./my_app

Starting from a saved state

dmtcp_restart checkpoint_name*.dmtcp