Checkpointing and Requeing Jobs

Have a really long job that you want to run? Here’s how you do it:

Submit the job to the queue
Run for almost the full max wall time
Send a kill signal to your code using timeout
Your code saves a checkpoint
Requeue the job with scontrol
Repeat 2-5 until your job finishes

#!/bin/bash
#SBATCH --open-mode=append # Needed to append, instead of overwriting the log
#SBATCH --time 2-0:0:) # Run for 2 days (or --time 8:0:0 for gpus)
#... other slurm flags

# Possible checkpoint file name to look for
# ie. `awesome_script --resume "$CHECKPOINT" ...other args...`
CHECKPOINT="${SLURM_JOB_ID}.ckpt"

# Run your code, timing out 30 minutes before closing
timeout 47.5h awesome_script ...
time awesome_script

# Requeue if your code timed out
if [[ $? == 124 ]]; then
    scontrol requeue $SLURM_JOB_ID
fi

Catching Signals

Instead of using timeout slurm can send your script a signal (blog, man) just before the job ends. Here, I’m sending USR1 (User Signal 1) 300 seconds before the wall time, see the SLURM documentation for more.

#!/bin/bash
#SBATCH --signal=USR1@300
#... other slurm flags

awesome_script
if [[ $? == 124 ]]; then
    scontrol requeue $SLURM_JOB_ID
fi

Typically, a non-zero exit code in Linux means “something went wrong”. Because we don’t want to requeue a job that failed indefinetly, we need to be able to distighish between “Something went wrong” and “I need more time”.

Here we’re checking if the exit code is 124 (timeout uses 124 to indicate the command timed out), but any non-zero exit code could work. Check your code’s docs to see what’s normal, what’s an error, and how to throw a different signal

Tool Specific Support

Program	Restart	Signal Catching/Handling
PyTorch Lightning	yes	yes
Quantum Espresso	yes	experimental
LAMMPs	yes	no, see Python
GPAW	yes	no, see Python
Python	?	Yes
Julia	?	No