Checkpointing and Requeing Jobs
Have a really long job that you want to run? Here’s how you do it:
- Submit the job to the queue
- Run for almost the full max wall time
- Send a kill signal to your code using
timeout
- Your code saves a checkpoint
- Requeue the job with scontrol
- Repeat 2-5 until your job finishes
#!/bin/bash
#SBATCH --open-mode=append # Needed to append, instead of overwriting the log
#SBATCH --time 2-0:0:) # Run for 2 days (or --time 8:0:0 for gpus)
#... other slurm flags
# Possible checkpoint file name to look for
# ie. `awesome_script --resume "$CHECKPOINT" ...other args...`
CHECKPOINT="${SLURM_JOB_ID}.ckpt"
# Run your code, timing out 30 minutes before closing
timeout 47.5h awesome_script ...
time awesome_script
# Requeue if your code timed out
if [[ $? == 124 ]]; then
scontrol requeue $SLURM_JOB_ID
fi
Catching Signals
Instead of using timeout
slurm can send your script a signal (blog, man) just before the job ends. Here, I’m sending USR1 (User Signal 1) 300 seconds before the wall time, see the SLURM documentation for more.
#!/bin/bash
#SBATCH --signal=USR1@300
#... other slurm flags
awesome_script
if [[ $? == 124 ]]; then
scontrol requeue $SLURM_JOB_ID
fi
Typically, a non-zero exit code in Linux means “something went wrong”. Because we don’t want to requeue a job that failed indefinetly, we need to be able to distighish between “Something went wrong” and “I need more time”.
Here we’re checking if the exit code is 124 (
timeout
uses 124 to indicate the command timed out), but any non-zero exit code could work. Check your code’s docs to see what’s normal, what’s an error, and how to throw a different signal
Tool Specific Support
Program | Restart | Signal Catching/Handling |
---|---|---|
PyTorch Lightning | yes | yes |
Quantum Espresso | yes | experimental |
LAMMPs | yes | no, see Python |
GPAW | yes | no, see Python |
Python | ? | Yes |
Julia | ? | No |