Cluster Architecture
What’s New
- Bigger Nodes: Higher Core counts, More Memory, TBs of NVMe scratch
- Faster GPUs: Between 2.5x and 36x faster
- More Storage: Up to 100TB of Archival Storage
- No pre-built modules will need to use spack
- 4 Tier Storage System: Node, Scratch, Turbo, and DataDen
- Short queues, limited wall times
Artemis by the Numbers
Node | # | CPU | GPU | RAM | Disk | $ |
---|---|---|---|---|---|---|
H100 | 3 | AMD 9654 | 4x H100 SXM | 768 GB | 1.9TB | 117,950 |
A100 | 2 | AMD 7513 | 4x A100 SXM | 512 GB | 1.6TB (?) | 58,597 |
Largemem | 3 | AMD 9654 | 768 GB | 1.9TB | 13,989 | |
CPU | 25 | AMD 9654 | 368 GB | 1.9TB | 12,998 |
CPU | Cores | Threads | Base | Boost | L3 Cache |
---|---|---|---|---|---|
AMD Epyc 9654 CPU | 96 | 192 | 2.6GHz | 3.55GHz (All Core) | 384MB |
AMD Epyc 7513 CPU | 32 | 64 | 2.6GHz | 3.65GHz (Max) | 128MB |
Nodes are partitioned by threads, not cores. Picking 1 or a multiple of 2 is advisable; see sbatch’s
--distribution
flag
GPU | VRAM | GPU Mem Bandwidth | FP64 | FP64 TC | FP32 - TC | BF16 TC |
---|---|---|---|---|---|---|
A100 SXM | 80GB | 2,039GB/s | 9.7 | 19.5 | 156 | 312 |
H100 SXM | 80GB | 3.34TB/s | 34 | 67 | 989 | 1989 |
FLOPs are listed in teraFLOPs (\(10^{12}\) floating point operations per second). Tensor Cores (TC) are specialized for general matrix multiplications (GEMM).
Partitions
Partition | Nodes | Max Wall Time | Priority | Max Jobs | Max Nodes |
---|---|---|---|---|---|
venkvis-cpu | CPU | 48hrs | |||
venkvis-largemem | Large Mem | 48hrs | |||
venkvis-a100 | A100 | 8hrs | |||
venkvis-h100 | H100 | 8hrs | |||
debug | all | 30 minutes | 100 | 1 | 4 |
- Usage is proportionate to the cost of the nodes you use
- Usage (currently) does not reset; thus, your fair share priority will not recover
- The Max Jobs limit is enforced using MaxSubmitJobsPerUser
A Note on Fairshare
Nodes are priced proportionate to their cost and the fraction that you use.
- 1 H100 and 1 - 48 CPUs is charged at 1/4th the H100 cost, or 29,487.5 per hour
- 1 A100 and 1 - 16 CPU is charged at 1/4th the A100 cost, or 14,649.25 per hour
- 1 CPU and 0 - 1.9GB is charged at 1/192 the CPU cost, or 67.7 per hour
Using the most appropriate resource for the job is the best way to spend less time in the queue
- Fair Share is based on your usage and the primary driver of priority.
- Efficiently using the nodes will minimize your usage.
- Submitting to the wrong partition will kill your fair share
- Fairshare, does not (currently) reset, but long term will have a half-life of ~2 weeks
Debugging
The Debug partition is explicitly designed to get you on a node fast for debugging or development. It’s priced at the average cost per CPU/GPU/Memory unit
- Use
--gres
to target particular GPUs (i.e.--gres=gpu:h100:1
to get 1 H100 GPU or any gpu--gres=gpu:1
)- Being flexible let’s slurm schedule you sooner -
- You can end up on any node that meets your requirements
- CPU-only jobs will get routed to GPU nodes if all the CPU nodes are taken
- You still only pay the debug rate and you only get what you asked for
- You can only have one debug job running or in the queue at a time
Priority on Lighthouse
Priority is how jobs are sorted in the queue, jobs with a higher priority run first
Job_priority =
(10000) * min(Time In Queue / 28 Days, 1) +
(10000) * (fair-share_factor) +
(1000000) * (0 or 100 if venkvis-debug) +
... # Other stuff (Assoc Factor)
The fair-share_factor
is (roughly) \(U_{total} / (N U_{you})\), where $N$ is the size of the group, \(U_{total}\) is the total usage of the group and \(U_{you}\) is your usage.
- You can get a report with
sshare -lU
withLevelFS
being yourfair-share_factor
- It’s greater than 1 for under-served users
- Between 0 and 1 for over-served users
Slurm’s Multifactor Priority Plugin The Fair Tree Fairshare Algorithm
Why not a longer Max Wall Time?
tl;dr: To keep the queue short. Use checkpointing for longer runs.
Let’s assume a M/M/1 queue
- Jobs arrive every \(\lambda\) time units (Poisson Process)
- Run times take on average \(1/\mu\) time units and are exponentially distributed
- First-come, first-served queue (so no priority, fair share, or partitions)
On average, the time from submission to job completion is: \(\frac{1}{\mu - \lambda}\). The utilization is \(\rho = \lambda/\mu\), if \(\rho > 1\) the queue will grow unbounded. Otherwise, it’s expected length is: \(\frac{\rho}{1-\rho}\) With a variance of: \(\frac{\rho}{(1-\rho)^2}\)
Expected Queue Lengths
Plot of the queue length (red), it’s variance (blue) and \(+3\sigma\) band (green)
- The expected queue length (red) rapidly increases as \(\rho \to 1\)
- The variance in queue length (blue) increases even faster
The expected wait for a one-off job is ~1/2 the max wall time divided by the number of nodes
- ~0:40 for a H100 gpu
- 1 hr for an A100 gpu
- ~2hrs for an entire CPU node
Storage
Name | Path | Base Size | Fair Share | $/TB/month | Notes |
---|---|---|---|---|---|
Node Local | /tmp | 1.9TB | Fast, On-Node NVMe storage | ||
Turbo, replicated | /nfs/turbo/coe-venkvis/ | 10TB | 500GB | 13.02 | Fast, Automated regular backups |
scratch | /scratch/venkvis_root/venkvis/ | 10TB | 500GB | Fast, Auto-purged 60 days after last use | |
DataDen | Access via Globus | 100TB | 5TB | 1.67 | Tape Storage; Files should be between 10 - 200 GB, accessible only via Globus |
/home | /home/<user> | 80GB | Fast, mounted on Turbo |
- Node Local: Scratch files, temporary checkpoints
- Turbo/ Home: Software, Environments, Code
- scratch: Large Datasets actively being used, multi-node checkpoints.
- DataDen: Large Datasets not actively being used
Please manage your storage responsibly and clean up after yourself In particular, node-local storage is not automatically cleaned up, it’s on you to clean up
Cleanup with trap
Use trap to create a job specific TMPDIR and clean it up on exit
#!/bin/bash
# SBATCH directives
# Create job-specific tmp directory
export TMPDIR=$(mktemp --directory --tmpdir)
# Ensure Cleanup on exit
trap 'rm -rf -- "$TMPDIR"' EXIT
# ... do job stuff