Cluster Architecture

What’s New

  • Bigger Nodes: Higher Core counts, More Memory, TBs of NVMe scratch
  • Faster GPUs: Between 2.5x and 36x faster
  • More Storage: Up to 100TB of Archival Storage
  • No pre-built modules will need to use spack
  • 4 Tier Storage System: Node, Scratch, Turbo, and DataDen
  • Short queues, limited wall times

Artemis by the Numbers

Node # CPU GPU RAM Disk $
H100 3 AMD 9654 4x H100 SXM 768 GB 1.9TB 117,950
A100 2 AMD 7513 4x A100 SXM 512 GB 1.6TB (?) 58,597
Largemem 3 AMD 9654   768 GB 1.9TB 13,989
CPU 25 AMD 9654   368 GB 1.9TB 12,998
CPU Cores Threads Base Boost L3 Cache
AMD Epyc 9654 CPU 96 192 2.6GHz 3.55GHz (All Core) 384MB
AMD Epyc 7513 CPU 32 64 2.6GHz 3.65GHz (Max) 128MB

Nodes are partitioned by threads, not cores. Picking 1 or a multiple of 2 is advisable; see sbatch’s --distribution flag

GPU VRAM GPU Mem Bandwidth FP64 FP64 TC FP32 - TC BF16 TC
A100 SXM 80GB 2,039GB/s 9.7 19.5 156 312
H100 SXM 80GB 3.34TB/s 34 67 989 1989

FLOPs are listed in teraFLOPs (\(10^{12}\) floating point operations per second). Tensor Cores (TC) are specialized for general matrix multiplications (GEMM).


Partitions

Partition Nodes Max Wall Time Priority Max Jobs Max Nodes
venkvis-cpu CPU 48hrs      
venkvis-largemem Large Mem 48hrs      
venkvis-a100 A100 8hrs      
venkvis-h100 H100 8hrs      
debug all 30 minutes 100 1 4
  • Usage is proportionate to the cost of the nodes you use
  • Usage (currently) does not reset; thus, your fair share priority will not recover
  • The Max Jobs limit is enforced using MaxSubmitJobsPerUser

A Note on Fairshare

Nodes are priced proportionate to their cost and the fraction that you use.

  • 1 H100 and 1 - 48 CPUs is charged at 1/4th the H100 cost, or 29,487.5 per hour
  • 1 A100 and 1 - 16 CPU is charged at 1/4th the A100 cost, or 14,649.25 per hour
  • 1 CPU and 0 - 1.9GB is charged at 1/192 the CPU cost, or 67.7 per hour

Using the most appropriate resource for the job is the best way to spend less time in the queue

  • Fair Share is based on your usage and the primary driver of priority.
  • Efficiently using the nodes will minimize your usage.
  • Submitting to the wrong partition will kill your fair share
  • Fairshare, does not (currently) reset, but long term will have a half-life of ~2 weeks

Debugging

The Debug partition is explicitly designed to get you on a node fast for debugging or development. It’s priced at the average cost per CPU/GPU/Memory unit

  • Use --gres to target particular GPUs (i.e. --gres=gpu:h100:1 to get 1 H100 GPU or any gpu --gres=gpu:1)
    • Being flexible let’s slurm schedule you sooner -
  • You can end up on any node that meets your requirements
    • CPU-only jobs will get routed to GPU nodes if all the CPU nodes are taken
    • You still only pay the debug rate and you only get what you asked for
  • You can only have one debug job running or in the queue at a time

Priority on Lighthouse

Priority is how jobs are sorted in the queue, jobs with a higher priority run first

Job_priority =
    (10000) * min(Time In Queue / 28 Days, 1) +
    (10000) * (fair-share_factor) +
    (1000000) * (0 or 100 if venkvis-debug) +
    ... # Other stuff (Assoc Factor)

The fair-share_factor is (roughly) \(U_{total} / (N U_{you})\), where $N$ is the size of the group, \(U_{total}\) is the total usage of the group and \(U_{you}\) is your usage.

  • You can get a report with sshare -lU with LevelFS being your fair-share_factor
  • It’s greater than 1 for under-served users
  • Between 0 and 1 for over-served users

Slurm’s Multifactor Priority Plugin The Fair Tree Fairshare Algorithm


Why not a longer Max Wall Time?

tl;dr: To keep the queue short. Use checkpointing for longer runs.

Let’s assume a M/M/1 queue

  • Jobs arrive every \(\lambda\) time units (Poisson Process)
  • Run times take on average \(1/\mu\) time units and are exponentially distributed
  • First-come, first-served queue (so no priority, fair share, or partitions)

On average, the time from submission to job completion is: \(\frac{1}{\mu - \lambda}\). The utilization is \(\rho = \lambda/\mu\), if \(\rho > 1\) the queue will grow unbounded. Otherwise, it’s expected length is: \(\frac{\rho}{1-\rho}\) With a variance of: \(\frac{\rho}{(1-\rho)^2}\)


Expected Queue Lengths

Plot of the queue length (red), it’s variance (blue) and \(+3\sigma\) band (green)

  • The expected queue length (red) rapidly increases as \(\rho \to 1\)
  • The variance in queue length (blue) increases even faster

The expected wait for a one-off job is ~1/2 the max wall time divided by the number of nodes

  • ~0:40 for a H100 gpu
  • 1 hr for an A100 gpu
  • ~2hrs for an entire CPU node

Storage

Name Path Base Size Fair Share $/TB/month Notes
Node Local /tmp 1.9TB     Fast, On-Node NVMe storage
Turbo, replicated /nfs/turbo/coe-venkvis/ 10TB 500GB 13.02 Fast, Automated regular backups
scratch /scratch/venkvis_root/venkvis/ 10TB 500GB   Fast, Auto-purged 60 days after last use
DataDen Access via Globus 100TB 5TB 1.67 Tape Storage; Files should be between 10 - 200 GB, accessible only via Globus
/home /home/<user> 80GB     Fast, mounted on Turbo
  • Node Local: Scratch files, temporary checkpoints
  • Turbo/ Home: Software, Environments, Code
  • scratch: Large Datasets actively being used, multi-node checkpoints.
  • DataDen: Large Datasets not actively being used

Please manage your storage responsibly and clean up after yourself In particular, node-local storage is not automatically cleaned up, it’s on you to clean up

Cleanup with trap

Use trap to create a job specific TMPDIR and clean it up on exit

#!/bin/bash
# SBATCH directives

# Create job-specific tmp directory
export TMPDIR=$(mktemp --directory --tmpdir)

# Ensure Cleanup on exit
trap 'rm -rf -- "$TMPDIR"' EXIT

# ... do job stuff