Middlebury

High Performance Computing (HPC)/Training

Overview of the Ada Cluster

How is a cluster different from my laptop/desktop?

Architecture

Logging in

ssh username@ada
  • "username" is your Middlebury username. If your username on the computer you're logging in from is also your Midd username (e.g. if you're using a college owned computer), then you can just use the command ("ssh ada").
  • You will be prompted for your Middlebury password--after you enter your password, you will now have a linux command prompt for the head node "ada".
  • You are now in your home directory on ada. From here you can access the filesystem in your home directory, using standard linux commands. For example, we can make a directory:
mkdir test_job
  • While it's not necessary, for convenience you can consider setting up public key authentication from your laptop or desktop; this will allow you to login securely without entering your password.

Editing and otherwise working with files

There are a number of approaches to working with your files on ada. Some examples include:

  • Connect via SSH as described above and edit any files in the terminal
  • Use an SSH extension within your editor to edit files on the remote machine as though they were local. An example is Visual Studio Code with the Remote - SSH extension. In this model all files are stored on ada but the editor appears to run locally.
  • Use a third party tool, like MobaXterm for Windows or SSHFS for OSX/Linux that makes ada appear to be a network drive
  • Develop locally and copy your files from your personal computer using rsync or other similar command

Submitting jobs via the Slurm scheduler

To run jobs on the cluster, you must submit them via script to the slurm scheduler. A summary of slurm commands and options can be found here.

Basic slurm script

  • We have the basic slurm script shown below in the text file "slurm_serial.sh":
#!/usr/bin/env bash # slurm template for serial jobs
# Set SLURM options
#SBATCH --job-name=serial_test # Job name
#SBATCH --output=serial_test-%j.out # Standard output and error log #SBATCH --mail-user=username@middlebury.edu # Where to send mail #SBATCH --mail-type=NONE # Mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH --mem=100mb # Job memory request 
#SBATCH --partition=standard # Partition (queue) 
#SBATCH --time=00:05:00 # Time limit hrs:min:sec 

# print SLURM envirionment variables
echo "Job ID: ${SLURM_JOB_ID}"
echo "Node: ${SLURMD_NODENAME}" echo "Starting: "`date +"%D %T"` 
# Your calculations here 
printf "\nHello world from ${SLURMD_NODENAME}!\n\n" 
# End of job info 
echo "Ending: "`date +"%D %T"`
  • A list of environment variables that can be configured in the slurm submit script is here.

Submitting jobs

  • Jobs are submitted to the slurm scheduler via the sbatch command:
sbatch slurm_serial.sh
  • A list of options for sbatch can be found here.

 

Monitoring jobs

  • You can monitor the status of jobs in the queue via the squeue command:
squeue
  • You can review which nodes are assigned to which queues and which nodes are idle via the sinfo command:
sinfo

Parallel Jobs

Array jobs

If a serial job can easily broken into several (or many) independent pieces, then it's most efficient to submit an array job, which is a set of closely related serial jobs that will all run independently.

  • To submit an array job, use the slurm option "--array". For example "--array=0-4" will run 5 independent tasks, labeled 0-4 by the environment variable SLURM_ARRAY_TASK_ID.
  • To allow each array task to perform a different calculation, you can to use SLURM_ARRAY_TASK_ID as an input parameter to your calculation.
  • Each array task will appear as an independent job in the queue and run independently.
  • An entire array job can be canceled at once or each task can be canceled individually.

Here is simple example of a slurm array job script is

#!/usr/bin/env bash
# slurm template for array jobs
# Set SLURM options
#SBATCH --job-name=array_test # Job name
#SBATCH --output=array_test-%A-%a.out # Standard output and error log
#SBATCH --mail-user=username@middlebury.edu # Where to send mail
#SBATCH --mail-type=NONE # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mem=100mb # Job memory request
#SBATCH --partition=standard # Partition (queue) 
#SBATCH --time=00:05:00 # Time limit hrs:min:sec 
#SBATCH --array=0-4 # Array range

# print SLURM envirionment variables 
echo "Job ID: ${SLURM_JOB_ID}"
echo "Array ID: ${SLURM_ARRAY_TASK_ID}"
echo "Node: ${SLURMD_NODENAME}"
echo "Starting: "`date +"%D %T"` 
# Your calculations here
printf "\nHello world from array task ${SLURM_ARRAY_TASK_ID}!\n\n" 
# End of job info
echo "Ending: "`date +"%D %T"` 

An example of how a serial job can be broken into an array job is on the HPC Github repository (see below).

Shared memory or multi-threaded jobs

If your code can take advantage of multiple CPU cores via multi-threading, you can request multiple CPU cores on a single node for your job in the slurm script via the "--cpus-per-task" option. For example specifying:

#SBATCH --cpus-per-task=8    # Number of CPU cores for this job

in the slurm script would request 8 CPU cores for the job. The standard CPU compute nodes have 36 cores per node, so you can request up to 36 cores per job. All cores will be on the same node and share memory, as if the calculation was running on a single stand alone workstation.

Note that your code must be able to take advantage of the additional CPU cores that slurm allocates--if you request multiple cores for a purely serial code (i.e. that can only use 1 CPU core) the additional CPU cores will remain idle.

An example of shared memory parallelization is available in the GitHub repository in the "multithread-example" directory.

Multi-node (MPI) jobs

The cluster is currently not configured to allow for multi-node (e.g. MPI) jobs.

 

GPU jobs

There is a single GPU compute node (with 4 GPUs) which is accessible via the gpu-standard, gpu-short, and gpu-long queues. All GPU jobs must be submitted to one of these queues via the --partition option. You should also specify the number of GPUs your job will use via the --gres option (short for "Generic Resources"). For example, to use one GPU:

#SBATCH --partition=gpu-standard                    # Partition (queue)
#SBATCH --gres=gpu:1                                # Number of GPUs

By setting the --gres option, Slurm will configure your job to use a specific GPU(s), enabling multiple jobs/users to run concurrently on the GPU node (each configured to use a different GPU). To your program it will only look like a single GPU(s) is available.

Note that the GPU node has fewer CPU cores (16) than the other nodes so make sure to set your --cpus-per-task options differently for this node.

Large Memory jobs

Standard CPU compute nodes have a total of 96 GB of RAM, so you can request up to 96 GB for jobs submitted to the standard, short or long queues. In your slurm submit script you should specify the amount of memory needed via the --mem option. For example, include the line:

#SBATCH --mem=2gb

to request 2gb for a job. If your job requires more than 96GB of RAM, you will need to use the high memory node, which has 768 GB of RAM. To access the high memory node you need to submit to the himem-standard, himem-short, himem-long queues for example, including the options:

#SBATCH --partition=himem-standard              # Partition (queue) 
#SBATCH --mem=128gb                             # Job memory request

would request 128GB of RAM using the himem-standard queue.

Storage

  • Each user has a home directory located at /home/$USER where $USER is your Middlebury username, and also accessible via the $HOME environment variable. Each user has a quote of 50 GB in their home directory/
  • Additionally each user has a storage directory located /storage/$USER which is also accessible via the $STORAGE environment variable. The quota on each user's storage directory is 400 GB.
  • The home directory has a fairly small quota as it is only intended for storage of scripts, code, executables, and small parameter files, NOT for data storage.
  • Data files should be stored on in the storage directory.

Local scratch storage

Home and storage directories are located on separate nodes (the head node and storage nodes) and only mounted remotely to each compute node via ethernet. For jobs that need to frequently read/write significant amounts of data to disk, it may be advantageous to read/write to the local scratch space on each compute node which will be much faster to access.

Local scratch directories for each user are available at /local/$USER which is stored in the $SCRATCH.

Checkpointing

Checkpointing your jobs running on ada is recommended. Checkpointing stores the internal state of your calculation periodically so the job can be restarted from that state, e.g. if the node goes down or the wall clock limit is reached. Ideally, checkpointing is done internally in your application (it is built into many open source and commercial packages); if your application doesn't support checkpointing internally you can use an external checkpointing tool such as dmtcp. Here we'll illustrate an example of using external checkpointing via dmtcp found the directory "ckpt-example" on the GitHub repository.

  • We'll illustrate checkpointing using a simple counter. First compile the executable "count" from the source code "counter.c" via:
gcc counter.c -o count

Now you should see the executable file "count". Take a look at the slurm script slurm-ckpt-start.sh. The key line is:

timeout 15 dmtcp_launch --no-coordinator -p 0 -i 10 ./count 
  • "timeout" is a standard linux utility that will automatically stop whatever command that follows; the "15" is the length of time before the process is killed in seconds. You can also use units of days and hours, eg. "timeout 47h". Timeout is not necessary for checkpointing, but it lets you stop your job before the wall clock limit is reached and slurm kills your job.
  • "dmtcp_launch" is the command to start running your executable (in this case count) through the dmtcp checkpointing tool. We suggest you always use the "--no-coordinator -p 0" options to avoid interference with other jobs.
  • The "-i" option sets the frequency that dmtcp will store the state of you process to a checkpoint file. "-i 10" checkpoints the file every 10 seconds--much more frequently than you would ever want to do in practice (this is just so the example goes quickly). More reasonable for an actual job would be "-i 3600" to checkpoint once an hour.
  • In practice, the checkpointing syntax for "your_executable", might be something like:
timeout 47h dmtcp_launch --no-coordinator -p 0 -i 3600 your_executable
  • Now submit the slurm script "slurm_ckpt_start.sh"
sbatch slurm-ckpt-start.sh
  • Once that job has completed, you should see a checkpointing file of the form "ckpt_count_*.dmtcp". You job can be restarted using the "dmtcp_restart" command as is found in "slurm_ckpt_restart.sh":
sbatch slurm-ckpt-restart.sh
  • You can restart and continue the job any number of times via the same restart script. E.g. try submitting the restart script a 2nd time.
sbatch slurm-ckpt-restart.sh

Sample jobs

Breaking a serial job into an array job

An example of using array jobs is in the directory "array_job_example" on the HPC Github repository

  • The python script factor_list.py will find the prime factors of a list of integers, e.g. the 12-digit numbers in the file "sample_list_12.dat":
python factor_list.py sample_list_12.dat
  • To factor all 20 16-digit numbers in "sample_list_12.dat" as a single serial job (which will take several minutes), submit the slurm script "serial_factor.sh":
sbatch serial_factor.sh
  • The factors will be stored in "serial_factors_out.dat"
  • The slurm script "array_factor.sh" breaks the calculation up into a 10 task array job:
sbatch array_factor.sh
  • Each array task stores the results in the file "array_factors_out-${SLURM_ARRAY_TASK_ID}.dat" where the task array ID runs from 0-9.
  • After all the array tasks are complete, the data can combined into a single file , e.g. array_factors_out.dat:
cat array_factors_out-?.dat > array_factors_out.dat
  • You can check that both methods give you the same result via diff:
diff serial_factors_out.dat array_factors_out.dat

Serial Stata job

The primary difference between using Stata on the cluster and using Stata on your computer is learning how to run Stata in batch mode, that is, non-interactively. To use Stata on the cluster, you will need a shell script (*.sh) that inserts your Stata process into the Slurm queue and runs your Stata do file from the command line. You need basic Unix command skills, basic Slurm syntax and a Stata do file.

You can log in to MIddlebury's HPC repository at Github to see executable examples of both a serial Stata job and a parallel Stata job in the "Stata-examples" directory. A serial Stata job is the simplest, using a single processor on a single node to execute your calculations. Most Stata users will need to use the parallel computing capabilities if they need to use the cluster to perform their calculations. Both the serial and parallel computing examples use "stata_auto.do" as the sample do file, so be sure to download it as well. Copy the shell script and do file to your home directory on Ada. The command to run the serial shell script is:

 sbatch stata_serial.sh

Parallel Stata job

Because we are using Stata MP (multiprocessor), the program already has built-in multiprocessor capabilities. Our license allows us to use up to 16 processors. Stata will automatically use as many processors as it can "see", which is where the specifications in Slurm (the queuing software) are important. There is a single difference between the serial job syntax and the parallel job syntax for Stata, and that is to change "#SBATCH --cpus-per-task=1" to "#SBATCH --cpus-per-task=16" in the shell script, which tells Stata there are 16 computing processors available (see the above section on "Shared memory or multi-threaded jobs").

Copy the example script and do file to your home directory on Ada and type to following command:

 sbatch stata_parallel.sh

 

Modules

Ada uses Environment modules to manage specialized software. Modules are short scripts that automatically configure your environment (i.e. set the PATH and other environment variables). You can view the available modules via the command:

module avail

Modules can be loaded via "module load", e.g.

module load python/anaconda2

If your job needs a module which is not loaded by default, you must load the appropriate module in your slurm submit script.

The python/anaconda2 module is a file anaconda2 file in the python directory. You can create your own module files by creating a directory to contain your modules files, e.g. modulefiles, and then subdirectories for each program and module files for each version. The use command will add your modules files to the modules search path, e.g.

module use $HOME/modulefiles

You will need to execute the use command every time you log in. Doing so at every login is tedious, so instead you add the use command to your .bash_profile file.

Git repository

Sample slurm scripts and example jobs are availing in the GitHub repository:

https://github.com/middlebury/HPC

You can clone a copy of this repository to your home directory (or elsewhere) via the command:

git clone https://github.com/middlebury/HPC.git

Best practices

  • Do NOT run calculations on the head node! All calculations need to be submitted to the scheduler via slurm.
  • Data files should be stored in the $STORAGE directory, not $HOME.
  • When possible, array jobs should be used when calculations can be split into independent pieces.
  • Checkpoint your jobs either internally, or externally via dmtcp.
  • Only request the memory you'll actually use (with a buffer for room for error).
  • Use the $SCRATCH directory for frequent read/writes during the calculation.
Powered by MediaWiki