Middlebury

Difference between revisions of "High Performance Computing (HPC)/Training"

Line 114: Line 114:
  
 
* Once that job has completed, you should see a checkpointing file of the form "ckpt_count_*.dmtcp". You job can be restarted using the "dmtcp_restart" command as is found in "slurm_ckpt_restart.sh":
 
* Once that job has completed, you should see a checkpointing file of the form "ckpt_count_*.dmtcp". You job can be restarted using the "dmtcp_restart" command as is found in "slurm_ckpt_restart.sh":
 +
 +
sbatch slurm_ckpt_restart.sh
 +
 +
(Note: you will get a warning message for this sample job on the initial restart--this will not cause a problem)
 +
 +
* You can restart and continue the job any number of times via the restart script. E.g. try submitting the restart script a 2nd time.
  
 
  sbatch slurm_ckpt_restart.sh
 
  sbatch slurm_ckpt_restart.sh

Revision as of 10:38, 4 September 2019

Overview of the Ada Cluster

How is a cluster different from my laptop/desktop?

Architecture

Logging in

ssh username@ada
  • "username" is your Middlebury username. If your username on the computer you're logging in from is also your Midd username (e.g. if you're using a college owned computer), then you can just use the command ("ssh ada").
  • You will be prompted for your Middlebury password--after you enter your password, you will now have a linux command prompt for the head node "ada".
  • You are now in your home directory on ada. From here you can access the filesystem in your home directory, using standard linux commands. For example, we can make a directory:
mkdir test_job
  • While it's not necessary, for convenience you can consider setting up public key authentication from your laptop or desktop; this will allow you to login securely without entering your password.

Submitting jobs vis the Slurm scheduler

Basic slurm script

  • We have the basic slurm script shown below in the text file "slurm_serial.sh":
#!/usr/bin/env bash
# slurm template for serial jobs

# Set SLURM options
#SBATCH --job-name=serial_test                  # Job name
#SBATCH --output=serial_test-%j.out             # Standard output and error log
#SBATCH --mail-user=username@middlebury.edu     # Where to send mail	
#SBATCH --mail-type=NONE                        # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mem=100mb                             # Job memory request
#SBATCH --partition=standard                    # Partition (queue) 
#SBATCH --time=00:05:00                         # Time limit hrs:min:sec

# print SLURM envirionment variables
echo "Job ID: ${SLURM_JOB_ID}"
echo "Node: ${SLURMD_NODENAME}"
echo "Starting: "`date +"%D %T"`

# Your calculations here
printf "\nHello world from ${SLURMD_NODENAME}!\n\n"

# End of job info
echo "Ending:   "`date +"%D %T"`

Submitting jobs

  • Jobs are submitted to the slurm scheduler via the "sbatch" command:
sbatch slurm_serial.sh

Monitoring jobs

  • You can monitor the status of jobs in the queue via the "squeue" command:
squeue

Parallel Jobs

Array jobs

If a serial job can easily broken into several (or many) independent pieces, then it's most efficient to submit an array job, which is a set of closely related serial jobs that will all run independently.

  • To submit an array job, use the slurm option "--array". For example "--array=0-4" will run 5 independent tasks, labeled 0-4 by the environment variable SLURM_ARRAY_TASK_ID.
  • To allow each array task to perform a different calculation, you can to use SLURM_ARRAY_TASK_ID as an input parameter to your calculation.
  • Each array task will appear as an independent job in the queue and run independently.
  • An entire array job can be canceled at once or each task can be canceled individually.

Here is simple example of a slurm array job script is

#!/usr/bin/env bash
# slurm template for array jobs

# Set SLURM options
#SBATCH --job-name=array_test                   # Job name
#SBATCH --output=array_test-%A-%a.out           # Standard output and error log
#SBATCH --mail-user=username@middlebury.edu     # Where to send mail    
#SBATCH --mail-type=NONE                        # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mem=100mb                             # Job memory request
#SBATCH --partition=standard                    # Partition (queue) 
#SBATCH --time=00:05:00                         # Time limit hrs:min:sec
#SBATCH --array=0-4                             # Array range

# print SLURM envirionment variables
echo "Job ID: ${SLURM_JOB_ID}"
echo "Array ID: ${SLURM_ARRAY_TASK_ID}"
echo "Node: ${SLURMD_NODENAME}"
echo "Starting: "`date +"%D %T"`

# Your calculations here
printf "\nHello world from array task ${SLURM_ARRAY_TASK_ID}!\n\n"

# End of job info
echo "Ending:   "`date +"%D %T"`

An example of how a serial job can be broken into an array job is on the HPC Github repository (see below).

Shared memory or multi-threaded jobs

Multi-node (MPI) jobs

GPU jobs

Large Memory jobs

Storage

  • Each user has a home directory located at /home/$USER where $USER is your Middlebury username, and also accessible via the $HOME environment variable. Each user has a quote of 50 GB in their home directory/
  • Additionally each user has a storage directory located /storage/$USER which is also accessible via the $STORAGE environment variable. The quota on each user's storage directory is 400 GB.
  • The home directory has a fairly small quota as it is only intended for storage of scripts, code, executables, and small parameter files, NOT for data storage.
  • Data files should be stored on in the storage directory.

Local scratch storage

Checkpointing

Checkpointing your jobs running on ada is recommended. Checkpointing stores the internal state of your calculation periodically so the job can be restarted from that state, e.g. if the node goes down or the wall clock limit is reached. Ideally, checkpointing is done internally in your application (it is built into many open source and commercial packages); if your application doesn't support checkpointing internally you can use an external checkpointing tool such as dmtcp. Here we'll illustrate an example of using external checkpointing via dmtcp found the directory "ckpt-example" on the GitHub repository.

  • We'll illustrate checkpointing using a simple counter. First compile the executable "count" from the source code "counter.c" via:
gcc counter.c -o count
  • Now submit the slurm script "slurm_ckpt_start.sh"
sbatch slurm_ckpt_start.sh
  • Once that job has completed, you should see a checkpointing file of the form "ckpt_count_*.dmtcp". You job can be restarted using the "dmtcp_restart" command as is found in "slurm_ckpt_restart.sh":
sbatch slurm_ckpt_restart.sh

(Note: you will get a warning message for this sample job on the initial restart--this will not cause a problem)

  • You can restart and continue the job any number of times via the restart script. E.g. try submitting the restart script a 2nd time.
sbatch slurm_ckpt_restart.sh

Sample jobs

Breaking a serial job into an array job

An example of using array jobs is in the directory "array_job_example" on the HPC Github repository

  • The python script factor_list.py will find the prime factors of a list of integers, e.g. the 12-digit numbers in the file "sample_list_12.dat":
python factor_list.py sample_list_12.dat
  • To factor all 20 16-digit numbers in "sample_list_12.dat" as a single serial job (which will take several minutes), submit the slurm script "serial_factor.sh":
sbatch serial_factor.sh
  • The factors will be stored in "serial_factors_out.dat"
  • The slurm script "array_factor.sh" breaks the calculation up into a 10 task array job:
sbatch array_factor.sh
  • Each array task stores the results in the file "array_factors_out-${SLURM_ARRAY_TASK_ID}.dat" where the task array ID runs from 0-9.
  • After all the array tasks are complete, the data can combined into a single file , e.g. array_factors_out.dat:
cat array_factors_out-?.dat > array_factors_out.dat
  • You can check that both methods give you the same result via diff:
diff serial_factors_out.dat array_factors_out.dat

Serial Stata job

Parallel Stata job

Git repository

Best practices

Powered by MediaWiki