Cluster matlab

From Biomedical Optics Lab

Using Matlab on the Cluster

We are exploring how to use Matlab with the new HPC cluster, and we will use this page to keep track of our current approaches.

Running Matlab Code As a Job Using Slurm

To properly use the cluster, we must submit jobs using the Slurm scheduler. To run an existing Matlab script, we just include its name in the Slurm instructions. Here is an example for running the Matlab code parforcluster.m. The text below could be named ada-submit and used for running all Matlab scripts by updating the filename inside the Matlab run command matlab -nodisplay -nosplash -nodesktop -r "run('./parforcluster.m');exit;". Note that until we have Matlab Parallel Server, we can only ever use one node at a time.

#!/usr/bin/env bash
# 
# submit batch file; based on CS 416 S20 ada-submit 

# Set SLURM options (you should not need to change these)
#SBATCH --job-name=testing                      # Job name
#SBATCH --output=./test-results/testing-%j.out  # Name for output log file (%j is job ID)
#SBATCH --nodes=1                               # Requesting 1 node and 1 task per node should
#SBATCH --ntasks-per-node=1                     # ensure exclusive access to the node
#SBATCH --cpus-per-task=36                      # Limit the job to only two cores
#SBATCH --partition=short                       # Partition (queue) 
#SBATCH --time=00:15:00                         # Time limit hrs:min:sec

# DONT MODIFY ANYTHING ABOVE THIS LINE

# Print SLURM envirionment variables
echo "# Job Info ----------------------------"
echo "Job ID: ${SLURM_JOB_ID}"
echo "Node: ${SLURMD_NODENAME}"
echo "Starting: "`date +"%D %T"`

echo -e "\n# Run Results -------------------------"
matlab -nodisplay -nosplash -nodesktop -r "run('./parforcluster.m');exit;"  # Run the Matlab script

# For reference, dump info about the processor
echo -e "\n# CPU Info ----------------------------"
lscpu

Running Matlab Interactively, Using Slurm to Assign the Node

For parallel computing, Matlab sets up a "parpool" of workers first to connect to the available CPUs. Unfortunately, if we execute the code using Slurm, our connection to the node will close, and the parpool session will end. That is, there will be an initialization time of ~15 seconds each time we send a Slurm job.

Instead, we would like to be able to keep out Matlab session open and work interactively within the assigned node. To have Slurm assign a node to use interactively, we can write:

$ srun --partition=long --pty --nodes=1 --ntasks-per-node=36 -t 00:30:00 --wait=0 --export=ALL /bin/bash

This will move you from the head node to a node which can be used for computing. That is, your terminal prompt will go from [username@ada ~] to [username@node007 ~], for example. At the prompt, you can then start Matlab.

$ matlab

Note that not all nodes have Matlab installed. To leave Matlab, type exit. To leave the assigned node, type exit again.

If you would like to use a graphical user interface (GUI), you must have X11 forwarding set up on your machine. On a Windows computer, you can use MobaXTerm to connect to Ada and interact with Matlab as if it were on your computer. We will update this page when we figure out how to do so on a Mac or within Visual Studio Code.

Running Matlab Interactively, Using SSH and Double Tunneling

Because not all nodes have Matlab installed, the Slurm scheduler might assign you to a node on which you cannot use Matlab. To choose a node ourselves, we can connect directly using SSH. With a terminal open on Ada, the command is of the form:

$ ssh <name-of-node> -L <port-num>:localhost:<port-num>

The port-num should be the same before and after the localhost. Try to avoid using small port numbers! Anything in the range of 1024-49151 will work fine. First check if the port is being used by an application.

For example, we could connect to node004 with the following command:

$ ssh node004 -L 5001:localhost:5001

In fact, we could access that node directly without accessing Ada first by using double tunneling:

$  ssh -t -t ada.middlebury.edu -L 5001:localhost:5001 ssh node004 -L 5001:localhost:5001

Matlab Code for Parallel Computing on the Cluster

The instructions in the previous section will allow you to run Matlab on the cluster, but Matlab will not use the full power of the cluster without proper instructions within your Matlab script.

Matlab makes parallel computing extremely simple because it handles the distribution of calculations to parallel workers for you. This means that if you replace a standard for loop in your code with the parfor command, Matlab will automatically try to run each loop on different CPUs. (Note that you must write your parfor loop accordingly so that the results coming back from different workers can be assembled in the proper order.) Unfortunately, the default number of workers that Matlab will initialize is 12, even though each node as 36 available workers.

To access the full number of available workers on the node, you can obtain the properties of the cluster using parcluster, and you can initialize the parallel pool of workers with parpool using the cluster profile. An extremely simple Matlab script using a parfor loop is below:

%% first run, to initialize parpool
mycluster=parcluster('local')
nworkers=mycluster.NumWorkers
poolobj=parpool(mycluster,nworkers)

%% parpool is now initialized, run code like normal
%% inside the Matlab code, you'll probably have a parfor loop

parfor(index1=1:100,nworkers)  % by default, Matlab uses the max. number of workers,
                               % but it can be changed
    index1
end

%% after final run, close parpool to release workers
delete(poolobj)

The first few lines in the above code initialize the pool of workers. This only needs to be run at the start of the session. Subsequent runs can have this commented out. Otherwise, the parallel pool of workers will initialize each run, requiring up to 15 seconds of extra initialization time.

The above code will print out which iteration of the loop it is on. Observe that the output does not list the index1 values in order from 1 to 100 because each worker completes the task at different times. If the order matters, then the output needs to be save to the appropriate row in an array.

At the end of the session, the last line of the above code should be used to release the pool of workers. This should be commented out until the last run has completed. Otherwise, the parallel pool of workers will initialize on the next run.

Using More than One Node

It appears that Matlab Parallel Server is required in order for Matlab to be able to use the workers on more than one node. We are currently investigating acquiring this license to take full advantage of the power of the Ada cluster.

Things to Add to the Wiki

  • Include instructions on connecting to the cluster, using Anthony's "Navigating ADA" Google doc.
    • Include instructions for MobaXTerm, Visual Studio Code, and regular ssh from a terminal.
  • Assigning Slurm parameters as variables accessible within Matlab for setting up the cluster/parpoool.

From https://docs.rc.fas.harvard.edu/kb/parallel-matlab-pct-dcs/

  • Add how to "Manage Cluster Profiles" from the "Parallel" Menu within the Matlab GUI. Presumably this is how we can update the 'local' profile.
  • Include instructions for working with files on the cluster. For example, running a Matlab script which references functions in other files. Also, we need details about the current Slurm script, examples of what it outputs, and where it stores its outputs. It would also be nice to have more details on how to personalize the Slurm script.
  • Incorporate the "best practices" list from the Middlebury HPC Wiki:

https://mediawiki.middlebury.edu/LIS/High_Performance_Computing_(HPC)/Training

The list is copied here:

    • Do NOT run calculations on the head node! All calculations need to be submitted to the scheduler via slurm.
    • Data files should be stored in the $STORAGE directory, not $HOME.
    • When possible, array jobs should be used when calculations can be split into independent pieces.
    • Checkpoint your jobs either internally, or externally via dmtcp.
    • Only request the memory you'll actually use (with a buffer for room for error).
    • Use the $SCRATCH directory for frequent read/writes during the calculation.
  • Confirm that the cluster does not accept multi-node jobs, as mentioned in the Middlebury HPC Wiki.
  • Interfacing with Git.

From the Middlebury HPC Wiki: "You can clone a copy of this repository to your home directory (or elsewhere) via the command:"

git clone https://github.com/middlebury/HPC.git