Difference between revisions of "Cluster matlab"
(14 intermediate revisions by the same user not shown) | |||
Line 3: | Line 3: | ||
We are exploring how to use Matlab with the new HPC cluster, and we will use this page to keep track of our current approaches. | We are exploring how to use Matlab with the new HPC cluster, and we will use this page to keep track of our current approaches. | ||
− | === Running Matlab Code Using Slurm === | + | === Running Matlab Code As a Job Using Slurm === |
− | To properly use the cluster, we must submit jobs using the Slurm scheduler. To run an existing Matlab script, we just include its name in the Slurm instructions. Here is an example for running the Matlab code <code>parforcluster.m</code>. The text below could be named <code>ada-submit</code> and used for running all Matlab scripts by updating the filename inside the Matlab run command <code>matlab -nodisplay -nosplash -nodesktop -r "run('./parforcluster.m');exit;"</code>. | + | To properly use the cluster, we must submit jobs using the Slurm scheduler. To run an existing Matlab script, we just include its name in the Slurm instructions. Here is an example for running the Matlab code <code>parforcluster.m</code>. The text below could be named <code>ada-submit</code> and used for running all Matlab scripts by updating the filename inside the Matlab run command <code>matlab -nodisplay -nosplash -nodesktop -r "run('./parforcluster.m');exit;"</code>. Note that until we have Matlab Parallel Server, we can only ever use one node at a time. |
− | < | + | <pre> |
#!/usr/bin/env bash | #!/usr/bin/env bash | ||
# | # | ||
Line 12: | Line 12: | ||
# Set SLURM options (you should not need to change these) | # Set SLURM options (you should not need to change these) | ||
− | #SBATCH --job-name=testing | + | #SBATCH --job-name=testing # Job name |
− | #SBATCH --output=./test-results/testing-%j.out | + | #SBATCH --output=./test-results/testing-%j.out # Name for output log file (%j is job ID) |
#SBATCH --nodes=1 # Requesting 1 node and 1 task per node should | #SBATCH --nodes=1 # Requesting 1 node and 1 task per node should | ||
#SBATCH --ntasks-per-node=1 # ensure exclusive access to the node | #SBATCH --ntasks-per-node=1 # ensure exclusive access to the node | ||
− | #SBATCH --cpus-per-task= | + | #SBATCH --cpus-per-task=36 # Limit the job to only two cores |
− | #SBATCH --partition=short | + | #SBATCH --partition=short # Partition (queue) |
#SBATCH --time=00:15:00 # Time limit hrs:min:sec | #SBATCH --time=00:15:00 # Time limit hrs:min:sec | ||
− | |||
# DONT MODIFY ANYTHING ABOVE THIS LINE | # DONT MODIFY ANYTHING ABOVE THIS LINE | ||
Line 30: | Line 29: | ||
echo -e "\n# Run Results -------------------------" | echo -e "\n# Run Results -------------------------" | ||
− | matlab -nodisplay -nosplash -nodesktop -r "run('./parforcluster.m');exit;" | + | matlab -nodisplay -nosplash -nodesktop -r "run('./parforcluster.m');exit;" # Run the Matlab script |
− | + | ||
# For reference, dump info about the processor | # For reference, dump info about the processor | ||
echo -e "\n# CPU Info ----------------------------" | echo -e "\n# CPU Info ----------------------------" | ||
lscpu | lscpu | ||
− | </code> | + | </pre> |
+ | |||
+ | === Running Matlab Interactively, Using Slurm to Assign the Node === | ||
+ | For parallel computing, Matlab sets up a "parpool" of workers first to connect to the available CPUs. Unfortunately, if we execute the code using Slurm, our connection to the node will close, and the parpool session will end. That is, there will be an initialization time of ~15 seconds each time we send a Slurm job. | ||
+ | |||
+ | Instead, we would like to be able to keep out Matlab session open and work interactively within the assigned node. To have Slurm assign a node to use interactively, we can write: | ||
+ | |||
+ | <pre> | ||
+ | $ srun --partition=long --pty --nodes=1 --ntasks-per-node=36 -t 00:30:00 --wait=0 --export=ALL /bin/bash | ||
+ | </pre> | ||
+ | |||
+ | This will move you from the head node to a node which can be used for computing. That is, your terminal prompt will go from <code>[username@ada ~]</code> to <code>[username@node007 ~]</code>, for example. At the prompt, you can then start Matlab. | ||
+ | |||
+ | <pre> | ||
+ | $ matlab | ||
+ | </pre> | ||
+ | |||
+ | Note that not all nodes have Matlab installed. To leave Matlab, type <code>exit</code>. To leave the assigned node, type <code>exit</code> again. | ||
+ | |||
+ | If you would like to use a graphical user interface (GUI), you must have X11 forwarding set up on your machine. On a Windows computer, you can use MobaXTerm to connect to Ada and interact with Matlab as if it were on your computer. We will update this page when we figure out how to do so on a Mac or within Visual Studio Code. | ||
+ | |||
+ | === Running Matlab Interactively, Using SSH and Double Tunneling === | ||
+ | Because not all nodes have Matlab installed, the Slurm scheduler might assign you to a node on which you cannot use Matlab. To choose a node ourselves, we can connect directly using SSH. With a terminal open on Ada, the command is of the form: | ||
+ | |||
+ | <pre> | ||
+ | $ ssh <name-of-node> -L <port-num>:localhost:<port-num> | ||
+ | </pre> | ||
+ | |||
+ | The <code>port-num</code> should be the same before and after the localhost. Try to avoid using small port numbers! Anything in the range of 1024-49151 will work fine. First check if the port is being used by an application. | ||
+ | |||
+ | For example, we could connect to node004 with the following command: | ||
+ | <pre> | ||
+ | $ ssh node004 -L 5001:localhost:5001 | ||
+ | </pre> | ||
+ | |||
+ | In fact, we could access that node directly without accessing Ada first by using double tunneling: | ||
+ | <pre> | ||
+ | $ ssh -t -t ada.middlebury.edu -L 5001:localhost:5001 ssh node004 -L 5001:localhost:5001 | ||
+ | </pre> | ||
+ | |||
+ | == Matlab Code for Parallel Computing on the Cluster == | ||
+ | The instructions in the previous section will allow you to run Matlab on the cluster, but Matlab will not use the full power of the cluster without proper instructions within your Matlab script. | ||
+ | |||
+ | Matlab makes parallel computing extremely simple because it handles the distribution of calculations to parallel workers for you. This means that if you replace a standard <code>for</code> loop in your code with the <code>parfor</code> command, Matlab will automatically try to run each loop on different CPUs. (Note that you must write your <code>parfor</code> loop accordingly so that the results coming back from different workers can be assembled in the proper order.) Unfortunately, the default number of workers that Matlab will initialize is 12, even though each node as 36 available workers. | ||
+ | |||
+ | To access the full number of available workers on the node, you can obtain the properties of the cluster using <code>parcluster</code>, and you can initialize the parallel pool of workers with <code>parpool</code> using the cluster profile. An extremely simple Matlab script using a parfor loop is below: | ||
+ | |||
+ | <pre> | ||
+ | %% first run, to initialize parpool | ||
+ | mycluster=parcluster('local') | ||
+ | nworkers=mycluster.NumWorkers | ||
+ | poolobj=parpool(mycluster,nworkers) | ||
+ | |||
+ | %% parpool is now initialized, run code like normal | ||
+ | %% inside the Matlab code, you'll probably have a parfor loop | ||
+ | |||
+ | parfor(index1=1:100,nworkers) % by default, Matlab uses the max. number of workers, | ||
+ | % but it can be changed | ||
+ | index1 | ||
+ | end | ||
+ | |||
+ | %% after final run, close parpool to release workers | ||
+ | delete(poolobj) | ||
+ | </pre> | ||
+ | The first few lines in the above code initialize the pool of workers. This only needs to be run at the start of the session. Subsequent runs can have this commented out. Otherwise, the parallel pool of workers will initialize each run, requiring up to 15 seconds of extra initialization time. | ||
+ | |||
+ | The above code will print out which iteration of the loop it is on. Observe that the output does not list the <code>index1</code> values in order from 1 to 100 because each worker completes the task at different times. If the order matters, then the output needs to be save to the appropriate row in an array. | ||
+ | |||
+ | At the end of the session, the last line of the above code should be used to release the pool of workers. This should be commented out until the last run has completed. Otherwise, the parallel pool of workers will initialize on the next run. | ||
+ | |||
+ | ====Using More than One Node==== | ||
+ | It appears that Matlab Parallel Server is required in order for Matlab to be able to use the workers on more than one node. We are currently investigating acquiring this license to take full advantage of the power of the Ada cluster. | ||
+ | |||
+ | ==Things to Add to the Wiki== | ||
+ | * Include instructions on connecting to the cluster, using Anthony's "Navigating ADA" Google doc. | ||
+ | ** Include instructions for MobaXTerm, Visual Studio Code, and regular ssh from a terminal. | ||
+ | |||
+ | * Assigning Slurm parameters as variables accessible within Matlab for setting up the cluster/parpoool. | ||
+ | From | ||
+ | https://docs.rc.fas.harvard.edu/kb/parallel-matlab-pct-dcs/ | ||
+ | |||
+ | * Add how to "Manage Cluster Profiles" from the "Parallel" Menu within the Matlab GUI. Presumably this is how we can update the 'local' profile. | ||
+ | |||
+ | * Include instructions for working with files on the cluster. For example, running a Matlab script which references functions in other files. Also, we need details about the current Slurm script, examples of what it outputs, and where it stores its outputs. It would also be nice to have more details on how to personalize the Slurm script. | ||
+ | |||
+ | * Incorporate the "best practices" list from the Middlebury HPC Wiki: | ||
+ | https://mediawiki.middlebury.edu/LIS/High_Performance_Computing_(HPC)/Training | ||
+ | |||
+ | The list is copied here: | ||
+ | ** Do NOT run calculations on the head node! All calculations need to be submitted to the scheduler via slurm. | ||
+ | ** Data files should be stored in the $STORAGE directory, not $HOME. | ||
+ | ** When possible, array jobs should be used when calculations can be split into independent pieces. | ||
+ | ** Checkpoint your jobs either internally, or externally via dmtcp. | ||
+ | ** Only request the memory you'll actually use (with a buffer for room for error). | ||
+ | ** Use the $SCRATCH directory for frequent read/writes during the calculation. | ||
+ | |||
+ | * Confirm that the cluster does not accept multi-node jobs, as mentioned in the Middlebury HPC Wiki. | ||
+ | |||
+ | * Interfacing with Git. | ||
+ | From the Middlebury HPC Wiki: | ||
+ | "You can clone a copy of this repository to your home directory (or elsewhere) via the command:" | ||
+ | <pre> | ||
+ | git clone https://github.com/middlebury/HPC.git | ||
+ | </pre> |
Latest revision as of 14:46, 25 June 2020
Using Matlab on the Cluster
We are exploring how to use Matlab with the new HPC cluster, and we will use this page to keep track of our current approaches.
Running Matlab Code As a Job Using Slurm
To properly use the cluster, we must submit jobs using the Slurm scheduler. To run an existing Matlab script, we just include its name in the Slurm instructions. Here is an example for running the Matlab code parforcluster.m
. The text below could be named ada-submit
and used for running all Matlab scripts by updating the filename inside the Matlab run command matlab -nodisplay -nosplash -nodesktop -r "run('./parforcluster.m');exit;"
. Note that until we have Matlab Parallel Server, we can only ever use one node at a time.
#!/usr/bin/env bash # # submit batch file; based on CS 416 S20 ada-submit # Set SLURM options (you should not need to change these) #SBATCH --job-name=testing # Job name #SBATCH --output=./test-results/testing-%j.out # Name for output log file (%j is job ID) #SBATCH --nodes=1 # Requesting 1 node and 1 task per node should #SBATCH --ntasks-per-node=1 # ensure exclusive access to the node #SBATCH --cpus-per-task=36 # Limit the job to only two cores #SBATCH --partition=short # Partition (queue) #SBATCH --time=00:15:00 # Time limit hrs:min:sec # DONT MODIFY ANYTHING ABOVE THIS LINE # Print SLURM envirionment variables echo "# Job Info ----------------------------" echo "Job ID: ${SLURM_JOB_ID}" echo "Node: ${SLURMD_NODENAME}" echo "Starting: "`date +"%D %T"` echo -e "\n# Run Results -------------------------" matlab -nodisplay -nosplash -nodesktop -r "run('./parforcluster.m');exit;" # Run the Matlab script # For reference, dump info about the processor echo -e "\n# CPU Info ----------------------------" lscpu
Running Matlab Interactively, Using Slurm to Assign the Node
For parallel computing, Matlab sets up a "parpool" of workers first to connect to the available CPUs. Unfortunately, if we execute the code using Slurm, our connection to the node will close, and the parpool session will end. That is, there will be an initialization time of ~15 seconds each time we send a Slurm job.
Instead, we would like to be able to keep out Matlab session open and work interactively within the assigned node. To have Slurm assign a node to use interactively, we can write:
$ srun --partition=long --pty --nodes=1 --ntasks-per-node=36 -t 00:30:00 --wait=0 --export=ALL /bin/bash
This will move you from the head node to a node which can be used for computing. That is, your terminal prompt will go from [username@ada ~]
to [username@node007 ~]
, for example. At the prompt, you can then start Matlab.
$ matlab
Note that not all nodes have Matlab installed. To leave Matlab, type exit
. To leave the assigned node, type exit
again.
If you would like to use a graphical user interface (GUI), you must have X11 forwarding set up on your machine. On a Windows computer, you can use MobaXTerm to connect to Ada and interact with Matlab as if it were on your computer. We will update this page when we figure out how to do so on a Mac or within Visual Studio Code.
Running Matlab Interactively, Using SSH and Double Tunneling
Because not all nodes have Matlab installed, the Slurm scheduler might assign you to a node on which you cannot use Matlab. To choose a node ourselves, we can connect directly using SSH. With a terminal open on Ada, the command is of the form:
$ ssh <name-of-node> -L <port-num>:localhost:<port-num>
The port-num
should be the same before and after the localhost. Try to avoid using small port numbers! Anything in the range of 1024-49151 will work fine. First check if the port is being used by an application.
For example, we could connect to node004 with the following command:
$ ssh node004 -L 5001:localhost:5001
In fact, we could access that node directly without accessing Ada first by using double tunneling:
$ ssh -t -t ada.middlebury.edu -L 5001:localhost:5001 ssh node004 -L 5001:localhost:5001
Matlab Code for Parallel Computing on the Cluster
The instructions in the previous section will allow you to run Matlab on the cluster, but Matlab will not use the full power of the cluster without proper instructions within your Matlab script.
Matlab makes parallel computing extremely simple because it handles the distribution of calculations to parallel workers for you. This means that if you replace a standard for
loop in your code with the parfor
command, Matlab will automatically try to run each loop on different CPUs. (Note that you must write your parfor
loop accordingly so that the results coming back from different workers can be assembled in the proper order.) Unfortunately, the default number of workers that Matlab will initialize is 12, even though each node as 36 available workers.
To access the full number of available workers on the node, you can obtain the properties of the cluster using parcluster
, and you can initialize the parallel pool of workers with parpool
using the cluster profile. An extremely simple Matlab script using a parfor loop is below:
%% first run, to initialize parpool mycluster=parcluster('local') nworkers=mycluster.NumWorkers poolobj=parpool(mycluster,nworkers) %% parpool is now initialized, run code like normal %% inside the Matlab code, you'll probably have a parfor loop parfor(index1=1:100,nworkers) % by default, Matlab uses the max. number of workers, % but it can be changed index1 end %% after final run, close parpool to release workers delete(poolobj)
The first few lines in the above code initialize the pool of workers. This only needs to be run at the start of the session. Subsequent runs can have this commented out. Otherwise, the parallel pool of workers will initialize each run, requiring up to 15 seconds of extra initialization time.
The above code will print out which iteration of the loop it is on. Observe that the output does not list the index1
values in order from 1 to 100 because each worker completes the task at different times. If the order matters, then the output needs to be save to the appropriate row in an array.
At the end of the session, the last line of the above code should be used to release the pool of workers. This should be commented out until the last run has completed. Otherwise, the parallel pool of workers will initialize on the next run.
Using More than One Node
It appears that Matlab Parallel Server is required in order for Matlab to be able to use the workers on more than one node. We are currently investigating acquiring this license to take full advantage of the power of the Ada cluster.
Things to Add to the Wiki
- Include instructions on connecting to the cluster, using Anthony's "Navigating ADA" Google doc.
- Include instructions for MobaXTerm, Visual Studio Code, and regular ssh from a terminal.
- Assigning Slurm parameters as variables accessible within Matlab for setting up the cluster/parpoool.
From https://docs.rc.fas.harvard.edu/kb/parallel-matlab-pct-dcs/
- Add how to "Manage Cluster Profiles" from the "Parallel" Menu within the Matlab GUI. Presumably this is how we can update the 'local' profile.
- Include instructions for working with files on the cluster. For example, running a Matlab script which references functions in other files. Also, we need details about the current Slurm script, examples of what it outputs, and where it stores its outputs. It would also be nice to have more details on how to personalize the Slurm script.
- Incorporate the "best practices" list from the Middlebury HPC Wiki:
https://mediawiki.middlebury.edu/LIS/High_Performance_Computing_(HPC)/Training
The list is copied here:
- Do NOT run calculations on the head node! All calculations need to be submitted to the scheduler via slurm.
- Data files should be stored in the $STORAGE directory, not $HOME.
- When possible, array jobs should be used when calculations can be split into independent pieces.
- Checkpoint your jobs either internally, or externally via dmtcp.
- Only request the memory you'll actually use (with a buffer for room for error).
- Use the $SCRATCH directory for frequent read/writes during the calculation.
- Confirm that the cluster does not accept multi-node jobs, as mentioned in the Middlebury HPC Wiki.
- Interfacing with Git.
From the Middlebury HPC Wiki: "You can clone a copy of this repository to your home directory (or elsewhere) via the command:"
git clone https://github.com/middlebury/HPC.git