Middlebury

High Performance Computing (HPC)

Revision as of 08:58, 2 August 2021 by David Guertin (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

 

Overview

High Performance Computing (HPC) is the aggregation of computing power and memory to perform complex calculations in parallel, increasing the speed and efficiency of computer simulations and data analysis. In 2018, a collaboration of faculty in the social and natural sciences and ITS staff successfully secured a $150,000 grant from the National Science Foundation to build Middlebury's first HPC cluster. Dubbed "Ada" in honor of Ada Lovelace, the famed 19th century mathematician, the cluster is a tool intended to support the research efforts of faculty who rely on access to expanded computing resources. We continue to add to our collaboration as resources become available.

This wiki describes the cluster structure and how to use it. The cluster is a shared resource, so we use queuing software (called Slurm) to manage job processing and to ensure fair access. Below are basic instructions for logging in to the cluster, accessing the queue and writing scripts to work efficiently and within best practices for a shared computing resource.

Cluster users must include an acknowledgement of NSF funding in any published research, as quoted below:

"This material is based upon work supported by the National Science Foundation under Grant No. 1827373.”

Please email the principal investigator, Professor Amy Yuen, with publication information for grant reporting purposes.

 

Access

A mananging group of faculty and staff have developed (policies) for various types of users. All users must agree to these policies and submit this (form) before obtaining access. The working group periodically offers training sessions for students and faculty interested in learning how to access the cluster and work with the queueing software. Users may indicate interest in these training sessions using this (form).

 

Hardware

The HPC cluster consists of 17 computer nodes with a cumulative total of 556 processors. It includes 14 nodes with 96GB of RAM each and one additional node with 768GB of RAM. In addition, the HPC cluster has a dedicated graphics processing unit (GPU) with 96GB of RAM, along with a storage node with 60TB of hard drive storage.

 


Software

 

Guidelines

Expectations and Support for Users

All HPC users will be expected to accept the standard Middlebury Code of Conduct relating to information and technology as well as a general set of best practices specific to the cluster. These will be posted on the HPC wiki page. Additionally, faculty who have little or no experience using a shared computing cluster are strongly urged to participate in the periodic training sessions offered by ITS staff and HPC affiliated faculty.

 

Cluster Use Principles

The use of the Ada cluster is governed by all the policies that apply to Middlebury’s Information Technology (http://www.middlebury.edu/about/handbook/policies-for-all/appropriate-use/info-tech) and the following principles:

  1. The Ada cluster supports the research and educational missions of Middlebury College. Users agree to only run computational jobs related to those missions. For example, cryptocurrency mining for financial gain or commercial use of the cluster is not appropriate.
  2. The Ada cluster is a shared resource. Running computations that consume large portions of the cluster for extended periods (including consuming large portions of the available disk space) could prevent others from using this community resource. Exercise care in how you use the Ada cluster to be respectful of other community members’ interest in using the system.
  3. You are entirely responsible for any data you place on the cluster. You agree that your data management practices are in accordance with Middlebury’s policies and any applicable regulations or agreements, e.g. HIPAA, data use agreements, etc.
  4. The Ada cluster is intended for data analysis, not data storage. Data is not backed up. Data that is no longer needed should be promptly deleted to ensure there is sufficient disk space for everyone.
  5. You agree to respect the privacy of other users, e.g. by not exploring directories owned by other users even if those directories are accessible to you.
  6. You are expected to report any security incidents or abuse to ITS immediately. Examples of security incidents include but are not limited to: unauthorized access or use, compromised accounts -including “shared” login credentials, and misuse of data.

Users whose behavior runs counter to these principles may be asked by cluster administrators to leave the cluster.

 

Training

 

Questions

Powered by MediaWiki