AWS ParallelCluster is an open-source tool for managing clusters, simplifying the deployment and administration of HPC clusters on AWS. By integrating with WEKA, organizations can create a high-performance data platform that significantly reduces epoch time from months to days without requiring additional infrastructure investment.
The infrastructure performance and efficiency gains made possible with WEKA for AWS ParallelCluster, organizations can accelerate their own pace of innovation, maximize their utilization of GPU-accelerated infrastructure, and control costs.
The integration of WEKA with AWS ParallelCluster using Slurm comprises two primary components: a compute cluster and a standalone WEKA cluster. Refer to the numbered elements in the accompanying illustration for details:
The integration of WEKA with AWS ParallelCluster using Slurm consists of two main components: a compute cluster and a standalone WEKA cluster. Refer to the numbered elements in the accompanying illustration for details:
WEKA cluster deployment The WEKA cluster backends are deployed within the same Virtual Private Cloud (VPC) and subnet as the AWS ParallelCluster cluster. These backends use the i3en instance family, which is optimized for high-performance storage and compute workloads.
WEKA client integration The WEKA client software is installed across the AWS ParallelCluster components, including the controller node, login nodes, and worker nodes. This software facilitates seamless access to the WEKA cluster by presenting a mount point within the file system, enabling efficient data sharing and processing.
Data management and tiering To optimize data handling, WEKA employs an Amazon S3 bucket for data tiering. This system ensures that data is automatically allocated to the appropriate storage tier based on access patterns and cost-efficiency considerations. Furthermore, WEKA leverages S3 for storing snapshots, providing an additional layer of data resilience and enabling robust disaster recovery.
Step 1: AWS ParallelCluster CLI installation
If you already have the AWS ParallelCluster CLI installed you can skip to the next step. Otherwise, follow the procedure Installing the AWS ParallelCluster command line interface (CLI) in AWS documentation.
Step 2: Create S3 bucket to store AWS ParallelCluster integration scripts
Create an S3 bucket in the same region AWS ParallelCluster will be deployed.
Verify bucket creation
Step 3: Clone WEKA Cloud-Solutions repository and copy integrations scripts to S3
Clone this repository and cd
to the aws/parallelcluster/
directory:
Upload integration scripts to S3
Create IAM policy for WEKA clients using provided template
Step 4: Modify AWS ParallelCluster template
This repository includes an example cluster template file. Follow the steps below to modify the template for your AWS environment. If you already have a cluster template, combine it with the example one provided.
Create a copy of the example template file.
Update Region.
Update Networking settings.
Update SSH KeyName.
Update S3 bucket name.
Update ALB DNS Name.
Update WEKA filesystem name (optional).
Update WEKA mount point (optional). (Escape the forward slash in the AWS ARN with a backslash.)
Update IAM Policy ARN. (Escape the forward slash in the AWS ARN with a backslash.)
Update SlurmQueues.
While the example template shows two queues to demonstrate a common customer setup, you can configure as few as one queue. If you do implement multiple queues, update each queue's configuration.
Review and update the following parameters as necessary:
Name
ComputeResources > Name
InstanceType
MinCount
MaxCount
CustomSlurmSettings > RealMemory
This value is specific to the instance type. Refer to the table below for the correct value.
CustomSlurmSettings > CpuSpecList
This value is specific to the instance type. Refer to the table below for the correct value.
OnNodeConfigured > Sequence > Script > Args > --cores
This value is specific to the instance type. Refer to the table below for the correct value.
If the --cores
argument is defined, the WEKA mount is created using a DPDK mount for optimal performance. If the argument is not defined, the WEKA mount defaults to UDP mode. DPDK is preferred for all instances to achieve higher storage performance. However, certain instances, such as HeadNodes, can use a UDP mount if necessary.
hpc7a.96xlarge
95,191
742110
Ensure that the AWS ParallelCluster nodes can connect with the WEKA backends on port 14000 for both TCP and UDP. Review your security group settings to confirm that the WEKA clients can communicate with the WEKA backends effectively.
Run create-cluster
To assist with debugging, disable rollback-on-failure
if errors occur.
Additional instance types: If your instance type is not listed in the table below, contact the for assistance.