AWS ParallelCluster and WEKA Integration
Last updated
Last updated
AWS ParallelCluster is an open-source tool for managing clusters, simplifying the deployment and administration of HPC clusters on AWS. By integrating with WEKA, organizations can create a high-performance data platform that significantly reduces epoch time from months to days without requiring additional infrastructure investment.
The infrastructure performance and efficiency gains made possible with WEKA for AWS ParallelCluster, organizations can accelerate their own pace of innovation, maximize their utilization of GPU-accelerated infrastructure, and control costs.
The integration of WEKA with AWS ParallelCluster using Slurm comprises two primary components: a compute cluster and a standalone WEKA cluster. Refer to the numbered elements in the accompanying illustration for details:
The integration of WEKA with AWS ParallelCluster using Slurm consists of two main components: a compute cluster and a standalone WEKA cluster. Refer to the numbered elements in the accompanying illustration for details:
WEKA cluster deployment The WEKA cluster backends are deployed within the same Virtual Private Cloud (VPC) and subnet as the AWS ParallelCluster cluster. These backends use the i3en instance family, which is optimized for high-performance storage and compute workloads.
WEKA client integration The WEKA client software is installed across the AWS ParallelCluster components, including the controller node, login nodes, and worker nodes. This software facilitates seamless access to the WEKA cluster by presenting a mount point within the file system, enabling efficient data sharing and processing.
Data management and tiering To optimize data handling, WEKA employs an Amazon S3 bucket for data tiering. This system ensures that data is automatically allocated to the appropriate storage tier based on access patterns and cost-efficiency considerations. Furthermore, WEKA leverages S3 for storing snapshots, providing an additional layer of data resilience and enabling robust disaster recovery.
Step 1: AWS ParallelCluster CLI installation
If you already have the AWS ParallelCluster CLI installed you can skip to the next step. Otherwise, follow the procedure Installing the AWS ParallelCluster command line interface (CLI) in AWS documentation.
Step 2: Create S3 bucket to store AWS ParallelCluster integration scripts
Create an S3 bucket in the same region AWS ParallelCluster will be deployed.
Verify bucket creation
Step 3: Clone WEKA Cloud-Solutions repository and copy integrations scripts to S3
Clone this repository and cd
to the aws/parallelcluster/
directory:
Upload integration scripts to S3
Create IAM policy for WEKA clients using provided template
Step 4: Modify AWS ParallelCluster template
This repository includes an example cluster template file. Follow the steps below to modify the template for your AWS environment. If you already have a cluster template, combine it with the example one provided.
Create a copy of the example template file.
Update Region.
Update Networking settings.
Update SSH KeyName.
Update S3 bucket name.
Update ALB DNS Name.
Update WEKA filesystem name (optional).
Update WEKA mount point (optional). (Escape the forward slash in the AWS ARN with a backslash.)
Update IAM Policy ARN. (Escape the forward slash in the AWS ARN with a backslash.)
Update SlurmQueues.
While the example template shows two queues to demonstrate a common customer setup, you can configure as few as one queue. If you do implement multiple queues, update each queue's configuration.
Review and update the following parameters as necessary:
Name
ComputeResources > Name
InstanceType
MinCount
MaxCount
CustomSlurmSettings > RealMemory
This value is specific to the instance type. Refer to the table below for the correct value.
CustomSlurmSettings > CpuSpecList
This value is specific to the instance type. Refer to the table below for the correct value.
OnNodeConfigured > Sequence > Script > Args > --cores
This value is specific to the instance type. Refer to the table below for the correct value.
If the --cores
argument is defined, the WEKA mount is created using a DPDK mount for optimal performance. If the argument is not defined, the WEKA mount defaults to UDP mode. DPDK is preferred for all instances to achieve higher storage performance. However, certain instances, such as HeadNodes, can use a UDP mount if necessary.
hpc7a.96xlarge
95,191
742110
Ensure that the AWS ParallelCluster nodes can connect with the WEKA backends on port 14000 for both TCP and UDP. Review your security group settings to confirm that the WEKA clients can communicate with the WEKA backends effectively.
Run create-cluster
To assist with debugging, disable rollback-on-failure
if errors occur.
Additional instance types: If your instance type is not listed in the table below, contact the for assistance.
You can deploy a WEKA cluster using Terraform by choosing one of the following methods:
Refer to the comprehensive documentation provided in WEKA installation on AWS using Terraform. This option is ideal if you are new to the process and require detailed guidance, including explanations of the concepts involved.
Follow the concise step-by-step instructions provided below. This option assumes you are already familiar with the subject and are looking for a quick reference.
Step 1: Set up the Terraform working directory
Create a directory to use as your Terraform working directory.
Inside this directory, create a file named main.tf
and paste the following configuration:provider "aws" {
Step 2: Update the Terraform configuration file
Update the following variables in the main.tf
file with the required values:
AWS region
AWS region to deploy WEKA
Example: eu-west-1
weka_version
Example: 4.4.2
get_weka_io_token
Example: H5BPF1ssQrstCVz@get.weka.io
key_pair_name
Name of an existing EC2 key pair for SSH access.
prefix
A prefix used for naming cluster resources.
The prefix and cluster_name
are concatenated with a hyphen (-
) to form the names of resources created by the WEKA Terraform module.
Example: weka
cluster_name
Suffix for cluster resource names.
Example: cluster
cluster_size
Number of instances for the WEKA cluster backends.
Minimum: 6
.
Example: 6
instance_type
Instance type for WEKA cluster backends.
Options: i3en.2xlarge
, i3en.3xlarge
, i3en.6xlarge
, i3en.12xlarge
, or i3en.24xlarge
.
sg_ids
A list of security group IDs for the cluster. These IDs are typically generated based on the CloudFormation stack name.
By default, the naming convention follows the format: sagemaker-hyperpod-SecurityGroup-xxxxxxxxxxxxx
Example: ["sg-1d2esy4uf63ps5", "sg-5s2fsgyhug3tps9"]
subnet_ids
A list of private subnet IDs where the SageMaker HyperPod cluster will be deployed.
These subnets must exist within the same VPC as the cluster and be configured for private communication.
Example: ["subnet-0a1b2c3d4e5f6g7h8", "subnet-1a2b3c4d5e6f7g8h9"]
vpc_id
The ID of the Virtual Private Cloud (VPC) where the SageMaker HyperPod cluster will be deployed.
The VPC must accommodate subnets, security groups, and other related resources.
Example: vpc-123abc456def
alb_additional_subnet_id
Private subnet ID in a different availability zone for load balancing.
Example: [subnet-9a8b7c6d5e4f3g2h1]
Step 3: Deploy the WEKA cluster
Initialize Terraform:
Plan the deployment:
Apply the configuration to deploy the WEKA cluster:
Confirm the deployment when prompted.
Step 4: Verify deployment
Ensure the WEKA cluster is fully deployed before proceeding. It may take several minutes for the configuration to complete after Terraform finishes executing.
WEKA software version to deploy. Must be version 4.2.15
or later. Available at .
Token retrieved from .