AWS ParallelCluster and WEKA Integration

Overview

AWS ParallelCluster is an open-source tool for managing clusters, simplifying the deployment and administration of HPC clusters on AWS. By integrating with WEKA, organizations can create a high-performance data platform that significantly reduces epoch time from months to days without requiring additional infrastructure investment.

The infrastructure performance and efficiency gains made possible with WEKA for AWS ParallelCluster, organizations can accelerate their own pace of innovation, maximize their utilization of GPU-accelerated infrastructure, and control costs.

Slurm based architecture with AWS ParallelCluster

The integration of WEKA with AWS ParallelCluster using Slurm comprises two primary components: a compute cluster and a standalone WEKA cluster. Refer to the numbered elements in the accompanying illustration for details:

The integration of WEKA with AWS ParallelCluster using Slurm consists of two main components: a compute cluster and a standalone WEKA cluster. Refer to the numbered elements in the accompanying illustration for details:

WEKA cluster deployment The WEKA cluster backends are deployed within the same Virtual Private Cloud (VPC) and subnet as the AWS ParallelCluster cluster. These backends use the i3en instance family, which is optimized for high-performance storage and compute workloads.
WEKA client integration The WEKA client software is installed across the AWS ParallelCluster components, including the controller node, login nodes, and worker nodes. This software facilitates seamless access to the WEKA cluster by presenting a mount point within the file system, enabling efficient data sharing and processing.
Data management and tiering To optimize data handling, WEKA employs an Amazon S3 bucket for data tiering. This system ensures that data is automatically allocated to the appropriate storage tier based on access patterns and cost-efficiency considerations. Furthermore, WEKA leverages S3 for storing snapshots, providing an additional layer of data resilience and enabling robust disaster recovery.

Deployment workflow for AWS ParallelCluster cluster

Deploy WEKA Cluster using Terraform

You can deploy a WEKA cluster using Terraform by choosing one of the following methods:

Refer to the comprehensive documentation provided in WEKA installation on AWS using Terraform. This option is ideal if you are new to the process and require detailed guidance, including explanations of the concepts involved.
Follow the concise step-by-step instructions provided below. This option assumes you are already familiar with the subject and are looking for a quick reference.

Step 1: Set up the Terraform working directory

Create a directory to use as your Terraform working directory.
Inside this directory, create a file named main.tf and paste the following configuration:provider "aws" {

provider "aws" {
    region = "<AWS region>"
}

module "deploy_weka" {
  source                        = "weka/weka/aws"
  weka_version                  = "<WEKA SW Version>"
  get_weka_io_token             = "<get weka token>"
  key_pair_name                 = "<key pair>"
  prefix                        = "<cluster prefix>"
  cluster_name                  = "<cluster name>"
  cluster_size                  = 6
  instance_type                 = "i3en.6xlarge"
  sg_ids                        = ["sg-xxxxxxxxxxxxxxxxx"]
  subnet_ids                    = ["subnet-xxxxxxxxxxxxxxxxx"]
  vpc_id                        = "vpc-xxxxxxxxxxxxxxxxx"
  alb_additional_subnet_id      = "subnet-yyyyyyyyyyyyyyyyyy"
  use_placement_group           = false
  assign_public_ip              = false
  set_dedicated_fe_container    = false
  secretmanager_create_vpc_endpoint = true
  tiering_enable_obs_integration = true
}

output "deploy_weka_output" {
  value = module.deploy_weka
}

Step 2: Update the Terraform configuration file

Update the following variables in the main.tf file with the required values:

Variable

Description

Example/Options

AWS region

AWS region to deploy WEKA

Example: eu-west-1

weka_version

WEKA software version to deploy. Must be version 4.2.15 or later. Available at https://get.weka.io/.

Example: 4.4.2

get_weka_io_token

Token retrieved from https://get.weka.io/.

Example: [email protected]

key_pair_name

Name of an existing EC2 key pair for SSH access.

prefix

A prefix used for naming cluster resources.

The prefix and cluster_name are concatenated with a hyphen (-) to form the names of resources created by the WEKA Terraform module.

Example: weka

cluster_name

Suffix for cluster resource names.

Example: cluster

cluster_size

Number of instances for the WEKA cluster backends.

Minimum: 6.

Example: 6

instance_type

Instance type for WEKA cluster backends.

Options: i3en.2xlarge, i3en.3xlarge, i3en.6xlarge, i3en.12xlarge, or i3en.24xlarge.

sg_ids

A list of security group IDs for the cluster. These IDs are typically generated based on the CloudFormation stack name. By default, the naming convention follows the format: sagemaker-hyperpod-SecurityGroup-xxxxxxxxxxxxx

Example: ["sg-1d2esy4uf63ps5", "sg-5s2fsgyhug3tps9"]

subnet_ids

A list of private subnet IDs where the SageMaker HyperPod cluster will be deployed.

These subnets must exist within the same VPC as the cluster and be configured for private communication.

Example: ["subnet-0a1b2c3d4e5f6g7h8", "subnet-1a2b3c4d5e6f7g8h9"]

vpc_id

The ID of the Virtual Private Cloud (VPC) where the SageMaker HyperPod cluster will be deployed.

The VPC must accommodate subnets, security groups, and other related resources.

Example: vpc-123abc456def

alb_additional_subnet_id

Private subnet ID in a different availability zone for load balancing.

Example: [subnet-9a8b7c6d5e4f3g2h1]

Step 3: Deploy the WEKA cluster

Initialize Terraform:
```
terraform init
```
Plan the deployment:
```
terraform plan
```
Apply the configuration to deploy the WEKA cluster:
```
terraform apply
```
Confirm the deployment when prompted.

Step 4: Verify deployment

Ensure the WEKA cluster is fully deployed before proceeding. It may take several minutes for the configuration to complete after Terraform finishes executing.

Prepare for AWS ParallelCluster deployment

Step 1: AWS ParallelCluster CLI installation

If you already have the AWS ParallelCluster CLI installed you can skip to the next step. Otherwise, follow the procedure Installing the AWS ParallelCluster command line interface (CLI) in AWS documentation.

Step 2: Create S3 bucket to store AWS ParallelCluster integration scripts

Create an S3 bucket in the same region AWS ParallelCluster will be deployed.

aws s3 mb s3://<bucket name> --region <region>

Verify bucket creation

aws s3 ls | grep <bucket name>

Step 3: Clone WEKA Cloud-Solutions repository and copy integrations scripts to S3

Clone this repository and cd to the aws/parallelcluster/ directory:

git clone https://github.com/weka/cloud-solutions.git
cd cloud-solutions/aws/parallelcluster

Upload integration scripts to S3

aws s3 cp ./scripts/weka-install.py s3://<bucket name>/scripts/weka-install.py 
aws s3 cp ./scripts/virtualenv-setup.sh s3://<bucket name>/scripts/virtualenv-setup.sh

Create IAM policy for WEKA clients using provided template

aws iam create-policy --policy-name weka-client-pcluster --policy-document file://./iam/example-pcluster-policy.json

Step 4: Modify AWS ParallelCluster template

This repository includes an example cluster template file. Follow the steps below to modify the template for your AWS environment. If you already have a cluster template, combine it with the example one provided.

Create a copy of the example template file.

cp example-pcluster-template.yaml pcluster.yaml

Update Region.

sed -i '' -e 's/Region: us-east-2/Region: <region>/' pcluster.yaml

Update Networking settings.

sed -i '' -e 's/subnet-123456789abcdefg/<your subnet>/g' pcluster.yaml
sed -i '' -e' s/sg-123456789abcdefg/<your security group>/g' pcluster.yaml

Update SSH KeyName.

sed -i '' -e s/support_key/<your SSH KeyPair Name>/g' pcluster.yaml

Update S3 bucket name.

sed -i '' -e 's/MY-S3-BUCKET/<s3 bucket name>/g' pcluster.yaml

Update ALB DNS Name.

sed -i '' -e 's/internal-weka-lb-12345689.us-east-2.elb.amazonaws.com/<WEKA ALB DNS NAME>/g' pcluster.yaml

Update WEKA filesystem name (optional).

sed -i '' -e 's/--filesystem-name=default/--filesystem-name=<filesystem name>/g' pcluster.yaml

Update WEKA mount point (optional). (Escape the forward slash in the AWS ARN with a backslash.)

sed -i '' -e 's/--mount-point=\/mnt\/weka/--mount-point=<mount point>/g' pcluster.yaml

Update IAM Policy ARN. (Escape the forward slash in the AWS ARN with a backslash.)

sed -i '' -e 's/arn:aws:iam::123456789:policy\/weka-pcluster-client-policy/<IAM policy ARN>/g' pcluster.yaml

Update SlurmQueues.

While the example template shows two queues to demonstrate a common customer setup, you can configure as few as one queue. If you do implement multiple queues, update each queue's configuration.

Review and update the following parameters as necessary:

Name
ComputeResources > Name
InstanceType
MinCount
MaxCount
CustomSlurmSettings > RealMemory
- This value is specific to the instance type. Refer to the table below for the correct value.
CustomSlurmSettings > CpuSpecList
- This value is specific to the instance type. Refer to the table below for the correct value.
OnNodeConfigured > Sequence > Script > Args > --cores
- This value is specific to the instance type. Refer to the table below for the correct value.

If the --cores argument is defined, the WEKA mount is created using a DPDK mount for optimal performance. If the argument is not defined, the WEKA mount defaults to UDP mode. DPDK is preferred for all instances to achieve higher storage performance. However, certain instances, such as HeadNodes, can use a UDP mount if necessary.

Additional instance types: If your instance type is not listed in the table below, contact the Customer Success Team for assistance.

Instance Type

CpuSpecList

RealMemory

hpc7a.96xlarge

95,191

742110

Verify security group configurations

Ensure that the AWS ParallelCluster nodes can connect with the WEKA backends on port 14000 for both TCP and UDP. Review your security group settings to confirm that the WEKA clients can communicate with the WEKA backends effectively.

Deploy AWS ParallelCluster cluster

Run create-cluster

pcluster create-cluster -c pcluster.yaml --cluster-name your-cluster-name --rollback-on-failure FALSE

To assist with debugging, disable rollback-on-failure if errors occur.

PreviousAdd WEKA to an existing Amazon SageMaker HyperPod cluster NextAzure CycleCloud for SLURM and WEKA Integration

Last updated 1 month ago