Add WEKA to an existing Amazon SageMaker HyperPod cluster

Deployment workflow for existing Amazon SageMaker Hyperpod cluster

You can deploy a WEKA cluster using Terraform by choosing one of the following methods:

Refer to the comprehensive documentation provided in WEKA installation on AWS using Terraform. This option is ideal if you are new to the process and require detailed guidance, including explanations of the concepts involved.
Follow the concise step-by-step instructions provided below. This option assumes you are already familiar with the subject and are looking for a quick reference.

Step 1: Set up the Terraform working directory

Create a directory to use as your Terraform working directory.
Inside this directory, create a file named main.tf and paste the following configuration:provider "aws" {

provider "aws" {
    region = "<AWS region>"
}

module "deploy_weka" {
  source                        = "weka/weka/aws"
  weka_version                  = "<WEKA SW Version>"
  get_weka_io_token             = "<get weka token>"
  key_pair_name                 = "<key pair>"
  prefix                        = "<cluster prefix>"
  cluster_name                  = "<cluster name>"
  cluster_size                  = 6
  instance_type                 = "i3en.6xlarge"
  sg_ids                        = ["sg-xxxxxxxxxxxxxxxxx"]
  subnet_ids                    = ["subnet-xxxxxxxxxxxxxxxxx"]
  vpc_id                        = "vpc-xxxxxxxxxxxxxxxxx"
  alb_additional_subnet_id      = "subnet-yyyyyyyyyyyyyyyyyy"
  use_placement_group           = false
  assign_public_ip              = false
  set_dedicated_fe_container    = false
  secretmanager_create_vpc_endpoint = true
  tiering_enable_obs_integration = true
}

output "deploy_weka_output" {
  value = module.deploy_weka
}

Step 2: Update the Terraform configuration file

Update the following variables in the main.tf file with the required values:

Variable

Description

Example/Options

AWS region

AWS region to deploy WEKA

Example: eu-west-1

weka_version

WEKA software version to deploy. Must be version 4.2.15 or later. Available at https://get.weka.io/.

Example: 4.4.2

get_weka_io_token

Token retrieved from https://get.weka.io/.

Example: [email protected]

key_pair_name

Name of an existing EC2 key pair for SSH access.

prefix

A prefix used for naming cluster resources.

The prefix and cluster_name are concatenated with a hyphen (-) to form the names of resources created by the WEKA Terraform module.

Example: weka

cluster_name

Suffix for cluster resource names.

Example: cluster

cluster_size

Number of instances for the WEKA cluster backends.

Minimum: 6.

Example: 6

instance_type

Instance type for WEKA cluster backends.

Options: i3en.2xlarge, i3en.3xlarge, i3en.6xlarge, i3en.12xlarge, or i3en.24xlarge.

sg_ids

A list of security group IDs for the cluster. These IDs are typically generated based on the CloudFormation stack name. By default, the naming convention follows the format: sagemaker-hyperpod-SecurityGroup-xxxxxxxxxxxxx

Example: ["sg-1d2esy4uf63ps5", "sg-5s2fsgyhug3tps9"]

subnet_ids

A list of private subnet IDs where the SageMaker HyperPod cluster will be deployed.

These subnets must exist within the same VPC as the cluster and be configured for private communication.

Example: ["subnet-0a1b2c3d4e5f6g7h8", "subnet-1a2b3c4d5e6f7g8h9"]

vpc_id

The ID of the Virtual Private Cloud (VPC) where the SageMaker HyperPod cluster will be deployed.

The VPC must accommodate subnets, security groups, and other related resources.

Example: vpc-123abc456def

alb_additional_subnet_id

Private subnet ID in a different availability zone for load balancing.

Example: [subnet-9a8b7c6d5e4f3g2h1]

Step 3: Deploy the WEKA cluster

Initialize Terraform:
```
terraform init
```
Plan the deployment:
```
terraform plan
```
Apply the configuration to deploy the WEKA cluster:
```
terraform apply
```
Confirm the deployment when prompted.

Step 4: Verify deployment

Ensure the WEKA cluster is fully deployed before proceeding. It may take several minutes for the configuration to complete after Terraform finishes executing.

Deploy WEKA clients in Amazon SageMaker Hyperpod

Step 1: Download integration scripts from GitHub

Clone the GitHub repository:

git clone https://github.com/weka/cloud-solutions.git

Enter the sagemaker-hyperpod directory:

cd cloud-solutions/aws/sagemaker-hyperpod/

Step 2: Verify region configuration

Verify AWS CLI region configuration:

aws configure list

Verify the region listed is the desired region for the SageMaker Hyperpod cluster. If it is not correct set the AWS_REGION environment variable to the correct region.

export AWS_REGION=<desired region>

Step 3: Verifying VPC configuration

Ensure the optional parameter "Availability zone ID to deploy the backup private subnet" is configured with a valid entry. If the CloudFormation template has already been deployed, update the existing stack using the existing template.
Edit the sagemaker-hyperpod-SecurityGroup rule created by the CloudFormation template. Add the following inbound rules to allow access from your management workstation's CIDR range:

TCP port 22 (SSH)
TCP port 14000 (WEKA UI)

This ensures that your management workstation can connect securely to the cluster.

Step 3: Configure environment variables

Run set_env_vars.sh:

./set_env_vars.sh <stack_name> && source env_vars

Cloud_Formation_Stack: Name of the existing CloudFormation stack.

Step 4: Deploy WEKA clients to existing cluster

Run deploy_weka_into_existing_cluster.sh replacing <weka_backend_ip> with either a WEKA backend IP or Application Load Balancer DNS name and <FS Name> with the name of the WEKA filesystem you wish to mount.

./deploy_weka_into_existing_cluster.sh <ALB_NAME> <FS name>

ALB_NAME: Obtain this from the AWS Console or Terraform output. It is a DNS name.
WEKA_FS_NAME: Obtain this from the WEKA UI. The default filesystem name is default.

Step 5: Verify WEKA clients are mounted

Login to one of the cluster nodes using SSH or SSM.
Verify mount using df:

df -h

The WEKA filesystem is mounted at /mnt/weka on all SageMaker HyperPod nodes.

PreviousDeploy a new Amazon SageMaker HyperPod cluster with WEKA NextAWS ParallelCluster and WEKA Integration

Last updated 4 months ago