Integrate SageMaker HyperPod with WEKA using Slurm
Explore the architecture and deployment workflow for integrating WEKA with SageMaker HyperPod using Slurm.
Last updated
Explore the architecture and deployment workflow for integrating WEKA with SageMaker HyperPod using Slurm.
Last updated
The integration of WEKA with SageMaker HyperPod using Slurm comprises two primary components: a compute cluster and a standalone WEKA cluster. Refer to the numbered elements in the accompanying illustration for details:
WEKA cluster deployment The WEKA cluster backends are deployed within the same Virtual Private Cloud (VPC) and subnet as the SageMaker HyperPod cluster. These backends use the i3en instance family, which is optimized for high-performance storage and compute workloads.
WEKA client integration The WEKA client software is installed across the SageMaker HyperPod components, including the controller node, login nodes, and worker nodes. This software facilitates seamless access to the WEKA cluster by presenting a mount point within the file system, enabling efficient data sharing and processing.
Data management and tiering To optimize data handling, WEKA employs an Amazon S3 bucket for data tiering. This system ensures that data is automatically allocated to the appropriate storage tier based on access patterns and cost-efficiency considerations. Furthermore, WEKA leverages S3 for storing snapshots, providing an additional layer of data resilience and enabling robust disaster recovery.
Deploy CloudFormation template (or an equivalent) to create the prerequisites for the SageMaker HyperPod cluster.
The CloudFormation template can be found at: Amazon SageMaker HyperPod > 0. Prerequisites > 2. Own Account.
Ensure the optional parameter "Availability zone ID to deploy the backup private subnet" is configured with a valid entry.
Retrieve the required for the WEKA package installation by accessing the WEKA download command at: https://get.weka.io/.
Step 1: Set up the Terraform working directory
Create a directory to use as your Terraform working directory.
Inside this directory, create a file named main.tf
and paste the following configuration:
Step 2: Update the Terraform configuration file
Update the following variables in the main.tf
file with the required values:
Step 3: Deploy the WEKA Cluster
Initialize Terraform:
Plan the deployment:
Apply the configuration to deploy the WEKA cluster:
Confirm the deployment when prompted.
Step 4: Verify deployment
Ensure the WEKA cluster is fully deployed before proceeding. It may take several minutes for the configuration to complete after Terraform finishes executing.
Related topic
WEKA installation on AWS using Terraform
Clone the WEKA cloud solutions repository Download the repository from GitHub:
Navigate to the SageMaker HyperPod directory Change to the relevant directory:
Set environment variables Run the script to set environment variables based on the CloudFormation stack:
Create cluster script
Copy the example script to create_cluster.sh
:
Modify create_cluster.sh
Edit create_cluster.sh
to customize the cluster configuration, updating environment variables as needed:
Source environment variables Load the environment variables:
Update set_weka.sh
configuration
Navigate to the LifecycleScripts
directory and adjust settings in set_weka.sh
as needed:
Verify network cards and cores if using p5.48xlarge
. For other instance types with multiple EFA cards, modify the script accordingly:
Update filesystem and mount point
Modify FILESYSTEM_NAME
and MOUNT_POINT
if a different filesystem or mount point is required:
Create the cluster Run the cluster creation script:
CLUSTER_NAME
: Typically set to ml-cluster
in workshop documentation.
ALB_NAME
: Obtain this from the AWS Console or Terraform output. It will be a DNS name.
Monitor cluster creation Track the cluster creation process:
Continue setup Proceed with the setup by following Section 1, Step D of the AWS SageMaker HyperPod workshop.
The WEKA file system will be mounted at /mnt/weka
on all SageMaker HyperPod nodes.
Variable | Description | Example/Options |
---|---|---|
weka_version
WEKA software version to deploy. Must be version 4.2.15
or later. Available at https://get.weka.io/.
Example: 4.4.2
get_weka_io_token
Token retrieved from https://get.weka.io/.
Example: H5BPF1ssQrstCVz@get.weka.io
key_pair_name
Name of an existing EC2 key pair for SSH access.
prefix
A prefix used for naming cluster resources.
The prefix and cluster_name
are concatenated with a hyphen (-
) to form the names of resources created by the WEKA Terraform module.
Example: weka
cluster_name
Suffix for cluster resource names.
Example: cluster
cluster_size
Number of instances for the WEKA cluster backends.
Minimum: 6
.
Example: 6
instance_type
Instance type for WEKA cluster backends.
Options: i3en.2xlarge
, i3en.3xlarge
, i3en.6xlarge
, i3en.12xlarge
, or i3en.24xlarge
.
sg_ids
A list of security group IDs for the cluster. These IDs are typically generated based on the CloudFormation stack name.
By default, the naming convention follows the format: sagemaker-hyperpod-SecurityGroup-xxxxxxxxxxxxx
Example: ["sg-1d2esy4uf63ps5", "sg-5s2fsgyhug3tps9"]
subnet_ids
A list of private subnet IDs where the SageMaker HyperPod cluster will be deployed.
These subnets must exist within the same VPC as the cluster and be configured for private communication.
Example: ["subnet-0a1b2c3d4e5f6g7h8", "subnet-1a2b3c4d5e6f7g8h9"]
vpc_id
The ID of the Virtual Private Cloud (VPC) where the SageMaker HyperPod cluster will be deployed.
The VPC must accommodate subnets, security groups, and other related resources.
Example: vpc-123abc456def
alb_additional_subnet_id
Private subnet ID in a different availability zone for load balancing.
Example: [subnet-9a8b7c6d5e4f3g2h1]