W E K A
4.4
4.4
  • WEKA v4.4 documentation
    • Documentation revision history
  • WEKA System Overview
    • Introduction
      • WEKA system functionality features
      • Converged WEKA system deployment
      • Optimize redundancy in WEKA deployments
    • SSD capacity management
    • Filesystems, object stores, and filesystem groups
    • WEKA networking
    • Data lifecycle management
    • WEKA client and mount modes
    • WEKA containers architecture overview
    • Glossary
  • Planning and Installation
    • Prerequisites and compatibility
    • WEKA cluster installation on bare metal servers
      • Plan the WEKA system hardware requirements
      • Obtain the WEKA installation packages
      • Install the WEKA cluster using the WMS with WSA
      • Install the WEKA cluster using the WSA
      • Manually install OS and WEKA on servers
      • Manually prepare the system for WEKA configuration
        • Broadcom adapter setup for WEKA system
        • Enable the SR-IOV
      • Configure the WEKA cluster using the WEKA Configurator
      • Manually configure the WEKA cluster using the resources generator
        • VLAN tagging in the WEKA system
      • Perform post-configuration procedures
      • Add clients to an on-premises WEKA cluster
    • WEKA Cloud Deployment Manager Web (CDM Web) User Guide
    • WEKA Cloud Deployment Manager Local (CDM Local) User Guide
    • WEKA installation on AWS
      • WEKA installation on AWS using Terraform
        • Terraform-AWS-WEKA module description
        • Deployment on AWS using Terraform
        • Required services and supported regions
        • Supported EC2 instance types using Terraform
        • WEKA cluster auto-scaling in AWS
        • Detailed deployment tutorial: WEKA on AWS using Terraform
      • WEKA installation on AWS using the Cloud Formation
        • Self-service portal
        • CloudFormation template generator
        • Deployment types
        • AWS Outposts deployment
        • Supported EC2 instance types using Cloud Formation
        • Add clients to a WEKA cluster on AWS
        • Auto scaling group
        • Troubleshooting
    • WEKA installation on Azure
      • Azure-WEKA deployment Terraform package description
      • Deployment on Azure using Terraform
      • Required services and supported regions
      • Supported virtual machine types
      • Auto-scale virtual machines in Azure
      • Add clients to a WEKA cluster on Azure
      • Troubleshooting
      • Detailed deployment tutorial: WEKA on Azure using Terraform
    • WEKA installation on GCP
      • WEKA project description
      • GCP-WEKA deployment Terraform package description
      • Deployment on GCP using Terraform
      • Required services and supported regions
      • Supported machine types and storage
      • Auto-scale instances in GCP
      • Add clients to a WEKA cluster on GCP
      • Troubleshooting
      • Detailed deployment tutorial: WEKA on GCP using Terraform
      • Google Kubernetes Engine and WEKA over POSIX deployment
    • WEKA installation on OCI
  • Getting Started with WEKA
    • Manage the system using the WEKA GUI
    • Manage the system using the WEKA CLI
      • WEKA CLI hierarchy
      • CLI reference guide
    • Run first IOs with WEKA filesystem
    • Getting started with WEKA REST API
    • WEKA REST API and equivalent CLI commands
  • Performance
    • WEKA performance tests
      • Test environment details
  • WEKA Filesystems & Object Stores
    • Manage object stores
      • Manage object stores using the GUI
      • Manage object stores using the CLI
    • Manage filesystem groups
      • Manage filesystem groups using the GUI
      • Manage filesystem groups using the CLI
    • Manage filesystems
      • Manage filesystems using the GUI
      • Manage filesystems using the CLI
    • Attach or detach object store buckets
      • Attach or detach object store bucket using the GUI
      • Attach or detach object store buckets using the CLI
    • Advanced data lifecycle management
      • Advanced time-based policies for data storage location
      • Data management in tiered filesystems
      • Transition between tiered and SSD-only filesystems
      • Manual fetch and release of data
    • Mount filesystems
      • Mount filesystems from Single Client to Multiple Clusters (SCMC)
      • Manage authentication across multiple clusters with connection profiles
    • Snapshots
      • Manage snapshots using the GUI
      • Manage snapshots using the CLI
    • Snap-To-Object
      • Manage Snap-To-Object using the GUI
      • Manage Snap-To-Object using the CLI
    • Snapshot policies
      • Manage snapshot policies using the GUI
      • Manage snapshot policies using the CLI
    • Quota management
      • Manage quotas using the GUI
      • Manage quotas using the CLI
  • Additional Protocols
    • Additional protocol containers
    • Manage the NFS protocol
      • Supported NFS client mount parameters
      • Manage NFS networking using the GUI
      • Manage NFS networking using the CLI
    • Manage the S3 protocol
      • S3 cluster management
        • Manage the S3 service using the GUI
        • Manage the S3 service using the CLI
      • S3 buckets management
        • Manage S3 buckets using the GUI
        • Manage S3 buckets using the CLI
      • S3 users and authentication
        • Manage S3 users and authentication using the CLI
        • Manage S3 service accounts using the CLI
      • S3 lifecycle rules management
        • Manage S3 lifecycle rules using the GUI
        • Manage S3 lifecycle rules using the CLI
      • Audit S3 APIs
        • Configure audit webhook using the GUI
        • Configure audit webhook using the CLI
        • Example: How to use Splunk to audit S3
        • Example: How to use S3 audit events for tracking and security
      • S3 supported APIs and limitations
      • S3 examples using boto3
      • Configure and use AWS CLI with WEKA S3 storage
    • Manage the SMB protocol
      • Manage SMB using the GUI
      • Manage SMB using the CLI
  • Security
    • WEKA security overview
    • Obtain authentication tokens
    • Manage token expiration
    • Manage account lockout threshold policy
    • Manage KMS
      • Manage KMS using GUI
      • Manage KMS using CLI
    • Manage TLS certificates
      • Manage TLS certificates using GUI
      • Manage TLS certificates using CLI
    • Manage Cross-Origin Resource Sharing
    • Manage CIDR-based security policies
    • Manage login banner
  • Secure cluster membership with join secret authentication
  • Licensing
    • License overview
    • Classic license
  • Operation Guide
    • Alerts
      • Manage alerts using the GUI
      • Manage alerts using the CLI
      • List of alerts and corrective actions
    • Events
      • Manage events using the GUI
      • Manage events using the CLI
      • List of events
    • Statistics
      • Manage statistics using the GUI
      • Manage statistics using the CLI
      • List of statistics
    • Insights
    • System congestion
    • User management
      • Manage users using the GUI
      • Manage users using the CLI
    • Organizations management
      • Manage organizations using the GUI
      • Manage organizations using the CLI
      • Mount authentication for organization filesystems
    • Expand and shrink cluster resources
      • Add a backend server
      • Expand specific resources of a container
      • Shrink a cluster
    • Background tasks
      • Set up a Data Services container for background tasks
      • Manage background tasks using the GUI
      • Manage background tasks using the CLI
    • Upgrade WEKA versions
    • Manage WEKA drivers
  • Monitor the WEKA Cluster
    • Deploy monitoring tools using the WEKA Management Station (WMS)
    • WEKA Home - The WEKA support cloud
      • Local WEKA Home overview
      • Deploy Local WEKA Home v3.0 or higher
      • Deploy Local WEKA Home v2.x
      • Explore cluster insights
      • Explore performance statistics in Grafana
      • Manage alerts and integrations
      • Enforce security and compliance
      • Optimize support and data management
      • Export cluster metrics to Prometheus
    • Set up WEKAmon for external monitoring
    • Set up the SnapTool external snapshots manager
  • Kubernetes
    • Composable clusters for multi-tenancy in Kubernetes
    • WEKA Operator deployment
    • WEKA Operator day-2 operations
  • WEKApod
    • WEKApod Data Platform Appliance overview
    • WEKApod servers overview
    • Rack installation
    • WEKApod initial system setup and configuration
    • WEKApod support process
  • AWS Solutions
    • Amazon SageMaker HyperPod and WEKA Integrations
      • Deploy a new Amazon SageMaker HyperPod cluster with WEKA
      • Add WEKA to an existing Amazon SageMaker HyperPod cluster
    • AWS ParallelCluster and WEKA Integration
  • Azure Solutions
    • Azure CycleCloud for SLURM and WEKA Integration
  • Best Practice Guides
    • WEKA and Slurm integration
      • Avoid conflicting CPU allocations
    • Storage expansion best practice
  • Support
    • Get support for your WEKA system
    • Diagnostics management
      • Traces management
        • Manage traces using the GUI
        • Manage traces using the CLI
      • Protocols debug level management
        • Manage protocols debug level using the GUI
        • Manage protocols debug level using the CLI
      • Diagnostics data management
  • Appendices
    • WEKA CSI Plugin
      • Deployment
      • Storage class configurations
      • Tailor your storage class configuration with mount options
      • Dynamic and static provisioning
      • Launch an application using WEKA as the POD's storage
      • Add SELinux support
      • NFS transport failback
      • Upgrade legacy persistent volumes for capacity enforcement
      • Troubleshooting
    • Convert cluster to multi-container backend
    • Create a client image
    • Update WMS and WSA
    • BIOS tool
Powered by GitBook
On this page
  • Overview
  • Slurm based architecture with AWS ParallelCluster
  • Deployment workflow for AWS ParallelCluster cluster
  • Deploy WEKA Cluster using Terraform
  • Prepare for AWS ParallelCluster deployment
  • Verify security group configurations
  • Deploy AWS ParallelCluster cluster
  1. AWS Solutions

AWS ParallelCluster and WEKA Integration

PreviousAdd WEKA to an existing Amazon SageMaker HyperPod clusterNextAzure CycleCloud for SLURM and WEKA Integration

Last updated 2 months ago

Overview

AWS ParallelCluster is an open-source tool for managing clusters, simplifying the deployment and administration of HPC clusters on AWS. By integrating with WEKA, organizations can create a high-performance data platform that significantly reduces epoch time from months to days without requiring additional infrastructure investment.

The infrastructure performance and efficiency gains made possible with WEKA for AWS ParallelCluster, organizations can accelerate their own pace of innovation, maximize their utilization of GPU-accelerated infrastructure, and control costs.

Slurm based architecture with AWS ParallelCluster

The integration of WEKA with AWS ParallelCluster using Slurm comprises two primary components: a compute cluster and a standalone WEKA cluster. Refer to the numbered elements in the accompanying illustration for details:

The integration of WEKA with AWS ParallelCluster using Slurm consists of two main components: a compute cluster and a standalone WEKA cluster. Refer to the numbered elements in the accompanying illustration for details:

  1. WEKA cluster deployment The WEKA cluster backends are deployed within the same Virtual Private Cloud (VPC) and subnet as the AWS ParallelCluster cluster. These backends use the i3en instance family, which is optimized for high-performance storage and compute workloads.

  2. WEKA client integration The WEKA client software is installed across the AWS ParallelCluster components, including the controller node, login nodes, and worker nodes. This software facilitates seamless access to the WEKA cluster by presenting a mount point within the file system, enabling efficient data sharing and processing.

  3. Data management and tiering To optimize data handling, WEKA employs an Amazon S3 bucket for data tiering. This system ensures that data is automatically allocated to the appropriate storage tier based on access patterns and cost-efficiency considerations. Furthermore, WEKA leverages S3 for storing snapshots, providing an additional layer of data resilience and enabling robust disaster recovery.

Deployment workflow for AWS ParallelCluster cluster

Deploy WEKA Cluster using Terraform

Prepare for AWS ParallelCluster deployment

Step 1: AWS ParallelCluster CLI installation

Step 2: Create S3 bucket to store AWS ParallelCluster integration scripts

  1. Create an S3 bucket in the same region AWS ParallelCluster will be deployed.

aws s3 mb s3://<bucket name> --region <region>
  1. Verify bucket creation

aws s3 ls | grep <bucket name>

Step 3: Clone WEKA Cloud-Solutions repository and copy integrations scripts to S3

  1. Clone this repository and cd to the aws/parallelcluster/ directory:

git clone https://github.com/weka/cloud-solutions.git
cd cloud-solutions/aws/parallelcluster
  1. Upload integration scripts to S3

aws s3 cp ./scripts/weka-install.py s3://<bucket name>/scripts/weka-install.py 
aws s3 cp ./scripts/virtualenv-setup.sh s3://<bucket name>/scripts/virtualenv-setup.sh 
  1. Create IAM policy for WEKA clients using provided template

aws iam create-policy --policy-name weka-client-pcluster --policy-document file://./iam/example-pcluster-policy.json

Step 4: Modify AWS ParallelCluster template

This repository includes an example cluster template file. Follow the steps below to modify the template for your AWS environment. If you already have a cluster template, combine it with the example one provided.

  1. Create a copy of the example template file.

cp example-pcluster-template.yaml pcluster.yaml
  1. Update Region.

sed -i '' -e 's/Region: us-east-2/Region: <region>/' pcluster.yaml
  1. Update Networking settings.

sed -i '' -e 's/subnet-123456789abcdefg/<your subnet>/g' pcluster.yaml
sed -i '' -e' s/sg-123456789abcdefg/<your security group>/g' pcluster.yaml
  1. Update SSH KeyName.

sed -i '' -e s/support_key/<your SSH KeyPair Name>/g' pcluster.yaml
  1. Update S3 bucket name.

sed -i '' -e 's/MY-S3-BUCKET/<s3 bucket name>/g' pcluster.yaml
  1. Update ALB DNS Name.

sed -i '' -e 's/internal-weka-lb-12345689.us-east-2.elb.amazonaws.com/<WEKA ALB DNS NAME>/g' pcluster.yaml
  1. Update WEKA filesystem name (optional).

sed -i '' -e 's/--filesystem-name=default/--filesystem-name=<filesystem name>/g' pcluster.yaml
  1. Update WEKA mount point (optional). (Escape the forward slash in the AWS ARN with a backslash.)

sed -i '' -e 's/--mount-point=\/mnt\/weka/--mount-point=<mount point>/g' pcluster.yaml
  1. Update IAM Policy ARN. (Escape the forward slash in the AWS ARN with a backslash.)

sed -i '' -e 's/arn:aws:iam::123456789:policy\/weka-pcluster-client-policy/<IAM policy ARN>/g' pcluster.yaml
  1. Update SlurmQueues.

While the example template shows two queues to demonstrate a common customer setup, you can configure as few as one queue. If you do implement multiple queues, update each queue's configuration.

Review and update the following parameters as necessary:

  • Name

  • ComputeResources > Name

  • InstanceType

  • MinCount

  • MaxCount

  • CustomSlurmSettings > RealMemory

    • This value is specific to the instance type. Refer to the table below for the correct value.

  • CustomSlurmSettings > CpuSpecList

    • This value is specific to the instance type. Refer to the table below for the correct value.

  • OnNodeConfigured > Sequence > Script > Args > --cores

    • This value is specific to the instance type. Refer to the table below for the correct value.

If the --cores argument is defined, the WEKA mount is created using a DPDK mount for optimal performance. If the argument is not defined, the WEKA mount defaults to UDP mode. DPDK is preferred for all instances to achieve higher storage performance. However, certain instances, such as HeadNodes, can use a UDP mount if necessary.

Instance Type
CpuSpecList
RealMemory

hpc7a.96xlarge

95,191

742110

Verify security group configurations

Ensure that the AWS ParallelCluster nodes can connect with the WEKA backends on port 14000 for both TCP and UDP. Review your security group settings to confirm that the WEKA clients can communicate with the WEKA backends effectively.

Deploy AWS ParallelCluster cluster

  1. Run create-cluster

pcluster create-cluster -c pcluster.yaml --cluster-name your-cluster-name --rollback-on-failure FALSE
  1. To assist with debugging, disable rollback-on-failure if errors occur.

.

.

.

.

If you already have the AWS ParallelCluster CLI installed you can skip to the next step. Otherwise, follow the procedure in AWS documentation.

Additional instance types: If your instance type is not listed in the table below, contact the for assistance.

Installing the AWS ParallelCluster command line interface (CLI)
Deploy WEKA Cluster using Terraform
Prepare for AWS ParallelCluster deployment
Verify security group configuration
Deploy AWS ParallelCluster cluster
Customer Success Team
Slurm based architecture with AWS ParallelCluster

You can deploy a WEKA cluster using Terraform by choosing one of the following methods:

  • Refer to the comprehensive documentation provided in WEKA installation on AWS using Terraform. This option is ideal if you are new to the process and require detailed guidance, including explanations of the concepts involved.

  • Follow the concise step-by-step instructions provided below. This option assumes you are already familiar with the subject and are looking for a quick reference.

Step 1: Set up the Terraform working directory

  1. Create a directory to use as your Terraform working directory.

  2. Inside this directory, create a file named main.tf and paste the following configuration:provider "aws" {

provider "aws" {
    region = "<AWS region>"
}

module "deploy_weka" {
  source                        = "weka/weka/aws"
  weka_version                  = "<WEKA SW Version>"
  get_weka_io_token             = "<get weka token>"
  key_pair_name                 = "<key pair>"
  prefix                        = "<cluster prefix>"
  cluster_name                  = "<cluster name>"
  cluster_size                  = 6
  instance_type                 = "i3en.6xlarge"
  sg_ids                        = ["sg-xxxxxxxxxxxxxxxxx"]
  subnet_ids                    = ["subnet-xxxxxxxxxxxxxxxxx"]
  vpc_id                        = "vpc-xxxxxxxxxxxxxxxxx"
  alb_additional_subnet_id      = "subnet-yyyyyyyyyyyyyyyyyy"
  use_placement_group           = false
  assign_public_ip              = false
  set_dedicated_fe_container    = false
  secretmanager_create_vpc_endpoint = true
  tiering_enable_obs_integration = true
}

output "deploy_weka_output" {
  value = module.deploy_weka
}

Step 2: Update the Terraform configuration file

Update the following variables in the main.tf file with the required values:

Variable
Description
Example/Options

AWS region

AWS region to deploy WEKA

Example: eu-west-1

weka_version

Example: 4.4.2

get_weka_io_token

Example: H5BPF1ssQrstCVz@get.weka.io

key_pair_name

Name of an existing EC2 key pair for SSH access.

prefix

A prefix used for naming cluster resources.

The prefix and cluster_name are concatenated with a hyphen (-) to form the names of resources created by the WEKA Terraform module.

Example: weka

cluster_name

Suffix for cluster resource names.

Example: cluster

cluster_size

Number of instances for the WEKA cluster backends.

Minimum: 6.

Example: 6

instance_type

Instance type for WEKA cluster backends.

Options: i3en.2xlarge, i3en.3xlarge, i3en.6xlarge, i3en.12xlarge, or i3en.24xlarge.

sg_ids

A list of security group IDs for the cluster. These IDs are typically generated based on the CloudFormation stack name. By default, the naming convention follows the format: sagemaker-hyperpod-SecurityGroup-xxxxxxxxxxxxx

Example: ["sg-1d2esy4uf63ps5", "sg-5s2fsgyhug3tps9"]

subnet_ids

A list of private subnet IDs where the SageMaker HyperPod cluster will be deployed.

These subnets must exist within the same VPC as the cluster and be configured for private communication.

Example: ["subnet-0a1b2c3d4e5f6g7h8", "subnet-1a2b3c4d5e6f7g8h9"]

vpc_id

The ID of the Virtual Private Cloud (VPC) where the SageMaker HyperPod cluster will be deployed.

The VPC must accommodate subnets, security groups, and other related resources.

Example: vpc-123abc456def

alb_additional_subnet_id

Private subnet ID in a different availability zone for load balancing.

Example: [subnet-9a8b7c6d5e4f3g2h1]

Step 3: Deploy the WEKA cluster

  1. Initialize Terraform:

    terraform init
  2. Plan the deployment:

    terraform plan
  3. Apply the configuration to deploy the WEKA cluster:

    terraform apply

    Confirm the deployment when prompted.

Step 4: Verify deployment

Ensure the WEKA cluster is fully deployed before proceeding. It may take several minutes for the configuration to complete after Terraform finishes executing.

WEKA software version to deploy. Must be version 4.2.15 or later. Available at .

Token retrieved from .

https://get.weka.io/
https://get.weka.io/