# AWS ParallelCluster and WEKA Integration

## Overview

AWS ParallelCluster is an open-source tool for managing clusters, simplifying the deployment and administration of HPC clusters on AWS. By integrating with WEKA, organizations can create a high-performance data platform that significantly reduces epoch time from months to days without requiring additional infrastructure investment.

The infrastructure performance and efficiency gains made possible with WEKA for AWS ParallelCluster, organizations can accelerate their own pace of innovation, maximize their utilization of GPU-accelerated infrastructure, and control costs.

## Slurm based architecture with AWS ParallelCluster

The integration of WEKA with AWS ParallelCluster using Slurm comprises two primary components: a compute cluster and a standalone WEKA cluster. Refer to the numbered elements in the accompanying illustration for details:

The integration of WEKA with AWS ParallelCluster using Slurm consists of two main components: a compute cluster and a standalone WEKA cluster. Refer to the numbered elements in the accompanying illustration for details:

1. **WEKA cluster deployment**\
   The WEKA cluster backends are deployed within the same Virtual Private Cloud (VPC) and subnet as the AWS ParallelCluster cluster. These backends use the i3en instance family, which is optimized for high-performance storage and compute workloads.
2. **WEKA client integration**\
   The WEKA client software is installed across the AWS ParallelCluster components, including the controller node, login nodes, and worker nodes. This software facilitates seamless access to the WEKA cluster by presenting a mount point within the file system, enabling efficient data sharing and processing.
3. **Data management and tiering**\
   To optimize data handling, WEKA employs an Amazon S3 bucket for data tiering. This system ensures that data is automatically allocated to the appropriate storage tier based on access patterns and cost-efficiency considerations. Furthermore, WEKA leverages S3 for storing snapshots, providing an additional layer of data resilience and enabling robust disaster recovery.

<div data-with-frame="true"><figure><img src="https://content.gitbook.com/content/ZW262oqYA8pNNfGvXjHa/blobs/s5oPzuR3nPK3iVEA5Yy3/PCluster-WEKA-Arch.png" alt=""><figcaption><p>Slurm based architecture with AWS ParallelCluster</p></figcaption></figure></div>

## Deployment workflow for AWS ParallelCluster cluster

1. [Deploy WEKA Cluster using Terraform](#deploy-weka-cluster-using-terraform).
2. [Prepare for AWS ParallelCluster deployment](#prepare-for-aws-parallelcluster-deployment).
3. [Verify security group configuration](#verify-security-group-configurations).
4. [Deploy AWS ParallelCluster cluster](#deploy-aws-parallelcluster-cluster).

### Deploy WEKA Cluster using Terraform

You can deploy a WEKA cluster using Terraform by choosing one of the following methods:

* Refer to the comprehensive documentation provided in [WEKA installation on AWS using Terraform](https://app.gitbook.com/s/VJsIYq2tJgf6IfttPZ6j/planning-and-installation/aws/weka-installation-on-aws-using-terraform "mention").\
  This option is ideal if you are new to the process and require detailed guidance, including explanations of the concepts involved.
* Follow the concise step-by-step instructions provided below.\
  This option assumes you are already familiar with the subject and are looking for a quick reference.

**Step 1: Set up the Terraform working directory**

1. Create a directory to use as your Terraform working directory.
2. Inside this directory, create a file named `main.tf` and paste the following configuration:provider "aws" {

<pre class="language-hcl"><code class="lang-hcl">provider "aws" {
    region = "&#x3C;AWS region>"
}

<strong>module "deploy_weka" {
</strong>  source                        = "weka/weka/aws"
  weka_version                  = "&#x3C;WEKA SW Version>"
  get_weka_io_token             = "&#x3C;get weka token>"
  key_pair_name                 = "&#x3C;key pair>"
  prefix                        = "&#x3C;cluster prefix>"
  cluster_name                  = "&#x3C;cluster name>"
  cluster_size                  = 6
  instance_type                 = "i3en.6xlarge"
  sg_ids                        = ["sg-xxxxxxxxxxxxxxxxx"]
  subnet_ids                    = ["subnet-xxxxxxxxxxxxxxxxx"]
  vpc_id                        = "vpc-xxxxxxxxxxxxxxxxx"
  alb_additional_subnet_id      = "subnet-yyyyyyyyyyyyyyyyyy"
  use_placement_group           = false
  assign_public_ip              = false
  set_dedicated_fe_container    = false
  secretmanager_create_vpc_endpoint = true
  tiering_enable_obs_integration = true
}

output "deploy_weka_output" {
  value = module.deploy_weka
}
</code></pre>

**Step 2: Update the Terraform configuration file**

Update the following variables in the `main.tf` file with the required values:

| Variable                   | Description                                                                                                                                                                                                                                         | Example/Options                                                                               |
| -------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
| AWS region                 | AWS region to deploy WEKA                                                                                                                                                                                                                           | Example: eu-west-1                                                                            |
| `weka_version`             | WEKA software version to deploy. Must be version `4.2.15` or later. Available at <https://get.weka.io/>.                                                                                                                                            | Example: `4.4.2`                                                                              |
| `get_weka_io_token`        | Token retrieved from <https://get.weka.io/>.                                                                                                                                                                                                        | Example: `H5BPF1ssQrstCVz@get.weka.io`                                                        |
| `key_pair_name`            | Name of an existing EC2 key pair for SSH access.                                                                                                                                                                                                    |                                                                                               |
| `prefix`                   | <p>A prefix used for naming cluster resources.</p><p>The prefix and <code>cluster\_name</code> are concatenated with a hyphen (<code>-</code>) to form the names of resources created by the WEKA Terraform module.</p>                             | Example: `weka`                                                                               |
| `cluster_name`             | Suffix for cluster resource names.                                                                                                                                                                                                                  | Example: `cluster`                                                                            |
| `cluster_size`             | <p>Number of instances for the WEKA cluster backends.</p><p>Minimum: <code>6</code>.</p>                                                                                                                                                            | Example: `6`                                                                                  |
| `instance_type`            | Instance type for WEKA cluster backends.                                                                                                                                                                                                            | Options: `i3en.2xlarge`, `i3en.3xlarge`, `i3en.6xlarge`, `i3en.12xlarge`, or `i3en.24xlarge`. |
| `sg_ids`                   | <p>A list of security group IDs for the cluster. These IDs are typically generated based on the CloudFormation stack name.<br>By default, the naming convention follows the format: <code>sagemaker-hyperpod-SecurityGroup-xxxxxxxxxxxxx</code></p> | Example: `["sg-1d2esy4uf63ps5", "sg-5s2fsgyhug3tps9"]`                                        |
| `subnet_ids`               | <p>A list of private subnet IDs where the SageMaker HyperPod cluster will be deployed.</p><p>These subnets must exist within the same VPC as the cluster and be configured for private communication.</p>                                           | Example: `["subnet-0a1b2c3d4e5f6g7h8", "subnet-1a2b3c4d5e6f7g8h9"]`                           |
| `vpc_id`                   | <p>The ID of the Virtual Private Cloud (VPC) where the SageMaker HyperPod cluster will be deployed.</p><p>The VPC must accommodate subnets, security groups, and other related resources.</p>                                                       | Example: `vpc-123abc456def`                                                                   |
| `alb_additional_subnet_id` | Private subnet ID in a different availability zone for load balancing.                                                                                                                                                                              | Example: `[subnet-9a8b7c6d5e4f3g2h1]`                                                         |

**Step 3: Deploy the WEKA cluster**

1. Initialize Terraform:

   ```bash
   terraform init
   ```
2. Plan the deployment:

   ```bash
   terraform plan
   ```
3. Apply the configuration to deploy the WEKA cluster:

   <pre class="language-bash"><code class="lang-bash"><strong>terraform apply
   </strong></code></pre>

   Confirm the deployment when prompted.

**Step 4: Verify deployment**

Ensure the WEKA cluster is fully deployed before proceeding. It may take several minutes for the configuration to complete after Terraform finishes executing.

### Prepare for AWS ParallelCluster deployment

**Step 1: AWS ParallelCluster CLI installation**

If you already have the AWS ParallelCluster CLI installed you can skip to the next step. Otherwise, follow the procedure [Installing the AWS ParallelCluster command line interface (CLI)](https://docs.aws.amazon.com/parallelcluster/latest/ug/install-v3-parallelcluster.html) in AWS documentation.

**Step 2: Create S3 bucket to store AWS ParallelCluster integration scripts**

1. Create an S3 bucket in the same region AWS ParallelCluster will be deployed.

```
aws s3 mb s3://<bucket name> --region <region>
```

2. Verify bucket creation

```
aws s3 ls | grep <bucket name>
```

**Step 3: Clone WEKA Cloud-Solutions repository and copy integrations scripts to S3**

1. Clone this repository and `cd` to the `aws/parallelcluster/` directory:

```
git clone https://github.com/weka/cloud-solutions.git
cd cloud-solutions/aws/parallelcluster
```

2. Upload integration scripts to S3

```
aws s3 cp ./scripts/weka-install.py s3://<bucket name>/scripts/weka-install.py 
aws s3 cp ./scripts/virtualenv-setup.sh s3://<bucket name>/scripts/virtualenv-setup.sh 
```

3. Create IAM policy for WEKA clients using provided template

```
aws iam create-policy --policy-name weka-client-pcluster --policy-document file://./iam/example-pcluster-policy.json
```

**Step 4: Modify AWS ParallelCluster template**

This repository includes an example cluster template file. Follow the steps below to modify the template for your AWS environment. If you already have a cluster template, combine it with the example one provided.

1. Create a copy of the example template file.

```
cp example-pcluster-template.yaml pcluster.yaml
```

2. Update Region.

```
sed -i '' -e 's/Region: us-east-2/Region: <region>/' pcluster.yaml
```

3. Update Networking settings.

```
sed -i '' -e 's/subnet-123456789abcdefg/<your subnet>/g' pcluster.yaml
sed -i '' -e' s/sg-123456789abcdefg/<your security group>/g' pcluster.yaml
```

4. Update SSH KeyName.

```
sed -i '' -e s/support_key/<your SSH KeyPair Name>/g' pcluster.yaml
```

5. Update S3 bucket name.

```
sed -i '' -e 's/MY-S3-BUCKET/<s3 bucket name>/g' pcluster.yaml
```

6. Update ALB DNS Name.

```
sed -i '' -e 's/internal-weka-lb-12345689.us-east-2.elb.amazonaws.com/<WEKA ALB DNS NAME>/g' pcluster.yaml
```

7. Update WEKA filesystem name (optional).

```
sed -i '' -e 's/--filesystem-name=default/--filesystem-name=<filesystem name>/g' pcluster.yaml
```

8. Update WEKA mount point (optional). (Escape the forward slash in the AWS ARN with a backslash.)

```
sed -i '' -e 's/--mount-point=\/mnt\/weka/--mount-point=<mount point>/g' pcluster.yaml
```

9. Update IAM Policy ARN. (Escape the forward slash in the AWS ARN with a backslash.)

```
sed -i '' -e 's/arn:aws:iam::123456789:policy\/weka-pcluster-client-policy/<IAM policy ARN>/g' pcluster.yaml
```

10. Update SlurmQueues.

While the example template shows two queues to demonstrate a common customer setup, you can configure as few as one queue. If you do implement multiple queues, update each queue's configuration.

Review and update the following parameters as necessary:

* **Name**
* **ComputeResources > Name**
* **InstanceType**
* **MinCount**
* **MaxCount**
* **CustomSlurmSettings > RealMemory**
  * This value is specific to the instance type. Refer to the table below for the correct value.
* **CustomSlurmSettings > CpuSpecList**
  * This value is specific to the instance type. Refer to the table below for the correct value.
* **OnNodeConfigured > Sequence > Script > Args > --cores**
  * This value is specific to the instance type. Refer to the table below for the correct value.

If the `--cores` argument is defined, the WEKA mount is created using a **DPDK mount** for optimal performance. If the argument is not defined, the WEKA mount defaults to **UDP mode**. **DPDK is preferred** for all instances to achieve higher storage performance. However, certain instances, such as HeadNodes, can use a UDP mount if necessary.

Additional instance types:\
If your instance type is not listed in the table below, contact the [Customer Success Team](https://app.gitbook.com/o/-L7Tp-Uy9BMSCSCx0MlK/s/uOB5D2WMRjXBnChPFTrk/) for assistance.

| Instance Type  | CpuSpecList | RealMemory |
| -------------- | ----------- | ---------- |
| hpc7a.96xlarge | 95,191      | 742110     |

### Verify security group configurations

Ensure that the AWS ParallelCluster nodes can connect with the WEKA backends on port 14000 for both TCP and UDP. Review your security group settings to confirm that the WEKA clients can communicate with the WEKA backends effectively.

### Deploy AWS ParallelCluster cluster

1. Run create-cluster

{% code overflow="wrap" %}

```
pcluster create-cluster -c pcluster.yaml --cluster-name your-cluster-name --rollback-on-failure FALSE
```

{% endcode %}

2. To assist with debugging, disable `rollback-on-failure` if errors occur.
