W E K A
4.4
4.4
  • WEKA v4.4 documentation
    • Documentation revision history
  • WEKA System Overview
    • Introduction
      • WEKA system functionality features
      • Converged WEKA system deployment
      • Redundancy optimization in WEKA
    • SSD capacity management
    • Filesystems, object stores, and filesystem groups
    • WEKA networking
    • Data lifecycle management
    • WEKA client and mount modes
    • WEKA containers architecture overview
    • Glossary
  • Planning and Installation
    • Prerequisites and compatibility
    • WEKA cluster installation on bare metal servers
      • Plan the WEKA system hardware requirements
      • Obtain the WEKA installation packages
      • Install the WEKA cluster using the WMS with WSA
      • Install the WEKA cluster using the WSA
      • Manually install OS and WEKA on servers
      • Manually prepare the system for WEKA configuration
        • Broadcom adapter setup for WEKA system
        • Enable the SR-IOV
      • Configure the WEKA cluster using the WEKA Configurator
      • Manually configure the WEKA cluster using the resources generator
        • VLAN tagging in the WEKA system
      • Perform post-configuration procedures
      • Add clients to an on-premises WEKA cluster
    • WEKA Cloud Deployment Manager Web (CDM Web) User Guide
    • WEKA Cloud Deployment Manager Local (CDM Local) User Guide
    • WEKA installation on AWS
      • WEKA installation on AWS using Terraform
        • Terraform-AWS-WEKA module description
        • Deployment on AWS using Terraform
        • Required services and supported regions
        • Supported EC2 instance types using Terraform
        • WEKA cluster auto-scaling in AWS
        • Detailed deployment tutorial: WEKA on AWS using Terraform
      • WEKA installation on AWS using the Cloud Formation
        • Self-service portal
        • CloudFormation template generator
        • Deployment types
        • AWS Outposts deployment
        • Supported EC2 instance types using Cloud Formation
        • Add clients to a WEKA cluster on AWS
        • Auto scaling group
        • Troubleshooting
    • WEKA installation on Azure
      • Azure-WEKA deployment Terraform package description
      • Deployment on Azure using Terraform
      • Required services and supported regions
      • Supported virtual machine types
      • Auto-scale virtual machines in Azure
      • Add clients to a WEKA cluster on Azure
      • Troubleshooting
      • Detailed deployment tutorial: WEKA on Azure using Terraform
    • WEKA installation on GCP
      • WEKA project description
      • GCP-WEKA deployment Terraform package description
      • Deployment on GCP using Terraform
      • Required services and supported regions
      • Supported machine types and storage
      • Auto-scale instances in GCP
      • Add clients to a WEKA cluster on GCP
      • Troubleshooting
      • Detailed deployment tutorial: WEKA on GCP using Terraform
      • Google Kubernetes Engine and WEKA over POSIX deployment
    • WEKA installation on OCI
  • Getting Started with WEKA
    • Manage the system using the WEKA GUI
    • Manage the system using the WEKA CLI
      • WEKA CLI hierarchy
      • CLI reference guide
    • Run first IOs with WEKA filesystem
    • Getting started with WEKA REST API
    • WEKA REST API and equivalent CLI commands
  • Performance
    • WEKA performance tests
      • Test environment details
  • WEKA Filesystems & Object Stores
    • Manage object stores
      • Manage object stores using the GUI
      • Manage object stores using the CLI
    • Manage filesystem groups
      • Manage filesystem groups using the GUI
      • Manage filesystem groups using the CLI
    • Manage filesystems
      • Manage filesystems using the GUI
      • Manage filesystems using the CLI
    • Attach or detach object store buckets
      • Attach or detach object store bucket using the GUI
      • Attach or detach object store buckets using the CLI
    • Advanced data lifecycle management
      • Advanced time-based policies for data storage location
      • Data management in tiered filesystems
      • Transition between tiered and SSD-only filesystems
      • Manual fetch and release of data
    • Mount filesystems
      • Mount filesystems from Single Client to Multiple Clusters (SCMC)
      • Manage authentication across multiple clusters with connection profiles
    • Snapshots
      • Manage snapshots using the GUI
      • Manage snapshots using the CLI
    • Snap-To-Object
      • Manage Snap-To-Object using the GUI
      • Manage Snap-To-Object using the CLI
    • Snapshot policies
      • Manage snapshot policies using the GUI
      • Manage snapshot policies using the CLI
    • Quota management
      • Manage quotas using the GUI
      • Manage quotas using the CLI
  • Additional Protocols
    • Additional protocol containers
    • Manage the NFS protocol
      • Supported NFS client mount parameters
      • Manage NFS networking using the GUI
      • Manage NFS networking using the CLI
    • Manage the S3 protocol
      • S3 cluster management
        • Manage the S3 service using the GUI
        • Manage the S3 service using the CLI
      • S3 buckets management
        • Manage S3 buckets using the GUI
        • Manage S3 buckets using the CLI
      • S3 users and authentication
        • Manage S3 users and authentication using the CLI
        • Manage S3 service accounts using the CLI
      • S3 lifecycle rules management
        • Manage S3 lifecycle rules using the GUI
        • Manage S3 lifecycle rules using the CLI
      • Audit S3 APIs
        • Configure audit webhook using the GUI
        • Configure audit webhook using the CLI
        • Example: How to use Splunk to audit S3
        • Example: How to use S3 audit events for tracking and security
      • S3 supported APIs and limitations
      • S3 examples using boto3
      • Configure and use AWS CLI with WEKA S3 storage
    • Manage the SMB protocol
      • Manage SMB using the GUI
      • Manage SMB using the CLI
  • Security
    • WEKA security overview
    • Obtain authentication tokens
    • Manage token expiration
    • Manage account lockout threshold policy
    • Manage KMS
      • Manage KMS using GUI
      • Manage KMS using CLI
    • Manage TLS certificates
      • Manage TLS certificates using GUI
      • Manage TLS certificates using CLI
    • Manage Cross-Origin Resource Sharing
    • Manage CIDR-based security policies
    • Manage login banner
  • Secure cluster membership with join secret authentication
  • Licensing
    • License overview
    • Classic license
  • Operation Guide
    • Alerts
      • Manage alerts using the GUI
      • Manage alerts using the CLI
      • List of alerts and corrective actions
    • Events
      • Manage events using the GUI
      • Manage events using the CLI
      • List of events
    • Statistics
      • Manage statistics using the GUI
      • Manage statistics using the CLI
      • List of statistics
    • Insights
    • System congestion
    • User management
      • Manage users using the GUI
      • Manage users using the CLI
    • Organizations management
      • Manage organizations using the GUI
      • Manage organizations using the CLI
      • Mount authentication for organization filesystems
    • Expand and shrink cluster resources
      • Add a backend server
      • Expand specific resources of a container
      • Shrink a cluster
    • Background tasks
      • Set up a Data Services container for background tasks
      • Manage background tasks using the GUI
      • Manage background tasks using the CLI
    • Upgrade WEKA versions
    • Manage WEKA drivers
  • Monitor the WEKA Cluster
    • Deploy monitoring tools using the WEKA Management Station (WMS)
    • WEKA Home - The WEKA support cloud
      • Local WEKA Home overview
      • Deploy Local WEKA Home v3.0 or higher
      • Deploy Local WEKA Home v2.x
      • Explore cluster insights
      • Explore performance statistics in Grafana
      • Manage alerts and integrations
      • Enforce security and compliance
      • Optimize support and data management
      • Export cluster metrics to Prometheus
    • Set up WEKAmon for external monitoring
    • Set up the SnapTool external snapshots manager
  • Kubernetes
    • Composable clusters for multi-tenancy in Kubernetes
    • WEKA Operator deployment
    • WEKA Operator day-2 operations
  • WEKApod
    • WEKApod Data Platform Appliance overview
    • WEKApod servers overview
    • Rack installation
    • WEKApod initial system setup and configuration
    • WEKApod support process
  • AWS Solutions
    • Amazon SageMaker HyperPod and WEKA Integrations
      • Deploy a new Amazon SageMaker HyperPod cluster with WEKA
      • Add WEKA to an existing Amazon SageMaker HyperPod cluster
    • AWS ParallelCluster and WEKA Integration
  • Azure Solutions
    • Azure CycleCloud for SLURM and WEKA Integration
  • Best Practice Guides
    • WEKA and Slurm integration
      • Avoid conflicting CPU allocations
    • Storage expansion best practice
  • Support
    • Get support for your WEKA system
    • Diagnostics management
      • Traces management
        • Manage traces using the GUI
        • Manage traces using the CLI
      • Protocols debug level management
        • Manage protocols debug level using the GUI
        • Manage protocols debug level using the CLI
      • Diagnostics data management
  • Appendices
    • WEKA CSI Plugin
      • Deployment
      • Storage class configurations
      • Tailor your storage class configuration with mount options
      • Dynamic and static provisioning
      • Launch an application using WEKA as the POD's storage
      • Add SELinux support
      • NFS transport failback
      • Upgrade legacy persistent volumes for capacity enforcement
      • Troubleshooting
    • Convert cluster to multi-container backend
    • Create a client image
    • Update WMS and WSA
    • BIOS tool
Powered by GitBook
On this page
  • Introduction
  • What is Azure CycleCloud?
  • What is SLURM?
  • Solution overview
  • Prerequisites
  • Workflow: Integrate Azure CycleCloud with WEKA
  • Step 1: Download the Azure CycleCloud/WEKA integration template
  • Step 2: Configure network parameters for DPDK
  • Step 3: Deploy the cluster initialization module
  • Step 4: Configure the WEKA blade on the CycleCloud/WEKA template
  • Step 5: Test the integration
  1. Azure Solutions

Azure CycleCloud for SLURM and WEKA Integration

Learn to integrate Azure CycleCloud with the WEKA Data Platform and SLURM scheduler to streamline HPC cluster management and enable high-performance, scalable data solutions for AI, ML, and analytics.

PreviousAWS ParallelCluster and WEKA IntegrationNextWEKA and Slurm integration

Last updated 5 months ago

Introduction

The integration of Azure CycleCloud with the WEKA Data Platform delivers a robust, high-performance solution tailored for data-intensive workloads in High-Performance Computing (HPC) environments.

Azure CycleCloud simplifies the orchestration and management of HPC clusters on Azure, providing features such as dynamic autoscaling and streamlined configuration management for complex deployments. Paired with WEKA, users benefit from a high-performance, scalable file system designed to handle low-latency, high-throughput workloads, making it ideal for applications in AI, analytics, machine learning, and other HPC domains.

This document provides a step-by-step guide to integrating WEKA with your CycleCloud environment using the SLURM scheduler, enabling seamless data access and management for HPC workloads.

What is Azure CycleCloud?

Azure CycleCloud is a comprehensive solution for orchestrating and managing High-Performance Computing (HPC) environments in Azure. It enables users to:

  • Provision infrastructure: Quickly set up the compute and storage resources required for HPC workloads.

  • Deploy familiar HPC schedulers: Integrate with widely used schedulers like SLURM, Grid Engine, or HPC Pack.

  • Scale efficiently: Automatically scale infrastructure to handle jobs of varying sizes, optimizing resource utilization and cost.

  • Simplify file system integration: Create and mount different types of file systems onto compute cluster nodes to support demanding HPC applications.

CycleCloud also enhances HPC environments by deploying autoscaling plugins on supported schedulers. This eliminates the need for users to develop and manage complex autoscaling logic, allowing them to focus on scheduler-level configurations they already know.

For more details, refer to the official Azure CycleCloud documentation: .

What is SLURM?

SLURM (Simple Linux Utility for Resource Management) is a widely adopted open-source workload manager designed for High-Performance Computing (HPC), Artificial Intelligence (AI), and cloud computing environments. It enables users to efficiently run large-scale parallel and distributed applications across clusters of compute nodes.

Key features of SLURM include:

  • Job scheduling: Manages and prioritizes job execution based on resource availability and user-defined policies.

  • Resource management: Allocates and tracks compute resources such as CPUs, GPUs, and memory.

  • Fault tolerance: Supports mechanisms for recovering jobs and managing failures.

  • Power management: Optimizes energy use by powering nodes up or down based on workload demands.

SLURM is trusted by many of the world’s top supercomputers, research institutes, universities, and enterprises due to its scalability and flexibility.

In the context of Azure CycleCloud and WEKA, SLURM is currently the only scheduler supported for integration. However, support for additional schedulers is planned for future releases.

Solution overview

This architecture demonstrates how Azure CycleCloud integrates with the WEKA Data Platform and SLURM to deliver a scalable, high-performance solution for High-Performance Computing (HPC) and High-Throughput Computing (HTC) workloads. The process includes four key steps:

  1. Job submission: Users submit HPC jobs through the SLURM scheduler, specifying the number of Azure Virtual Machines (VMs) to deploy. Azure CycleCloud provisions the required compute nodes using Virtual Machine Scale Sets (VMSS), ensuring resources match workload demands.

  2. Automatic WEKA mounting: During the initialization of the VMs, a cluster-init module automatically mounts each compute node to the WEKA storage cluster, enabling seamless access to high-performance storage.

  3. Job execution: Jobs are executed using the WEKA Data Platform, which provides a unified, high-speed storage layer. The platform combines NVMe performance with Azure Blob Storage scalability and cost efficiency, ensuring optimal performance for HPC/HTC workloads.

  4. Data persistence: After job completion, data remains stored on the WEKA platform. This ensures continuity, allowing users to retain data for future analysis, migrate it to Azure Blob Storage for long-term archiving, or redeploy nodes for further computation.

By combining CycleCloud’s dynamic compute provisioning and scaling with WEKA’s advanced storage capabilities, this solution offers a robust and efficient framework for HPC and HTC applications.

Prerequisites

Before proceeding, ensure that both Azure CycleCloud and WEKA are installed and configured in your Azure environment. This document focuses on integrating these two solutions.

If either solution is not yet installed, refer to the following resources to complete the installation:

Once both solutions are installed, you can proceed with the integration workflow.

Workflow: Integrate Azure CycleCloud with WEKA

The integration of Azure CycleCloud with the WEKA Data Platform involves four key steps:

Step 1: Download the Azure CycleCloud/WEKA integration template

Create and deploy the cluster-init module on the Azure CycleCloud nodes to automatically configure WEKA integration.

Procedure

  1. Log in to the Azure CycleCloud Virtual Machine (VM) Access the VM where Azure CycleCloud is installed.

  2. Clone the CycleCloud/WEKA integration template repository from GitHub to your CycleCloud instance using the previously copied URL:

git clone https://github.com/themorey/cyclecloud-weka.git
  1. Import the template Navigate to the cloned repository directory and import the slurm-weka template into Azure CycleCloud:

cyclecloud import_template -f /home/weka/cyclecloud-weka/templates/slurm-weka.txt  
  1. Verify the template in the CycleCloud GUI Once the template is successfully imported, it appears in the CycleCloud GUI under the templates section.

  2. Review the template configuration Click on the newly imported template. It includes a section labeled Weka Cluster Info. You will configure this section in a later step of this guide.

Step 2: Configure network parameters for DPDK

The WEKA data platform leverages the Data Plane Development Kit (DPDK) to achieve high performance and low latency across all hosts. By using DPDK, WEKA's filesystem (WekaFS) bypasses the host kernel's traditional networking stack. This enables direct communication with the Network Interface Card (NIC) in user space, reducing latency by eliminating context switches and data copying. The result is significantly improved throughput and efficiency.

To fully use DPDK, each host requires two NICs. These dual NICs allow load balancing and facilitate the segregation of network traffic types, such as data traffic and management traffic, ensuring optimal performance.

For step-by-step guidance on enabling DPDK and configuring dual NICs for high-performance scenarios, refer to WEKA networking topic.

Procedure

  1. Log in to the Azure CycleCloud VM Access the VM where Azure CycleCloud is installed.

  2. Modify the template to support dual NICs Locate the section labeled [[nodearraybase]] and add the following configuration for dual network interfaces:

    [[[network-interface eth0]]]  
    AssociatePublicIpAddress = $ExecuteNodesPublic  
    SubnetId = $SubnetId  
    AcceleratedNetworking = true  
    
    [[[network-interface eth1]]]  
    SubnetId = $SubnetId
  • You can apply these network parameters to individual nodes (for example, HPC, HTC, dynamic) or add them to the [[nodearraybase]] configuration.

  • If other nodes reference [[nodearraybase]] using Extends = nodearraybase, they inherit this configuration automatically.

  1. Save and Apply the Changes Save the modified template and ensure it is uploaded to your CycleCloud instance.

After completing these steps, your CycleCloud nodes is provisioned with two NICs, enabling DPDK to optimize performance for the WEKA Data Platform.

Step 3: Deploy the cluster initialization module

The cluster initialization module ensures that each node in the Azure CycleCloud environment is configured to integrate seamlessly with the WEKA Data Platform.

Procedure

1. Create the cluster Initialization script

Copy the following shell script and save it to the scripts directory on your Azure CycleCloud VM:

#!/bin/bash  
set -ex  

# Add specific commands here to configure each node for WEKA  
# Example: mounting storage, installing dependencies, or setting environment variables

weka@cyclecloud-vm://home/weka/cyclecloud-weka/specs/htc/cluster-init/scripts$ ls
001-htc-cluster-init.sh  README.txt

Depending on your deployment, you may create separate CycleCloud specifications for each Node Array (for example, HPC and HTC nodes) and provide a distinct cloud-init script for each array.

2. Add the script to the cluster configuration

  1. Access the cluster configuration in the CycleCloud GUI

    1. Log in to the CycleCloud GUI.

    2. Navigate to the cluster configuration you wish to edit.

    3. Click Edit to modify the settings.

  2. Attach the cluster initialization script

    1. In the Advanced Settings section, scroll to the Cluster Init section near the bottom of the page.

    2. Click Browse and navigate to the saved script location on the CycleCloud VM.

    3. Select the script to apply it to the desired node array.

    Example configuration: You can deploy the same script for multiple node arrays (for example, both HTC and HPC nodes) or assign unique scripts to different arrays.

  3. Save changes

    • Click Save and exit the Edit Configuration panel.

After completing these steps, the cluster initialization module is deployed to your CycleCloud nodes, ensuring they are properly configured during startup.

Step 4: Configure the WEKA blade on the CycleCloud/WEKA template

The cluster-init script used in the previous step requires specific configuration parameters, including the IP addresses of the WEKA storage platform, the mount point for the nodes, and the WEKA filesystem name.

Procedure

  1. Retrieve WEKA configuration details

    1. Log into the WEKA GUI: Navigate to the Cluster Servers section and note the IP addresses of the WEKA backend servers.

    2. Select or create a filesystem:

      1. In the WEKA GUI, go to the Filesystems section.

      2. Identify the filesystem you want to mount to the CycleCloud VMs. You can select an existing filesystem or create a new one for this purpose.

  2. Populate the WEKA Blade in CycleCloud

    1. Open the WEKA Blade Configuration: In the CycleCloud GUI, click Edit on the cluster configuration. Navigate to the WEKA Cluster Info section.

    2. Fill in the required parameters:

      • Mount point: Specify the desired mount point for the nodes.

  3. Save the configuration

    1. Click Save to apply the changes.

    2. Exit the Edit Configuration panel and return to the CycleCloud GUI.

Your CycleCloud nodes are now configured to automatically connect to the specified WEKA filesystem during initialization. This completes the integration process.

Step 5: Test the integration

To validate the integration, run a SLURM job across multiple nodes and confirm that each node connects directly to the WEKA Data Platform through the specified mount point.

Procedure

  1. Run a SLURM job:

    1. Log into the Scheduler VM.

    2. Submit a SLURM job. For example, run a batch HTC job using 3 nodes:

      batch <job-script>.sh  
    3. Verify that 3 HTC nodes are activated in CycleCloud.

  2. Monitor cluster initialization (optional):

    1. Log into one of the HTC nodes.

    2. Navigate to the cluster-init logs:

      cd /opt/cycle/jetpack/logs/cluster-init/<weka-template>/<spec>/scripts  
    3. Use tail to monitor the script's progress and confirm mounting to WEKA:

      tail -f <script-name>  

  3. Verify on the WEKA GUI:

    1. Access the WEKA GUI.

    2. Navigate to the Clients section to verify that all nodes are connected and mounted to the WEKA Data Platform.

    3. Ensure all nodes display a green status, indicating successful connectivity.

  4. Once the nodes are mounted and operational, the integration is confirmed, and you can proceed with HPC analysis.

Azure CycleCloud:

WEKA on Azure:

Download the Azure CycleCloud/WEKA integration template Obtain the pre-built template that simplifies the integration process.

Configure network parameters for DPDK Adjust the network settings on the Azure CycleCloud nodes to enable Data Plane Development Kit (DPDK), ensuring optimal data transfer performance.

Deploy the cluster initialization module Create and deploy the cluster-init module on the Azure CycleCloud nodes to automatically configure WEKA integration.

Set up the WEKA blade Configure the WEKA blade using the CycleCloud/WEKA template installed in step 1 to finalize the integration.

Test the integration Verify the integration of Azure CycleCloud, SLURM, and the WEKA Data Platform.

Retrieve the official template Browse to and copy the URL to the clipboard.

Open the CycleCloud/WEKA template Navigate to the template downloaded in , and open it using a text editor.

WEKA addresses: Enter the IP addresses of the WEKA backend servers from . Separate multiple IP addresses with commas.

WEKA filesystem: Enter the name of the selected WEKA filesystem from .

Azure CycleCloud Overview
Azure CycleCloud Installation Guide
WEKA Installation on Azure
https://github.com/themorey/cyclecloud-weka
>>>
>>>
>>>
>>>
>>>
Step 1
Step 1
Step 1
WEKA and Azure CycleCloud integration architecture
Log in to the Azure CycleCloud VM
Copy the URL from the cyclecloud-weka repository
Clone the CycleCloud/WEKA integration template repository
CycleCloud GUI
WEKA Cluster Info
Log in to the Azure CycleCloud VM
Modify the template to support dual NICs
Edit cluster configuration in the CycleCloud GUI
Edit WEKA: Advanced Settings
Navigate to the saved script location
Select a filesystem
Example configuration
Log into the Scheduler VM
Example: Submit a SLURM job
Verify 3 HTC nodes are activated in CycleCloud