NeuralMesh Axon deployment

Deploy NeuralMesh Axon and integrate with orchestration services like Kubernetes and SLURM. For partners and skilled customers performing independent installations.

NeuralMesh Axon deployment overview

NeuralMesh Axon offers a flexible deployment model, supporting both Slurm and Kubernetes runtime environments. regardless of the deployment method, specific infrastructure configurations are required to ensure high performance and stability.

Infrastructure prerequisites

Before deploying software, the physical and network environment must be prepared:

  • Network: The topology must be non-blocking with zero over-subscription. Jumbo frames (9k MTU for Ethernet, 4k for InfiniBand) and Source Based Routing policies are required.

  • Hardware: Servers require AMD or Intel CPUs with contention-free allocation per NUMA domain and NVMe storage with Power Loss Protection (PLP).

  • Linux configuration: All deployments require specific Linux kernel settings, including disabling swap partitions, disabling kernel NUMA balancing, and setting noexec on /tmp. BIOS settings must disable hyperthreading and enable maximum performance.

  • Firewall: Ensure traffic is allowed on essential ports, such as 14000-16100 (Core traffic), 443 (Traces), and 8200-8201 (Vault).

Deployment options

Administrators can choose between a Kubernetes-native deployment or a direct installation on servers running Slurm.

Option A: Kubernetes deployment

This method uses the WEKA Operator to manage lifecycle and automation via Custom Resource Definitions (CRDs).

  • WEKA Operator and CSI: The WEKA Operator and CSI plugin are installed through Helm charts to manage the cluster state.

  • Node preparation: Kubernetes nodes are labeled (weka.io/supports-backends=true and weka.io/supports-clients=true) to control pod placement.

  • SELinux: For Kubernetes clusters, a specific SELinux policy (container_use_wekafs) must be installed on nodes if SELinux support is enabled.

  • Custom Resources: The cluster is defined and deployed by applying WekaCluster and WekaClient CRDs, which specify container counts and resources.

Option B: Slurm deployment (without Kubernetes)

This method involves manual distribution and scripting on the servers running on Slurm.

  • Distribution: The software release tarball is downloaded and distributed to all cluster servers using tools like pdcp.

  • Installation: An installation script configures the container topology (Drives, Compute, and Frontend containers), binding specific CPU cores and NICs to NeuralMesh processes.

  • Service management: Systemd overrides are created to ensure graceful shutdown, and the cluster IO is started manually via CLI (weka cluster start-io).

Workload integration

NeuralMesh Axon integrates with schedulers to optimize resource usage. To prevent resource conflicts with Slurm, configure slurm.conf to exclude the specific CPU cores and memory reserved for NeuralMesh Axon using CpuSpecList and MemSpecLimit.

Multi-tenancy

Subdivide the cluster into Organizations to support multi-tenancy. Security policies enforce boundaries based on user roles and network CIDRs.

NeuralMesh Axon deployment requirements

System requirements overview

NeuralMesh Axon requires specific infrastructure configurations for both Linux servers and Kubernetes runtime environments.

  • Network infrastructure: Configure the following network settings:

    • Non-blocking network topology with zero over-subscribed segments.

    • Jumbo frames enabled: 9k MTU for Ethernet, 4k MTU for Infiniband.

    • Source Based Routing policies applied to each dataplane network device.

  • Hardware requirements: Deploy servers that meet these specifications:

    • CPU architecture: AMD (single socket required) or Intel (dual-socket supported).

    • CPU allocation: Contention-free CPU allocation for each NUMA domain.

    • Memory: Sufficient RAM to allocate HugeTLB pages at runtime. For details, see Memory resource planning.

    • Storage: Adequate NVME capacity for the selected protection scheme.

  • Software prerequisites: Install and configure these software components:

    • Unified CGROUPs or CGROUP v2 enabled.

    • Linux kernel headers matching the booted kernel version.

  • Additional Linux server configuration requirements: see Linux system configuration.

Kubernetes-specific requirements

Ensure your Kubernetes environment meets specific driver and security configurations before deployment to guarantee optimal performance and compatibility.

NIC driver requirements

Review the supported drivers and specific version requirements for Ethernet and InfiniBand configurations.

Mellanox OFED Ethernet and InfiniBand drivers

Kubernetes deployments use NeuralMesh Axon Core releases that are one major version behind Slurm deployments. Consequently, these releases require the installation of Mellanox OFED drivers rather than using the drivers bundled in the Core codebase.

  • Core versions 5.0 and later: No OFED driver installation is required as the code is bundled with Core.

  • Core versions before 5.0: You must install a qualified OFED driver.

    • Latest date-versioned release: OFED 24.04-0.7.0.0.

    • Latest semantic-versioned release: OFED 5.9-0.5.6.0.

For more details, see Ethernet drivers and configurations and InfiniBand drivers and configurations.

ENA drivers

  • Supported versions: 1.0.2 through 2.0.2.

  • Recommendation: Use the current driver from the official OS repository.

ixgbevf drivers

  • Supported versions: 3.2.2 through 4.1.2.

  • Recommendation: Use the current driver from the official OS repository.

Enable SELinux support for NeuralMesh Axon on Kubernetes

Configure SELinux support for NeuralMesh Axon clusters hosted in a Kubernetes runtime environment. This process requires installing a specific SELinux policy on each participating Kubernetes node and updating the deployment configuration.

Procedure:

  1. Install the SELinux policy on the Kubernetes nodes: Apply the policy package to each node participating in the NeuralMesh Axon system using the files from the public GitHub repository.

    1. Clone the repository: git clone https://github.com/weka/csi-wekafs.git

    2. Copy the SELinux files to the relevant nodes (example using pdcp): pdcp -w neuralmesh-axon-[001-200] -r csi-wekafs/selinux ~/

    3. Install the SELinux module (example using pdsh): pdsh -w neuralmesh-axon-[001-200] "sudo semodule -i ~/selinux/csi-wekafs.pp" This command defines a new SELinux boolean labeled container_use_wekafs. If the container_use_wekafs boolean does not appear in the list, compile the policy package that matches the target Linux server. See Install a custom SELinux policy.

    4. Enable the SELinux boolean: pdsh -w neuralmesh-axon-[001-200] "sudo setsebool container_use_wekafs=on"

  2. Update the WEKA Operator definition: Update the definition file with the SELinux support flag set to either mixed or enforced.

    Run the following Helm command to upgrade the plugin, for example, with the enforced setting:

SSD requirements

  • SSD configuration: Determine these specifications for each server:

    • Number of SSDs required.

    • Capacity per SSD (total capacity = number × capacity per SSD).

  • SSD feature requirements: Verify that SSDs provide:

    • Power Loss Protection (PLP).

    • Capacity up to 30 TB.

    • Capacity ratio between smallest and largest SSDs not exceeding 8:1 across the cluster.

Network configuration

Configure firewall rules to allow these ports and protocols on all cluster servers:

Port/NodePort
Protocol
Purpose

14000-14100, 15000-15100, 16000-16100

TCP + UDP

NeuralMesh Axon Core container traffic

443

TCP

NeuralMesh Axon Core traces remote viewer

22

TCP

SSH management access

123

TCP

NTP management access

8200, 8201

TCP

Hashicorp Vault traffic

5696

TCP

KMIP traffic

CPU core management

NeuralMesh Axon co-locates with AI workloads on shared computational hardware.

For non-Kubernetes/bare-metal deployments, allocate dedicated CPU cores to both NeuralMesh Axon and the job-scheduling service to prevent CPU starvation.

For Kubernetes deployments, CPU core allocation is not required.

Linux system configuration

Apply these configurations to all NeuralMesh Axon deployments, regardless of deployment type.

  • CPU and memory settings:

    • Balance CPU cores across NUMA domains containing dataplane network devices.

    • Install Linux kernel header package matching the booted kernel version.

    • Disable NUMA balancing by setting kernel.numa_balancing = 0 and add the entry to the /etc/sysctl.conf file to ensure persistence.

  • Network configuration:

    • Configure Source Based Routing policies for every dataplane NIC on every dataplane subnet in the OS network scripts to enable access without specifying bind/outbound interfaces.

    • Configure ARP settings for each dataplane network device by setting the following parameters in the /etc/sysctl.conf file, replacing <device> with the actual network interface name.

  • System services and settings: Configure these system requirements:

    • Disable swap partitions.

    • Run NTP with synchronized date and time across all cluster servers.

    • Install and enable rpcbind, XFS, and SquashFS.

    • Maintain consistent IOMMU settings (BIOS and bootloader) across all cluster servers.

    • Apply noexec mount option to /tmp.

  • BIOS configuration: Flash BIOS using WEKA's bios_tool to:

    • Select maximum power and performance settings.

    • Disable hyperthreading (HT).

    • Disable Active Management Technology (AMT).

Install NeuralMesh Axon on a Kubernetes environment

The installation of NeuralMesh Axon on a Kubernetes environment consists of a WEKA Operator that simplifies the installation, management, and scaling of NeuralMesh Axon deployments within a Kubernetes cluster.

The WEKA Operator uses Kubernetes Custom Resource Definitions (CRDs) to provide resilience, scalability, and automation for each cluster. It provides a cluster-wide service that monitors each Kubernetes namespace containing NeuralMesh Axon custom resources.

When the WEKA Operator detects that the state of the deployed resources does not match the declared state in the resource definition file, it performs one of the following actions:

  • Logs clear and useful messages in the WEKA Operator controller pod.

  • Bootstraps the resource to return it to the desired state, provided the change is additive and non-destructive.

The WEKA Operator automation never makes subtractive or destructive changes to deployed resources.

Compatibility

Verify the environment meets the following minimum version requirements:

  • Kubernetes: Version 1.25 or higher.

  • OpenShift: Version 4.17 or higher.

Network requirements

The following Kubernetes nodePorts are required. These ports use default ranges unless defined differently in the definition files. These ports are not hard-coded into the source code and can be adjusted to match the target Kubernetes cluster service topology.

Purpose

Source Container(s)

Destination Container(s)

Default Port Ranges

Protocol

Client connections

Client

Compute, Drives

45000-65000

TCP/UDP

Cluster allocation

Operator

Compute, Drives, Client

35000-35499

TCP/UDP

Core traffic

Compute, Drives

Compute, Drives

35000-35499

TCP/UDP

Workflow

  1. Configure Kubelet settings.

  2. Install NeuralMesh Axon.

  3. Label the nodes.

  4. Create registry secrets.

  5. Install the WEKA Operator and CSI Plugin.

  6. Deploy driver and discovery services.

  7. Deploy NeuralMesh Axon Custom Resources.

1. Configure Kubelet settings

Configure the Kubelet with static CPU management to enable exclusive CPU allocation for NeuralMesh Axon performance.

  1. Define the CPU management policy: Configure the Kubelet with the following settings:

  2. Identify the active Kubelet configuration: Check which configmap holds the current Kubelet configuration.

    If multiple Kubelet configurations exist, modify the configuration specifically for the worker nodes.

  3. Apply the CPU settings: Edit the Kubelet config map to add the required CPU settings.

2. Install NeuralMesh Axon

The installation of a NeuralMesh Axon cluster in a Kubernetes environment is an automated process. The solution uses three collections of Kubernetes namespaces:

  • Global namespace for the NeuralMesh Axon WEKA Operator.

  • Global namespace for the CSI-WekaFS CSI plugin.

  • Individual namespaces for each NeuralMesh Axon cluster.

Reliable pod scheduling for these services requires specific Kubernetes labels and selectors on both nodes and resources.

3. Label the nodes

Apply Kubernetes node labels to assign the CSI and NeuralMesh Axon Core cluster pods to specific nodes. Ensure the labels applied to the nodes match the nodeSelector definitions in the Custom Resource Definitions (CRDs).

The following commands demonstrate how to label nodes using the example values supports-clients and supports-backends:

  1. Label nodes for the CSI:

  2. Label nodes for the NeuralMesh Axon Core cluster:

4. Create registry secrets

Create a docker-registry secret in the global namespace for the WEKA Operator and in each namespace hosting a Core cluster.

The following commands are examples.

  1. Export the required variables:

  2. Create the namespace and secret for the WEKA Operator:

  3. Create the secret for NeuralMesh Axon Core:

5. Install the WEKA Operator and CSI Plugin

Install the WEKA Operator and CSI plugin using Helm to enable storage lifecycle management. This installation sets up the required Custom Resource Definitions (CRDs) and deploys the operator within the specified namespace.

The version numbers in the following commands are examples. Verify the latest version at https://get.weka.io/ui/operator and update the WEKA_OPERATOR_VERSION variable before execution.

  1. Set the WEKA Operator version:

  2. Pull the Helm chart and apply the CRDs:

  3. Install the WEKA Operator and CSI Plugin:

6. Deploy driver and discovery services

  1. Deploy the Driver Distribution Service: Prepare the variables and apply the configuration.

  2. Deploy the Drive Signer Service:

7. Deploy NeuralMesh Axon Custom Resources

Install the Custom Resources for the WekaCluster and WekaClient.

  1. Deploy WekaCluster: Update the dynamicTemplate values to match the target cluster size.

  2. Deploy WekaClient:

Install NeuralMesh Axon without Kubernetes

Installing the NeuralMesh Axon software involves downloading the installer, configuring system reliability settings, and running an automated script to deploy the cluster topology.

Server configuration prerequisites

Verify the hardware and resource requirements to ensure optimal cluster performance.

The following table outlines the required networking, storage, resources, and protection scheme configurations.

Component
Requirement

Networking

Two dataplane network devices (without HA configuration).

Storage

2, 4, or 8 NVMe drives per server.

Resources

One Drive core per NVMe drive (up to 8 cores).

Two Compute cores per Drive core (up to 16 cores).

Twelve Frontend cores (regardless of number of drives).

Protection scheme

16+4 protection (16 stripe width, 4 parity) with 2 hot spares.

NUMA alignment guidelines

Apply the following logic to ensure correct NUMA alignment:

  • Distribute CPU core IDs evenly across NUMA domains that contain a dataplane network device.

  • If hyperthreading is enabled, do not assign sibling cores to the Axon cluster. Exclude them from customer workloads.

Workflow

  1. Download and distribute the WEKA release.

  2. Configure graceful shutdown.

  3. Deploy the cluster.

  4. Finalize the installation.

1. Download and distribute the WEKA release

Begin by downloading the NeuralMesh Axon Core software and distributing it to all servers in the cluster.

  1. Download the release: Replace YOUR_WEKA_TOKEN with your assigned download token and WEKA_RELEASE with the specific release version.

  2. Distribute and install the software: Use pdcp and pdsh to copy the tarball to the servers, extract it, and run the installation script.

2. Configure graceful shutdown

Configure a systemd override to ensure the system handles shutdowns gracefully by enforcing a specific timeout and reboot action.

  1. Create the configuration file: Create the override.conf file locally.

  2. Apply the configuration to all servers: Distribute the file to the poweroff.target.d directory on all cluster servers and reload the systemd daemon.

3. Deploy the Axon Core cluster

Customize and execute the following installation script to deploy the NeuralMesh Axon Core cluster. This script automates the creation of containers (Drives, Compute, Frontend), forms the cluster, configures data protection, and assigns NVMe drives.

Example installation script

Prerequisites

  • Ensure pdsh is installed and configured for passwordless access to all target servers.

  • Verify all target servers are listed in /etc/hosts with a common identifier (e.g., axon) to facilitate hostname parsing.

Procedure

  1. Copy the example installation script to a file (for example, install_axon.sh).

  2. Edit the Cluster Parameters section to match your hardware configuration:

    • Core IDs: Align COMPUTE_CORES, DRIVE_CORES, and FE_CORES with your specific NUMA topology.

    • NICs: Update COMPUTE_NICS, DRIVE_NICS, and FE_NICS with the correct interface names.

    • Drive identifier: Update DRIVE_MATCH with a unique string (model or size) that identifies your NVMe drives.

  3. Make the script executable and run it:

4. Finalize the installation

After the cluster deployment script completes, enable the observation service and start the IO services.

  1. Enable the Axon Observe service:

  2. Start the cluster IO services:

NeuralMesh Axon workload integration with Slurm

Slurm is an open-source cluster management and job scheduling system tailored for Linux clusters. It manages resource access, workload execution, and monitoring.

In a NeuralMesh Axon architecture, Slurm compute servers often function as NeuralMesh Axon Core clients, requiring a Frontend container to mount the local storage service. Additionally, Compute and Drive containers may run on these servers in converged configurations.

To ensure optimal performance and stability, you must isolate CPU and memory resources for NeuralMesh Axon processes. This prevents conflicts with Slurm-managed user workloads or daemons.

NeuralMesh Axon integration with Slurm

Configure Slurm for resource isolation

Configure Slurm to isolate CPU and memory resources for WEKA processes. This prevents conflicts between Slurm services (primarily slurmd) and user workloads attempting to use the same cores. This configuration uses the task/affinity and task/cgroup plugins to control compute resource exposure and binds nodes to designated resources.

Procedure

  1. Edit the slurm.conf file to enable resource tracking and containment: Add or modify the following settings:

    • ProctrackType: Set to proctrack/cgroup to use cgroups for process tracking.

    • TaskPlugin: Set to task/affinity,task/cgroup to enable resource binding and containment.

    • TaskPluginParam: Set to SlurmdOffSpec to prevent Slurm daemons from running on cores designated for WEKA.

    • SelectType: Set to select/cons_tres to track cores, memory, and GPUs as consumable resources.

    • SelectTypeParameters: Set to CR_Core_Memory to schedule workloads based on core and memory availability.

    • PrologFlags: Set to Contain to enforce cgroup containment for all user nodes.

    • JobAcctGatherType: (Optional) Set to jobacct_gather/cgroup for metrics gathering.

    Example configuration snippet:

  2. Edit the cgroup.conf file to enforce resource constraints: Ensure the following parameters are set:

    • ConstrainCores: Set to yes.

    • ConstrainRamSpace: Set to yes.

  3. Define the compute nodes in slurm.conf to allocate exclusive resources: Set the following parameters for each node definition:

    • RealMemory: The total available memory on the compute node.

    • CpuSpecList: The list of virtual CPU IDs reserved for system use (including WEKA processes). Slurm uses only the cores defined here; it excludes others from user jobs.

    • MemSpecLimit: The amount of memory (in MB) reserved for system use. This is required when SelectTypeParameters is set to CR_Core_Memory.

    Example node definition: This example reserves cores 47 through 95 for Slurm usage, leaving cores 0 through 46 available for WEKA.

  4. Mount the filesystem using the resources excluded from Slurm: The following example mounts the filesystem using core 46, which is outside the CpuSpecList range (47-95) defined for Slurm.

Integrate with Kubernetes

NeuralMesh Axon is a Kubernetes-native solution, delivering super-computing storage packaged as a Kubernetes application. You can manage the placement and prioritization of NeuralMesh Axon Core clusters directly from the point of installation using standard Kubernetes mechanisms.

Manage placement and prioritization

Utilize the following Kubernetes-native features to control where and how NeuralMesh Axon resources are deployed:

  • Selectors: Target specific nodes for deployment.

  • Affinity rules: Define complex rules for pod placement relative to other pods or nodes.

  • Priority classes: Ensure critical NeuralMesh Axon components receive scheduling priority.

  • Resource requests: Reserve the necessary compute and memory resources for optimal performance.

NeuralMesh Axon Multi-tenancy

NeuralMesh Axon implements logical multi-tenancy by subdividing cluster controls and resources into secure authorization boundaries called Organizations.

Security enforcement

WEKA security policies enforce the boundaries created by Organizations. These policies use network CIDR information and Organization member roles to evaluate authorization for API calls and data operations.

Policy scope and evaluation

You can apply security policies to the entire cluster, a specific Organization, or individual filesystems within an Organization. To ensure fine-grained enforcement, the system evaluates the following:

  • The authenticated user's Organization membership.

  • The user's Organization permissions.

  • The network CIDR from which the authenticated API call originated.

Last updated