NeuralMesh Axon deployment
Deploy NeuralMesh Axon and integrate with orchestration services like Kubernetes and SLURM. For partners and skilled customers performing independent installations.
NeuralMesh Axon deployment overview
NeuralMesh Axon offers a flexible deployment model, supporting both Slurm and Kubernetes runtime environments. regardless of the deployment method, specific infrastructure configurations are required to ensure high performance and stability.
Infrastructure prerequisites
Before deploying software, the physical and network environment must be prepared:
Network: The topology must be non-blocking with zero over-subscription. Jumbo frames (9k MTU for Ethernet, 4k for InfiniBand) and Source Based Routing policies are required.
Hardware: Servers require AMD or Intel CPUs with contention-free allocation per NUMA domain and NVMe storage with Power Loss Protection (PLP).
Linux configuration: All deployments require specific Linux kernel settings, including disabling swap partitions, disabling kernel NUMA balancing, and setting
noexecon/tmp. BIOS settings must disable hyperthreading and enable maximum performance.Firewall: Ensure traffic is allowed on essential ports, such as 14000-16100 (Core traffic), 443 (Traces), and 8200-8201 (Vault).
Deployment options
Administrators can choose between a Kubernetes-native deployment or a direct installation on servers running Slurm.
Option A: Kubernetes deployment
This method uses the WEKA Operator to manage lifecycle and automation via Custom Resource Definitions (CRDs).
WEKA Operator and CSI: The WEKA Operator and CSI plugin are installed through Helm charts to manage the cluster state.
Node preparation: Kubernetes nodes are labeled (
weka.io/supports-backends=trueandweka.io/supports-clients=true) to control pod placement.SELinux: For Kubernetes clusters, a specific SELinux policy (
container_use_wekafs) must be installed on nodes if SELinux support is enabled.Custom Resources: The cluster is defined and deployed by applying
WekaClusterandWekaClientCRDs, which specify container counts and resources.
Option B: Slurm deployment (without Kubernetes)
This method involves manual distribution and scripting on the servers running on Slurm.
Distribution: The software release tarball is downloaded and distributed to all cluster servers using tools like
pdcp.Installation: An installation script configures the container topology (Drives, Compute, and Frontend containers), binding specific CPU cores and NICs to NeuralMesh processes.
Service management: Systemd overrides are created to ensure graceful shutdown, and the cluster IO is started manually via CLI (
weka cluster start-io).
Workload integration
NeuralMesh Axon integrates with schedulers to optimize resource usage. To prevent resource conflicts with Slurm, configure slurm.conf to exclude the specific CPU cores and memory reserved for NeuralMesh Axon using CpuSpecList and MemSpecLimit.
Multi-tenancy
Subdivide the cluster into Organizations to support multi-tenancy. Security policies enforce boundaries based on user roles and network CIDRs.
NeuralMesh Axon deployment requirements
System requirements overview
NeuralMesh Axon requires specific infrastructure configurations for both Linux servers and Kubernetes runtime environments.
Network infrastructure: Configure the following network settings:
Non-blocking network topology with zero over-subscribed segments.
Jumbo frames enabled: 9k MTU for Ethernet, 4k MTU for Infiniband.
Source Based Routing policies applied to each dataplane network device.
Hardware requirements: Deploy servers that meet these specifications:
CPU architecture: AMD (single socket required) or Intel (dual-socket supported).
CPU allocation: Contention-free CPU allocation for each NUMA domain.
Memory: Sufficient RAM to allocate HugeTLB pages at runtime. For details, see Memory resource planning.
Storage: Adequate NVME capacity for the selected protection scheme.
Software prerequisites: Install and configure these software components:
Unified CGROUPs or CGROUP v2 enabled.
Linux kernel headers matching the booted kernel version.
Additional Linux server configuration requirements: see Linux system configuration.
Kubernetes-specific requirements
Ensure your Kubernetes environment meets specific driver and security configurations before deployment to guarantee optimal performance and compatibility.
NIC driver requirements
Review the supported drivers and specific version requirements for Ethernet and InfiniBand configurations.
Mellanox OFED Ethernet and InfiniBand drivers
Kubernetes deployments use NeuralMesh Axon Core releases that are one major version behind Slurm deployments. Consequently, these releases require the installation of Mellanox OFED drivers rather than using the drivers bundled in the Core codebase.
Core versions 5.0 and later: No OFED driver installation is required as the code is bundled with Core.
Core versions before 5.0: You must install a qualified OFED driver.
Latest date-versioned release: OFED 24.04-0.7.0.0.
Latest semantic-versioned release: OFED 5.9-0.5.6.0.
For more details, see Ethernet drivers and configurations and InfiniBand drivers and configurations.
ENA drivers
Supported versions: 1.0.2 through 2.0.2.
Recommendation: Use the current driver from the official OS repository.
ixgbevf drivers
Supported versions: 3.2.2 through 4.1.2.
Recommendation: Use the current driver from the official OS repository.
Enable SELinux support for NeuralMesh Axon on Kubernetes
Configure SELinux support for NeuralMesh Axon clusters hosted in a Kubernetes runtime environment. This process requires installing a specific SELinux policy on each participating Kubernetes node and updating the deployment configuration.
Procedure:
Install the SELinux policy on the Kubernetes nodes: Apply the policy package to each node participating in the NeuralMesh Axon system using the files from the public GitHub repository.
Clone the repository:
git clone https://github.com/weka/csi-wekafs.gitCopy the SELinux files to the relevant nodes (example using
pdcp):pdcp -w neuralmesh-axon-[001-200] -r csi-wekafs/selinux ~/Install the SELinux module (example using
pdsh):pdsh -w neuralmesh-axon-[001-200] "sudo semodule -i ~/selinux/csi-wekafs.pp"This command defines a new SELinux boolean labeledcontainer_use_wekafs. If thecontainer_use_wekafsboolean does not appear in the list, compile the policy package that matches the target Linux server. See Install a custom SELinux policy.Enable the SELinux boolean:
pdsh -w neuralmesh-axon-[001-200] "sudo setsebool container_use_wekafs=on"
Update the WEKA Operator definition: Update the definition file with the SELinux support flag set to either
mixedorenforced.Run the following Helm command to upgrade the plugin, for example, with the enforced setting:
SSD requirements
SSD configuration: Determine these specifications for each server:
Number of SSDs required.
Capacity per SSD (total capacity = number × capacity per SSD).
SSD feature requirements: Verify that SSDs provide:
Power Loss Protection (PLP).
Capacity up to 30 TB.
Capacity ratio between smallest and largest SSDs not exceeding 8:1 across the cluster.
Network configuration
Configure firewall rules to allow these ports and protocols on all cluster servers:
14000-14100, 15000-15100, 16000-16100
TCP + UDP
NeuralMesh Axon Core container traffic
443
TCP
NeuralMesh Axon Core traces remote viewer
22
TCP
SSH management access
123
TCP
NTP management access
8200, 8201
TCP
Hashicorp Vault traffic
5696
TCP
KMIP traffic
CPU core management
NeuralMesh Axon co-locates with AI workloads on shared computational hardware.
For non-Kubernetes/bare-metal deployments, allocate dedicated CPU cores to both NeuralMesh Axon and the job-scheduling service to prevent CPU starvation.
For Kubernetes deployments, CPU core allocation is not required.
Linux system configuration
Apply these configurations to all NeuralMesh Axon deployments, regardless of deployment type.
CPU and memory settings:
Balance CPU cores across NUMA domains containing dataplane network devices.
Install Linux kernel header package matching the booted kernel version.
Disable NUMA balancing by setting
kernel.numa_balancing = 0and add the entry to the/etc/sysctl.conffile to ensure persistence.
Network configuration:
Configure Source Based Routing policies for every dataplane NIC on every dataplane subnet in the OS network scripts to enable access without specifying bind/outbound interfaces.
Configure ARP settings for each dataplane network device by setting the following parameters in the
/etc/sysctl.conffile, replacing<device>with the actual network interface name.
System services and settings: Configure these system requirements:
Disable swap partitions.
Run NTP with synchronized date and time across all cluster servers.
Install and enable
rpcbind,XFS, andSquashFS.Maintain consistent IOMMU settings (BIOS and bootloader) across all cluster servers.
Apply
noexecmount option to/tmp.
BIOS configuration: Flash BIOS using WEKA's bios_tool to:
Select maximum power and performance settings.
Disable hyperthreading (HT).
Disable Active Management Technology (AMT).
Install NeuralMesh Axon on a Kubernetes environment
The installation of NeuralMesh Axon on a Kubernetes environment consists of a WEKA Operator that simplifies the installation, management, and scaling of NeuralMesh Axon deployments within a Kubernetes cluster.
The WEKA Operator uses Kubernetes Custom Resource Definitions (CRDs) to provide resilience, scalability, and automation for each cluster. It provides a cluster-wide service that monitors each Kubernetes namespace containing NeuralMesh Axon custom resources.
When the WEKA Operator detects that the state of the deployed resources does not match the declared state in the resource definition file, it performs one of the following actions:
Logs clear and useful messages in the WEKA Operator controller pod.
Bootstraps the resource to return it to the desired state, provided the change is additive and non-destructive.
The WEKA Operator automation never makes subtractive or destructive changes to deployed resources.
Compatibility
Verify the environment meets the following minimum version requirements:
Kubernetes: Version 1.25 or higher.
OpenShift: Version 4.17 or higher.
Network requirements
The following Kubernetes nodePorts are required. These ports use default ranges unless defined differently in the definition files. These ports are not hard-coded into the source code and can be adjusted to match the target Kubernetes cluster service topology.
Purpose
Source Container(s)
Destination Container(s)
Default Port Ranges
Protocol
Client connections
Client
Compute, Drives
45000-65000
TCP/UDP
Cluster allocation
Operator
Compute, Drives, Client
35000-35499
TCP/UDP
Core traffic
Compute, Drives
Compute, Drives
35000-35499
TCP/UDP
Workflow
Configure Kubelet settings.
Install NeuralMesh Axon.
Label the nodes.
Create registry secrets.
Install the WEKA Operator and CSI Plugin.
Deploy driver and discovery services.
Deploy NeuralMesh Axon Custom Resources.
1. Configure Kubelet settings
Configure the Kubelet with static CPU management to enable exclusive CPU allocation for NeuralMesh Axon performance.
Define the CPU management policy: Configure the Kubelet with the following settings:
Identify the active Kubelet configuration: Check which configmap holds the current Kubelet configuration.
If multiple Kubelet configurations exist, modify the configuration specifically for the worker nodes.
Apply the CPU settings: Edit the Kubelet config map to add the required CPU settings.
2. Install NeuralMesh Axon
The installation of a NeuralMesh Axon cluster in a Kubernetes environment is an automated process. The solution uses three collections of Kubernetes namespaces:
Global namespace for the NeuralMesh Axon WEKA Operator.
Global namespace for the CSI-WekaFS CSI plugin.
Individual namespaces for each NeuralMesh Axon cluster.
Reliable pod scheduling for these services requires specific Kubernetes labels and selectors on both nodes and resources.
3. Label the nodes
Apply Kubernetes node labels to assign the CSI and NeuralMesh Axon Core cluster pods to specific nodes. Ensure the labels applied to the nodes match the nodeSelector definitions in the Custom Resource Definitions (CRDs).
The following commands demonstrate how to label nodes using the example values supports-clients and supports-backends:
Label nodes for the CSI:
Label nodes for the NeuralMesh Axon Core cluster:
4. Create registry secrets
Create a docker-registry secret in the global namespace for the WEKA Operator and in each namespace hosting a Core cluster.
The following commands are examples.
Export the required variables:
Create the namespace and secret for the WEKA Operator:
Create the secret for NeuralMesh Axon Core:
5. Install the WEKA Operator and CSI Plugin
Install the WEKA Operator and CSI plugin using Helm to enable storage lifecycle management. This installation sets up the required Custom Resource Definitions (CRDs) and deploys the operator within the specified namespace.
Set the WEKA Operator version:
Pull the Helm chart and apply the CRDs:
Install the WEKA Operator and CSI Plugin:
6. Deploy driver and discovery services
Deploy the Driver Distribution Service: Prepare the variables and apply the configuration.
Deploy the Drive Signer Service:
7. Deploy NeuralMesh Axon Custom Resources
Install the Custom Resources for the WekaCluster and WekaClient.
Deploy WekaCluster: Update the dynamicTemplate values to match the target cluster size.
Deploy WekaClient:
Install NeuralMesh Axon without Kubernetes
Installing the NeuralMesh Axon software involves downloading the installer, configuring system reliability settings, and running an automated script to deploy the cluster topology.
Server configuration prerequisites
Verify the hardware and resource requirements to ensure optimal cluster performance.
The following table outlines the required networking, storage, resources, and protection scheme configurations.
Networking
Two dataplane network devices (without HA configuration).
Storage
2, 4, or 8 NVMe drives per server.
Resources
One Drive core per NVMe drive (up to 8 cores).
Two Compute cores per Drive core (up to 16 cores).
Twelve Frontend cores (regardless of number of drives).
Protection scheme
16+4 protection (16 stripe width, 4 parity) with 2 hot spares.
NUMA alignment guidelines
Apply the following logic to ensure correct NUMA alignment:
Distribute CPU core IDs evenly across NUMA domains that contain a dataplane network device.
If hyperthreading is enabled, do not assign sibling cores to the Axon cluster. Exclude them from customer workloads.
Workflow
Download and distribute the WEKA release.
Configure graceful shutdown.
Deploy the cluster.
Finalize the installation.
1. Download and distribute the WEKA release
Begin by downloading the NeuralMesh Axon Core software and distributing it to all servers in the cluster.
Download the release: Replace
YOUR_WEKA_TOKENwith your assigned download token andWEKA_RELEASEwith the specific release version.Distribute and install the software: Use
pdcpandpdshto copy the tarball to the servers, extract it, and run the installation script.
2. Configure graceful shutdown
Configure a systemd override to ensure the system handles shutdowns gracefully by enforcing a specific timeout and reboot action.
Create the configuration file: Create the
override.conffile locally.Apply the configuration to all servers: Distribute the file to the
poweroff.target.ddirectory on all cluster servers and reload the systemd daemon.
3. Deploy the Axon Core cluster
Customize and execute the following installation script to deploy the NeuralMesh Axon Core cluster. This script automates the creation of containers (Drives, Compute, Frontend), forms the cluster, configures data protection, and assigns NVMe drives.
Prerequisites
Ensure
pdshis installed and configured for passwordless access to all target servers.Verify all target servers are listed in
/etc/hostswith a common identifier (e.g.,axon) to facilitate hostname parsing.
Procedure
Copy the example installation script to a file (for example,
install_axon.sh).Edit the Cluster Parameters section to match your hardware configuration:
Core IDs: Align
COMPUTE_CORES,DRIVE_CORES, andFE_CORESwith your specific NUMA topology.NICs: Update
COMPUTE_NICS,DRIVE_NICS, andFE_NICSwith the correct interface names.Drive identifier: Update
DRIVE_MATCHwith a unique string (model or size) that identifies your NVMe drives.
Make the script executable and run it:
4. Finalize the installation
After the cluster deployment script completes, enable the observation service and start the IO services.
Enable the Axon Observe service:
Start the cluster IO services:
NeuralMesh Axon workload integration with Slurm
Slurm is an open-source cluster management and job scheduling system tailored for Linux clusters. It manages resource access, workload execution, and monitoring.
In a NeuralMesh Axon architecture, Slurm compute servers often function as NeuralMesh Axon Core clients, requiring a Frontend container to mount the local storage service. Additionally, Compute and Drive containers may run on these servers in converged configurations.
To ensure optimal performance and stability, you must isolate CPU and memory resources for NeuralMesh Axon processes. This prevents conflicts with Slurm-managed user workloads or daemons.

Configure Slurm for resource isolation
Configure Slurm to isolate CPU and memory resources for WEKA processes. This prevents conflicts between Slurm services (primarily slurmd) and user workloads attempting to use the same cores. This configuration uses the task/affinity and task/cgroup plugins to control compute resource exposure and binds nodes to designated resources.
Procedure
Edit the
slurm.conffile to enable resource tracking and containment: Add or modify the following settings:ProctrackType: Set to
proctrack/cgroupto use cgroups for process tracking.TaskPlugin: Set to
task/affinity,task/cgroupto enable resource binding and containment.TaskPluginParam: Set to
SlurmdOffSpecto prevent Slurm daemons from running on cores designated for WEKA.SelectType: Set to
select/cons_tresto track cores, memory, and GPUs as consumable resources.SelectTypeParameters: Set to
CR_Core_Memoryto schedule workloads based on core and memory availability.PrologFlags: Set to
Containto enforce cgroup containment for all user nodes.JobAcctGatherType: (Optional) Set to
jobacct_gather/cgroupfor metrics gathering.
Example configuration snippet:
Edit the
cgroup.conffile to enforce resource constraints: Ensure the following parameters are set:ConstrainCores: Set to
yes.ConstrainRamSpace: Set to
yes.
Define the compute nodes in
slurm.confto allocate exclusive resources: Set the following parameters for each node definition:RealMemory: The total available memory on the compute node.CpuSpecList: The list of virtual CPU IDs reserved for system use (including WEKA processes). Slurm uses only the cores defined here; it excludes others from user jobs.MemSpecLimit: The amount of memory (in MB) reserved for system use. This is required whenSelectTypeParametersis set toCR_Core_Memory.
Example node definition: This example reserves cores 47 through 95 for Slurm usage, leaving cores 0 through 46 available for WEKA.
Mount the filesystem using the resources excluded from Slurm: The following example mounts the filesystem using core 46, which is outside the
CpuSpecListrange (47-95) defined for Slurm.
Integrate with Kubernetes
NeuralMesh Axon is a Kubernetes-native solution, delivering super-computing storage packaged as a Kubernetes application. You can manage the placement and prioritization of NeuralMesh Axon Core clusters directly from the point of installation using standard Kubernetes mechanisms.
Manage placement and prioritization
Utilize the following Kubernetes-native features to control where and how NeuralMesh Axon resources are deployed:
Selectors: Target specific nodes for deployment.
Affinity rules: Define complex rules for pod placement relative to other pods or nodes.
Priority classes: Ensure critical NeuralMesh Axon components receive scheduling priority.
Resource requests: Reserve the necessary compute and memory resources for optimal performance.
NeuralMesh Axon Multi-tenancy
NeuralMesh Axon implements logical multi-tenancy by subdividing cluster controls and resources into secure authorization boundaries called Organizations.
Security enforcement
WEKA security policies enforce the boundaries created by Organizations. These policies use network CIDR information and Organization member roles to evaluate authorization for API calls and data operations.
Policy scope and evaluation
You can apply security policies to the entire cluster, a specific Organization, or individual filesystems within an Organization. To ensure fine-grained enforcement, the system evaluates the following:
The authenticated user's Organization membership.
The user's Organization permissions.
The network CIDR from which the authenticated API call originated.
Last updated