WEKA Operator deployment
Discover how the WEKA Operator streamlines deploying, scaling, and managing the WEKA Data Platform on Kubernetes, delivering high-performance storage for compute-intensive workloads like AI and HPC.
Overview
The WEKA Operator simplifies deploying, managing, and scaling the WEKA Data Platform within a Kubernetes cluster. It provides custom Kubernetes resources that define and manage WEKA components effectively.
By integrating WEKA's high-performance storage into Kubernetes, the Operator supports compute-intensive applications like AI, ML, and HPC. This enhances data access speed and boosts overall performance.
The WEKA Operator automates tasks, enables periodic maintenance, and ensures robust cluster management. This setup provides resilience and scalability across the cluster. With its persistent, high-performance data layer, the WEKA Operator enables efficient management of large datasets, ensuring scalability and efficiency.
WEKA Operator backend deployment overview
The WEKA Operator backend deployment integrates various components within a Kubernetes cluster to deploy, manage, and scale the WEKA Data Platform effectively.
How it works
Local Server Setup: This setup integrates Kubernetes with the WekaCluster custom resources (CRDs) and facilitates WEKA Operator installation through Helm. Configuring Helm registry authentication provides access to the necessary CRDs and initiates the operator installation.
WekaCluster CR: The WekaCluster CR defines the WEKA cluster’s configuration, including storage, memory, and resource limits, while optimizing memory and CPU settings to prevent out-of-memory errors. Cluster and container management also support operational tasks through on-demand executions (through WekaManualOperation) and scheduled tasks (through WekaPolicy).
WEKA Operator:
The WEKA Operator retrieves Kubernetes configurations from WekaCluster CRs, grouping multiple WEKA containers to organize WEKA nodes into a single unified cluster.
To enable access to WEKA container images, the Operator retrieves credentials from Kubernetes secrets in each namespace that requires WEKA resources.
Using templates, it calculates the required number of containers and deploys the WEKA cluster on Kubernetes backends through a CRD.
Each node requires specific Kubelet configurations—such as kernel headers, storage allocations, and huge page settings—to optimize memory management for the WEKA containers. Data is stored in the
/opt/k8s-weka
directory on each node, with CPU and memory allocations determined by the number of WEKA containers and available CPU cores per node.
Driver Distribution Model: This model ensures efficient kernel module loading and compatibility across nodes, supporting scalable deployment for both clients and backends. It operates through three primary roles:
Distribution Service: A central repository storing and serving WEKA drivers for seamless access across nodes.
Drivers Builder: Compiles drivers for specific WEKA versions and kernel targets, uploading them to the Distribution Service. Multiple builders can run concurrently to support the same repository.
Drivers Loader: Automatically detects missing drivers, retrieves them from the Distribution Service, and loads them using
modprobe
.

WEKA Operator client deployment overview
The WEKA Operator client deployment uses the WekaClient custom resource to manage WEKA containers across a set of designated nodes, similar to a DaemonSet. Each WekaClient instance provisions WEKA containers as individual pods, creating a persistent layer that supports high availability by allowing safe pod recreation when necessary.
How it works
Deployment initiation: The user starts the deployment from a local server, which triggers the process.
Custom resource retrieval: The WEKA Operator retrieves the WekaClient custom resource (CR) configuration. This CR defines which nodes in the Kubernetes cluster run WEKA containers.
WEKA containers deployment: Based on the WekaClient CR, the Operator deploys WEKA containers across the specified Kubernetes client nodes. Each WEKA container instance runs as a single pod, similar to a DaemonSet.
Persistent storage setup: Using the WEKA Container Storage Interface (CSI) plugin, the WEKA Operator sets up a persistent volume (PV) for the clients. This storage is managed by the WEKA Operator and is a prerequisite for clients relying on WEKA.
High availability: The WEKA containers act as a persistent layer, enabling each pod to be safely recreated as needed. This supports high availability by ensuring continuous service even if individual pods are restarted or moved.

WEKA Operator client-only deployment
If the WEKA cluster is outside the Kubernetes cluster but you have workloads inside Kubernetes, you can deploy a WEKA client within the Kubernetes cluster to connect to the external WEKA cluster.
Deployment workflow
Obtain setup information.
Prepare Kubernetes environment.
Set up driver distribution.
Discover drives for WEKA cluster provisioning.
Install the WEKA Operator.
Install the WekaCluster and WekaClient custom resources.
1. Obtain setup information
To deploy the WEKA Operator in your Kubernetes environment, contact the WEKA Customer Success Team to obtain the necessary setup information.
Container repository (quay.io)
Includes: Image pull secrets and Docker
QUAY_USERNAME
QUAY_PASSWORD
QUAY_SECRET_KEY
example_user
example_password
quay-io-robot-secret
WEKA Operator Version
WEKA_OPERATOR_VERSION
v1.6.1
WEKA Image
WEKA_IMAGE_VERSION_TAG
4.4.5.118-k8s.4
By gathering this information in advance, you have all the required values to complete the deployment workflow efficiently. Replace the placeholders with the actual values in the setup files.
2. Prepare Kubernetes environment
Ensure the following requirements are met:
Local server requirements
Kubernetes cluster and node requirements
Kubernetes port requirements
Kubelet requirements
Image pull secrets requirements
Local server requirements
Ensure access to a server for manual
helm install
, unless a higher-level tool (for example, Argo CD) is used.
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 && chmod 700 get_helm.sh &&./get_helm.sh
Kubernetes cluster and node requirements
Ensure that Kubernetes is correctly set up and configured to handle WEKA workloads.
Minimum Kubernetes version: 1.25
Minimum OpenShift version: 4.17
Kernel headers: Ensure kernel headers on each node match the kernel version.
Storage: Allocate storage on
/opt/k8s-weka
for WEKA containers. Estimate: ~20 GiB per WEKA container + 10 GiB per CPU core in use.Huge pages configuration:
Compute core: 3 GiB of huge pages
Drive core: 1.5 GiB of huge pages
Client core: 1.5 GiB of huge pages Check current huge pages with command:
grep Huge /proc/meminfo
Add the appropriate number of huge pages:
sudo sysctl -w vm.nr_hugepages=3000
Set huge pages to persist through reboots:
sudo sh -c 'echo "vm.nr_hugepages = 3000" >> /etc/sysctl.conf'
Kubernetes port requirements
Ensure ports availability according to the following table:
Client connection
Client
Backend
45000-65000
TCP/ UDP
Clients find free ports dynamically within this range. Not mandatory to define explicitly.
Cluster allocation
WEKA Operator
Cluster Nodes
35000 (default)
TCP/ UDP
Default range for cluster allocation. Can be overridden but must ensure no conflicts.
Backend communication
Backend
Backend
35000 (default)
TCP/ UDP
Ports are allocated within the WEKA Operator but are not guaranteed to be free. Ensure the port range is available.
Port override
Operator API
WekaCluster CR
User-defined
TCP/ UDP
Overrides allow specifying ports manually, mainly useful for migrating non-K8s clusters.
Kubelet requirements
Configure Kubelet with static CPU management to enable exclusive CPU allocation:
reservedSystemCPUs: "0"
cpuManagerPolicy: static
Check which configmap holds the kubelet config.
kubectl get cm -A|grep kubelet
If there are more than one kubelet config, modify the config for worker nodes.Edit the kubelet config map to add the CPU settings.
kubectl edit cm -n kube-system kubelet-config
Image pull secrets requirements
Set up Kubernetes secrets for secure image pulling across namespaces. Apply the secret in all namespaces where WEKA resources are deployed.
Verify that namespaces are defined and do not overlap to avoid configuration conflicts.
Example:
The following example creates a secret for quay.io authentication for both the weka-operator-system
namespace and the default
namespace. Repeat as necessary for namespaces. Replace the placeholders with the actual values.
export QUAY_USERNAME='QUAY_USERNAME' # Replace with the actual value
export QUAY_PASSWORD='QUAY_PASSWORD' # Replace with the actual value
kubectl create ns weka-operator-system
kubectl create secret docker-registry QUAY_SECRET_KEY \ # Replace with the actual value
--docker-server=quay.io \
--docker-username=$QUAY_USERNAME \
--docker-password=$QUAY_PASSWORD \
--docker-email=$QUAY_USERNAME \
--namespace=weka-operator-system
kubectl create secret docker-registry QUAY_SECRET_KEY \ # Replace with the actual value
--docker-server=quay.io \
--docker-username=$QUAY_USERNAME \
--docker-password=$QUAY_PASSWORD \
--docker-email=$QUAY_USERNAME \
--namespace=default
3. Install the WEKA Operator
Apply WEKA Custom Resource Definitions (CRDs): Download and apply the WEKA Operator CRDs to define WEKA-specific resources in Kubernetes. Replace the version placeholder(WEKA_OPERATOR_VERSION) with the actual value
helm pull oci://quay.io/weka.io/helm/weka-operator --untar --version <WEKA_OPERATOR_VERSION>
kubectl apply -f weka-operator/crds
Install the WEKA Operator: Deploy the WEKA Operator to the Kubernetes cluster. Specify the namespace, image version, and pull secret to enable WEKA’s resources. Replace the version placeholder(WEKA_OPERATOR_VERSION) with the actual value.
helm upgrade --create-namespace \
--install weka-operator oci://quay.io/weka.io/helm/weka-operator \
--namespace weka-operator-system \
--version <WEKA_OPERATOR_VERSION> \
Verify the installation: Run the following:
kubectl -n weka-operator-system get pod
The returned results should look similar to this:
NAME READY STATUS RESTARTS AGE
weka-operator-controller-manager-564bfd6b49-p6k7d 2/2 Running 0 13s
4. Set up driver distribution
Driver distribution applies to client and backend entities.
Verify driver distribution prerequisites:
Ensure a WEKA-compatible image (
weka-in-container
) is accessible through the registry and has the necessary credentials (imagePullSecret
).Define node selection criteria, especially for the Driver Builder role, to match the kernel requirements of target nodes.
Set up the driver distribution service and driver builder: Driver distribution is typically included as part of the operator installation process. Therefore, it is not necessary to install drivers separately unless you are also installing the operator.
To build and distribute WEKA drivers, the standard approach involves deploying the following components:
drivers-builder container: One container per combination of Weka version, kernel version, and architecture.
drivers-dist container: A single container responsible for serving the compiled drivers.
Service: Exposes the drivers-dist container.
This setup supports scenarios such as handling multiple kernel versions and executing custom pre-run scripts.
Important notes:
Deploy multiple drivers-builder containers only if you need to support multiple kernel versions or multiple WEKA versions.
Replace placeholder versions with your target WekaClient and WekaCluster versions.
The image versions used in the builder containers must match the corresponding WEKA versions.
Save the manifest above to
weka-driver.yaml
, and apply it:kubectl apply -f weka-driver.yaml
5. Discover drives for WEKA cluster provisioning
To provision drives for a WEKA cluster, each drive must go through a discovery process. This process ensures that all drives are correctly identified, accessible, and ready for use within the cluster.
The discovery process involves the following key actions:
Node updates during discovery
Each node is annotated with a list of known serial IDs for all drives accessible to the operator, providing a unique identifier for each drive.
An extended resource,
weka.io/drives
, is created to indicate the number of drives that are ready and available on each node.
Available drives
Only healthy, unblocked drives are marked as available. Drives that are manually flagged due to issues such as corruption or other unrecoverable errors are excluded from the available pool to ensure cluster stability.
Drive discovery steps
Sign drives Each drive receives a WEKA-specific signature, marking it as ready for discovery and integration into the cluster.
Discover drives The signed drives are detected and prepared for cluster operations. If drives already have the WEKA signature, only the discovery step is required to verify and track them in the cluster.
Drive discovery methods
The WEKA system supports two primary methods for drive discovery:
WekaManualOperation A one-time operation that performs both drive signing and discovery, suitable for manual provisioning.
WekaPolicy An automated, policy-driven approach that performs periodic discovery across all matching nodes. The
WekaPolicy
method operates on an event-driven model, initiating discovery immediately when relevant changes (such as node updates or drive additions) are detected.
Operations examples
6. Install the WekaCluster and WekaClient custom resources
This procedure provides step-by-step instructions for deploying the WekaCluster and WekaClient Custom Resources (CRs) in a Kubernetes cluster. Follow these procedures in sequence if both components are required. Begin with the WekaCluster CR, then create the necessary client secret, and finally deploy the WekaClient CR.
Step 1: Install the WekaCluster CR
To deploy a WEKA cluster backend using the WekaCluster CR, perform the following:
Prerequisites:
Ensure the driver distribution service is configured. This is the same service used by WEKA clients. See 4. Set up driver distribution.
Use either the
WekaManualOperation
(recommended for initial deployments) orWekaPolicy
to sign and discover drives. See 5. Discover drives for WEKA cluster provisioning.
Create a manifest file (for example, weka-cluster.yaml) with the required configuration:
apiVersion: weka.weka.io/v1alpha1
kind: WekaCluster
metadata:
name: cluster-dev
namespace: default
spec:
template: dynamic
dynamicTemplate:
computeContainers: 6
driveContainers: 6
numDrives: 1
image: quay.io/weka.io/weka-in-container:WEKA_IMAGE_VERSION_TAG # Replace with actual image tag
nodeSelector:
weka.io/supports-backends: "true"
driversDistService: "https://weka-drivers-dist.weka-operator-system.svc.cluster.local:60002"
imagePullSecret: "QUAY_SECRET_KEY" # Replace with the actual secret
network:
udpMode: true
ethDevice: br-ex
Apply the WekaCluster CR:
kubectl apply -f weka-cluster.yaml
Step 2: Create WEKA cluster client secret
Before deploying a WekaClient CR, create a Kubernetes Secret with credentials required to join the WEKA cluster.
Prepare the secret YAML: Create a file named secret.yaml:
apiVersion: v1
kind: Secret
metadata:
name: weka-cluster-dev # The wekaSecretRef in the WekaClient CR must much this secret name
namespace: weka-operator-system
type: Opaque
data:
org: <base64-encoded-org>
join-secret: <base64-encoded-join-secret>
password: <base64-encoded-password>
username: <base64-encoded-username>
Apply the secret:
kubectl apply -f secret.yaml
Step 3: Install the WekaClient CR
The WekaClient CR deploys WekaContainers across designated Kubernetes nodes, similar to a DaemonSet but without automatic pod cleanup.
WekaClient specification (reference)
Key configurable fields in the WekaClientSpec
:
type WekaClientSpec struct {
Image string `json:"image"` // Image to be used for WekaContainer
ImagePullSecret string `json:"imagePullSecret,omitempty"` // Secret for pulling the image
Port int `json:"port,omitempty"` // If unset (0), WEKA selects a free port from PortRange
AgentPort int `json:"agentPort,omitempty"` // If unset (0), WEKA selects a free port from PortRange
PortRange *PortRange `json:"portRange,omitempty"` // Used for dynamic port allocation
NodeSelector map[string]string `json:"nodeSelector,omitempty"` // Specifies nodes for deployment
WekaSecretRef string `json:"wekaSecretRef,omitempty"` // Reference to Weka secret
NetworkSelector NetworkSelector `json:"network,omitempty"` // Defines network configuration
DriversDistService string `json:"driversDistService,omitempty"` // URL for driver distribution service
DriversLoaderImage string `json:"driversLoaderImage,omitempty"` // Image for drivers loader
JoinIps []string `json:"joinIpPorts,omitempty"` // IPs to join for cluster setup
TargetCluster ObjectReference `json:"targetCluster,omitempty"` // Reference to target cluster
CpuPolicy CpuPolicy `json:"cpuPolicy,omitempty"` // CPU policy, e.g., "auto," "shared," "dedicated," etc.
CoresNumber int `json:"coresNum,omitempty"` // Number of cores to use
CoreIds []int `json:"coreIds,omitempty"` // Specific core IDs to use
TracesConfiguration *TracesConfiguration `json:"tracesConfiguration,omitempty"` // Trace settings
Tolerations []string `json:"tolerations,omitempty"` // Tolerations for nodes
RawTolerations []v1.Toleration `json:"rawTolerations,omitempty"` // Detailed toleration settings
AdditionalMemory int `json:"additionalMemory,omitempty"` // Additional memory allocation
WekaHomeConfig WekahomeClientConfig `json:"wekaHomeConfig,omitempty"` // Deprecated field
WekaHome *WekahomeClientConfig `json:"wekaHome,omitempty"` // Deprecated field
UpgradePolicy UpgradePolicy `json:"upgradePolicy,omitempty"` // Policy for handling upgrades
}
Apply the manifest:
kubectl apply -f weka-client.yaml
Step 4: Next steps
After deploying the WekaCluster and WekaClient CRs:
Monitor their status using
kubectl get wekaClusters
andkubectl get wekaClients
.Proceed to install the WEKA CSI Plugin if persistent storage access is required. See the WEKA CSI Plugin topic for plugin setup instructions.
Upgrade the WEKA Operator
Upgrading the WEKA Operator involves updating the Operator and managing wekaClient
configurations to ensure all client pods operate on the latest version. Additionally, each WEKA version requires a new builder instance with a unique wekaContainer
metadata name, ensuring compatibility and streamlined management of version-specific resources.
Procedure:
Upgrade the WEKA Operator Follow the steps in Install the WEKA Operator using the latest version. Re-running the installation process with the updated version upgrades the WEKA Operator without requiring additional setup.
Configure upgrade policies for
wekaClient
TheupgradePolicy
parameter in thewekaClient
Custom Resource (CR) specification controls how client pods are updated when the WEKA version changes. Options include:rolling: The operator automatically updates each client pod sequentially, replacing one pod at a time to maintain availability.
manual: No automatic pod replacements are performed by the operator. Manual deletion of each client pod is required, after which the pod restarts with the updated version. Use
kubectl delete pod <pod-name>
to delete each pod manually.all-at-once: The operator updates all client pods simultaneously, applying the new version cluster-wide in a single step.
To apply the upgrade, update the
weka-in-container
version by:Directly editing the version with
kubectl edit
on thewekaClient
CR.Modifying the client configuration manifest, then reapplying it with
kubectl apply -f <manifest-file>
.
Create a new builder Instance for each WEKA version Rather than updating existing builder instances, create a new instance of the builder with each WEKA kernel version. Each builder must have a unique
wekaContainer
metadata name to support version-specific compatibility.Create a new builder: For each WEKA version, create a new builder instance with an updated
wekaContainer
meta name that corresponds to the new version. This ensures that clients and resources linked to specific kernel versions can continue to operate without conflicts.Cleanup outdated builders: Once the upgrade is validated and previous versions are no longer needed, you can delete outdated builder instances associated with those older versions. This cleanup step optimizes resources but allows you to maintain multiple builder instances if supporting different kernel versions is required.
Delete a WekaCluster
When you delete a WekaCluster, the system enforces a 24-hour grace period before completing the removal. To expedite this process and delete the cluster immediately, you can set the graceful destroy duration to zero before initiating the deletion.
Procedure
Run the following command to set the graceful destroy duration to zero:
kubectl patch WekaCluster <cluster name> --type='merge' -p='{"spec":{"gracefulDestroyDuration": "0"}}'
Where:
<cluster name>
: Specifies the name of your WekaCluster.
Run the following command to delete the WekaCluster:
kubectl delete WekaCluster <cluster name> --namespace <cluster namespace>
Where:
<cluster name>
: Specifies the name of the WekaCluster you want to delete.<cluster namespace>
: Specifies the namespace where the cluster is located.
Best practices
Preloading images
To optimize runtime and minimize delays, preloading images during the reading or preparation phase can significantly reduce waiting time in subsequent steps. Without preloading, some servers may sit idle while images download, leading to further delays when all servers advance to the next step.
Display custom fields
WEKA Custom Resources enable enhanced observability by marking certain display fields. While kubectl get
displays only a limited set of fields by default, using the -o wide
option or exploring through k9s
allows you to view all fields.
Example command to quickly assess WekaContainer status:
kubectl get wekacontainer -o wide --all-namespaces
Example output:
NAMESPACE NAME STATUS MODE AGE DRIVES COUNT WEKA CID
weka-operator-system cluster-dev-clients-34.242.2.16 Running client 64s
weka-operator-system cluster-dev-clients-52.51.10.75 Running client 64s 12
weka-operator-system cluster-dev-compute-16fd029f-8aad-487c-be32-c74d70350f69 Running compute 6m49s 9
weka-operator-system cluster-dev-compute-33f54d4b-302d-4d85-9765-f6d9a7a31d02 Running compute 6m50s 8
... (additional rows)
weka-operator-system weka-dsc-34.242.2.16 PodNotRunning discovery 64s
This view provides a quick status overview, showing progress and resource allocation at a glance.
Troubleshooting
This section provides guidance for resolving common deployment issues with WEKA Operator.
Pod stuck in pending state
Describe the pod to identify the scheduling issue (using Kubernetes native reporting).
If the pod is blocked on weka.io/drives
, it indicates that the operator was unable to allocate the required drives for the corresponding WekaContainer. This issue may occur if the user has requested more drives than are available on the node or if there are too many driveContainers
already running.
Ensure the drives are signed and the number of drives corresponds to the requested in the spec of the WekaCluster.
If there’s an image pull failure, verify your imagePullSecret
. Each customer must have a unique robot secret for quay.io.
Pod in “wekafsio driver not found” loop
Check the logs for this message and see for further steps.
CSI not functioning
Ensure the nodeSelector
configurations on both the CSI installation and the WekaClient match.
Appendix: Kubernetes Glossary
Last updated