WEKA Operator deployments
Discover how the WEKA Operator streamlines deploying, scaling, and managing the WEKA Data Platform on Kubernetes, delivering high-performance storage for compute-intensive workloads like AI and HPC.
Overview
The WEKA Operator simplifies deploying, managing, and scaling the WEKA Data Platform within a Kubernetes cluster. It provides custom Kubernetes resources that define and manage WEKA components effectively.
By integrating WEKA's high-performance storage into Kubernetes, the Operator supports compute-intensive applications like AI, ML, and HPC. This enhances data access speed and boosts overall performance.
The WEKA Operator automates tasks, enables periodic maintenance, and ensures robust cluster management. This setup provides resilience and scalability across the cluster. With its persistent, high-performance data layer, the WEKA Operator enables efficient management of large datasets, ensuring scalability and efficiency.
Target audience: This guide is intended exclusively for experienced Kubernetes cluster administrators. It provides detailed procedures for deploying the WEKA Operator on a Kubernetes cluster that meets the specified requirements in the 2. Prepare Kubernetes environment section.
Versions compatibility
The following matrix outlines the minimum version requirements for specific features when managed through the WEKA Kubernetes Operator. To ensure stability, always verify that your WEKA cluster and Operator versions are aligned.
S3
v1.7
4.4
Supported
NFS
v1.10
5.1.0
Supported
Audit
v1.10
5.1.0
Supported
SMB-W
—
—
Not supported
Data Services
—
—
Not supported
WEKA Operator backend deployment overview
The WEKA Operator backend deployment integrates various components within a Kubernetes cluster to deploy, manage, and scale the WEKA Data Platform effectively.
How it works
Local Server Setup: This setup integrates Kubernetes with the WekaCluster custom resources (CRDs) and facilitates WEKA Operator installation through Helm. Configuring Helm registry authentication provides access to the necessary CRDs and initiates the operator installation.
WekaCluster CR: The WekaCluster CR defines the WEKA cluster’s configuration, including storage, memory, and resource limits, while optimizing memory and CPU settings to prevent out-of-memory errors. Cluster and container management also support operational tasks through on-demand executions (through WekaManualOperation) and scheduled tasks (through WekaPolicy).
WEKA Operator:
The WEKA Operator retrieves Kubernetes configurations from WekaCluster CRs, grouping multiple WEKA containers to organize WEKA nodes into a single unified cluster.
To enable access to WEKA container images, the Operator retrieves credentials from Kubernetes secrets in each namespace that requires WEKA resources.
Using templates, it calculates the required number of containers and deploys the WEKA cluster on Kubernetes backends through a CRD.
Each node requires specific Kubelet configurations—such as kernel headers, storage allocations, and huge page settings—to optimize memory management for the WEKA containers. Data is stored in the
/opt/k8s-wekadirectory on each node, with CPU and memory allocations determined by the number of WEKA containers and available CPU cores per node.
Driver Distribution Model: This model ensures efficient kernel module loading and compatibility across nodes, supporting scalable deployment for both clients and backends. It operates through three primary roles:
Distribution Service: A central repository storing and serving WEKA drivers for seamless access across nodes.
Drivers Builder: Compiles drivers for specific WEKA versions and kernel targets, uploading them to the Distribution Service. Multiple builders can run concurrently to support the same repository.
Drivers Loader: Automatically detects missing drivers, retrieves them from the Distribution Service, and loads them using
modprobe.

WEKA Operator client deployment overview
The WEKA Operator client deployment uses the WekaClient custom resource to manage WEKA containers across a set of designated nodes, similar to a DaemonSet. Each WekaClient instance provisions WEKA containers as individual pods, creating a persistent layer that supports high availability by allowing safe pod recreation when necessary.
How it works
Deployment initiation: The user starts the deployment from a local server, which triggers the process.
Custom resource retrieval: The WEKA Operator retrieves the WekaClient custom resource (CR) configuration. This CR defines which nodes in the Kubernetes cluster run WEKA containers.
WEKA containers deployment: Based on the WekaClient CR, the Operator deploys WEKA containers across the specified Kubernetes client nodes. Each WEKA container instance runs as a single pod, similar to a DaemonSet.
Persistent storage setup: The WEKA Operator automates the deployment of the WEKA Container Storage Interface (CSI) plugin, which is the standard way to provide persistent storage for applications within Kubernetes. This plugin enables pods (clients) to dynamically provision and mount Persistent Volumes (PVs) from the WEKA system.
Starting with Operator version 1.7.0, the deployment process has been streamlined:
Embedded CSI plugin: The CSI plugin is now embedded directly within the WekaClient CR, simplifying its management.
Co-located cluster requirement: This integrated CSI deployment is only supported when the WEKA cluster and the WEKA clients reside within the same Kubernetes cluster. This is configured by referencing the WEKA cluster in the
targetClusterfield of the WekaClient CR.
High availability: The WEKA containers act as a persistent layer, enabling each pod to be safely recreated as needed. This supports high availability by ensuring continuous service even if individual pods are restarted or moved.

WEKA Operator client-only deployment
If the WEKA cluster is outside the Kubernetes cluster but you have workloads inside Kubernetes, you can deploy a WEKA client within the Kubernetes cluster to connect to the external WEKA cluster.
Client Pod CLI restrictions
Cluster-level WEKA CLI commands are supported only from the Compute or Drive pods.
The WEKA Operator client and application client pods operate with restricted permissions intended for data-path access only. Running cluster CLI commands, such as weka status, from these contexts is not supported and results in authorization errors.
Deployment workflow
Obtain setup information: Collect registry credentials and version tags.
Prepare Kubernetes environment: Configure servers, huge pages, and kubelet policies.
Install the WEKA Operator: Deploy the controller and define drive type ratios.
Manage driver distribution: Configure local or external driver building services.
Discover and sign drives: Identify physical storage and configure sharing policies.
Provision WEKA resources: Deploy the WekaCluster (backend) and WekaClient (frontend).
Manage resources and label propagation: Monitor the health of your WEKA resources.
Manage the WEKA cluster management proxy: Access WEKA management and service endpoints using Kubernetes Ingress resources.
Perform post-deployment storage configuration: Configure the CSI plugin and storage classes based on your operator version to enable persistent volume provisioning.
WEKA Operator currently supports only x86 architecture.
1. Obtain setup information
Identify and record the credentials required to pull WEKA container images and the specific version tags for your deployment.
Before you begin
Contact the WEKA Customer Success Team to receive your authorized registry credentials.
Procedure
Access get.weka.io/ui/operator to identify the latest
WEKA_OPERATOR_VERSIONandWEKA_IMAGE_VERSION_TAG.Record the following credentials for your image pull secret:
Registry:
quay.ioQUAY_USERNAME
QUAY_PASSWORD
QUAY_SECRET_KEY: Typically
quay-io-robot-secret.
Replace all placeholders in your setup files with these values to ensure a consistent deployment.

2. Prepare Kubernetes environment
Ensure the infrastructure meets the performance and resiliency requirements of the WEKA data plane.
Local server requirements
Ensure access to a server for manual helm installation, unless using a higher-level deployment tool such as Argo CD.
Control plane high availability
Configure the Kubernetes control plane for high availability (HA) to match WEKA resiliency. HA depends on etcd quorum.
Quorum rule:
etcdrequires an odd number of members (N) and tolerates failures up to (N-1)/2.Recommendation: Use five or nine
etcdmembers for production storage backends.
Consider using an external etcd cluster or distributing control plane components across multiple failure domains. For more information, see the official Kubernetes HA topology guidance.
Node hardware and software requirements
Verify that every node in the cluster adheres to these specifications:
Kubernetes version: 1.25 or later (OpenShift 4.17 or later).
Storage allocation: Reserve ~20 GiB per WEKA container plus 10 GiB per allocated CPU core in
/opt/k8s-weka.Kernel headers: Ensure kernel headers exactly match the running kernel version to allow driver compilation.
Configure HugePages for Kubernetes worker nodes
Configure HugePages on worker nodes to ensure the WEKA process has the required memory allocation for high-performance data operations.
Memory allocation requirements
The WEKA process requires dedicated memory in the form of HugePages. The allocation size depends on the drive capacity and the number of CPU cores assigned to the process.
Server capacity: The sum of the usable capacity of all drives assigned to the server and allocated for WEKA.
Cores for WEKA: The number of CPU cores dedicated to the WEKA process on the container.
WEKA Container factor: The standard allocation of 1.7 GiB of hugePages on per WEKA container.
Metadata ratio: The relationship between metadata requirements and HugePages consumption. The default value is 1000. You can increase this up to 2000 to preserve non-HugePages Resident Set Size based on server memory availability.
Headroom: A 10% multiplier (1.1) to account for memory fragmentation and operational variance.
HugePages calculation reference
Use the following formula to determine the required memory:
To calculate the total required HugePages:
Convert the Total GiB value to MiB by multiplying by 1024.
Divide the result by 2 to get the total number of 2 MiB HugePages needed.
Example calculation
The following example demonstrates how to calculate the required HugePages for a high-performance server configuration.
Server specifications:
CPU cores: 64 total.
Cores for WEKA: 63 cores dedicated to the WEKA process.
Storage configuration: 16 drives, each with 15.3 TiB.
Server capacity: 244.8 TiB usable capacity (250,675 GiB).
Step-by-step calculation:
Calculate total GiB:
Convert to MiB:
Calculate the total required HugePages (2 MiB per HugePage):
Final requirement: 201,500 HugePages (rounded up).
Apply HugePages settings
Before you begin:
Identify the number of drives and CPU cores allocated to WEKA on the server.
Ensure you have root or sudo permissions on the worker nodes.
Procedure:
Check the current HugePages status on the server:
Apply the required HugePages value:
Persist the setting to ensure it remains active after a reboot:
Identify Kubernetes port requirements
Manage port allocations for the WEKA Operator and client services to ensure reliable network communication. The WEKA Operator automates port allocation to prevent collisions within multi-cluster environments.
Starting with WEKA Operator 1.10 and WEKA 5.1.0, the operator maintains a port pool starting at port 35000. It allocates a contiguous range of 260 ports per cluster. Earlier versions of the operator and WEKA software allocate 500 ports for this purpose.
WEKA clients discover and connect to services using a separate default range that starts at port 45000. The system handles these allocations internally. Manual configuration is typically unnecessary unless specific infrastructure or policy requirements apply.
Port allocation summary
WEKA Operator (v1.10+) / WEKA (v5.1.0+)
35000
260 ports per cluster
WEKA Operator / WEKA (previous versions)
35000
500 ports per cluster
WEKA client connectivity
45000
Internal allocation
Configure Kubelet requirements
Configure the Kubelet CPU Manager with a static policy to ensure predictable, high-performance behavior for WEKA data-plane processes. This setting enables Kubernetes to assign dedicated CPU cores to Guaranteed-QoS pods, which prevents CPU contention and eliminates scheduler jitter.
Before you begin
Identify the location of the Kubelet configuration file on each worker node.
kubeadm clusters: The configuration is typically located at
/var/lib/kubelet/config.yamland managed via thekube-system/kubelet-configConfigMap.Other systems: Check the
--config=flag in the Kubelet command line by runningps -ef | grep kubeletorsystemctl status kubelet.
Procedure
Apply static core allocation to each worker node separately.
Edit the Kubelet configuration file to include the
staticpolicy and reserve a CPU for system processes.
Example: Kubelet configuration for static core allocation
In this example, static CPU management is enabled and CPU 0 (represented as 1000m) is reserved for the system to ensure the WEKA data-plane pods do not compete with OS processes.
Related information
Control CPU Management Policies on the Node
Configure image pull secrets
Set up Kubernetes secrets to enable secure image pulling from the WEKA container registry. These secrets must exist in every namespace where WEKA resources are deployed to avoid authorization failures.
Before you begin
Identify your QUAY_USERNAME, QUAY_PASSWORD, and the desired QUAY_SECRET_KEY name obtained during the setup information phase.
Procedure
Define the target namespaces and ensure they do not overlap to prevent configuration conflicts.
Create the secret for
quay.ioauthentication in both theweka-operator-systemand the default namespaces. Repeat this process for any additional namespaces as required.
3. Install the WEKA Operator
Manage the lifecycle of WEKA resources by installing the WEKA Operator. This process involves applying Custom Resource Definitions (CRDs) and deploying the operator controller with specific configurations for the Container Storage Interface (CSI) and drive types.
Before you begin
Ensure the QUAY_SECRET_KEY is created in the weka-operator-system namespace.
Verify that helm and kubectl are installed on your server.
Procedure
Apply the WEKA CRDs: Download and apply the definitions required for the Kubernetes API to recognize WEKA resources. Replace <WEKA_OPERATOR_VERSION> with your specific version.
Define drive allocation ratios: Starting with version 1.10, you must specify the ratio between QLC and TLC drive types. This is essential for Hybrid Flash environments. Use the following parameters according to your storage configuration:
Hybrid Flash:
--set driveSharing.driveTypesRatio='{tlc: 9, qlc: 1}'Result: Allocates 1/10 capacity to QLC and 9/10 to TLC.Single drive type:
--set driveSharing.drivesTypesRatio='{qlc: 0}'Result: Disables hybrid allocation.
Deploy the WEKA Operator: Execute the Helm command to install the operator. For versions 1.7.0 and later, include the CSI plugin enablement flag.
For operator versions earlier than 1.7.0, omit the --set csi.installationEnabled=true parameter.
Verify the installation: Ensure the operator pod is running.
The expected output show the weka-operator-controller-manager with a Running status.
4. Manage driver distribution
Configure the distribution of WEKA drivers to both client and backend entities. The WEKA Operator provides a streamlined mechanism to build and serve drivers, ensuring compatibility across different kernel versions and architectures.
Before you begin
External service recommendation: WEKA recommends using the external driver distribution service at https://drivers.weka.io for most use cases. This service covers common operating systems and kernel versions.
Local distribution: If you operate in an air-gapped environment or use a custom OS build, you must configure a local driver distribution service.
Registry access: Ensure a WEKA-compatible image (
weka-in-container) is accessible in your registry with a validimagePullSecret.Version matching: The image versions used in the builder containers must match the target WekaClient and WekaCluster versions. If LTS image tag is available, use it.
Local driver distribution components
To build and distribute drivers locally, the system deploys the following components:
drivers-builder: One process per combination of WEKA version, kernel version, and architecture.
drivers-dist: A single process responsible for serving compiled drivers.
Service: A Kubernetes Service that exposes the drivers-dist process.
Procedure
Define node selection: Ensure that
nodeSelectorornodeAffinitymatches nodes that meet the kernel requirements for the target build.Create the distribution policy: For WEKA Operator 1.6.0 and later, use a WekaPolicy to automate the deployment of these components.
Apply the configuration: Save the manifest as
weka-drivers.yamland apply it to the cluster.
Reference: WekaPolicy attributes
image
The WEKA container image used for the distributor and default builder.
interval
How often the operator reconciles the policy. Default: 1m
builderPreRunScript
Optional script to run (for example, installing a compiler) before the build.
ensureImages
List of additional WEKA images to prebuild drivers for.
Examples: Driver distribution service for WEKA Operator using WekaPolicy, starting from version 1.6.0
The WEKA operator supports driver distribution deployment using the WEKA policy. When a valid policy is applied, the operator automatically creates the required resources as shown in the examples.
Requirements: When configuring driver distribution, the following elements must be preserved exactly as shown in the provided configuration snippets:
Ports
Network modes
Core configurations
Container name (spec.name)
Example 1: Minimal policy for drivers distribution (typical)
WekaPolicy additional attributes
You can use the following attributers if needed in addition to to the minimal policy:
ensureNICsPayload: Defines the configuration for ensuring a specific number of data NICs on selected nodes.interval: Defines how often to reconcile the policy.signDrivesPayload: Configures parameters to scan and sign drives for WEKA backend containers.
Example 2: Manual deployment of WEKA drivers distribution and builder containers
Example 3: Example: WekaPolicy for enabling local drivers distribution
5. Discover and sign drives
Identify and prepare physical storage devices for use within the WEKA cluster. This process ensures all drives are uniquely identified, healthy, and ready for integration.
The discovery process performs the following actions:
Node annotation: Each node is updated with a list of known serial IDs for all accessible drives.
Resource creation: An extended resource,
weka.io/drives, is created on each node to indicate the count of ready drives.Health verification: Only healthy, unblocked drives are marked as available. Drives with errors or manual blocks are excluded to maintain cluster stability.
Drive discovery methods
Identify the appropriate method for your environment:
WekaManualOperation: A one-time action that signs and discovers drives. Use this for initial manual provisioning.
WekaPolicy: An automated approach that performs periodic discovery. It initiates discovery immediately when it detects node updates or hardware additions.
Procedure
Define drive sharing and signing: Apply a WekaPolicy to sign compatible drives. For WEKA 5.1.0 and Operator 1.10, enable drive sharing to support composable clusters.
Initiate discovery: Use a WekaManualOperation to detect signed drives across the cluster. Replace placeholders with your recorded version and secret key.
Verify discovery: Confirm that the
weka.io/drivesextended resource is present on the target nodes.
Reference: Drive selection types
all-not-root
Signs all detected block devices except the root device.
aws-all
Detects NVMe devices using AWS PCI identifiers.
device-paths
Targets specific device paths listed in the manifest.
6. Provision WEKA resources
Deploy the WekaCluster and WekaClient Custom Resources (CRs) to provision the backend storage and connect your Kubernetes nodes.
Perform these steps in sequence:
Install the WekaCluster CR.
Create the WEKA cluster client secret.
Install the WekaClient CR.
1. Install the WekaCluster CR
Provision the WEKA cluster backend using the WekaCluster CR. This resource defines the storage containers, drive configurations, and networking for the cluster.
Before you begin
Drive discovery: Ensure you have signed and discovered drives.
Driver distribution: Verify the driver distribution service is accessible. WEKA recommends the external service at https://drivers.weka.io.
Drive sharing: If using WEKA 5.1.0 and Operator 1.10 onwards, use the
containerCapacityparameter instead ofnumDrives.
Procedure
Create a manifest file named
weka-cluster.yaml.Configure the resource using the following template. Replace the image tag and secret key placeholders with your recorded values.
Apply the manifest:
Reference: WekaCluster parameters
Identify and configure the parameters for the WekaCluster Custom Resource (CR) to define the backend storage environment.
template
Specifies the deployment template. Currently, only dynamic is supported. Default: dynamic
dynamicTemplate
Defines the scale of the cluster, including the number of computeContainers, driveContainers, and numDrives (or containerCapacity for v1.10+).
image
The WEKA container image version to deploy.
imagePullSecret
The Kubernetes secret name used to authenticate with the image registry.
driversDistService
The URL of the driver distribution service (e.g., https://drivers.weka.io).
nodeSelector
A map of key-value pairs used to select the nodes for the cluster pods.
roleNodeSelector
Defines specific node scheduling for compute, drive, and s3 roles.
wekaHome
Configures the endpoint and cacertSecret for WEKA Home connectivity.
ipv6
Enables or disables IPv6 networking.
additionalMemory
Specifies additional memory allocation per role beyond the default. Default: 0
ports
Overrides default port assignments, typically used for cluster migration.
operatorSecretRef
Reference to a secret used for migration-by-healing from non-Kubernetes environments.
expandEndpoints
Enables endpoint expansion during migration scenarios. Default: false
hugepagesOffsets
Specifies memory offsets for hugepage allocations (e.g., driveHugepagesOffset).
tolerations
A list of strings that expand to standard Kubernetes tolerations.
rawTolerations
A list of structured Kubernetes toleration objects for advanced scheduling.
network
Configures networking modes, such as udpMode or specific ethDevice settings.
2. Create the WEKA cluster client secret
Create a Kubernetes Secret containing the credentials required for clients to join the WEKA cluster.
Before you begin
Obtain the org, join-secret, password, and username from your WEKA backend.
Procedure
Encode each credential value to base64.
Linux/macOS example:
echo -n 'my_password' | base64
Create a file named
secret.yamland populate it with the encoded values.
Apply the secret:
3. Install the WekaClient CR
Deploy the WekaClient Custom Resource (CR) to manage WEKA containers across designated Kubernetes nodes. The WekaClient CR operates similarly to a DaemonSet, provisioning individual pods that maintain a persistent data-plane layer for your workloads.
Before you begin
Label nodes: Apply the following label to every worker node intended to host WEKA client pods:
kubectl label nodes <node-name> weka.io/supports-clients=trueVerify secrets: Ensure a Kubernetes Secret (for example, weka-cluster-dev) exists in the
weka-operator-systemnamespace. The secret must contain base64-encoded cluster credentials (org,join-secret,password, andusername).Identify drivers service: Identify whether you are using the external driver distribution service (https://drivers.weka.io) or a local service endpoint.
Procedure
Create a manifest file named
weka-client.yaml.Configure the WekaClient resource based on your environment. Use the targetCluster field for internal Kubernetes-managed clusters or joinIpPorts for clusters external to the environment.
Example: Internal cluster connection
Example: External cluster connection
Apply the manifest.
Reference: WekaClient parameters
Identify the configurable fields within the WekaClient specification to customize your deployment.
image
The WEKA container image version to deploy.
imagePullSecret
Secret name used to authenticate with the image registry.
port
Defines a range of 100 ports for the container. Default: Dynamic
agentPort
Specifies a single port used by the agent process. Default: Dynamic
portRange
Defines a basePort (for example, 45000) for automatic allocation.
nodeSelector
Selects the nodes where WEKA containers are scheduled.
network
Defines the network device (for example, mlnx0) or defaults to UDP mode. Default:
driversDistService
URL for the driver distribution service.
targetCluster
Reference to a WekaCluster CR within the same environment.
joinIpPorts
IP addresses used to join a cluster outside the local environment.
wekaSecretRef
Reference to the Kubernetes Secret containing cluster credentials.
coresNum
Number of physical CPU cores to allocate to each container. Default: 1
cpuPolicy
Defines core allocation behavior (auto, manual, shared, dedicated). Default: auto
upgradePolicy
Sets the upgrade strategy (rolling, manual, all-at-once). Default: rolling
gracefulDestroyDuration
Pause duration for local data/drive allocations during pod deletion. Default: 24H
7. Manage resources and label propagation
Monitor the health of your WEKA resources and understand how configuration metadata flows through the system.
Label propagation
The WEKA Operator automatically propagates labels from parent objects to children to maintain consistent metadata across the environment:
WekaClient > WekaContainer > Pod
WekaPolicy > WekaContainer
WekaCluster > WekaContainer
Resource monitoring
Run the following commands to verify the status of your deployment:
Monitor cluster status:
kubectl get wekaClustersMonitor client status:
kubectl get wekaClients
8. Manage the WEKA cluster management proxy
Access WEKA management and service endpoints using Kubernetes Ingress resources. WEKA exposes these endpoints to provide a unified interface for system administration and monitoring.
To enable external access, the Kubernetes environment typically requires the following infrastructure:
Ingress controller: A controller such as NGINX or Traefik to manage incoming traffic.
External connectivity: A load balancer or equivalent mechanism to route traffic from outside the cluster.
DNS resolution: Configured hostnames that resolve to the Ingress controller's external IP.
TLS termination: Optional platform-managed certificate management for secure HTTPS communication.
Ingress configuration
WEKA simplifies basic setups by managing Ingress configuration through a single ingressClass setting. For advanced or highly customized networking scenarios, you can wrap or modify the service using standard Kubernetes Ingress resources.
WEKA does not install or configure Ingress controllers, external load balancers, DNS records, or TLS certificates. These components remain the responsibility of the platform administrator.
9. Perform post-deployment storage configuration
Configure the CSI plugin and storage classes based on your operator version to enable persistent volume provisioning.
v1.7.0 and newer
CSI plugin and StorageClass are configured automatically.
Proceed to create a Persistent Volume Claim (PVC). See Dynamic and static provisioning.
v1.6.2 and older
CSI plugin requires manual installation.
Manually install the WEKA CSI Plugin. See WEKA CSI Plugin.
For v1.7.0+, the operator creates storage classes following the pattern weka-<groupName>-<fsName>. To disable this, set csi.storageClassCreationDisabled: true in your Helm values.
Upgrade the WEKA Operator
Upgrading the WEKA Operator involves updating the Operator and managing wekaClient configurations to ensure all client pods operate on the latest version. Additionally, each WEKA version requires a new builder instance with a unique wekaContainer metadata name, ensuring compatibility and streamlined management of version-specific resources.
Procedure:
Upgrade the WEKA Operator Follow the steps in Install the WEKA Operator using the latest version. Re-running the installation process with the updated version upgrades the WEKA Operator without requiring additional setup.
Configure upgrade policies for
wekaClientTheupgradePolicyparameter in thewekaClientCustom Resource (CR) specification controls how client pods are updated when the WEKA version changes. Options include:rolling: The operator automatically updates each client pod sequentially, replacing one pod at a time to maintain availability.
manual: No automatic pod replacements are performed by the operator. Manual deletion of each client pod is required, after which the pod restarts with the updated version. Use
kubectl delete pod <pod-name>to delete each pod manually.all-at-once: The operator updates all client pods simultaneously, applying the new version cluster-wide in a single step.
To apply the upgrade, update the
weka-in-containerversion:Edit the version with
kubectl editon thewekaClientCR.Modify the client configuration manifest, then reapply it with
kubectl apply -f <manifest-file>.
Create a new builder Instance for each WEKA version Rather than updating existing builder instances, create a new instance of the builder with each WEKA kernel version. Each builder must have a unique
wekaContainermetadata name to support version-specific compatibility.Create a new builder: For each WEKA version, create a new builder instance with an updated
wekaContainermeta name that corresponds to the new version. This ensures that clients and resources linked to specific kernel versions can continue to operate without conflicts.Cleanup outdated builders: Once the upgrade is validated and previous versions are no longer needed, you can delete outdated builder instances associated with those older versions. This cleanup step optimizes resources but allows you to maintain multiple builder instances if supporting different kernel versions is required.
Delete a WekaCluster
When you delete a WekaCluster, the system enforces a 24-hour grace period before completing the removal. To expedite this process and delete the cluster immediately, you can set the graceful destroy duration to zero before initiating the deletion.
Procedure
Run the following command to set the graceful destroy duration to zero:
Where:
<cluster name>: Specifies the name of your WekaCluster.
Run the following command to delete the WekaCluster:
Where:
<cluster name>: Specifies the name of the WekaCluster you want to delete.<cluster namespace>: Specifies the namespace where the cluster is located.
Migrate a WEKA client to a Kubernetes Operator-controlled client
To migrate a WEKA client running directly on a worker node to a Kubernetes Operator-controlled client, select either the container name override approach or a clean installation based on your environment's needs. Choose the container name override approach for minimal operational impact, or opt for a clean installation if you prefer a fresh environment without legacy components.
Migrate with container name override
This approach smoothly migrates the WEKA client without interrupting workloads by using container name overrides.
Before you begin
Ensure the environment does not use local mounts.
To prevent client duplication conflicts, ensure that quick manual removal of containers are possible.
Anticipate a maximum of two minutes of I/O stalls during the switchover process.
When WEKA modifies cgroups, the CPU cores allocated aren't automatically freed. To reclaim them in Kubernetes, typically a node reboot is needed, although a Kubernetes service restart may sometimes capture these resources based on specific settings. Until a reboot is executed, CPUs remain double allocated.
Procedure
Identify the standalone container name: Run the following command on the worker node to locate the active WEKA client container.
Example output:
Note the name in the CONTAINER column, for example,
client.Configure the deployment manifest:
Update the
wekaclientsYAML file with the exact container name identified in the previous step.Insert the name into the
overridessection under theWekaClientspec:
Apply the configuration: Deploy the updated WEKA client file to the Kubernetes cluster to initiate the Operator-based client.
Remove the standalone container: Run the following commands on the worker node immediately after applying the new configuration. Complete these steps within two minutes to avoid crashes caused by duplicate clients.
Stop the container:
weka local stop <container_name>(use--forceif needed)Remove the container:
weka local rm <container_name>
Service cleanup: After a successful deployment, if the legacy WEKA service it is no longer required, remove it from the Kubernetes worker node that runs WEKA client manually.
Migrate with a clean installation
This approach evicts the workload from the node and performs a clean installation of the WEKA client through the Kubernetes Operator, ensuring a fresh environment without requiring a container name override.
Before you begin
Ensure the cluster has sufficient resources to handle workloads during node eviction.
The environment must not use local mounts. Use only CSI.
This procedure may cause a temporary disruption to the node being migrated. Anticipate up to two minutes of I/O delays during the switchover process as the Operator-based client establishes connectivity.
Procedure
Evict the node: Use the Kubernetes eviction process to move all running pods to other healthy worker nodes in the cluster. This prevents data access errors for active applications during the client removal.
Uninstall the standalone client: Log in to the k8s worker node that runs WEKA client and remove the existing WEKA service and its components. Use the following command to ensure a complete cleanup.
Verify container removal: Ensure no legacy WEKA processes remain active on the node. Run:
Confirm that no WEKA containers are running.
Install the Operator-managed client: Apply the
wekaclientsYAML manifest to the cluster. The Operator now manages the new container lifecycle, eliminating the need for thewekaContainerNameoverride.Monitor the switchover: Observe the system as the Operator pulls the necessary images and starts the client processes.
Best practices
Preloading images
To optimize runtime and minimize delays, preloading images during the reading or preparation phase can significantly reduce waiting time in subsequent steps. Without preloading, some servers may sit idle while images download, leading to further delays when all servers advance to the next step.
Display custom fields
WEKA Custom Resources enable enhanced observability by marking certain display fields. While kubectl get displays only a limited set of fields by default, using the -o wide option or exploring through k9s allows you to view all fields.
Example command to quickly assess WekaContainer status:
Example output:
This view provides a quick status overview, showing progress and resource allocation at a glance.
Troubleshooting
This section provides guidance for resolving common deployment issues with WEKA Operator.
Pod stuck in pending state
Describe the pod to identify the scheduling issue (using Kubernetes native reporting).
If the pod is blocked on weka.io/drives, it indicates that the operator was unable to allocate the required drives for the corresponding WekaContainer. This issue may occur if the user has requested more drives than are available on the node or if there are too many driveContainers already running.
Ensure the drives are signed and the number of drives corresponds to the requested in the spec of the WekaCluster.
If there’s an image pull failure, verify your imagePullSecret. Each customer must have a unique robot secret for quay.io.
Pod in “wekafsio driver not found” loop
Check the logs for this message and see for further steps.
CSI not functioning
Ensure the nodeSelector configurations on both the CSI installation and the WekaClient match.
Appendix: Kubernetes Glossary
Kubernetes Glossary
Learning Kubernetes is outside the scope of this document. This glossary covers essential Kubernetes components and concepts to support understanding of the environment. It is provided for convenience only and does not replace the requirement for Kubernetes knowledge and experience.
Pod
A Pod is the smallest, most basic deployable unit in Kubernetes. It represents a single instance of a running process in a cluster, typically containing one or more containers that share storage, network, and a single IP address. Pods are usually ephemeral; when they fail, a new Pod is created to replace them.
Node
A Node is a physical or virtual machine that serves as a worker in a Kubernetes cluster, running Pods and providing the necessary compute resources. Each Node is managed by the Kubernetes control plane and runs components like kubelet, kube-proxy, and a container runtime.
Namespace
A Namespace is a Kubernetes resource that divides a cluster into virtual sub-clusters, allowing for isolated environments within a single physical cluster. Namespaces help organize resources, manage permissions, and enable resource quotas within a cluster.
Label
Labels are key-value pairs attached to Kubernetes objects, like Pods and Nodes, used for identification and grouping. Labels facilitate organizing, selecting, and operating on resources, such as scheduling workloads based on specific node labels.
Taint
Taints are properties applied to nodes to restrict the schedule of pods. A taint on a Node prevents Pods without a matching toleration from being scheduled there. Taints often prevent certain workloads from running on specific Nodes unless explicitly permitted.
Toleration
A Toleration is a property of Pods that enables them to be scheduled on Nodes with matching taints. Tolerations work with taints to control, which workloads can run on specific Nodes in the cluster.
Affinity and Anti-Affinity
Affinity rules allow administrators to specify which Nodes or other Pods a given Pod should run nearby. Anti-affinity rules define the opposite: which Pods should not be scheduled near each other. These rules help with optimal resource allocation and reliability.
Selector
Selectors are expressions that enable filtering and selecting specific resources within the Kubernetes API. Node selectors, for example, specify the Nodes on which a Pod can run by matching their labels.
Deployment
A Deployment is a higher-level object for managing and scaling applications in Kubernetes. It defines the desired state for Pods and ensures they are created, updated, and scaled to maintain that state.
DaemonSet
A DaemonSet ensures that a specific Pod runs on all (or some) Nodes in the cluster, often used for tasks like logging, monitoring, or networking, where each Node requires the same component.
ReplicaSet
A ReplicaSet ensures a specified number of replicas of a Pod are running at any given time, allowing for redundancy and high availability. It is often managed by a Deployment, which abstracts the ReplicaSet management.
Service
A Service is an abstraction that defines a logical set of Pods and provides a stable network endpoint for access. It enables reliable communication between different Pods or external services, regardless of the individual Pods’ IP addresses.
ConfigMap
A ConfigMap is a Kubernetes resource used to store application configuration data. It separates configuration from application code, enabling easy updates without redeploying the entire application.
Secret
A Secret is a Kubernetes object used to store sensitive information, such as passwords, tokens, or keys. Like ConfigMaps, secrets are designed for confidential data, and Kubernetes provides mechanisms for securely managing and accessing them.
Persistent Volume (PV)
A Persistent Volume is a storage resource in Kubernetes that exists independently of any particular Pod. PVs provide long-term storage that persists beyond the lifecycle of individual Pods.
Persistent Volume Claim (PVC)
A Persistent Volume Claim is a request for storage made by a Pod. PVCs allow Pods to use persistent storage resources, which are dynamically or statically provisioned in the cluster.
Ingress
Ingress is a Kubernetes resource that manages external access to services within a cluster, typically via HTTP/HTTPS. Ingress enables load balancing, SSL termination, and routing to various services based on the request path.
Container Runtime
The container runtime is the underlying software that runs containers on a Node. Kubernetes supports multiple container runtimes, such as Docker, containers, and CRI-O.
Operator
An Operator is a method of packaging, deploying, and managing a Kubernetes application or service. It often provides automated management and monitoring for complex applications in Kubernetes clusters.
Last updated