W E K A
4.4
4.4
  • WEKA v4.4 documentation
    • Documentation revision history
  • WEKA System Overview
    • Introduction
      • WEKA system functionality features
      • Converged WEKA system deployment
      • Redundancy optimization in WEKA
    • SSD capacity management
    • Filesystems, object stores, and filesystem groups
    • WEKA networking
    • Data lifecycle management
    • WEKA client and mount modes
    • WEKA containers architecture overview
    • Glossary
  • Planning and Installation
    • Prerequisites and compatibility
    • WEKA cluster installation on bare metal servers
      • Plan the WEKA system hardware requirements
      • Obtain the WEKA installation packages
      • Install the WEKA cluster using the WMS with WSA
      • Install the WEKA cluster using the WSA
      • Manually install OS and WEKA on servers
      • Manually prepare the system for WEKA configuration
        • Broadcom adapter setup for WEKA system
        • Enable the SR-IOV
      • Configure the WEKA cluster using the WEKA Configurator
      • Manually configure the WEKA cluster using the resources generator
        • VLAN tagging in the WEKA system
      • Perform post-configuration procedures
      • Add clients to an on-premises WEKA cluster
    • WEKA Cloud Deployment Manager Web (CDM Web) User Guide
    • WEKA Cloud Deployment Manager Local (CDM Local) User Guide
    • WEKA installation on AWS
      • WEKA installation on AWS using Terraform
        • Terraform-AWS-WEKA module description
        • Deployment on AWS using Terraform
        • Required services and supported regions
        • Supported EC2 instance types using Terraform
        • WEKA cluster auto-scaling in AWS
        • Detailed deployment tutorial: WEKA on AWS using Terraform
      • WEKA installation on AWS using the Cloud Formation
        • Self-service portal
        • CloudFormation template generator
        • Deployment types
        • AWS Outposts deployment
        • Supported EC2 instance types using Cloud Formation
        • Add clients to a WEKA cluster on AWS
        • Auto scaling group
        • Troubleshooting
    • WEKA installation on Azure
      • Azure-WEKA deployment Terraform package description
      • Deployment on Azure using Terraform
      • Required services and supported regions
      • Supported virtual machine types
      • Auto-scale virtual machines in Azure
      • Add clients to a WEKA cluster on Azure
      • Troubleshooting
      • Detailed deployment tutorial: WEKA on Azure using Terraform
    • WEKA installation on GCP
      • WEKA project description
      • GCP-WEKA deployment Terraform package description
      • Deployment on GCP using Terraform
      • Required services and supported regions
      • Supported machine types and storage
      • Auto-scale instances in GCP
      • Add clients to a WEKA cluster on GCP
      • Troubleshooting
      • Detailed deployment tutorial: WEKA on GCP using Terraform
      • Google Kubernetes Engine and WEKA over POSIX deployment
    • WEKA installation on OCI
  • Getting Started with WEKA
    • Manage the system using the WEKA GUI
    • Manage the system using the WEKA CLI
      • WEKA CLI hierarchy
      • CLI reference guide
    • Run first IOs with WEKA filesystem
    • Getting started with WEKA REST API
    • WEKA REST API and equivalent CLI commands
  • Performance
    • WEKA performance tests
      • Test environment details
  • WEKA Filesystems & Object Stores
    • Manage object stores
      • Manage object stores using the GUI
      • Manage object stores using the CLI
    • Manage filesystem groups
      • Manage filesystem groups using the GUI
      • Manage filesystem groups using the CLI
    • Manage filesystems
      • Manage filesystems using the GUI
      • Manage filesystems using the CLI
    • Attach or detach object store buckets
      • Attach or detach object store bucket using the GUI
      • Attach or detach object store buckets using the CLI
    • Advanced data lifecycle management
      • Advanced time-based policies for data storage location
      • Data management in tiered filesystems
      • Transition between tiered and SSD-only filesystems
      • Manual fetch and release of data
    • Mount filesystems
      • Mount filesystems from Single Client to Multiple Clusters (SCMC)
      • Manage authentication across multiple clusters with connection profiles
    • Snapshots
      • Manage snapshots using the GUI
      • Manage snapshots using the CLI
    • Snap-To-Object
      • Manage Snap-To-Object using the GUI
      • Manage Snap-To-Object using the CLI
    • Snapshot policies
      • Manage snapshot policies using the GUI
      • Manage snapshot policies using the CLI
    • Quota management
      • Manage quotas using the GUI
      • Manage quotas using the CLI
  • Additional Protocols
    • Additional protocol containers
    • Manage the NFS protocol
      • Supported NFS client mount parameters
      • Manage NFS networking using the GUI
      • Manage NFS networking using the CLI
    • Manage the S3 protocol
      • S3 cluster management
        • Manage the S3 service using the GUI
        • Manage the S3 service using the CLI
      • S3 buckets management
        • Manage S3 buckets using the GUI
        • Manage S3 buckets using the CLI
      • S3 users and authentication
        • Manage S3 users and authentication using the CLI
        • Manage S3 service accounts using the CLI
      • S3 lifecycle rules management
        • Manage S3 lifecycle rules using the GUI
        • Manage S3 lifecycle rules using the CLI
      • Audit S3 APIs
        • Configure audit webhook using the GUI
        • Configure audit webhook using the CLI
        • Example: How to use Splunk to audit S3
        • Example: How to use S3 audit events for tracking and security
      • S3 supported APIs and limitations
      • S3 examples using boto3
      • Configure and use AWS CLI with WEKA S3 storage
    • Manage the SMB protocol
      • Manage SMB using the GUI
      • Manage SMB using the CLI
  • Security
    • WEKA security overview
    • Obtain authentication tokens
    • Manage token expiration
    • Manage account lockout threshold policy
    • Manage KMS
      • Manage KMS using GUI
      • Manage KMS using CLI
    • Manage TLS certificates
      • Manage TLS certificates using GUI
      • Manage TLS certificates using CLI
    • Manage Cross-Origin Resource Sharing
    • Manage CIDR-based security policies
    • Manage login banner
  • Secure cluster membership with join secret authentication
  • Licensing
    • License overview
    • Classic license
  • Operation Guide
    • Alerts
      • Manage alerts using the GUI
      • Manage alerts using the CLI
      • List of alerts and corrective actions
    • Events
      • Manage events using the GUI
      • Manage events using the CLI
      • List of events
    • Statistics
      • Manage statistics using the GUI
      • Manage statistics using the CLI
      • List of statistics
    • Insights
    • System congestion
    • User management
      • Manage users using the GUI
      • Manage users using the CLI
    • Organizations management
      • Manage organizations using the GUI
      • Manage organizations using the CLI
      • Mount authentication for organization filesystems
    • Expand and shrink cluster resources
      • Add a backend server
      • Expand specific resources of a container
      • Shrink a cluster
    • Background tasks
      • Set up a Data Services container for background tasks
      • Manage background tasks using the GUI
      • Manage background tasks using the CLI
    • Upgrade WEKA versions
    • Manage WEKA drivers
  • Monitor the WEKA Cluster
    • Deploy monitoring tools using the WEKA Management Station (WMS)
    • WEKA Home - The WEKA support cloud
      • Local WEKA Home overview
      • Deploy Local WEKA Home v3.0 or higher
      • Deploy Local WEKA Home v2.x
      • Explore cluster insights
      • Explore performance statistics in Grafana
      • Manage alerts and integrations
      • Enforce security and compliance
      • Optimize support and data management
      • Export cluster metrics to Prometheus
    • Set up WEKAmon for external monitoring
    • Set up the SnapTool external snapshots manager
  • Kubernetes
    • Composable clusters for multi-tenancy in Kubernetes
    • WEKA Operator deployment
    • WEKA Operator day-2 operations
  • WEKApod
    • WEKApod Data Platform Appliance overview
    • WEKApod servers overview
    • Rack installation
    • WEKApod initial system setup and configuration
    • WEKApod support process
  • AWS Solutions
    • Amazon SageMaker HyperPod and WEKA Integrations
      • Deploy a new Amazon SageMaker HyperPod cluster with WEKA
      • Add WEKA to an existing Amazon SageMaker HyperPod cluster
    • AWS ParallelCluster and WEKA Integration
  • Azure Solutions
    • Azure CycleCloud for SLURM and WEKA Integration
  • Best Practice Guides
    • WEKA and Slurm integration
      • Avoid conflicting CPU allocations
    • Storage expansion best practice
  • Support
    • Get support for your WEKA system
    • Diagnostics management
      • Traces management
        • Manage traces using the GUI
        • Manage traces using the CLI
      • Protocols debug level management
        • Manage protocols debug level using the GUI
        • Manage protocols debug level using the CLI
      • Diagnostics data management
  • Appendices
    • WEKA CSI Plugin
      • Deployment
      • Storage class configurations
      • Tailor your storage class configuration with mount options
      • Dynamic and static provisioning
      • Launch an application using WEKA as the POD's storage
      • Add SELinux support
      • NFS transport failback
      • Upgrade legacy persistent volumes for capacity enforcement
      • Troubleshooting
    • Convert cluster to multi-container backend
    • Create a client image
    • Update WMS and WSA
    • BIOS tool
Powered by GitBook
On this page
  • Overview
  • WEKA Operator backend deployment overview
  • WEKA Operator client deployment overview
  • Deployment workflow
  • 1. Obtain setup information
  • 2. Prepare Kubernetes environment
  • 3. Install the WEKA Operator
  • 4. Set up driver distribution
  • 5. Discover drives for WEKA cluster provisioning
  • 6. Install the WekaCluster and WekaClient custom resources
  • Upgrade the WEKA Operator
  • Best practices
  • Preloading images
  • Display custom fields
  • Troubleshooting
  • Pod stuck in pending state
  • Pod in “wekafsio driver not found” loop
  • CSI not functioning
  • Appendix: Kubernetes Glossary
  1. Kubernetes

WEKA Operator deployment

Discover how the WEKA Operator streamlines deploying, scaling, and managing the WEKA Data Platform on Kubernetes, delivering high-performance storage for compute-intensive workloads like AI and HPC.

Overview

The WEKA Operator simplifies deploying, managing, and scaling the WEKA Data Platform within a Kubernetes cluster. It provides custom Kubernetes resources that define and manage WEKA components effectively.

By integrating WEKA's high-performance storage into Kubernetes, the Operator supports compute-intensive applications like AI, ML, and HPC. This enhances data access speed and boosts overall performance.

The WEKA Operator automates tasks, enables periodic maintenance, and ensures robust cluster management. This setup provides resilience and scalability across the cluster. With its persistent, high-performance data layer, the WEKA Operator enables efficient management of large datasets, ensuring scalability and efficiency.

Target audience: This guide is intended exclusively for experienced Kubernetes cluster administrators. It provides detailed procedures for deploying the WEKA Operator on a Kubernetes cluster that meets the specified requirements.

WEKA Operator backend deployment overview

The WEKA Operator backend deployment integrates various components within a Kubernetes cluster to deploy, manage, and scale the WEKA Data Platform effectively.

How it works

  • Local Server Setup: This setup integrates Kubernetes with the WekaCluster custom resources (CRDs) and facilitates WEKA Operator installation through Helm. Configuring Helm registry authentication provides access to the necessary CRDs and initiates the operator installation.

  • WekaCluster CR: The WekaCluster CR defines the WEKA cluster’s configuration, including storage, memory, and resource limits, while optimizing memory and CPU settings to prevent out-of-memory errors. Cluster and container management also support operational tasks through on-demand executions (through WekaManualOperation) and scheduled tasks (through WekaPolicy).

  • WEKA Operator:

    • The WEKA Operator retrieves Kubernetes configurations from WekaCluster CRs, grouping multiple WEKA containers to organize WEKA nodes into a single unified cluster.

    • To enable access to WEKA container images, the Operator retrieves credentials from Kubernetes secrets in each namespace that requires WEKA resources.

    • Using templates, it calculates the required number of containers and deploys the WEKA cluster on Kubernetes backends through a CRD.

    • Each node requires specific Kubelet configurations—such as kernel headers, storage allocations, and huge page settings—to optimize memory management for the WEKA containers. Data is stored in the /opt/k8s-weka directory on each node, with CPU and memory allocations determined by the number of WEKA containers and available CPU cores per node.

  • Driver Distribution Model: This model ensures efficient kernel module loading and compatibility across nodes, supporting scalable deployment for both clients and backends. It operates through three primary roles:

    • Distribution Service: A central repository storing and serving WEKA drivers for seamless access across nodes.

    • Drivers Builder: Compiles drivers for specific WEKA versions and kernel targets, uploading them to the Distribution Service. Multiple builders can run concurrently to support the same repository.

    • Drivers Loader: Automatically detects missing drivers, retrieves them from the Distribution Service, and loads them using modprobe.

WEKA Operator client deployment overview

The WEKA Operator client deployment uses the WekaClient custom resource to manage WEKA containers across a set of designated nodes, similar to a DaemonSet. Each WekaClient instance provisions WEKA containers as individual pods, creating a persistent layer that supports high availability by allowing safe pod recreation when necessary.

How it works

  • Deployment initiation: The user starts the deployment from a local server, which triggers the process.

  • Custom resource retrieval: The WEKA Operator retrieves the WekaClient custom resource (CR) configuration. This CR defines which nodes in the Kubernetes cluster run WEKA containers.

  • WEKA containers deployment: Based on the WekaClient CR, the Operator deploys WEKA containers across the specified Kubernetes client nodes. Each WEKA container instance runs as a single pod, similar to a DaemonSet.

  • Persistent storage setup: Using the WEKA Container Storage Interface (CSI) plugin, the WEKA Operator sets up a persistent volume (PV) for the clients. This storage is managed by the WEKA Operator and is a prerequisite for clients relying on WEKA.

  • High availability: The WEKA containers act as a persistent layer, enabling each pod to be safely recreated as needed. This supports high availability by ensuring continuous service even if individual pods are restarted or moved.

WEKA Operator client-only deployment

If the WEKA cluster is outside the Kubernetes cluster but you have workloads inside Kubernetes, you can deploy a WEKA client within the Kubernetes cluster to connect to the external WEKA cluster.

Deployment workflow

  1. Obtain setup information.

  2. Prepare Kubernetes environment.

  3. Set up driver distribution.

  4. Discover drives for WEKA cluster provisioning.

  5. Install the WEKA Operator.

  6. Install the WekaCluster and WekaClient custom resources.

1. Obtain setup information

To deploy the WEKA Operator in your Kubernetes environment, contact the WEKA Customer Success Team to obtain the necessary setup information.

Component
Parameter
Example

Includes: Image pull secrets and Docker

QUAY_USERNAME QUAY_PASSWORD QUAY_SECRET_KEY

example_user example_password quay-io-robot-secret

WEKA Operator Version

WEKA_OPERATOR_VERSION

v1.4.0

WEKA Image

WEKA_IMAGE_VERSION_TAG

4.3.5.105-dist-drivers.5

By gathering this information in advance, you have all the required values to complete the deployment workflow efficiently. Replace the placeholders with the actual values in the setup files.

2. Prepare Kubernetes environment

Ensure the following requirements are met:

  • Local server requirements

  • Kubernetes cluster and node requirements

  • Kubernetes port requirements

  • Kubelet requirements

  • Image pull secrets requirements

Local server requirements

  1. Ensure access to a server for manual helm install, unless a higher-level tool (for example, Argo CD) is used.

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 && chmod 700 get_helm.sh &&./get_helm.sh

Kubernetes cluster and node requirements

Ensure that Kubernetes is correctly set up and configured to handle WEKA workloads.

  • Minimum Kubernetes version: 1.25

  • Minimum OpenShift version: 4.17

  1. Kernel headers: Ensure kernel headers on each node match the kernel version.

  2. Storage: Allocate storage on /opt/k8s-weka for WEKA containers. Estimate: ~20 GiB per WEKA container + 10 GiB per CPU core in use.

  3. Huge pages configuration:

    • Compute core: 3 GiB of huge pages

    • Drive core: 1.5 GiB of huge pages

    • Client core: 1.5 GiB of huge pages Check current huge pages with command: grep Huge /proc/meminfo

    • Add the appropriate number of huge pages: sudo sysctl -w vm.nr_hugepages=3000

    • Set huge pages to persist through reboots: sudo sh -c 'echo "vm.nr_hugepages = 3000" >> /etc/sysctl.conf'

Kubernetes port requirements

Ensure ports availability according to the following table:

Purpose
Source
Target
Target Ports
Protocol
Comments

Client connection

Client

Backend

45000-65000

TCP/ UDP

Clients find free ports dynamically within this range. Not mandatory to define explicitly.

Cluster allocation

WEKA Operator

Cluster Nodes

35000 (default)

TCP/ UDP

Default range for cluster allocation. Can be overridden but must ensure no conflicts.

Backend communication

Backend

Backend

35000 (default)

TCP/ UDP

Ports are allocated within the WEKA Operator but are not guaranteed to be free. Ensure the port range is available.

Port override

Operator API

WekaCluster CR

User-defined

TCP/ UDP

Overrides allow specifying ports manually, mainly useful for migrating non-K8s clusters.

Kubelet requirements

  1. Configure Kubelet with static CPU management to enable exclusive CPU allocation: reservedSystemCPUs: "0" cpuManagerPolicy: static

  2. Check which configmap holds the kubelet config. kubectl get cm -A|grep kubelet If there are more than one kubelet config, modify the config for worker nodes.

  3. Edit the kubelet config map to add the CPU settings. kubectl edit cm -n kube-system kubelet-config

Image pull secrets requirements

  • Set up Kubernetes secrets for secure image pulling across namespaces. Apply the secret in all namespaces where WEKA resources are deployed.

  • Verify that namespaces are defined and do not overlap to avoid configuration conflicts.

Example:

The following example creates a secret for quay.io authentication for both the weka-operator-system namespace and the default namespace. Repeat as necessary for namespaces. Replace the placeholders with the actual values.

export QUAY_USERNAME='QUAY_USERNAME' # Replace with the actual value
export QUAY_PASSWORD='QUAY_PASSWORD' # Replace with the actual value

kubectl create ns weka-operator-system
kubectl create secret docker-registry QUAY_SECRET_KEY \ # Replace with the actual value
  --docker-server=quay.io \
  --docker-username=$QUAY_USERNAME \
  --docker-password=$QUAY_PASSWORD \
  --docker-email=$QUAY_USERNAME \
  --namespace=weka-operator-system

kubectl create secret docker-registry QUAY_SECRET_KEY \ # Replace with the actual value
  --docker-server=quay.io \
  --docker-username=$QUAY_USERNAME \
  --docker-password=$QUAY_PASSWORD \
  --docker-email=$QUAY_USERNAME \
  --namespace=default

3. Install the WEKA Operator

  1. Apply WEKA Custom Resource Definitions (CRDs): Download and apply the WEKA Operator CRDs to define WEKA-specific resources in Kubernetes. Replace the version placeholder(WEKA_OPERATOR_VERSION) with the actual value

helm pull oci://quay.io/weka.io/helm/weka-operator --untar --version <WEKA_OPERATOR_VERSION>
kubectl apply -f weka-operator/crds
  1. Install the WEKA Operator: Deploy the WEKA Operator to the Kubernetes cluster. Specify the namespace, image version, and pull secret to enable WEKA’s resources. Replace the version placeholder(WEKA_OPERATOR_VERSION) with the actual value.

helm upgrade --create-namespace \
    --install weka-operator oci://quay.io/weka.io/helm/weka-operator \
    --namespace weka-operator-system \
    --version <WEKA_OPERATOR_VERSION> \
  1. Verify the installation: Run the following: kubectl -n weka-operator-system get pod The returned results should look similar to this:

NAME                                               READY  STATUS  RESTARTS   AGE
weka-operator-controller-manager-564bfd6b49-p6k7d   2/2   Running     0      13s

4. Set up driver distribution

Driver distribution applies to client and backend entities.

  1. Verify driver distribution prerequisites:

    1. Ensure a WEKA-compatible image (weka-in-container) is accessible through the registry and has the necessary credentials (imagePullSecret).

    2. Define node selection criteria, especially for the Driver Builder role, to match the kernel requirements of target nodes.

  2. Set up the driver distribution service and driver builder: Replace the container version tag (WEKA_IMAGE_VERSION_TAG) placeholders with the actual values:

apiVersion: weka.weka.io/v1alpha1
kind: WekaContainer
metadata:
  name: weka-drivers-dist
  namespace: default
  labels:
    app: weka-drivers-dist
spec:
  agentPort: 60001
  image: quay.io/weka.io/weka-in-container:<WEKA_IMAGE_VERSION_TAG> # Replace with the actual value
  imagePullSecret: "<QUAY_SECRET_KEY>" # Replace with the actual value
  mode: "drivers-dist"
  name: dist
  numCores: 1
  port: 60002
---
apiVersion: v1
kind: Service
metadata:
  name: weka-drivers-dist
  namespace: default
spec:
  type: ClusterIP
  ports:
    - name: weka-drivers-dist
      port: 60002
      targetPort: 60002
  selector:
    app: weka-drivers-dist
---
apiVersion: weka.weka.io/v1alpha1
kind: WekaContainer
metadata:
  name: weka-drivers-builder
  namespace: default
spec:
  agentPort: 60001
  image: quay.io/weka.io/weka-in-container:<WEKA_IMAGE_VERSION_TAG> # Replace with the actual value
  imagePullSecret: "<QUAY_SECRET_KEY>" # Replace with the actual value
  mode: "drivers-loader"
  name: dist 
  numCores: 1
  port: 60002

Ensure that nodeSelector or nodeAffinity aligns with the kernel requirements of the build nodes.

  1. Save the manifest above to weka-driver.yaml , and apply it: kubectl apply -f weka-driver.yaml

5. Discover drives for WEKA cluster provisioning

To provision drives for a WEKA cluster, each drive must go through a discovery process. This process ensures that all drives are correctly identified, accessible, and ready for use within the cluster.

The discovery process involves the following key actions:

  • Node updates during discovery

    • Each node is annotated with a list of known serial IDs for all drives accessible to the operator, providing a unique identifier for each drive.

    • An extended resource, weka.io/drives, is created to indicate the number of drives that are ready and available on each node.

  • Available drives

    • Only healthy, unblocked drives are marked as available. Drives that are manually flagged due to issues such as corruption or other unrecoverable errors are excluded from the available pool to ensure cluster stability.

Drive discovery steps

  1. Sign drives Each drive receives a WEKA-specific signature, marking it as ready for discovery and integration into the cluster.

  2. Discover drives The signed drives are detected and prepared for cluster operations. If drives already have the WEKA signature, only the discovery step is required to verify and track them in the cluster.

Drive discovery methods

The WEKA system supports two primary methods for drive discovery:

  • WekaManualOperation A one-time operation that performs both drive signing and discovery, suitable for manual provisioning.

  • WekaPolicy An automated, policy-driven approach that performs periodic discovery across all matching nodes. The WekaPolicy method operates on an event-driven model, initiating discovery immediately when relevant changes (such as node updates or drive additions) are detected.

Manual operations example:

The following operation signs specific drives:

apiVersion: weka.weka.io/v1alpha1
kind: WekaManualOperation
metadata:
  name: sign-specific-drives
  namespace: weka-operator-system
spec:
  action: "sign-drives"
  image: quay.io/weka.io/weka-in-container:WEKA_IMAGE_VERSION_TAG # Replace with the actual value
  imagePullSecret: "QUAY_SECRET_KEY"  \ # Replace with the actual value
  payload:
    signDrivesPayload:
      type: device-paths
      nodeSelector:
	      weka.io/supports-backends: "true"
      devicePaths:
        - /dev/nvme0n1
        - /dev/nvme1n1
        - /dev/nvme2n1
        - /dev/nvme3n1
        - /dev/nvme4n1
        - /dev/nvme5n1
        - /dev/nvme6n1
        - /dev/nvme7n1

Drive selection types:

  • all-not-root: Avoids using additional block devices aside from the root device.

  • aws-all: AWS-specific, detects NVMe devices by AWS PCI identifiers.

  • device-paths: Lists specific device paths, as shown in the example. Each node presents its subset of this list.

Drive discovery example:

The following example initiates a drive discovery operation:

apiVersion: weka.weka.io/v1alpha1
kind: WekaManualOperation
metadata:
  name: discover-drives
  namespace: weka-operator-system
spec:
  action: "discover-drives"
  image: quay.io/weka.io/weka-in-container:WEKA_IMAGE_VERSION_TAG # Replace with the actual value
  imagePullSecret: "QUAY_SECRET_KEY" # Replace with the actual value
  payload:
    discoverDrivesPayload:
      nodeSelector:
	      weka.io/supports-backends: "true"

Key fields:

  • nodeSelector (payload): Limits the operation to specific nodes.

  • tolerations (spec): Supports Kubernetes tolerations for high-level objects like WekaCluster and WekaClient. Only tolerations are supported for WekaManualOperation, WekaContainer, and WekaPolicy.

6. Install the WekaCluster and WekaClient custom resources

This procedure provides step-by-step instructions for deploying the WekaCluster and WekaClient Custom Resources (CRs) in a Kubernetes cluster. Follow these procedures in sequence if both components are required. Begin with the WekaCluster CR, then create the necessary client secret, and finally deploy the WekaClient CR.

Step 1: Install the WekaCluster CR

To deploy a WEKA cluster backend using the WekaCluster CR, perform the following:

  1. Prerequisites:

    1. Ensure the driver distribution service is configured. This is the same service used by WEKA clients. See 4. Set up driver distribution.

    2. Use either the WekaManualOperation (recommended for initial deployments) or WekaPolicy to sign and discover drives. See 5. Discover drives for WEKA cluster provisioning.

  2. Create a manifest file (for example, weka-cluster.yaml) with the required configuration:

    apiVersion: weka.weka.io/v1alpha1
    kind: WekaCluster
    metadata:
      name: cluster-dev
      namespace: default
    spec:
      template: dynamic
      dynamicTemplate:
        computeContainers: 6
        driveContainers: 6
        numDrives: 1
      image: quay.io/weka.io/weka-in-container:WEKA_IMAGE_VERSION_TAG # Replace with actual image tag
      nodeSelector:
        weka.io/supports-backends: "true"
      driversDistService: "https://weka-drivers-dist.weka-operator-system.svc.cluster.local:60002"
      imagePullSecret: "QUAY_SECRET_KEY" # Replace with the actual secret
      network:
        udpMode: true
        ethDevice: br-ex
WekaCluster key parameters and configurations
  • template: Only dynamic is currently supported. Future templates will include capacity and performance.

  • dynamicTemplate: Configure dynamic settings for compute and drive containers within this template.

    dynamicTemplate:
      computeContainers: <number>
      driveContainers: <number>
      numDrives: <number>
  • image, imagePullSecret, driversDistService, nodeSelector, tolerations, rawTolerations, and network are configured similarly to the WekaClient CR.

  • roleNodeSelector: Defines scheduling by role (compute, drive, s3) through a map of node selectors.

  • WekaHome Configuration: Sets the WekaHome endpoint and certificate.

    wekaHome:
      endpoint: "https://custom-domain.lan:30443"
      cacertSecret: "weka-home-cacert"
  • ipv6: Enables IPv6 (default is false).

  • additionalMemory: Adds memory per role beyond default allocations.

  • ports: Override default port assignments if needed, such as for cluster migration.

  • operatorSecretRef and expandEndpoints: Parameters used exclusively for migration, supporting migration-by-healing from a non-K8s environment to K8s.

  • Hugepages Offsets: Specifies offsets for hugepage allocations for drives, compute, and S3 (for example, driveHugepagesOffset).

  1. Apply the WekaCluster CR:

kubectl apply -f weka-cluster.yaml

Step 2: Create WEKA cluster client secret

Before deploying a WekaClient CR, create a Kubernetes Secret with credentials required to join the WEKA cluster.

  1. Prepare the secret YAML: Create a file named secret.yaml:

piVersion: v1
kind: Secret
metadata:
  name: weka-cluster-dev  # The wekaSecretRef in the WekaClient CR must much this secret name
  namespace: weka-operator-system
type: Opaque
data:
  org: <base64-encoded-org>
  join-secret: <base64-encoded-join-secret>
  password: <base64-encoded-password>
  username: <base64-encoded-username>

Replace all placeholder with base64-encoded values provided by your WEKA backend.

  1. Apply the secret:

kubectl apply -f secret.yaml

Step 3: Install the WekaClient CR

The WekaClient CR deploys WekaContainers across designated Kubernetes nodes, similar to a DaemonSet but without automatic pod cleanup.

WekaClient specification (reference) Key configurable fields in the WekaClientSpec:

type WekaClientSpec struct {
    Image               string            `json:"image"`                   // Image to be used for WekaContainer
    ImagePullSecret     string            `json:"imagePullSecret,omitempty"` // Secret for pulling the image
    Port                int               `json:"port,omitempty"`           // If unset (0), WEKA selects a free port from PortRange
    AgentPort           int               `json:"agentPort,omitempty"`      // If unset (0), WEKA selects a free port from PortRange
    PortRange           *PortRange        `json:"portRange,omitempty"`      // Used for dynamic port allocation
    NodeSelector        map[string]string `json:"nodeSelector,omitempty"`   // Specifies nodes for deployment
    WekaSecretRef       string            `json:"wekaSecretRef,omitempty"`  // Reference to Weka secret
    NetworkSelector     NetworkSelector   `json:"network,omitempty"`        // Defines network configuration
    DriversDistService  string            `json:"driversDistService,omitempty"` // URL for driver distribution service
    DriversLoaderImage  string            `json:"driversLoaderImage,omitempty"` // Image for drivers loader
    JoinIps             []string          `json:"joinIpPorts,omitempty"`    // IPs to join for cluster setup
    TargetCluster       ObjectReference   `json:"targetCluster,omitempty"`  // Reference to target cluster
    CpuPolicy           CpuPolicy         `json:"cpuPolicy,omitempty"`      // CPU policy, e.g., "auto," "shared," "dedicated," etc.
    CoresNumber         int               `json:"coresNum,omitempty"`       // Number of cores to use
    CoreIds             []int             `json:"coreIds,omitempty"`        // Specific core IDs to use
    TracesConfiguration *TracesConfiguration `json:"tracesConfiguration,omitempty"` // Trace settings
    Tolerations         []string          `json:"tolerations,omitempty"`    // Tolerations for nodes
    RawTolerations      []v1.Toleration   `json:"rawTolerations,omitempty"` // Detailed toleration settings
    AdditionalMemory    int               `json:"additionalMemory,omitempty"` // Additional memory allocation
    WekaHomeConfig      WekahomeClientConfig  `json:"wekaHomeConfig,omitempty"` // Deprecated field
    WekaHome            *WekahomeClientConfig `json:"wekaHome,omitempty"`       // Deprecated field
    UpgradePolicy       UpgradePolicy     `json:"upgradePolicy,omitempty"`   // Policy for handling upgrades
}
WekaClient key parameters and configurations
  • image: Specifies the image to use for the container.

  • imagePullSecret: Defines the secret to use for pulling the image, which is propagated into the pod.

  • port and agentPort:

    • agentPort: A single port used by the agent.

    • port: Represents a range of 100 ports. This range may be reduced in the future, as it is not fully utilized by clients and is shared on the WEKA side.

  • portRange: Instead of specifying individual ports, a range can be defined. The operator automatically finds an available port instead of using the same one across all servers.

    portRange:
      basePort: 45000
  • nodeSelector: Selects the node where the WekaContainer will be scheduled.

  • network: Defines the network device for WEKA to use. By default, WEKA runs in UDP mode if no network device is specified. If using an Ethernet device, specify the device name (for example, mlnx0).

    network:
      ethDevice: mlnx0
  • driversDistService: A reference to the distribution service for drivers.

  • joinIpPorts: Used when the WEKA cluster and WEKA clients are not in the same Kubernetes cluster.

    joinIpPorts: ["10.0.1.168:16101"]
  • targetCluster: Used when the WEKA cluster and WEKA clients are in the same Kubernetes cluster.

    targetCluster:
      name: cluster-dev
      namespace: default
  • coresNum: Specifies the number of full cores to use for each WekaContainer.

  • cpuPolicy: Default value is auto, which automatically detects whether nodes are running with hyperthreading and allocates cores accordingly.

    • Example: 2 WEKA cores = 2 full cores, reserving 5 hyperthreads for a pod.

    • coreIds: Used in combination with cpuPolicy: manual for manual core allocation. Note: Unless advised by WEKA support, avoid using any policy other than auto.

  • tracesConfiguration: Configures trace capacity allocations.

  • tolerations and rawTolerations:

    • tolerations: A list of strings that expand to NoSchedule and NoExecution tolerations for existing keys.

    • rawTolerations: A list of Kubernetes toleration objects.

    tolerations:
      - simple-toleration
      - another-one
    rawTolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "weka-cluster"
        effect: "NoSchedule"
  • additionalMemory: Specifies additional memory in megabytes for cases when default memory allocation is insufficient. Note: Default memory allocations are typically set for 90%+ utilization.

  • wekaHome: Configures the Weka home directory to use. Defaults to the Weka cloud home. The primary configuration of Weka home is in the WekaCluster CR, but WekaClient can also specify a cacert for the client. This certificate is placed on client pods to connect to Weka Home.

    wekaHome:
      cacertSecret: "weka-home-cacert"
  • upgradePolicy: Defines how the WekaContainers are upgraded.

    • rolling (default): WekaContainers are updated one by one.

    • manual: WekaContainers are set to a new version, but the pod is not deleted until manually triggered. This gives the user control over when to update.

    • all-at-once: All WekaContainers are upgraded simultaneously after the image is changed.

  • gracefulDestroyDuration: Specifies the duration for which the cluster remains in a paused state, keeping local data and drive allocations while deleting all pods.

    • Default: 24 hours.

    • Note: In case of accidental cluster deletion, override this duration with a larger value and contact Weka support for recovery procedures. This is a safety measure, not a pause/unpause feature.

    To override the graceful destroy duration:

kubectl patch WekaCluster cluster-dev -n weka-operator-system --type='merge' -p='{"status":{"overrideGracefulDestroyDuration": "10000h"}}' --subresource=status

To release the cluster (allow full deletion):

kubectl patch WekaCluster cluster-dev -n weka-operator-system --type='merge' -p='{"status":{"overrideGracefulDestroyDuration": "0"}}' --subresource=status

Example: Connecting to internal WEKA cluster

apiVersion: weka.weka.io/v1alpha1
kind: WekaClient
metadata:
  name: cluster-dev-clients
spec:
  image: quay.io/weka.io/weka-in-container:WEKA_IMAGE_PLACEHOLDER
  imagePullSecret: "QUAY_SECRET_KEY" # Replace with the actual value
  driversDistService: "https://weka-drivers-dist.weka-operator-system.svc.cluster.local:60002"
  portRange:
    basePort: 46000
  nodeSelector:
    weka.io/supports-clients: "true"
  wekaSecretRef: weka-cluster-dev # Must match secret name created using secret yaml 
  targetCluster:
    name: cluster-dev
    namespace: default
  network:
    ethDevice: mlnx0

Example: Connecting to external WEKA cluster

apiVersion: weka.weka.io/v1alpha1
kind: WekaClient
metadata:
  name: cluster-dev-clients
spec:
  image: quay.io/weka.io/weka-in-container:WEKA_IMAGE_PLACEHOLDER
  imagePullSecret: "QUAY_SECRET_KEY" # Replace with the actual value
  driversDistService: "https://weka-drivers-dist.weka-operator-system.svc.cluster.local:60002"
  portRange:
    basePort: 46000
  nodeSelector:
    weka.io/supports-clients: "true"
  wekaSecretRef: weka-cluster-dev # Must match secret name created using secret yaml 
  joinIpPorts: ["10.0.2.137:16101"] # Replace with backend or LB IP:port
  network:
    ethDevice: mlnx0

Apply the manifest:

kubectl apply -f weka-client.yaml

Step 4: Next steps

After deploying the WekaCluster and WekaClient CRs:

  • Monitor their status using kubectl get wekaClusters and kubectl get wekaClients.

Upgrade the WEKA Operator

Upgrading the WEKA Operator involves updating the Operator and managing wekaClient configurations to ensure all client pods operate on the latest version. Additionally, each WEKA version requires a new builder instance with a unique wekaContainer metadata name, ensuring compatibility and streamlined management of version-specific resources.

Procedure:

  1. Configure upgrade policies for wekaClient The upgradePolicy parameter in the wekaClient Custom Resource (CR) specification controls how client pods are updated when the WEKA version changes. Options include:

    • rolling: The operator automatically updates each client pod sequentially, replacing one pod at a time to maintain availability.

    • manual: No automatic pod replacements are performed by the operator. Manual deletion of each client pod is required, after which the pod restarts with the updated version. Use kubectl delete pod <pod-name> to delete each pod manually.

    • all-at-once: The operator updates all client pods simultaneously, applying the new version cluster-wide in a single step.

    To apply the upgrade, update the weka-in-container version by:

    • Directly editing the version with kubectl edit on the wekaClient CR.

    • Modifying the client configuration manifest, then reapplying it with kubectl apply -f <manifest-file>.

  2. Create a new builder Instance for each WEKA version Rather than updating existing builder instances, create a new instance of the builder with each WEKA kernel version. Each builder must have a unique wekaContainer metadata name to support version-specific compatibility.

    • Create a new builder: For each WEKA version, create a new builder instance with an updated wekaContainer meta name that corresponds to the new version. This ensures that clients and resources linked to specific kernel versions can continue to operate without conflicts.

    • Cleanup outdated builders: Once the upgrade is validated and previous versions are no longer needed, you can delete outdated builder instances associated with those older versions. This cleanup step optimizes resources but allows you to maintain multiple builder instances if supporting different kernel versions is required.

Best practices

Preloading images

To optimize runtime and minimize delays, preloading images during the reading or preparation phase can significantly reduce waiting time in subsequent steps. Without preloading, some servers may sit idle while images download, leading to further delays when all servers advance to the next step.

Sample DaemonSet configuration for preloading images:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: weka-preload
  namespace: default
spec:
  selector:
    matchLabels:
      app: weka-preload
  template:
    metadata:
      labels:
        app: weka-preload
    spec:
      imagePullSecrets:
        - name: quay-secret
        - name: QUAY_SECRET_KEY" # Replace with the actual value
      nodeSelector:
        weka.io/supports-backends: "true"
      tolerations:
        - key: "key1"
          operator: "Equal"
          value: "value1"
          effect: "NoSchedule"
        - key: "key2"
          operator: "Exists"
          effect: "NoExecute"
      containers:
        - name: weka-preload
          image: quay.io/weka.io/weka-in-container:WEKA_IMAGE_VERSION_TAG # Replace with the actual value
          command: ["sleep", "infinity"]
          resources:
            requests:
              cpu: "100m"
              memory: "128Mi"
            limits:
              cpu: "500m"
              memory: "256Mi"

Display custom fields

WEKA Custom Resources enable enhanced observability by marking certain display fields. While kubectl get displays only a limited set of fields by default, using the -o wide option or exploring through k9s allows you to view all fields.

Example command to quickly assess WekaContainer status:

kubectl get wekacontainer -o wide --all-namespaces

Example output:

NAMESPACE              NAME                                                       STATUS          MODE              AGE     DRIVES COUNT   WEKA CID
weka-operator-system   cluster-dev-clients-34.242.2.16                            Running         client            64s
weka-operator-system   cluster-dev-clients-52.51.10.75                            Running         client            64s                    12
weka-operator-system   cluster-dev-compute-16fd029f-8aad-487c-be32-c74d70350f69   Running         compute           6m49s                  9
weka-operator-system   cluster-dev-compute-33f54d4b-302d-4d85-9765-f6d9a7a31d02   Running         compute           6m50s                  8

... (additional rows)

weka-operator-system   weka-dsc-34.242.2.16                                       PodNotRunning   discovery         64s

This view provides a quick status overview, showing progress and resource allocation at a glance.

Troubleshooting

This section provides guidance for resolving common deployment issues with WEKA Operator.

Pod stuck in pending state

Describe the pod to identify the scheduling issue (using Kubernetes native reporting).

If the pod is blocked on weka.io/drives, it indicates that the operator was unable to allocate the required drives for the corresponding WekaContainer. This issue may occur if the user has requested more drives than are available on the node or if there are too many driveContainers already running.

Ensure the drives are signed and the number of drives corresponds to the requested in the spec of the WekaCluster.

Pod in “wekafsio driver not found” loop

Check the logs for this message and see for further steps.

CSI not functioning

Ensure the nodeSelector configurations on both the CSI installation and the WekaClient match.

Appendix: Kubernetes Glossary

Kubernetes Glossary

Learning Kubernetes is outside the scope of this document. This glossary covers essential Kubernetes components and concepts to support understanding of the environment. It is provided for convenience only and does not replace the requirement for Kubernetes knowledge and experience.

Pod

A Pod is the smallest, most basic deployable unit in Kubernetes. It represents a single instance of a running process in a cluster, typically containing one or more containers that share storage, network, and a single IP address. Pods are usually ephemeral; when they fail, a new Pod is created to replace them.

Node

A Node is a physical or virtual machine that serves as a worker in a Kubernetes cluster, running Pods and providing the necessary compute resources. Each Node is managed by the Kubernetes control plane and runs components like kubelet, kube-proxy, and a container runtime.

Namespace

A Namespace is a Kubernetes resource that divides a cluster into virtual sub-clusters, allowing for isolated environments within a single physical cluster. Namespaces help organize resources, manage permissions, and enable resource quotas within a cluster.

Label

Labels are key-value pairs attached to Kubernetes objects, like Pods and Nodes, used for identification and grouping. Labels facilitate organizing, selecting, and operating on resources, such as scheduling workloads based on specific node labels.

Taint

Taints are properties applied to nodes to restrict the schedule of pods. A taint on a Node prevents Pods without a matching toleration from being scheduled there. Taints often prevent certain workloads from running on specific Nodes unless explicitly permitted.

Toleration

A Toleration is a property of Pods that enables them to be scheduled on Nodes with matching taints. Tolerations work with taints to control, which workloads can run on specific Nodes in the cluster.

Affinity and Anti-Affinity

Affinity rules allow administrators to specify which Nodes or other Pods a given Pod should run nearby. Anti-affinity rules define the opposite: which Pods should not be scheduled near each other. These rules help with optimal resource allocation and reliability.

Selector

Selectors are expressions that enable filtering and selecting specific resources within the Kubernetes API. Node selectors, for example, specify the Nodes on which a Pod can run by matching their labels.

Deployment

A Deployment is a higher-level object for managing and scaling applications in Kubernetes. It defines the desired state for Pods and ensures they are created, updated, and scaled to maintain that state.

DaemonSet

A DaemonSet ensures that a specific Pod runs on all (or some) Nodes in the cluster, often used for tasks like logging, monitoring, or networking, where each Node requires the same component.

ReplicaSet

A ReplicaSet ensures a specified number of replicas of a Pod are running at any given time, allowing for redundancy and high availability. It is often managed by a Deployment, which abstracts the ReplicaSet management.

Service

A Service is an abstraction that defines a logical set of Pods and provides a stable network endpoint for access. It enables reliable communication between different Pods or external services, regardless of the individual Pods’ IP addresses.

ConfigMap

A ConfigMap is a Kubernetes resource used to store application configuration data. It separates configuration from application code, enabling easy updates without redeploying the entire application.

Secret

A Secret is a Kubernetes object used to store sensitive information, such as passwords, tokens, or keys. Like ConfigMaps, secrets are designed for confidential data, and Kubernetes provides mechanisms for securely managing and accessing them.

Persistent Volume (PV)

A Persistent Volume is a storage resource in Kubernetes that exists independently of any particular Pod. PVs provide long-term storage that persists beyond the lifecycle of individual Pods.

Persistent Volume Claim (PVC)

A Persistent Volume Claim is a request for storage made by a Pod. PVCs allow Pods to use persistent storage resources, which are dynamically or statically provisioned in the cluster.

Ingress

Ingress is a Kubernetes resource that manages external access to services within a cluster, typically via HTTP/HTTPS. Ingress enables load balancing, SSL termination, and routing to various services based on the request path.

Container Runtime

The container runtime is the underlying software that runs containers on a Node. Kubernetes supports multiple container runtimes, such as Docker, containers, and CRI-O.

Operator

An Operator is a method of packaging, deploying, and managing a Kubernetes application or service. It often provides automated management and monitoring for complex applications in Kubernetes clusters.

PreviousComposable clusters for multi-tenancy in KubernetesNextWEKA Operator day-2 operations

Last updated 9 days ago

Container repository ()

Proceed to install the WEKA CSI Plugin if persistent storage access is required. See the topic for plugin setup instructions.

Upgrade the WEKA Operator Follow the steps in using the latest version. Re-running the installation process with the updated version upgrades the WEKA Operator without requiring additional setup.

If there’s an image pull failure, verify your imagePullSecret. Each customer must have a unique robot secret for .

WEKA CSI Plugin
quay.io
Install the WEKA Operator
quay.io
WEKA Operator backend deployment
WEKA Operator client deployment