WEKA Operator day-2 operations

Manage hardware, scale clusters, and optimize resources to ensure system stability and performance.

WEKA Operator day-2 operations maintain and optimize WEKA environments in Kubernetes clusters by focusing in these core areas:

  • Observability and monitoring

    • Scraping and visualizing metrics with standard Kubernetes monitoring tools such as Prometheus and Grafana.

  • Hardware maintenance

    • Component replacement and node management

    • Hardware failure remediation

  • Cluster scaling

    • Resource allocation optimization

    • Controlled cluster expansion

  • Cluster maintenance

    • Configuration updates

    • Pod rotation

    • Token secret management

  • WekaContainer lifecycle management

Administrators execute both planned maintenance and emergency responses while following standardized procedures to ensure high availability and minimize service disruption.


Observability and monitoring

Starting with version v1.7.0, the WEKA Operator exposes health and performance metrics for WEKA clusters, including throughput, CPU utilization, IOPS, and API requests. These metrics are available by default and can be collected and visualized using standard Kubernetes monitoring tools such as Prometheus and Grafana. No additional installation flags or custom Prometheus configurations are required.

Related topic

Monitor WEKA clusters in Kubernetes with Prometheus and Grafana

Hardware maintenance

Hardware maintenance operations ensure cluster reliability and performance through systematic component management and failure response procedures. These operations span from routine preventive maintenance to critical component replacements.

Key operations:

  • Node management

    • Graceful and forced node reboots.

    • Node replacement and removal.

    • Complete rack decommissioning procedures.

  • Container operations

    • Container migration from failed nodes.

    • Container replacement on active nodes.

    • Container management on denylisted nodes.

  • Storage management

    • Drive replacement in converged setups.

    • Storage integrity verification.

    • Component failure recovery.

Each procedure follows established protocols to maintain system stability and minimize service disruption during maintenance activities. The documented procedures enable administrators to execute both planned maintenance and emergency responses while preserving data integrity and system performance.

Before you begin

Before performing any hardware maintenance or replacement tasks, ensure you have:

  • Administrative access to your Kubernetes cluster.

  • SSH access to the cluster nodes.

  • kubectl command-line tool installed and configured.

  • Proper backup of any critical data on the affected components.

  • Required replacement hardware (if applicable).

  • Maintenance window scheduled (if required).


Perform standard verification steps

This procedure describes the standard verification steps for checking WEKA cluster health. Multiple procedures in this documentation refer to these verification steps to confirm successful completion of their respective tasks.

Procedure

  1. Log in to the wekacontainer:

  1. Check the WEKA cluster status:

Example
  1. Check cluster containers.

Example
  1. Check the WEKA filesystem status.

Example
  1. Verify the status of the WEKA cluster processes is UP.

Example
  1. Check all pods are up and running.

Example

Force reboot a machine

A force reboot may be necessary when a machine becomes unresponsive or encounters a critical error that cannot be resolved through standard troubleshooting. This task ensures the machine restarts and resumes normal operation.

Procedure

Phase 1: Perform standard verification steps.

Phase 2: Cordon and evict backend k8s nodes.

To cordon and evict a node, run the following commands. Replace <k8s_node_IP> with the target k8s node's IP address.

  1. Cordon the backend k8s node:

Example:

  1. Evict the running pods ensuring data is removed. For example, drain the backend k8s node:

Example
  1. Validate node status:

Example
  1. Verify pod statuses across namespaces:

Example

Phase 3: Ensure the WEKA containers are marked as drained.

  1. List the cluster backend containers. Run the following command to display the current status of all WEKA containers in the k8s nodes:

  2. Check the status of the WEKA containers. In the command output, locate the STATUS column for the relevant containers. Verify that it displays DRAINED for the host and backend container.

Example

Phase 4: Force a reboot on all backend k8s nodes. Use the reboot -f command to force a reboot on each backend k8s node.

Example:

After running this command, the container restarts immediately. Repeat for all k8's nodes one by one in your environment.

Phase 5: Uncordon the backend k8s node and verify WEKA cluster status.

  1. Uncordon the backend k8s node:

Example:

  1. Access the WEKA Operator in the backend k8s node:

  1. Verify the weka drives status:

Example

Ensure all the pods, weka containers and the cluster is in a healthy state (Fully Protected) and IO operations are running (STARTED). Monitor the redistribution progress and alerts.

Phase 6: Cordon and drain all client k8s nodes.

To cordon and drain a node, run the following commands. Replace <k8s_node_IP> with the target k8s node's IP address.

  1. Cordon the client k8s node to mark it as unschedulable:

Example:

  1. Evict the the workload. For example: Drain the client k8s node to evict running pods, ensuring data is removed:

Example
  1. Force reboot all client nodes. Example for one client k8s node:

  1. After the client k8s nodes are up, uncordon the client k8s node. Example for one client k8s node:

Example:

  1. Verify that after uncordoning all client Kubernetes nodes:

  • All regular pods remain scheduled and running on those nodes.

  • All client containers within the cluster are joined and operational.

  • Only pods designated for data I/O operations are evicted.

Example

Phase 7: Force a reboot on all client k8s nodes. Use the reboot -f command to force a reboot on each client k8s node.

Example for one client k8s node:

After running this command, the client node restarts immediately. Repeat for all client nodes in your environment.

Phase 8: Uncordon all client k8s nodes.

  1. Once the client k8s nodes are back online, uncordon them to restore their availability for scheduling workloads. Example command for uncordoning a single client k8s node:

Example
  1. Verify pod status across all k8s nodes to confirm that all pods are running as expected:

  1. Validate WEKA cluster status to ensure all containers are operational:

See examples in Perform standard verification steps.


Remove a rack or Kubernetes node

Removing a rack or Kubernetes (k8s) node is necessary when you need to decommission hardware, replace failed components, or reconfigure your cluster. This procedure guides you through safely removing nodes without disrupting your system operations.

Procedure

  1. Create failure domain labels for your nodes:

    1. Label nodes with two machines per failure domain:

      b. Label nodes with one machine per failure domain:

  2. Apply the NoSchedule taint to nodes in failure domains:

  3. Remove WEKA labels from the untainted node:

  4. Configure the WekaCluster:

    1. Create a configuration file named cluster.yaml with the following content:

    2. Apply the configuration:

  5. Verify failure domain configuration:

    1. Check container distribution across failure domains using the WEKA cluster container.

    2. Test failure domain behavior by draining nodes which have same FD :

    3. Reboot the drained nodes:

    4. Monitor workload redistribution: Check that workloads are redistributed to other failure domains while nodes in one FD are down

Expected results

After completing this procedure:

  • Your nodes are properly configured with failure domains.

  • Workloads are distributed according to the failure domain configuration.

  • The system is ready for node removal with minimal disruption.

Troubleshooting

If workloads do not redistribute as expected after node drain:

  1. Check node labels and taints.

  2. Verify the WekaCluster configuration.

  3. Review the Kubernetes scheduler logs for any errors.


Perform a graceful node reboot on client nodes

A graceful node reboot ensures minimal service disruption when you need to restart a node for maintenance, updates, or configuration changes. The procedure involves cordoning the node, draining workloads, performing the reboot, and then returning the node to service.

Procedure

  1. Cordon the Kubernetes node to prevent new workloads from being scheduled:

  1. Drain the node to safely evict all pods:

The system displays warnings about DaemonSet-managed pods being ignored. This is expected behavior.

  1. Verify the node status shows as SchedulingDisabled:

  1. Reboot the target node:

  1. Wait for the node to complete its reboot cycle and return to a Ready state:

  1. Uncordon the node to allow new workloads to be scheduled:

  1. Verify that pods are running correctly on the node:

See examples in Perform standard verification steps.

Expected results

After completing this procedure:

  • The node has completed a clean reboot cycle.

  • All pods is rescheduled and running.

  • The node is available for new workload scheduling.

Troubleshooting

If pods fail to start after the reboot:

  1. Check pod status and events using kubectl describe pod <pod-name>.

  2. Review node conditions using kubectl describe node <node-ip>.

  3. Examine system logs for any errors or warnings.


Replace a drive in a converged setup

Drive replacement is necessary when hardware failures occur or system upgrades are required. Following this procedure ensures minimal system disruption while maintaining data integrity.

Before you begin

  • Ensure a replacement drive ready for installation.

  • Identify the node and drive that needs replacement.

  • Ensure you have the necessary permissions to execute Kubernetes commands

  • Back up any critical data if necessary.

Procedure

  1. List and record drive information:

    1. List the available drives on the target node:

    2. Identify the serial ID of the drives:

    3. Record the current drive configuration:

    4. Save the serial ID of the drive being replaced for later use.

Example
  1. Remove node label: Remove the WEKA backend support label from the target node:

Example
  1. Delete drive container: Delete the WEKA container object associated with the drive. Then, verify that the container pod enters a pending state and the drive is removed from the cluster.

Example
  1. Sign the new drive:

    1. Create a YAML configuration file for drive signing:

    2. Apply the configuration:

Example
  1. Block the old drive:

    1. Create a YAML configuration file for blocking the old drive:

    2. Apply the configuration:

Example
  1. Restore node label: Re-add the WEKA backend support label to the node:

Example
  1. Verify the replacement:

    1. Check the cluster drive status:

    2. Verify that:

      • The new drive appears in the cluster.

      • The drive status is ACTIVE.

      • The serial ID matches the replacement drive.

Troubleshooting

  • If the container pod remains in a pending state, check the pod events and logs.

  • If drive signing fails, verify the device path and node selector.

  • If the old drive remains visible, ensure the block operation completed successfully.

  • Maintain system stability by replacing one drive at a time.

  • Keep track of all serial IDs involved in the replacement process.

  • Monitor system health throughout the procedure.


Replace a Kubernetes node

This procedure enables systematic node replacement while maintaining cluster functionality and minimizing service interruption, addressing performance issues, hardware failures, or routine maintenance needs.

Prerequisites

  • Identification of the node to be replaced.

  • A new node prepared for integration into the cluster.

Procedure

  1. Remove node deployment label: Remove the existing label used to deploy the cluster from the node:

Example
  1. List existing WEKA containers to identify containers on the node:

Example
  1. Delete the compute and drive containers specific to the node:

Example
  1. Verify container deletion:

    1. Verify containers are in PodNotRunning status.

    2. Confirm no containers are running on the old node.

      Look for:

      • STATUS column showing PodNotRunning.

      • No containers associated with the old node.

Example
  1. Add backend label to new node: Label the new node to support backends:

Example
  1. Sign drives on new node:

    1. Create a WekaManualOperation configuration to sign drives:

    2. Apply the configuration:

Example
  1. Verification steps:

    1. Verify WEKA containers are rescheduled.

    2. Check that new containers are running on the new node's IP.

    3. Validate cluster status using WEKA CLI.

    For details, see WEKA Operator day-2 operations.

Non-functional node replacement: When a node becomes unresponsive or faulty, delete the non-functional node: kubectl delete node <node-name>

Kubernetes automatically handles the following:

  • Detects node failure.

  • Removes affected containers.

  • Reschedules containers to available nodes.

Troubleshooting

If containers fail to reschedule, check:

  • Node labels

  • Drive signing process

  • Cluster resource availability

  • Network connectivity


Remove WEKA container from a failed node

Removing a WEKA container from a failed node is necessary to maintain cluster health and prevent any negative impact on system performance. This procedure ensures that the container is removed safely and the cluster remains operational.

Procedure: Remove WEKA container from an active node

To remove a WEKA container when the node is responsive, run the following:

Procedure: Remove WEKA container from a failed node (unresponsive)

  1. Apply the configuration:

  2. If the resign drives operation fails with the error "container node is not ready, cannot perform resign drives operation", set the skip flag:

  3. Wait for the pod to enter the Terminating state.

If the failed node is removed from the Kubernetes cluster, the WEKA container and corresponding stuck pod are automatically removed.

Resign drives manually

If you need to manually resign specific drives, create and apply the following YAML configuration:

Example: wekacontainer conditions added on deletion

Verification

You can verify the removal process by checking the WEKA container conditions. A successful removal shows the following conditions in order:

  1. ContainerDrivesDeactivated

  2. ContainerDeactivated

  3. ContainerDrivesRemoved

  4. ContainerRemoved

  5. ContainerDrivesResigned


Replace a container on an active node

Replacing a container on an active node allows for system upgrades or failure recovery without shutting down services. This procedure ensures that the replacement is performed smoothly, keeping the cluster operational while the container is swapped out.

Procedure

Phase 1: Delete the existing container

  1. Identify the container to be replaced:

  2. Delete the selected container:

Example

Phase 2: Monitor deactivation process

  1. Verify that the container and its drives are being deactivated:

    Expected status: The container shows DRAINED (DOWN) under the STATUS column.

  2. Check the process status:

    Expected status: The processes associated with the container show DOWN status.

  3. For drive containers, verify drive status:

    Look for:

    • Drive status changes from ACTIVE to FAILED for the affected container.

    • All other drives remain ACTIVE.

Example

Phase 3: Monitor container recreation

  1. Watch for the new container creation:

  2. Verify the new container's integration with the cluster:

    Expected result: A new container appears with UP status.

  3. Verify the new container's running status:

    Expected status: Running.

  4. Confirm the container's integration with the WEKA cluster:

    Expected status: UP.

  5. For drive containers, verify drive activity:

    Expected status: All drives display ACTIVE status.

See examples in Perform standard verification steps.

Troubleshooting

If the container remains in erminating state:

  1. Check the container events:

  2. Review the operator logs for error messages.

  3. Verify resource availability for the new container.

For failed container starts, check:

  • Node resource availability

  • Network connectivity

  • Service status


Replace a container on a denylisted node

Replacing a container on a denylisted node is necessary when the node is flagged as problematic and impacts cluster performance. This procedure ensures safe container replacement, restoring system stability.

Procedure

  1. Remove the backend label from the node that is hosting the WEKA container (for example, weka.io/supports-backends) to prevent it from being chosen for the new container

Example
  1. Delete the pod containing the WEKA container. This action prompts the WEKA cluster to recreate the container, ensuring it is not placed on the labeled node.

  1. Monitor the container recreation and pod scheduling status. The container remains in a pending state due to the label being removed.

Example

Expected results

  • The container pod enters Pending state.

  • Pod scheduling fails with message: "nodes are available: x node(s) didn't match Pod's node affinity/selector".

  • The container is prevented from running on the denied node.

Troubleshooting

If the pod schedules successfully on the denied node:

  • Verify the backend support label was removed successfully.

  • Check node taints and tolerations.

  • Review pod scheduling policies and constraints.


Cluster scaling

Adjusting the size of a WEKA cluster ensures optimal performance and cost efficiency. Expand to meet growing workloads or shrink to reduce resources as demand decreases.

Expand a cluster

Cluster expansion enhances system resources and storage capacity while maintaining cluster stability. This procedure describes how to expand a WEKA cluster by increasing the number of compute and drive containers.

This procedure exemplifies an expansion of a cluster with 6 compute and 6 drive containers to a cluster with 7 compute and 7 drive containers. Each driveContainer has one driveCore.

Before you begin

Verify the following:

  • Ensure sufficient resources are available.

  • Ensure valid Quay.io credentials for WEKA container images.

  • Ensure access to the WEKA operator namespace.

  • Check the number of available Kubernetes nodes using kubectl get nodes.

  • Ensure all existing WEKA containers are in Running state.

  • Confirm your cluster is healthy with weka status.

Procedure

  1. Update the cluster configuration by increasing container value from previous value in your YAML file:

  1. Apply the updated configuration:

Example

Expected results

  • Total of 14 backend containers (7 compute + 7 drive).

  • All new containers show status as UP.

  • Weka status shows increased storage capacity.

  • Protection status remains Fully protected.

Troubleshooting

  • If containers remain in Pending state, verify available node capacity.

  • Check for sufficient resources across Kubernetes nodes.

  • Review WEKA operator logs for expansion-related issues.

Considerations

  • The number of containers cannot exceed available Kubernetes nodes.

  • Pending containers indicate resource constraints or node availability issues.

  • Each expansion requires sufficient system resources across the cluster.

If your cluster has resource constraints or insufficient nodes, container creation may remain in a pending state until additional nodes become available.


Expand an S3 cluster

Expanding an S3 cluster is necessary when additional storage or improved performance is required. Follow the steps below to expand the cluster while maintaining data availability and integrity.

Procedure

  1. Update cluster YAML: Increase the number of S3 containers in the cluster YAML file and re-deploy the configuration. Example YAML update:

    Apply the changes:

Example
  1. Verify new pods: Confirm that additional S3 and Envoy pods are created and running. Use the following command to list all pods:

Ensure two new S3 and Envoy pods appear in the output and are in the Running state.

Example
  1. Validate expansion: Verify the S3 cluster has expanded to include the updated number of containers. Check the cluster status and ensure no errors are present. Use these commands for validation:

Confirm the updated configuration reflects four S3 containers and all components are operational.

Example

Shrink a cluster

A WEKA cluster shrink operation reduces compute and drive containers to optimize resources and system footprint. Shrinking may free resources, lower costs, align capacity with demand, or decommission infrastructure. Perform carefully to ensure data integrity and service availability.

Before you begin

Verify the following:

  • Cluster is in a healthy state before beginning.

  • The WEKA cluster is operational and with sufficient redundancy.

  • At least one hot spare configured for safe container removal.

Procedure

  1. Modify the cluster configuration:

  1. Apply the updated configuration:

Example
  1. Verify the desired state change:

Replace <cluster-name> with your specific value.

Example
  1. Remove specific containers:

    • Identify containers to remove

    • Delete the compute container:

    • Delete the drive container:

  2. Verify cluster stability:

    • Check container status.

    • Monitor cluster health.

    • Verify data protection status.

Expected results

  • Reduced number of active containers and related pod.

  • Cluster status shows Running.

  • All remaining containers running properly.

  • Data protection maintained.

  • No service disruption.

Troubleshooting

  • If cluster shows degraded status, verify hot spare availability.

  • Check operator logs for potential issues.

  • Ensure proper container termination.

  • Verify resource redistribution.

Limitations

  • Manual container removal required.

  • Must maintain minimum required containers for protection level.

  • Hot spare needed for safe removal.

  • Cannot remove containers below protection requirement.

Expand and shrink cluster resources


Increase client cores

When system demands increase, you may need to add more processing power by increasing the number of client cores. This procedure shows how to increase client cores from 1 to 2 cores to improve system performance while maintaining stability.

Prerequisites

Sufficient hugepage memory (1500MiB per core).

Procedure

  1. Update the WekaClient object configuration in your client YAML file:

AWS DPDK on EKS is not supported for this configuration.

  1. Apply the updated client configuration:

Example
  1. Verify the new client core is added:

Replace <cluster-name> with your specific value.

Example
  1. Delete all client container pods to trigger the reconfiguration:

Replace <client-name> and <ip-address> with your specific values.

Example for one node
  1. Verify the client containers have restarted and rejoined the cluster:

Look for pods with your client name prefix to confirm they are in Running state.

Example
  1. Confirm the core increase in the WEKA cluster using the following commands :

Example

Verification

After completing these steps, verify that:

  • All client pods are in Running state.

  • The CORES value shows 2 for client containers.

  • The clients have successfully rejoined the cluster.

  • The system status shows no errors using weka status.

Troubleshooting

If clients fail to restart:

  • Ensure sufficient hugepage memory is available.

  • Check pod events for specific error messages.

  • Verify the client configuration in the YAML file is correct.


Increase backend cores

Increase the number of cores allocated to compute and drive containers to improve processing capacity for intensive workloads.

The following procedure exemplifies increase of the computeCores and driveCores from 1 to 2 cores.

Procedure

  1. Modify the cluster YAML configuration to update core allocation:

  1. Apply the updated configuration:

Example
  1. Verify the changes are applied to the cluster configuration:

Example

Troubleshooting

If core values are not updated after applying changes:

  1. Verify the YAML syntax is correct.

  2. Ensure the cluster configuration was successfully applied.

  3. Check for any error messages in the cluster events:

  • Core allocation changes may require additional steps for full implementation.

  • Monitor cluster performance after making changes.

  • Consider testing in a non-production environment first.

  • Contact support if core values persist at previous settings after applying changes.


Cluster maintenance

Cluster maintenance ensures optimal performance, security, and reliability through regular updates. Key tasks include updating WekaCluster and WekaClient configurations, rotating pods to apply changes, and creating token secret for WekaClient.

Update WekaCluster configuration

This topic explains how to update WekaCluster configuration parameters to enhance cluster performance or resolve issues.

You can update the following WekaCluster parameters:

  • AdditionalMemory (spec.AdditionalMemory)

  • Tolerations (spec.Tolerations)

  • RawTolerations (spec.RawTolerations)

  • DriversDistService (spec.DriversDistService)

  • ImagePullSecret (spec.ImagePullSecret)

After completing each of the following procedures, all pods restart within a few minutes to apply the new configuration.

Procedure: Update AdditionalMemory

  1. Open your cluster.yaml file and update the additional memory values:

  2. Apply the updated configuration:

  3. Delete the WekaContainer pods:

  4. Verify that the memory values have been updated to the new settings.

Procedure: Update Tolerations

  1. Open your cluster.yaml file and update the toleration values:

  2. Apply the updated configuration:

  3. Delete all WekaContainer pods:

Procedure: Update DriversDistService

  1. Open your cluster.yaml file and update the DriversDistService value:

  2. Apply the updated configuration:

  3. Delete the WEKA driver distribution pods:

Procedure: Update ImagePullSecret

  1. Open your cluster.yaml file and update the ImagePullSecret value:

  2. Apply the updated configuration:

  3. Delete all WekaContainer pods:

Troubleshooting

If pods do not restart automatically or the new configuration is not applied, verify:

  • The syntax in your cluster.yaml file is correct.

  • You have the necessary permissions to modify the cluster configuration.

  • The cluster is in a healthy state.


Update WekaClient configuration

This topic explains how to update WekaClient configuration parameters to ensure optimal client interactions with the cluster.

You can update the following WekaClient parameters:

  • DriversDistService (spec.DriversDistService)

  • ImagePullSecret (spec.ImagePullSecret)

  • WekaSecretRef (spec.WekaSecretRef)

  • AdditionalMemory (spec.AdditionalMemory)

  • UpgradePolicy (spec.UpgradePolicy)

  • DriversLoaderImage (spec.DriversLoaderImage)

  • Port (spec.Port)

  • AgentPort (spec.AgentPort)

  • PortRange (spec.PortRange)

  • CoresNumber (spec.CoresNumber)

  • Tolerations (spec.Tolerations)

  • RawTolerations (spec.RawTolerations)

After completing each of the following procedures, all pods restart within a few minutes to apply the new configuration.

Before you begin

Before updating any WekaClient configuration:

  • Ensure you have access to the client.yaml configuration file or client CRD.

  • Verify you have the necessary permissions to modify client configurations.

  • Back up your current configuration.

  • Ensure the cluster is in a healthy state and accessible to clients.

Procedure: Update DriversDistService

  1. Open your client.yaml file and update the DriversDistService value:

  2. Apply the updated configuration:

  3. Delete the client pods:

Procedure: Update ImagePullSecret

  1. Open your client.yaml file and update the ImagePullSecret value:

  2. Apply the updated configuration:

  3. Delete the client pods:

Procedure: Update Additional Memory

  1. Open your client.yaml file and update the additional memory values:

  2. Apply the updated configuration:

  3. Delete the client pods:

Procedure:Update Tolerations

  1. Open your client.yaml file and update the toleration values:

  2. Apply the updated configuration:

  3. Delete the client pods:

Procedure: Update WekaSecretRef

  1. Open your client.yaml file and update the WekaSecretRef value:

  2. Apply the updated configuration:

  3. Delete the client pods:

Procedure: Update Port Configuration

This procedure demonstrates how to migrate from specific port and agentPort configurations to a portRange configuration for Weka clients.

  1. Deploy the initial client configuration with specific ports:

  2. Apply the initial configuration:

  3. Verify the clients are running with the initial port configuration:

  4. Update the client YAML by removing the port and agentPort specifications and adding portRange:

  5. Apply the updated configuration:

  6. Delete the existing client container pods to trigger reconfiguration:

    Replace <client-name> and <ip-address> with your specific values.

  7. Verify that the pods have restarted and rejoined the cluster:

Procedure: Update CoresNumber

  1. Open your client.yaml file and update the CoresNumber value:

  2. Apply the updated configuration:

  3. Delete the client pods:

Troubleshooting

If pods do not restart automatically or the new configuration is not applied, verify:

  • The syntax in your client.yaml file is correct.

  • You have the necessary permissions to modify the client configuration.

  • The cluster is in a healthy state and accessible to clients.

  • The specified ports are available and not blocked by network policies.


Rotate all pods when applying changes

Rotating pods after updating cluster configuration ensures changes are properly applied across all containers.

Procedure

  1. Apply the updated cluster configuration:

Example: update cluster.yaml and apply
  1. Delete all container pods and verify that all pods restart and reach the Running state within a few minutes. In the following commands replace the * with the actual container names.

Example
  1. Delete the drive pods:

Example
  1. Delete the S3 pods:

Example
  1. Delete the envoy pods:

Example

Verification

  1. Monitor pod status until all pods return to Running state:

Example
  1. Verify the configuration changes are applied by checking pod resources:

Example

Expected results

  • All pods return to Running state within a few minutes.

  • Resource configurations match the updated values in the cluster configuration.

  • No service disruption during the rotation process.

  • Pods automatically restart after deletion.

  • The system maintains availability during pod rotation.

  • Wait for each set of pods to begin restarting before proceeding to the next set.


Create token secret for WekaClient

WekaClient tokens used for cluster authentication have a limited lifespan and will eventually expire. This guide walks you through the process of generating a new token, encoding it properly, and creating the necessary Kubernetes secret to maintain WekaClient connectivity.

Prerequisites

  • Access to a running WEKA cluster with backend servers

  • Kubernetes cluster with WEKA Operator deployed

  • kubectl access with appropriate permissions

  • Access to the weka-operator-system namespace

Step 1: Generate a new join token and encode it

The join token must be generated from within one of the WEKA backend containers. Follow these steps to create a long-lived token:

  1. List the available pods in the weka-operator-system namespace:

  2. Connect to a backend pod and generate the token:

    This command creates a token that remains valid for 52 weeks (one year). The system generates an output a JWT token similar to:

  3. Encode the token: The generated token must be base64-encoded before use in the Kubernetes secret:

Save the base64-encoded output for use in the secret configuration.

Example

Step 2: Create the Kubernetes secret

Option A: Using YAML template

Create a YAML file with the following template, replacing the placeholder values:

Configuration notes:

  • join-secret: Use the base64-encoded token from Step 2

  • org, username, password: Copy these values from the existing secret or create new base64-encoded values

  • namespace: Use default or specify your target namespace

Option B: Copy from existing secret

To preserve existing credentials, export the current secret and modify only the token:

Edit the file to update the join-secret field with your new base64-encoded token.

Step 3: Apply the secret

Deploy the new secret to your Kubernetes cluster:

Verify the secret creation:

Example

In the following example, the secret is created in the default name space with name new-weka-client-secret-cluster1

The following yaml file creates a new client with the name new-cluster1-clients in the default name space using the new-weka-client-secret-cluster1 secret.

Step 4: Update WekaClient configuration

Remove any existing client instances and ensure no pods are actively using WEKA storage on the target node.

Remove active workloads

  1. Identify pods using Weka on the target node:

  2. Stop workloads using Weka storage:

Remove existing WekaClient

  1. List current WekaClient instances:

  2. Delete the existing client:

Deploy new WekaClient

Create a new WekaClient configuration that references your updated secret:

Apply the configuration:

Step 5: Verify client status

Monitor the new WekaClient deployment:

The new client should show a Running status. CSI pods may temporarily enter CrashLoopBackOff state while the client initializes, but will recover automatically once the client is ready.

Example

Troubleshooting

CSI Pods in CrashLoopBackOff

If CSI pods remain in a failed state after the WekaClient is running, manually restart them:

Token validation

To verify your token is working correctly, check the WekaClient logs:

Secret verification

Confirm your secret contains the correct base64-encoded values:

Best practices

  • Token Lifetime: Generate tokens with appropriate expiration times based on your maintenance schedule

  • Secret Management: Store secrets in appropriate namespaces with proper RBAC controls

  • Documentation: Maintain records of token generation dates and expiration times

  • Monitoring: Implement alerts for token expiration to prevent service disruptions

  • Testing: Validate new tokens in non-production environments before deploying to production

Security considerations

  • Limit access to token generation commands to authorized personnel only

  • Use namespaces to isolate secrets from different environments

  • Regularly rotate tokens as part of your security policy

  • Monitor and audit secret access and modifications


WekaContainer lifecycle management

The WekaContainer serves as a critical persistence layer within a Kubernetes environment. Understanding its lifecycle is crucial because deleting a WekaContainer, whether gracefully or forcefully, results in the permanent loss of all associated data.

The following diagram provides a visual overview of the WekaContainer's lifecycle in Kubernetes, illustrating the flow from creation through running states and the various paths taken during deletion. The subsequent sections elaborate on the specific states, processes, and decision points shown.

WekaContainer lifecycle in Kubernetes

Key deletion states

The deletion process involves two primary states the container can enter:

  • Deleting: This state signifies a graceful shutdown process triggered by standard Kubernetes deletion or pod deletion timeouts. It involves the controlled Deactivation sequence shown in the diagram before the container is removed.

  • Destroying: This state represents a forced, immediate removal, bypassing the deactivation steps. As the diagram shows, this is typically triggered by a Cluster destroy event.

Deletion triggers and paths

The specific path taken upon deletion depends on the trigger:

  • Kubernetes resource deletion: When a user deletes the WekaContainer custom resource directly (for example, kubectl delete wekacontainer...), Kubernetes initiates the process leading to the Deleting state, starting the graceful deactivation cycle.

  • Pod termination (user-initiated or node drain): As shown in the Pod termination path, if the specific Pod hosting the WekaContainer is terminated, while the WekaContainer Custom Resource (CR) still exists (for example, due to node failure, eviction, or direct kubectl delete pod):

    • Kubernetes first attempts to gracefully stop the Weka process within that pod using weka local stop, allowing a 5-minute grace period.

    • If successful, the process stops cleanly. If weka local stop times out or fails, the specific Weka container instance tied to that terminating pod may transition to the Deleting state (as per the diagram) to ensure proper deactivation and removal from the Weka cluster's perspective (leading to data loss for that instance).

    • Important: Because the WekaContainer CR itself has not been deleted and still defines the desired state, the WEKA Operator detects that the required pod is missing. Consequently, the Operator automatically attempts to create a new pod to replace the terminated one, aiming to bring the system back to the Running state defined by the CR. This new pod starts fresh.

  • Cluster destruction: A cluster destroy operation does not immediately transition containers to the Destroying state. By default, WekaCluster uses a graceful termination period (spec.gracefulDestroyDuration, set to 24 hours). When the WekaCluster custom resource is deleted, WekaContainers first enter a Paused state (pods are terminated), but the containers and their data remain intact. After the graceful period ends, containers transition to the Destroying state for forced removal, bypassing any graceful shutdown attempts.

The deactivation process (graceful deletion)

When a WekaContainer follows the path into the Deleting state, it undergoes the multi-step Deactivation process shown before drives are resigned. This sequence ensures safe removal from the WEKA cluster and includes:

  • Cluster deactivation.

  • Removal from the S3 cluster (if applicable).

  • Removal from the main WEKA cluster.

  • Skipping deactivation: By setting overrides.skipDeactivate=true, you can bypass the deactivation steps and route the flow directly to Resigned drives. However, this is considered unsafe.

Drive management

Regardless of whether the path taken was Deleting (with or without deactivation) or Destroying, the process ends with the storage drives being resigned. This makes them available for reuse.

Health state and replacement

In this flow diagram, it's crucial to understand that WekaContainers in the Deleting or Destroying states are deemed unhealthy. This informs Kubernetes and the WEKA operator that the container is non-functional, typically prompting replacement attempts based on the deployment configuration. However, that data from the deleted container is permanently lost.

Last updated