WEKA Operator day-2 operations involve managing hardware, scaling clusters, and optimizing resources to ensure system stability and performance.
WEKA Operator day-2 operations maintain and optimize WEKA environments in Kubernetes clusters through three core operational areas:
Hardware maintenance
Component replacement and node management
Hardware failure remediation
Cluster scaling
Resource allocation optimization
Controlled cluster expansion
Cluster maintenance
Configuration updates
Pod rotation
Administrators execute both planned maintenance and emergency responses while following standardized procedures to ensure high availability and minimize service disruption.
Hardware maintenance
Hardware maintenance operations ensure cluster reliability and performance through systematic component management and failure response procedures. These operations span from routine preventive maintenance to critical component replacements.
Key operations:
Node management
Graceful and forced node reboots.
Node replacement and removal.
Complete rack decommissioning procedures.
Container operations
Container migration from failed nodes.
Container replacement on active nodes.
Container management on denylisted nodes.
Storage management
Drive replacement in converged setups.
Storage integrity verification.
Component failure recovery.
Each procedure follows established protocols to maintain system stability and minimize service disruption during maintenance activities. The documented procedures enable administrators to execute both planned maintenance and emergency responses while preserving data integrity and system performance.
Before you begin
Before performing any hardware maintenance or replacement tasks, ensure you have:
Administrative access to your Kubernetes cluster.
SSH access to the cluster nodes.
kubectl command-line tool installed and configured.
Proper backup of any critical data on the affected components.
Required replacement hardware (if applicable).
Maintenance window scheduled (if required).
Perform standard verification steps
This procedure describes the standard verification steps for checking WEKA cluster health. Multiple procedures in this documentation refer to these verification steps to confirm successful completion of their respective tasks.
A force reboot may be necessary when a machine becomes unresponsive or encounters a critical error that cannot be resolved through standard troubleshooting. This task ensures the machine restarts and resumes normal operation.
Phase 3: Ensure the WEKA containers are marked as drained.
List the cluster backend containers.
Run the following command to display the current status of all WEKA containers in the k8s nodes:
weka cluster container
Check the status of the WEKA containers.
In the command output, locate the STATUS column for the relevant containers. Verify that it displays DRAINED for the host and backend container.
Example
HOST ID HOSTNAME CONTAINER IPS STATUS REQUESTED ACTION RELEASE FAILURE DOMAIN CORES MEMORY UPTIME LAST FAILURE REQUESTED ACTION FAILURE
0 ip-10-0-64-189 drivex8d9a7bcex5994x4566x975fx3c3fbc7bf017 10.0.64.189 DRAINED (DOWN) STOP 4.4.1 AUTO 1 1.54 GB Action requested (1 hour ago)
1 ip-10-0-119-18 drivexffc61f84x840ax4b4cx9944xabdf5d50ffac 10.0.119.18 UP NONE 4.4.1 AUTO 1 1.54 GB 2:06:45h
2 ip-10-0-96-125 drivexddded148xa18cx4d31xa785xfd19e87ae858 10.0.96.125 UP NONE 4.4.1 AUTO 1 1.54 GB 1:51:40h Action requested (1 hour ago)
3 ip-10-0-119-18 computex62e21720xd6efx4797xa1cbx2c134c316c95 10.0.119.18 UP NONE 4.4.1 AUTO 1 2.94 GB 2:06:47h
4 ip-10-0-68-19 drivex89230786x4d11x4364x9eb8x9d061baad12b 10.0.68.19 UP NONE 4.4.1 AUTO 1 1.54 GB 2:06:45h
5 ip-10-0-64-189 computexcc3cb8d2x4a7dx4f38x9087xa99c021dc51b 10.0.64.189 DRAINED (DOWN) STOP 4.4.1 AUTO 1 2.94 GB Action requested (1 hour ago)
6 ip-10-0-78-157 s3x86193fa8xd55ax4bb1xb8c7xc77f2841c76b 10.0.78.157 UP NONE 4.4.1 AUTO 1 1.26 GB 2:06:44h
7 ip-10-0-96-125 s3x1783ddc3x91a5x4fa9xa7aax83a88515418d 10.0.96.125 UP NONE 4.4.1 AUTO 1 1.26 GB 1:56:49h Action requested (1 hour ago)
8 ip-10-0-85-230 computexe4edde12xf27dx4745xafacx907dd046d202 10.0.85.230 UP NONE 4.4.1 AUTO 1 2.94 GB 2:06:48h
9 ip-10-0-68-19 computexb1c3fad3xe566x4cd0xa7e9xe1cd885e4f43 10.0.68.19 UP NONE 4.4.1 AUTO 1 2.94 GB 2:06:48h
10 ip-10-0-78-157 drivex34be6929x2711x42e9xa4afx532411fe3290 10.0.78.157 UP NONE 4.4.1 AUTO 1 1.54 GB 2:06:46h
11 ip-10-0-85-230 drivexa994b565x00c7x4d12x887ax611926a9f6cc 10.0.85.230 UP NONE 4.4.1 AUTO 1 1.54 GB 2:06:47h
12 ip-10-0-78-157 computex27909856x161ax4d29x9b86x233d94ff5c21 10.0.78.157 UP NONE 4.4.1 AUTO 1 2.94 GB 2:06:46h
13 ip-10-0-96-125 computex4769a665x1c29x4fcax899axc8f55b7207f8 10.0.96.125 UP NONE 4.4.1 AUTO 1 2.94 GB 1:53:08h Action requested (1 hour ago)
14 ip-10-0-74-235 22b88fbd24c8client 10.0.74.235 UP NONE 4.4.1 1 1.36 GB 1:21:58h
15 ip-10-0-94-134 22b88fbd24c8client 10.0.94.134 UP NONE 4.4.1 1 1.36 GB 0:41:26h
16 ip-10-0-85-177 22b88fbd24c8client 10.0.85.177 UP NONE 4.4.1 1 1.36 GB 1:21:55h
17 ip-10-0-71-46 22b88fbd24c8client 10.0.71.46 UP NONE 4.4.1 1 1.36 GB 1:21:48h
18 ip-10-0-71-140 22b88fbd24c8client 10.0.71.140 UP NONE 4.4.1 1 1.36 GB 0:40:53h
19 ip-10-0-92-127 22b88fbd24c8client 10.0.92.127 UP NONE 4.4.1 1 1.36 GB 1:21:40h
Phase 4: Force a reboot on all backend k8s nodes.
Use the reboot -f command to force a reboot on each backend k8s node.
Example:
sudo reboot -f
Rebooting.
After running this command, the container restarts immediately. Repeat for all k8's nodes one by one in your environment.
Phase 5: Uncordon the backend k8s node and verify WEKA cluster status.
DISK ID UUID HOSTNAME NODE ID SIZE STATUS LIFETIME % USED ATTACHMENT DRIVE STATUS
0 8406a082-cc8d-40a1-87b4-90f7053dc3f2 ip-10-0-119-18 21 6.82 TiB ACTIVE 0 OK OK
1 01738d49-f2eb-49aa-8e39-6b5ac7f12727 ip-10-0-68-19 81 6.82 TiB ACTIVE 0 OK OK
2 8684695a-9a7f-4a39-801c-7019bd5fd4ea ip-10-0-64-189 1 6.82 TiB PHASING_IN 0 OK OK
3 ca4cacf6-4355-420d-a790-e0e58670e0ec ip-10-0-96-125 41 6.82 TiB ACTIVE 0 OK OK
4 b410544f-9fef-42a9-a24d-c73ffd33cefc ip-10-0-78-157 201 6.82 TiB ACTIVE 0 OK OK
5 16c55940-cf2c-4bf9-8d4b-d05f61be7264 ip-10-0-85-230 221 6.82 TiB ACTIVE 0 OK OK
Ensure all the pods, weka containers and the cluster is in a healthy state (Fully Protected) and IO operations are running (STARTED). Monitor the redistribution progress and alerts.
Phase 6: Cordon and drain all client k8s nodes.
To cordon and drain a node, run the following commands. Replace <k8s_node_IP> with the target k8s node's IP address.
Cordon the client k8s node to mark it as unschedulable:
Phase 7: Force a reboot on all client k8s nodes.
Use the reboot -f command to force a reboot on each client k8s node.
Example for one client k8s node:
sudo reboot -f
Rebooting.
After running this command, the client node restarts immediately. Repeat for all client nodes in your environment.
Phase 8: Uncordon all client k8s nodes.
Once the client k8s nodes are back online, uncordon them to restore their availability for scheduling workloads. Example command for uncordoning a single client k8s node:
Removing a rack or Kubernetes (k8s) node is necessary when you need to decommission hardware, replace failed components, or reconfigure your cluster. This procedure guides you through safely removing nodes without disrupting your system operations.
Monitor workload redistribution:
Check that workloads are redistributed to other failure domains while nodes in one FD are down
kubectlgetpods-owide
Expected results
After completing this procedure:
Your nodes are properly configured with failure domains.
Workloads are distributed according to the failure domain configuration.
The system is ready for node removal with minimal disruption.
Troubleshooting
If workloads do not redistribute as expected after node drain:
Check node labels and taints.
Verify the WekaCluster configuration.
Review the Kubernetes scheduler logs for any errors.
Perform a graceful node reboot on client nodes
A graceful node reboot ensures minimal service disruption when you need to restart a node for maintenance, updates, or configuration changes. The procedure involves cordoning the node, draining workloads, performing the reboot, and then returning the node to service.
Procedure
Cordon the Kubernetes node to prevent new workloads from being scheduled:
The node is available for new workload scheduling.
Troubleshooting
If pods fail to start after the reboot:
Check pod status and events using kubectl describe pod <pod-name>.
Review node conditions using kubectl describe node <node-ip>.
Examine system logs for any errors or warnings.
Replace a drive in a converged setup
Drive replacement is necessary when hardware failures occur or system upgrades are required. Following this procedure ensures minimal system disruption while maintaining data integrity.
Before you begin
Ensure a replacement drive ready for installation.
Identify the node and drive that needs replacement.
Ensure you have the necessary permissions to execute Kubernetes commands
Back up any critical data if necessary.
Procedure
List and record drive information:
List the available drives on the target node:
lsblk
Identify the serial ID of the drives:
ls -l /dev/disk/by-id | grep nvme
Record the current drive configuration:
weka cluster drive --verbose
Save the serial ID of the drive being replaced for later use.
Example
$ lsblk
$ ls -l /dev/disk/by-id | grep nvme
lrwxrwxrwx 1 root root 13 Jan 21 06:22 nvme-Amazon_EC2_NVMe_Instance_Storage_AWS22956E1E147546CE0 -> ../../nvme1n1
lrwxrwxrwx 1 root root 13 Jan 21 06:22 nvme-Amazon_EC2_NVMe_Instance_Storage_AWS22956E1E147546CE0_1 -> ../../nvme1n1
lrwxrwxrwx 1 root root 13 Jan 21 06:22 nvme-Amazon_EC2_NVMe_Instance_Storage_AWS22B489297A8BDAE28 -> ../../nvme2n1
lrwxrwxrwx 1 root root 13 Jan 21 06:22 nvme-Amazon_EC2_NVMe_Instance_Storage_AWS22B489297A8BDAE28_1 -> ../../nvme2n1
root@ip-10-0-86-72:/# weka cluster drive --verbose
UID DISK ID UUID HOST ID HOSTNAME NODE ID DEVICE PATH SIZE STATUS STATUS TIME FAILURE DOMAIN FAILURE DOMAIN ID WRITABLE LIFETIME % USED NVKV % USED ATTACHMENT VENDOR FIRMWARE SERIAL NUMBER MODEL ADDED REMOVED BLOCK SIZE SPARES REMAINING SPARES THRESHOLD DRIVE STATUS
cee03321-4874-c49d-d33b-ab3286e9e5e9 0 cc29f627-d90b-4ffc-9cb6-bd00c0859ba6 1 ip-10-0-86-72 21 0000:00:1e.0 6.82 TiB ACTIVE 0:16:12h AUTO 1 Writable 0 1 OK AMAZON 0 AWS1B0D53508C5F4ADC7 Amazon EC2 NVMe Instance Storage 0:16:46h 512 100 0 OK
c7590021-e14a-1469-2dc6-fa23e3b458b0 1 98e66c84-1b2f-4c9c-98e6-175bc693daf8 2 ip-10-0-108-188 41 0000:00:1e.0 6.82 TiB ACTIVE 0:16:12h AUTO 4 Writable 0 1 OK AMAZON 0 AWS19912EA78EC9334FC Amazon EC2 NVMe Instance Storage 0:16:46h 512 100 0 OK
af29dd72-47f2-1ba5-49a3-cf5bb6de8bdd 2 a00537e5-128d-4a8f-9ce4-4a410c5dedb7 3 ip-10-0-107-76 61 0000:00:1e.0 6.82 TiB ACTIVE 0:16:12h AUTO 2 Writable 0 1 OK AMAZON 0 AWS2283053DCF0B24EB8 Amazon EC2 NVMe Instance Storage 0:16:46h 512 100 0 OK
6b97cc0e-4a9a-6dae-cc16-b55e63105c9a 3 58fd7006-6727-46cd-b408-95693c70b525 0 ip-10-0-115-154 1 0000:00:1e.0 6.82 TiB ACTIVE 0:16:12h AUTO 0 Writable 0 1 OK AMAZON 0 AWS11FCE0C874C4432A2 Amazon EC2 NVMe Instance Storage 0:16:46h 512 100 0 OK
17c58212-351b-88e2-bc2e-ead01ab4b525 4 92197281-8d39-43aa-86c2-bb8901ac2f2f 7 ip-10-0-106-169 141 0000:00:1e.0 6.82 TiB ACTIVE 0:16:12h AUTO 3 Writable 0 1 OK AMAZON 0 AWS228BDF37DD8C9F561 Amazon EC2 NVMe Instance Storage 0:16:45h 512 100 0 OK
34fc77be-1041-02a8-f491-43ef142ced47 5 5b87de1e-1cf6-4c95-9ba8-03381e206384 6 ip-10-0-97-8 121 0000:00:1e.0 6.82 TiB ACTIVE 0:16:12h AUTO 5 Writable 0 1 OK AMAZON 0 AWS22956E1E147546CE0 Amazon EC2 NVMe Instance Storage 0:16:44h 512 100 0 OK
Remove node label: Remove the WEKA backend support label from the target node:
Delete drive container: Delete the WEKA container object associated with the drive. Then, verify that the container pod enters a pending state and the drive is removed from the cluster.
$ kubectl apply -f sign_devicepath_drive.yaml
wekamanualoperation.weka.weka.io/sign-specific-drives created
$ kubectl get wekamanualoperation --all-namespaces
NAMESPACE NAME ACTION STATUS AGE
weka-operator-system sign-specific-drives sign-drives 30s 23s
Block the old drive:
Create a YAML configuration file for blocking the old drive:
If the container pod remains in a pending state, check the pod events and logs.
If drive signing fails, verify the device path and node selector.
If the old drive remains visible, ensure the block operation completed successfully.
Maintain system stability by replacing one drive at a time.
Keep track of all serial IDs involved in the replacement process.
Monitor system health throughout the procedure.
Remove WEKA container from a failed node
Removing a WEKA container from a failed node is necessary to maintain cluster health and prevent any negative impact on system performance. This procedure ensures that the container is removed safely and the cluster remains operational.
Procedure: Remove WEKA container from an active node
Follow these steps to remove a WEKA container when the node is responsive:
Request WEKA container deletion by setting the deletion timestamp:
You can verify the removal process by checking the WEKA container conditions. A successful removal shows the following conditions in order:
ContainerDrivesDeactivated
ContainerDeactivated
ContainerDrivesRemoved
ContainerRemoved
ContainerDrivesResigned
Replace a container on an active node
Replacing a container on an active node allows for system upgrades or failure recovery without shutting down services. This procedure ensures that the replacement is performed smoothly, keeping the cluster operational while the container is swapped out.
$ kubectl delete pod cluster-dev-drive-05ddc629-b7f5-4090-8736-b9fc3b48ad82 -n weka-operator-system
pod "cluster-dev-drive-05ddc629-b7f5-4090-8736-b9fc3b48ad82" deleted
Phase 2: Monitor deactivation process
Verify that the container and its drives are being deactivated:
weka cluster container
Expected status: The container shows DRAINED (DOWN) under the STATUS column.
Check the process status:
weka cluster process
Expected status: The processes associated with the container show DOWN status.
For drive containers, verify drive status:
weka cluster drive
Look for:
Drive status changes from ACTIVE to FAILED for the affected container.
All other drives remain ACTIVE.
Example
root@ip-10-0-79-159:/# weka cluster container
HOST ID HOSTNAME CONTAINER IPS STATUS REQUESTED ACTION RELEASE FAILURE DOMAIN CORES MEMORY UPTIME LAST FAILURE REQUESTED ACTION FAILURE
0 ip-10-0-116-144 drivexcbc24786xdce1x4f0dx93ecx174a06206c6e 10.0.116.144 UP NONE 4.4.1.89-k8s-beta x5 1 1.54 GB 0:10:03h
1 ip-10-0-118-174 drivex49d0f816x2a9bx45f2x844ax05ddd6182f3e 10.0.118.174 UP NONE 4.4.1.89-k8s-beta x2 1 1.54 GB 0:10:04h
2 ip-10-0-81-44 computex45f8ef2bx2b74x4d7fxa01bx096c857b6740 10.0.81.44 UP NONE 4.4.1.89-k8s-beta x8 1 2.94 GB 0:10:04h
3 ip-10-0-116-144 computexe207f94ax6b5ex4217xb963x5573c0f5201d 10.0.116.144 UP NONE 4.4.1.89-k8s-beta x5 1 2.94 GB 0:10:02h
4 ip-10-0-100-147 computex3f265472x00bfx4172xbaaex5e5f20364aa8 10.0.100.147 UP NONE 4.4.1.89-k8s-beta x6 1 2.94 GB 0:07:15h
5 ip-10-0-117-96 drivex69ffc965x6cd5x489cxb44bx9abacbce7a98 10.0.117.96 UP NONE 4.4.1.89-k8s-beta x2 1 1.54 GB 0:07:22h
6 ip-10-0-79-159 drivexd012285ex65a9x4dc9xa2d6x32a4c834f02f 10.0.79.159 UP NONE 4.4.1.89-k8s-beta x1 1 1.54 GB 0:10:00h
7 ip-10-0-81-44 drivex0657c388xd04cx4c0bx8633x445671d86657 10.0.81.44 UP NONE 4.4.1.89-k8s-beta x8 1 1.54 GB 0:09:56h
8 ip-10-0-82-71 drivexaac103bexbbfbx48e0x8c49x7d4c57bc1730 10.0.82.71 UP NONE 4.4.1.89-k8s-beta x1 1 1.54 GB 0:09:56h
9 ip-10-0-99-7 computex0a01896fx1436x4314x9bcdx59d59da50257 10.0.99.7 UP NONE 4.4.1.89-k8s-beta x6 1 2.94 GB 0:09:59h
10 ip-10-0-121-214 computex09541214x6d47x49e3xabfbx2410ee6c1e2b 10.0.121.214 UP NONE 4.4.1.89-k8s-beta x5 1 2.94 GB 0:09:50h
11 ip-10-0-117-96 computex13e5e78cx9600x4c33x98b3x7d5fc16420b7 10.0.117.96 UP NONE 4.4.1.89-k8s-beta x2 1 2.94 GB 0:09:58h
12 ip-10-0-82-208 drivex9d266897x9dfbx4714xa5d9xc2db327dbee0 10.0.82.208 UP NONE 4.4.1.89-k8s-beta x3 1 1.54 GB 0:09:57h
13 ip-10-0-79-159 s3x1770abeaxba16x46aex9f4ax91aefb70cf1e 10.0.79.159 UP NONE 4.4.1.89-k8s-beta x1 1 1.26 GB 0:07:23h
14 ip-10-0-100-147 drivexd8f97316x1c7cx4610xa892xd66d900e6cbe 10.0.100.147 UP NONE 4.4.1.89-k8s-beta x6 1 1.54 GB 0:10:09h
15 ip-10-0-99-7 s3x9cd9970fxfaf9x46fcxb9c7xb5a53a5a0ccc 10.0.99.7 UP NONE 4.4.1.89-k8s-beta x6 1 1.26 GB 0:09:53h
16 ip-10-0-93-213 drivex6ca2fb06xf38ax4335x9861x22b43fe4f8a6 10.0.93.213 UP NONE 4.4.1.89-k8s-beta x3 1 1.54 GB 0:10:09h
17 ip-10-0-93-213 computexc25efa58x2120x490cxb5c1x09fe57e45cdc 10.0.93.213 UP NONE 4.4.1.89-k8s-beta x3 1 2.94 GB 0:10:08h
18 ip-10-0-121-214 drivex05ddc629xb7f5x4090x8736xb9fc3b48ad82 10.0.121.214 DRAINED (DOWN) STOP 4.4.1.89-k8s-beta x5 1 1.54 GB
19 ip-10-0-66-157 drivexb5168f94x8cd4x4b6dx843cx65cd6919d55f 10.0.66.157 UP NONE 4.4.1.89-k8s-beta x7 1 1.54 GB 0:10:06h
20 ip-10-0-88-165 drivex816a0286xe173x44b4xb528x9cbde1f698c7 10.0.88.165 UP NONE 4.4.1.89-k8s-beta x4 1 1.54 GB 0:10:05h
21 ip-10-0-118-174 computex8ddb4d56x338fx483fxaac2x9064bccbce17 10.0.118.174 UP NONE 4.4.1.89-k8s-beta x2 1 2.94 GB 0:10:06h
22 ip-10-0-88-165 computex310353c8xa349x479axb34cxa172531d3927 10.0.88.165 UP NONE 4.4.1.89-k8s-beta x4 1 2.94 GB 0:09:55h
23 ip-10-0-66-157 computex334e89efx1230x43e1xa4c2x7c4774c5383e 10.0.66.157 UP NONE 4.4.1.89-k8s-beta x7 1 2.94 GB 0:09:50h
24 ip-10-0-70-16 computexfc0fd513xf1ccx49eexa228xd209579a22cb 10.0.70.16 UP NONE 4.4.1.89-k8s-beta x4 1 2.94 GB 0:10:04h
25 ip-10-0-82-71 computexcc79625ax5291x44bdx850ax92320ea1a791 10.0.82.71 UP NONE 4.4.1.89-k8s-beta x1 1 2.94 GB 0:10:03h
root@ip-10-0-79-159:/# weka cluster process
PROCESS ID CONTAINER ID SLOT IN HOST HOSTNAME CONTAINER IPS STATUS RELEASE ROLES NETWORK CPU MEMORY UPTIME LAST FAILURE
0 0 0 ip-10-0-116-144 drivexcbc24786xdce1x4f0dx93ecx174a06206c6e 10.0.116.144 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:57h Host joined a new cluster (7 minutes ago)
1 0 1 ip-10-0-116-144 drivexcbc24786xdce1x4f0dx93ecx174a06206c6e 10.0.116.144 UP 4.4.1.89-k8s-beta DRIVES UDP 1 1.54 GB 0:06:52h
20 1 0 ip-10-0-118-174 drivex49d0f816x2a9bx45f2x844ax05ddd6182f3e 10.0.118.174 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:57h Host joined a new cluster (7 minutes ago)
21 1 1 ip-10-0-118-174 drivex49d0f816x2a9bx45f2x844ax05ddd6182f3e 10.0.118.174 UP 4.4.1.89-k8s-beta DRIVES UDP 1 1.54 GB 0:06:53h
40 2 0 ip-10-0-81-44 computex45f8ef2bx2b74x4d7fxa01bx096c857b6740 10.0.81.44 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:57h Host joined a new cluster (7 minutes ago)
41 2 1 ip-10-0-81-44 computex45f8ef2bx2b74x4d7fxa01bx096c857b6740 10.0.81.44 UP 4.4.1.89-k8s-beta COMPUTE UDP 2 2.94 GB 0:06:52h
60 3 0 ip-10-0-116-144 computexe207f94ax6b5ex4217xb963x5573c0f5201d 10.0.116.144 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:57h Host joined a new cluster (7 minutes ago)
61 3 1 ip-10-0-116-144 computexe207f94ax6b5ex4217xb963x5573c0f5201d 10.0.116.144 UP 4.4.1.89-k8s-beta COMPUTE UDP 2 2.94 GB 0:06:52h
80 4 0 ip-10-0-100-147 computex3f265472x00bfx4172xbaaex5e5f20364aa8 10.0.100.147 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:59h
81 4 1 ip-10-0-100-147 computex3f265472x00bfx4172xbaaex5e5f20364aa8 10.0.100.147 UP 4.4.1.89-k8s-beta COMPUTE UDP 2 2.94 GB 0:06:52h
100 5 0 ip-10-0-117-96 drivex69ffc965x6cd5x489cxb44bx9abacbce7a98 10.0.117.96 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:58h Host joined a new cluster (7 minutes ago)
101 5 1 ip-10-0-117-96 drivex69ffc965x6cd5x489cxb44bx9abacbce7a98 10.0.117.96 UP 4.4.1.89-k8s-beta DRIVES UDP 1 1.54 GB 0:06:53h
120 6 0 ip-10-0-79-159 drivexd012285ex65a9x4dc9xa2d6x32a4c834f02f 10.0.79.159 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:57h Host joined a new cluster (7 minutes ago)
121 6 1 ip-10-0-79-159 drivexd012285ex65a9x4dc9xa2d6x32a4c834f02f 10.0.79.159 UP 4.4.1.89-k8s-beta DRIVES UDP 1 1.54 GB 0:06:52h
140 7 0 ip-10-0-81-44 drivex0657c388xd04cx4c0bx8633x445671d86657 10.0.81.44 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:57h Host joined a new cluster (7 minutes ago)
141 7 1 ip-10-0-81-44 drivex0657c388xd04cx4c0bx8633x445671d86657 10.0.81.44 UP 4.4.1.89-k8s-beta DRIVES UDP 1 1.54 GB 0:06:52h
160 8 0 ip-10-0-82-71 drivexaac103bexbbfbx48e0x8c49x7d4c57bc1730 10.0.82.71 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:57h Host joined a new cluster (7 minutes ago)
161 8 1 ip-10-0-82-71 drivexaac103bexbbfbx48e0x8c49x7d4c57bc1730 10.0.82.71 UP 4.4.1.89-k8s-beta DRIVES UDP 1 1.54 GB 0:06:53h
180 9 0 ip-10-0-99-7 computex0a01896fx1436x4314x9bcdx59d59da50257 10.0.99.7 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:57h Host joined a new cluster (7 minutes ago)
181 9 1 ip-10-0-99-7 computex0a01896fx1436x4314x9bcdx59d59da50257 10.0.99.7 UP 4.4.1.89-k8s-beta COMPUTE UDP 1 2.94 GB 0:06:52h
200 10 0 ip-10-0-121-214 computex09541214x6d47x49e3xabfbx2410ee6c1e2b 10.0.121.214 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:57h Host joined a new cluster (7 minutes ago)
201 10 1 ip-10-0-121-214 computex09541214x6d47x49e3xabfbx2410ee6c1e2b 10.0.121.214 UP 4.4.1.89-k8s-beta COMPUTE UDP 2 2.94 GB 0:06:52h
220 11 0 ip-10-0-117-96 computex13e5e78cx9600x4c33x98b3x7d5fc16420b7 10.0.117.96 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:57h Host joined a new cluster (7 minutes ago)
221 11 1 ip-10-0-117-96 computex13e5e78cx9600x4c33x98b3x7d5fc16420b7 10.0.117.96 UP 4.4.1.89-k8s-beta COMPUTE UDP 2 2.94 GB 0:06:52h
240 12 0 ip-10-0-82-208 drivex9d266897x9dfbx4714xa5d9xc2db327dbee0 10.0.82.208 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:48h Host joined a new cluster (7 minutes ago)
241 12 1 ip-10-0-82-208 drivex9d266897x9dfbx4714xa5d9xc2db327dbee0 10.0.82.208 UP 4.4.1.89-k8s-beta DRIVES UDP 1 1.54 GB 0:06:52h
260 13 0 ip-10-0-79-159 s3x1770abeaxba16x46aex9f4ax91aefb70cf1e 10.0.79.159 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:57h Host joined a new cluster (7 minutes ago)
261 13 1 ip-10-0-79-159 s3x1770abeaxba16x46aex9f4ax91aefb70cf1e 10.0.79.159 UP 4.4.1.89-k8s-beta FRONTEND UDP 2 1.26 GB 0:06:52h
280 14 0 ip-10-0-100-147 drivexd8f97316x1c7cx4610xa892xd66d900e6cbe 10.0.100.147 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:58h Host joined a new cluster (7 minutes ago)
281 14 1 ip-10-0-100-147 drivexd8f97316x1c7cx4610xa892xd66d900e6cbe 10.0.100.147 UP 4.4.1.89-k8s-beta DRIVES UDP 1 1.54 GB 0:06:53h
300 15 0 ip-10-0-99-7 s3x9cd9970fxfaf9x46fcxb9c7xb5a53a5a0ccc 10.0.99.7 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:57h Host joined a new cluster (7 minutes ago)
301 15 1 ip-10-0-99-7 s3x9cd9970fxfaf9x46fcxb9c7xb5a53a5a0ccc 10.0.99.7 UP 4.4.1.89-k8s-beta FRONTEND UDP 2 1.26 GB 0:06:52h
320 16 0 ip-10-0-93-213 drivex6ca2fb06xf38ax4335x9861x22b43fe4f8a6 10.0.93.213 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:58h Host joined a new cluster (7 minutes ago)
321 16 1 ip-10-0-93-213 drivex6ca2fb06xf38ax4335x9861x22b43fe4f8a6 10.0.93.213 UP 4.4.1.89-k8s-beta DRIVES UDP 1 1.54 GB 0:06:47h
340 17 0 ip-10-0-93-213 computexc25efa58x2120x490cxb5c1x09fe57e45cdc 10.0.93.213 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:57h Host joined a new cluster (7 minutes ago)
341 17 1 ip-10-0-93-213 computexc25efa58x2120x490cxb5c1x09fe57e45cdc 10.0.93.213 UP 4.4.1.89-k8s-beta COMPUTE UDP 2 2.94 GB 0:06:50h
360 18 0 ip-10-0-121-214 drivex05ddc629xb7f5x4090x8736xb9fc3b48ad82 10.0.121.214 DOWN 4.4.1.89-k8s-beta MANAGEMENT UDP N/A Host joined a new cluster (7 minutes ago)
361 18 1 ip-10-0-121-214 drivex05ddc629xb7f5x4090x8736xb9fc3b48ad82 10.0.121.214 DOWN 4.4.1.89-k8s-beta DRIVES UDP 1 1.54 GB
380 19 0 ip-10-0-66-157 drivexb5168f94x8cd4x4b6dx843cx65cd6919d55f 10.0.66.157 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:57h Host joined a new cluster (7 minutes ago)
381 19 1 ip-10-0-66-157 drivexb5168f94x8cd4x4b6dx843cx65cd6919d55f 10.0.66.157 UP 4.4.1.89-k8s-beta DRIVES UDP 1 1.54 GB 0:06:53h
400 20 0 ip-10-0-88-165 drivex816a0286xe173x44b4xb528x9cbde1f698c7 10.0.88.165 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:57h Host joined a new cluster (7 minutes ago)
401 20 1 ip-10-0-88-165 drivex816a0286xe173x44b4xb528x9cbde1f698c7 10.0.88.165 UP 4.4.1.89-k8s-beta DRIVES UDP 1 1.54 GB 0:06:53h
420 21 0 ip-10-0-118-174 computex8ddb4d56x338fx483fxaac2x9064bccbce17 10.0.118.174 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:57h Host joined a new cluster (7 minutes ago)
421 21 1 ip-10-0-118-174 computex8ddb4d56x338fx483fxaac2x9064bccbce17 10.0.118.174 UP 4.4.1.89-k8s-beta COMPUTE UDP 2 2.94 GB 0:06:52h
440 22 0 ip-10-0-88-165 computex310353c8xa349x479axb34cxa172531d3927 10.0.88.165 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:57h Host joined a new cluster (7 minutes ago)
441 22 1 ip-10-0-88-165 computex310353c8xa349x479axb34cxa172531d3927 10.0.88.165 UP 4.4.1.89-k8s-beta COMPUTE UDP 2 2.94 GB 0:06:52h
460 23 0 ip-10-0-66-157 computex334e89efx1230x43e1xa4c2x7c4774c5383e 10.0.66.157 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:57h Host joined a new cluster (7 minutes ago)
461 23 1 ip-10-0-66-157 computex334e89efx1230x43e1xa4c2x7c4774c5383e 10.0.66.157 UP 4.4.1.89-k8s-beta COMPUTE UDP 2 2.94 GB 0:06:52h
480 24 0 ip-10-0-70-16 computexfc0fd513xf1ccx49eexa228xd209579a22cb 10.0.70.16 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:57h Host joined a new cluster (7 minutes ago)
481 24 1 ip-10-0-70-16 computexfc0fd513xf1ccx49eexa228xd209579a22cb 10.0.70.16 UP 4.4.1.89-k8s-beta COMPUTE UDP 1 2.94 GB 0:06:52h
500 25 0 ip-10-0-82-71 computexcc79625ax5291x44bdx850ax92320ea1a791 10.0.82.71 UP 4.4.1.89-k8s-beta MANAGEMENT UDP N/A 0:06:57h Host joined a new cluster (7 minutes ago)
501 25 1 ip-10-0-82-71 computexcc79625ax5291x44bdx850ax92320ea1a791 10.0.82.71 UP 4.4.1.89-k8s-beta COMPUTE UDP 2 2.94 GB 0:06:52h
root@ip-10-0-79-159:/# weka cluster drive
DISK ID UUID HOSTNAME NODE ID SIZE STATUS LIFETIME % USED ATTACHMENT DRIVE STATUS
0 2c5a2e45-72dd-481a-ae58-9672ab52fe86 ip-10-0-118-174 21 6.82 TiB ACTIVE 0 OK OK
1 12d86b77-24fa-4c08-9b25-b91758e17e99 ip-10-0-121-214 361 6.82 TiB FAILED 0 OK OK
2 0123005a-aa60-4d74-95da-dda248d41f6a ip-10-0-93-213 321 6.82 TiB ACTIVE 0 OK OK
3 a6b749a7-6f7d-4099-a750-d8368bbc6174 ip-10-0-82-208 241 6.82 TiB ACTIVE 0 OK OK
4 8f9be465-c5e0-4a92-8c1f-c9b963a8a596 ip-10-0-79-159 121 6.82 TiB ACTIVE 0 OK OK
5 1850a4bb-9cbd-4e83-8428-ac2fee28ec7f ip-10-0-116-144 1 6.82 TiB ACTIVE 0 OK OK
6 f45d8354-a294-4176-9d00-7ab61bc19225 ip-10-0-66-157 381 6.82 TiB ACTIVE 0 OK OK
7 7950caad-628e-48bf-b224-3571af78fc38 ip-10-0-81-44 141 6.82 TiB ACTIVE 0 OK OK
8 3270402b-794a-403d-a8ab-595afa60bb35 ip-10-0-117-96 101 6.82 TiB ACTIVE 0 OK OK
9 7dc6657d-3038-4b87-a4b6-c36d2fa2a08a ip-10-0-100-147 281 6.82 TiB ACTIVE 0 OK OK
10 f310387c-16b7-4252-92ff-a1475b5bbbf6 ip-10-0-82-71 161 6.82 TiB ACTIVE 0 OK OK
11 6e3e1c8b-da59-4106-9e65-e27b586a0c71 ip-10-0-88-165 401 6.82 TiB ACTIVE 0 OK OK
Phase 3: Monitor container recreation
Watch for the new container creation:
kubectlgetpods-owide-nweka-operator-system-w
Verify the new container's integration with the cluster:
wekaclustercontainer
Expected result: A new container appears with UP status.
Verify the new container's running status:
kubectlgetpods-nweka-operator-system
Expected status: Running.
Confirm the container's integration with the WEKA cluster:
wekaclusterhost
Expected status: UP.
For drive containers, verify drive activity:
wekaclusterdrive
Expected status: All drives display ACTIVE status.
Verify resource availability for the new container.
For failed container starts, check:
Node resource availability
Network connectivity
Service status
Replace a container on a denylisted node
Replacing a container on a denylisted node is necessary when the node is flagged as problematic and impacts cluster performance. This procedure ensures safe container replacement, restoring system stability.
Procedure
Remove the backend label from the node that is hosting the WEKA container (for example, weka.io/supports-backends) to prevent it from being chosen for the new container
$ kubectl get nodes 18.201.172.13 --show-labels
NAME STATUS ROLES AGE VERSION LABELS
18.201.172.13 Ready control-plane,etcd,master 161m v1.30.6+k3s1 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=k3s,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=18.201.172.13,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io/etcd=true,node-role.kubernetes.io/master=true,node.kubernetes.io/instance-type=k3s,p2p.k3s.cattle.io/enabled=true,weka.io/failure-domain=x1,weka.io/supports-backends=true,weka.io/supports-builds=true,weka.io/supports-clients=true
$ kubectl label nodes 18.201.172.13 weka.io/supports-backends-
node/18.201.172.13 unlabeled
$ kubectl get nodes 18.201.172.13 --show-labels
NAME STATUS ROLES AGE VERSION LABELS
18.201.172.13 Ready control-plane,etcd,master 162m v1.30.6+k3s1 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=k3s,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=18.201.172.13,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io/etcd=true,node-role.kubernetes.io/master=true,node.kubernetes.io/instance-type=k3s,p2p.k3s.cattle.io/enabled=true,weka.io/failure-domain=x1,weka.io/supports-builds=true,weka.io/supports-clients=true
Delete the pod containing the WEKA container.
This action prompts the WEKA cluster to recreate the container, ensuring it is not placed on the labeled node.
kubectldeletepod<pod-name>-nweka-operator-system
Monitor the container recreation and pod scheduling status.
The container remains in a pending state due to the label being removed.
Pod scheduling fails with message: "nodes are available: x node(s) didn't match Pod's node affinity/selector".
The container is prevented from running on the denied node.
Troubleshooting
If the pod schedules successfully on the denied node:
Verify the backend support label was removed successfully.
Check node taints and tolerations.
Review pod scheduling policies and constraints.
Cluster scaling
Adjusting the size of a WEKA cluster ensures optimal performance and cost efficiency. Expand to meet growing workloads or shrink to reduce resources as demand decreases.
Expand a cluster
Cluster expansion enhances system resources and storage capacity while maintaining cluster stability. This procedure describes how to expand a WEKA cluster by increasing the number of compute and drive containers.
This procedure exemplifies an expansion of a cluster with 6 compute and 6 drive containers to a cluster with 7 compute and 7 drive containers. Each driveContainer has one driveCore.
Before you begin
Verify the following:
Ensure sufficient resources are available.
Ensure valid Quay.io credentials for WEKA container images.
Ensure access to the WEKA operator namespace.
Check the number of available Kubernetes nodes using kubectl get nodes.
Ensure all existing WEKA containers are in Running state.
Confirm your cluster is healthy with weka status.
Procedure
Update the cluster configuration by increasing container value from previous value in your YAML file:
cluster.yaml
spec:template:dynamicdynamicTemplate:computeContainers:7# Increase from previous valuedriveContainers:7# Increase from previous valuecomputeCores:1driveCores:1numDrives:1
Total of 14 backend containers (7 compute + 7 drive).
All new containers show status as UP.
Weka status shows increased storage capacity.
Protection status remains Fully protected.
Troubleshooting
If containers remain in Pending state, verify available node capacity.
Check for sufficient resources across Kubernetes nodes.
Review WEKA operator logs for expansion-related issues.
Considerations
The number of containers cannot exceed available Kubernetes nodes.
Pending containers indicate resource constraints or node availability issues.
Each expansion requires sufficient system resources across the cluster.
If your cluster has resource constraints or insufficient nodes, container creation may remain in a pending state until additional nodes become available.
Expand an S3 cluster
Expanding an S3 cluster is necessary when additional storage or improved performance is required. Follow the steps below to expand the cluster while maintaining data availability and integrity.
Procedure
Update cluster YAML: Increase the number of S3 containers in the cluster YAML file and re-deploy the configuration.
Example YAML update:
spec:template:dynamicdynamicTemplate:computeContainers:6driveContainers:6computeCores:1driveCores:1numDrives:1s3Containers:4# Icrease from previous value
Validate expansion: Verify the S3 cluster has expanded to include the updated number of containers. Check the cluster status and ensure no errors are present.
Use these commands for validation:
kubectldescribewekacluster-nweka-operator-system
Confirm the updated configuration reflects four S3 containers and all components are operational.
Example
$ kubectl describe wekacluster -n weka-operator-system
Name: cluster-dev
Namespace: weka-operator-system
Labels: <none>
Annotations: <none>
API Version: weka.weka.io/v1alpha1
Kind: WekaCluster
Metadata:
Creation Timestamp: 2024-11-16T11:13:19Z
Finalizers:
weka.weka.io/finalizer
Generation: 3
Resource Version: 10445
UID: 844cec7a-f41d-45cd-9c59-9810bbd199fe
Spec:
Additional Memory:
Compute: 500
Drive: 1000
s3: 200
Cpu Policy: auto
Drivers Dist Service: https://weka-driver-dist.weka-operator-system.svc.cluster.local:60002
Dynamic Template:
Compute Containers: 6
Compute Cores: 1
Drive Containers: 6
Drive Cores: 1
Num Drives: 1
s3Containers: 4
Graceful Destroy Duration: 24h0m0s
Hot Spare: 0
Image: quay.io/weka.io/weka-in-container:4.4.1
Image Pull Secret: quay-io-robot-secret
Network:
Node Selector:
weka.io/supports-backends: true
Ports:
Role Node Selector:
Template: dynamic
Status:
Cluster ID: cd596d28-be9a-4864-b34b-dbe45e8914cc
Conditions:
Last Transition Time: 2024-11-16T11:13:19Z
Message: Cluster secrets are created
Reason: Init
Status: True
Type: ClusterSecretsCreated
Last Transition Time: 2024-11-16T11:13:21Z
Message: Completed successfully
Reason: Init
Status: True
Type: PodsCreated
Last Transition Time: 2024-11-16T11:22:41Z
Message: Completed successfully
Reason: Init
Status: True
Type: ContainerResourcesAllocated
Last Transition Time: 2024-11-16T11:19:06Z
Message: Completed successfully
Reason: Init
Status: True
Type: PodsReady
Last Transition Time: 2024-11-16T11:19:29Z
Message: Completed successfully
Reason: Init
Status: True
Type: ClusterCreated
Last Transition Time: 2024-11-16T11:19:29Z
Message: Completed successfully
Reason: Init
Status: True
Type: JoinedCluster
Last Transition Time: 2024-11-16T11:19:30Z
Message: Completed successfully
Reason: Init
Status: True
Type: DrivesAdded
Last Transition Time: 2024-11-16T11:20:11Z
Message: Completed successfully
Reason: Init
Status: True
Type: IoStarted
Last Transition Time: 2024-11-16T11:20:12Z
Message: Completed successfully
Reason: Init
Status: True
Type: ClusterSecretsApplied
Last Transition Time: 2024-11-16T11:20:14Z
Message: Completed successfully
Reason: Init
Status: True
Type: CondDefaultFsCreated
Last Transition Time: 2024-11-16T11:20:14Z
Message: Completed successfully
Reason: Init
Status: True
Type: CondS3ClusterCreated
Last Transition Time: 2024-11-16T11:20:15Z
Message: Completed successfully
Reason: Init
Status: True
Type: ClusterClientsSecretsCreated
Last Transition Time: 2024-11-16T11:20:15Z
Message: Completed successfully
Reason: Init
Status: True
Type: ClusterClientsSecretsApplied
Last Transition Time: 2024-11-16T11:20:15Z
Message: Completed successfully
Reason: Init
Status: True
Type: ClusterCSIsSecretsCreated
Last Transition Time: 2024-11-16T11:20:16Z
Message: Completed successfully
Reason: Init
Status: True
Type: ClusterCSIsSecretsApplied
Last Transition Time: 2024-11-16T11:20:16Z
Message: Completed successfully
Reason: Init
Status: True
Type: WekaHomeConfigured
Last Transition Time: 2024-11-16T11:20:16Z
Message: Completed successfully
Reason: Init
Status: True
Type: ClusterIsReady
Last Applied Image: quay.io/weka.io/weka-in-container:4.4.1
Last Applied Spec: 571bbabc250fb7cfb19ed709bc40b8cb752931a7b3080e300e1b58db8c9559ee
Ports:
Base Port: 15000
Lb Admin Port: 15301
Lb Port: 15300
Port Range: 500
s3Port: 15302
Status: Ready
Throughput:
Events: <none>
kavya.kv@U-1L2T6ZMZJ23X9:~/Downloads/Archive/manualoperations$ kubectl exec -it cluster-dev-compute-05a6a09a-432d-42fe-9df4-c129780aa410 -n weka-operator-system -- /bin/bash
root@ip-10-0-93-212:/# weka status
WekaIO v4.4.1 (CLI build 4.4.1)
cluster: cluster-dev (cd596d28-be9a-4864-b34b-dbe45e8914cc)
status: OK (16 backend containers UP, 6 drives UP)
protection: 3+2 (Fully protected)
hot spare: 0 failure domains
drive storage: 22.09 TiB total, 21.86 TiB unprovisioned
cloud: connected
license: Unlicensed
io status: STARTED 5 minutes ago (16 io-nodes UP, 138 Buckets UP)
link layer: Ethernet
clients: 0 connected
reads: 0 B/s (0 IO/s)
writes: 0 B/s (0 IO/s)
operations: 12 ops/s
alerts: 31 active alerts, use `weka alerts` to list them
root@ip-10-0-93-212:/# weka cluster host
HOST ID HOSTNAME CONTAINER IPS STATUS REQUESTED ACTION RELEASE FAILURE DOMAIN CORES MEMORY UPTIME LAST FAILURE REQUESTED ACTION FAILURE
0 ip-10-0-124-65 drivexac3824eexcb66x469exbca9xf2c7274db4ec 10.0.124.65 UP NONE 4.4.1 AUTO 1 1.54 GB 0:06:42h
1 ip-10-0-102-61 computex723230f4x6ed1x4ff3x94dfxe8c7ebcede75 10.0.102.61 UP NONE 4.4.1 AUTO 1 2.94 GB 0:06:40h
2 ip-10-0-93-212 computex05a6a09ax432dx42fex9df4xc129780aa410 10.0.93.212 UP NONE 4.4.1 AUTO 1 2.94 GB 0:06:26h
3 ip-10-0-79-87 s3xce450cebx58c9x4049x986ax75327fa0d76a 10.0.79.87 UP NONE 4.4.1 AUTO 1 1.26 GB 0:06:36h
4 ip-10-0-113-26 drivex65376aa9x24f0x4eb2x9dfexd72e408916e0 10.0.113.26 UP NONE 4.4.1 AUTO 1 1.54 GB 0:06:33h
5 ip-10-0-64-53 computexf3df3e56xaad6x4d29xb303xfdc60132f870 10.0.64.53 UP NONE 4.4.1 AUTO 1 2.94 GB 0:06:35h
6 ip-10-0-107-12 s3x78a7332fx1a54x429cxa34exaa96fbbff216 10.0.107.12 UP NONE 4.4.1 AUTO 1 1.26 GB 0:06:43h
7 ip-10-0-93-212 drivexd7414597x3a96x459ex99a0x7965345c3fa0 10.0.93.212 UP NONE 4.4.1 AUTO 1 1.54 GB 0:06:26h
8 ip-10-0-124-65 computexe6269c4exb392x4951xb41bxa401a59fb11a 10.0.124.65 UP NONE 4.4.1 AUTO 1 2.94 GB 0:06:42h
9 ip-10-0-113-26 computexf25ee328x7ea6x4d83x9e53x113996c91a78 10.0.113.26 UP NONE 4.4.1 AUTO 1 2.94 GB 0:06:33h
10 ip-10-0-64-53 drivexdad14164xf118x4cfdx9401x6e061f44209d 10.0.64.53 UP NONE 4.4.1 AUTO 1 1.54 GB 0:06:35h
11 ip-10-0-79-87 drivexaf464a29x7180x445ax869dx64e274b47993 10.0.79.87 UP NONE 4.4.1 AUTO 1 1.54 GB 0:06:36h
12 ip-10-0-102-61 drivex79df7254xcee6x4411xb78ax1e503e331e9f 10.0.102.61 UP NONE 4.4.1 AUTO 1 1.54 GB 0:06:39h
13 ip-10-0-107-12 computex6d9f3d37xb8e7x4db6x9df1x5aa2a82e423e 10.0.107.12 UP NONE 4.4.1 AUTO 1 2.94 GB 0:06:43h
14 ip-10-0-124-65 s3x75aaeac7xda47x44bbx82d3xc3c273575bd3 10.0.124.65 UP NONE 4.4.1 AUTO 1 1.26 GB 0:02:34h
15 ip-10-0-102-61 s3x1ffc8818xe647x4e5cxbbb3x95dcd8ca96f8 10.0.102.61 UP NONE 4.4.1 AUTO 1 1.26 GB 0:02:34h
The command 'weka cluster host' is deprecated. Please use 'weka cluster container' instead.
root@ip-10-0-93-212:/# weka s3 cluster
S3 Cluster Info
Status: Online
All Hosts: off
Port: 15300
Filesystem: default
S3 Hosts: HostId<14>, HostId<3>, HostId<6>, HostId<15>
root@ip-10-0-93-212:/# weka s3 cluster -v
S3 Cluster Info
Status: Online
All Hosts: off
Port: 15300
Filesystem: default
Config FS: .config_fs
S3 Hosts: HostId<14>, HostId<3>, HostId<6>, HostId<15>
Mount Options: rw,relatime,readcache,readahead_kb=32768,dentry_max_age_positive=1000,dentry_max_age_negative=0,container_name=s3xce450cebx58c9x4049x986ax75327fa0d76a
TLS: on
ILM: on
Creator Owner: off
Max Buckets Limit: 10000
MPU Background: on
ILM Hosts: HostId<3>
Anonymous Posix UID/GID: 65534/65534
Internal Port: 15302
SLB Admin Port: 15301
SLB Max Connections: 1024
SLB Max Pending Requests: 1024
SLB Max Requests: 1024
root@ip-10-0-93-212:/# weka s3 cluster status
ID HOSTNAME S3 STATUS IP PORT VERSION UPTIME ACTIVE REQUESTS LAST FAILURE
14 ip-10-0-124-65 Ready 10.0.124.65 15300 4.4.1 0:02:33h 0
15 ip-10-0-102-61 Ready 10.0.102.61 15300 4.4.1 0:02:31h 0
3 ip-10-0-79-87 Ready 10.0.79.87 15300 4.4.1 0:05:26h 0
6 ip-10-0-107-12 Ready 10.0.107.12 15300 4.4.1 0:05:26h 0
Shrink a cluster
A WEKA cluster shrink operation reduces compute and drive containers to optimize resources and system footprint. Shrinking may free resources, lower costs, align capacity with demand, or decommission infrastructure. Perform carefully to ensure data integrity and service availability.
Before you begin
Verify the following:
Cluster is in a healthy state before beginning.
The WEKA cluster is operational and with sufficient redundancy.
At least one hot spare configured for safe container removal.
Procedure
Modify the cluster configuration:
cluster.yaml
spec:template:dynamicdynamicTemplate:computeContainers:6# Reduce from previous valuedriveContainers:6# Reduce from previous valuecomputeCores:1driveCores:1numDrives:1
When system demands increase, you may need to add more processing power by increasing the number of client cores. This procedure shows how to increase client cores from 1 to 2 cores to improve system performance while maintaining stability.
Prerequisites
Sufficient hugepage memory (1500MiB per core).
Procedure
Update the WekaClient object configuration in your client YAML file:
coresNum:2#increase num of cores
AWS DPDK on EKS is not supported for this configuration.
Confirm the core increase in the WEKA cluster using the following commands :
weka cluster container
weka cluster process
weka status
Example
root@ip-10-0-98-109:/# weka cluster container
HOST ID HOSTNAME CONTAINER IPS STATUS REQUESTED ACTION RELEASE FAILURE DOMAIN CORES MEMORY UPTIME LAST FAILURE REQUESTED ACTION FAILURE
0 ip-10-0-83-118 drivex92d620e9xffc0x4d14x823ex6444a0d2a823 10.0.83.118 UP NONE 4.4.2.144-k8s AUTO 1 1.54 GB 2:17:27h
1 ip-10-0-83-118 s3x58e05049x0c4ex44b8x9b99xccff4dc364db 10.0.83.118 UP NONE 4.4.2.144-k8s AUTO 1 1.26 GB 2:17:25h
2 ip-10-0-125-187 drivex3cc5580dx303ex4c15xba6cx82ccc04a898d 10.0.125.187 UP NONE 4.4.2.144-k8s AUTO 1 1.54 GB 2:17:25h
3 ip-10-0-65-133 drivex4cca8165xfcaax438dx9a75xf372efc7a497 10.0.65.133 UP NONE 4.4.2.144-k8s AUTO 1 1.54 GB 2:17:28h
4 ip-10-0-65-133 computex133b97bcx5bf6x4612x8f2axd09fc42bb573 10.0.65.133 UP NONE 4.4.2.144-k8s AUTO 1 2.94 GB 2:17:25h
5 ip-10-0-110-144 computexc0ac9647xf6a5x4d77x909ax07e865469af1 10.0.110.144 UP NONE 4.4.2.144-k8s AUTO 1 2.94 GB 2:17:24h
6 ip-10-0-107-84 computex1b0db083x8986x45e0x96e3xe739b58808ee 10.0.107.84 UP NONE 4.4.2.144-k8s AUTO 1 2.94 GB 2:17:27h
7 ip-10-0-98-109 computex1d5f9c03x5b35x401exa71ex3786135e7a66 10.0.98.109 UP NONE 4.4.2.144-k8s AUTO 1 2.94 GB 2:17:23h
8 ip-10-0-110-144 drivex3cdd7cd0xecbbx4240x9f85xb5881b2f276c 10.0.110.144 UP NONE 4.4.2.144-k8s AUTO 1 1.54 GB 2:17:24h
9 ip-10-0-125-187 computexcc4c114exb720x49a7xa964xb050041220d1 10.0.125.187 UP NONE 4.4.2.144-k8s AUTO 1 2.94 GB 2:17:26h
10 ip-10-0-83-118 computexac9ca615x9425x48eexaafex756ee3e8e8aa 10.0.83.118 UP NONE 4.4.2.144-k8s AUTO 1 2.94 GB 2:17:24h
11 ip-10-0-65-133 s3xea3f0926x5063x4a8dx956cx9d8828f31232 10.0.65.133 UP NONE 4.4.2.144-k8s AUTO 1 1.26 GB 2:17:28h
12 ip-10-0-107-84 drivexfbc81c00xa6dex442fxb5b3x3ed4dd433741 10.0.107.84 UP NONE 4.4.2.144-k8s AUTO 1 1.54 GB 2:17:24h
13 ip-10-0-98-109 drivex07ddd04bx85cdx43acxbea0xb0f2c339f335 10.0.98.109 UP NONE 4.4.2.144-k8s AUTO 1 1.54 GB 2:17:23h
14 ip-10-0-103-75 c138a41e8d33client 10.0.103.75 UP NONE 4.4.2.144-k8s 2 2.94 GB 0:01:13h
15 ip-10-0-113-108 c138a41e8d33client 10.0.113.108 UP NONE 4.4.2.144-k8s 2 2.94 GB 0:01:19h
16 ip-10-0-96-250 c138a41e8d33client 10.0.96.250 UP NONE 4.4.2.144-k8s 2 2.94 GB 0:01:20h
17 ip-10-0-66-16 c138a41e8d33client 10.0.66.16 UP NONE 4.4.2.144-k8s 2 2.94 GB 0:01:28h
18 ip-10-0-94-223 c138a41e8d33client 10.0.94.223 UP NONE 4.4.2.144-k8s 2 2.94 GB 0:01:03h
19 ip-10-0-79-235 c138a41e8d33client 10.0.79.235 UP NONE 4.4.2.144-k8s 2 2.94 GB 0:01:23h
root@ip-10-0-98-109:/# weka cluster process
PROCESS ID CONTAINER ID SLOT IN HOST HOSTNAME CONTAINER IPS STATUS RELEASE ROLES NETWORK CPU MEMORY UPTIME LAST FAILURE
0 0 0 ip-10-0-83-118 drivex92d620e9xffc0x4d14x823ex6444a0d2a823 10.0.83.118 UP 4.4.2.144-k8s MANAGEMENT UDP N/A 2:17:15h
1 0 1 ip-10-0-83-118 drivex92d620e9xffc0x4d14x823ex6444a0d2a823 10.0.83.118 UP 4.4.2.144-k8s DRIVES UDP 6 1.54 GB 2:17:09h
20 1 0 ip-10-0-83-118 s3x58e05049x0c4ex44b8x9b99xccff4dc364db 10.0.83.118 UP 4.4.2.144-k8s MANAGEMENT UDP N/A 2:17:14h Host joined a new cluster (2 hours ago)
21 1 1 ip-10-0-83-118 s3x58e05049x0c4ex44b8x9b99xccff4dc364db 10.0.83.118 UP 4.4.2.144-k8s FRONTEND UDP 3 1.26 GB 2:17:09h
40 2 0 ip-10-0-125-187 drivex3cc5580dx303ex4c15xba6cx82ccc04a898d 10.0.125.187 UP 4.4.2.144-k8s MANAGEMENT UDP N/A 2:17:14h Host joined a new cluster (2 hours ago)
41 2 1 ip-10-0-125-187 drivex3cc5580dx303ex4c15xba6cx82ccc04a898d 10.0.125.187 UP 4.4.2.144-k8s DRIVES UDP 3 1.54 GB 2:17:08h
60 3 0 ip-10-0-65-133 drivex4cca8165xfcaax438dx9a75xf372efc7a497 10.0.65.133 UP 4.4.2.144-k8s MANAGEMENT UDP N/A 2:17:14h Host joined a new cluster (2 hours ago)
61 3 1 ip-10-0-65-133 drivex4cca8165xfcaax438dx9a75xf372efc7a497 10.0.65.133 UP 4.4.2.144-k8s DRIVES UDP 6 1.54 GB 2:17:09h
80 4 0 ip-10-0-65-133 computex133b97bcx5bf6x4612x8f2axd09fc42bb573 10.0.65.133 UP 4.4.2.144-k8s MANAGEMENT UDP N/A 2:17:14h Host joined a new cluster (2 hours ago)
81 4 1 ip-10-0-65-133 computex133b97bcx5bf6x4612x8f2axd09fc42bb573 10.0.65.133 UP 4.4.2.144-k8s COMPUTE UDP 2 2.94 GB 2:17:07h
100 5 0 ip-10-0-110-144 computexc0ac9647xf6a5x4d77x909ax07e865469af1 10.0.110.144 UP 4.4.2.144-k8s MANAGEMENT UDP N/A 2:17:13h Host joined a new cluster (2 hours ago)
101 5 1 ip-10-0-110-144 computexc0ac9647xf6a5x4d77x909ax07e865469af1 10.0.110.144 UP 4.4.2.144-k8s COMPUTE UDP 1 2.94 GB 2:17:07h
120 6 0 ip-10-0-107-84 computex1b0db083x8986x45e0x96e3xe739b58808ee 10.0.107.84 UP 4.4.2.144-k8s MANAGEMENT UDP N/A 2:17:13h Host joined a new cluster (2 hours ago)
121 6 1 ip-10-0-107-84 computex1b0db083x8986x45e0x96e3xe739b58808ee 10.0.107.84 UP 4.4.2.144-k8s COMPUTE UDP 1 2.94 GB 2:17:07h
140 7 0 ip-10-0-98-109 computex1d5f9c03x5b35x401exa71ex3786135e7a66 10.0.98.109 UP 4.4.2.144-k8s MANAGEMENT UDP N/A 2:17:13h Host joined a new cluster (2 hours ago)
141 7 1 ip-10-0-98-109 computex1d5f9c03x5b35x401exa71ex3786135e7a66 10.0.98.109 UP 4.4.2.144-k8s COMPUTE UDP 1 2.94 GB 2:17:07h
160 8 0 ip-10-0-110-144 drivex3cdd7cd0xecbbx4240x9f85xb5881b2f276c 10.0.110.144 UP 4.4.2.144-k8s MANAGEMENT UDP N/A 2:17:13h Host joined a new cluster (2 hours ago)
161 8 1 ip-10-0-110-144 drivex3cdd7cd0xecbbx4240x9f85xb5881b2f276c 10.0.110.144 UP 4.4.2.144-k8s DRIVES UDP 3 1.54 GB 2:17:07h
180 9 0 ip-10-0-125-187 computexcc4c114exb720x49a7xa964xb050041220d1 10.0.125.187 UP 4.4.2.144-k8s MANAGEMENT UDP N/A 2:17:14h Host joined a new cluster (2 hours ago)
181 9 1 ip-10-0-125-187 computexcc4c114exb720x49a7xa964xb050041220d1 10.0.125.187 UP 4.4.2.144-k8s COMPUTE UDP 1 2.94 GB 2:17:07h
200 10 0 ip-10-0-83-118 computexac9ca615x9425x48eexaafex756ee3e8e8aa 10.0.83.118 UP 4.4.2.144-k8s MANAGEMENT UDP N/A 2:17:14h Host joined a new cluster (2 hours ago)
201 10 1 ip-10-0-83-118 computexac9ca615x9425x48eexaafex756ee3e8e8aa 10.0.83.118 UP 4.4.2.144-k8s COMPUTE UDP 2 2.94 GB 2:17:08h
220 11 0 ip-10-0-65-133 s3xea3f0926x5063x4a8dx956cx9d8828f31232 10.0.65.133 UP 4.4.2.144-k8s MANAGEMENT UDP N/A 2:17:13h Host joined a new cluster (2 hours ago)
221 11 1 ip-10-0-65-133 s3xea3f0926x5063x4a8dx956cx9d8828f31232 10.0.65.133 UP 4.4.2.144-k8s FRONTEND UDP 3 1.26 GB 2:17:09h
240 12 0 ip-10-0-107-84 drivexfbc81c00xa6dex442fxb5b3x3ed4dd433741 10.0.107.84 UP 4.4.2.144-k8s MANAGEMENT UDP N/A 2:17:13h Host joined a new cluster (2 hours ago)
241 12 1 ip-10-0-107-84 drivexfbc81c00xa6dex442fxb5b3x3ed4dd433741 10.0.107.84 UP 4.4.2.144-k8s DRIVES UDP 3 1.54 GB 2:17:07h
260 13 0 ip-10-0-98-109 drivex07ddd04bx85cdx43acxbea0xb0f2c339f335 10.0.98.109 UP 4.4.2.144-k8s MANAGEMENT UDP N/A 2:17:14h Host joined a new cluster (2 hours ago)
261 13 1 ip-10-0-98-109 drivex07ddd04bx85cdx43acxbea0xb0f2c339f335 10.0.98.109 UP 4.4.2.144-k8s DRIVES UDP 3 1.54 GB 2:17:08h
280 14 0 ip-10-0-103-75 c138a41e8d33client 10.0.103.75 UP 4.4.2.144-k8s MANAGEMENT UDP N/A 0:01:09h Configuration snapshot pulled (1 minute ago)
281 14 1 ip-10-0-103-75 c138a41e8d33client 10.0.103.75 UP 4.4.2.144-k8s FRONTEND UDP 1 1.47 GB 0:01:03h
282 14 2 ip-10-0-103-75 c138a41e8d33client 10.0.103.75 UP 4.4.2.144-k8s FRONTEND UDP 2 1.47 GB 0:01:03h
300 15 0 ip-10-0-113-108 c138a41e8d33client 10.0.113.108 UP 4.4.2.144-k8s MANAGEMENT UDP N/A 0:01:14h Configuration snapshot pulled (1 minute ago)
301 15 1 ip-10-0-113-108 c138a41e8d33client 10.0.113.108 UP 4.4.2.144-k8s FRONTEND UDP 1 1.47 GB 0:01:10h
302 15 2 ip-10-0-113-108 c138a41e8d33client 10.0.113.108 UP 4.4.2.144-k8s FRONTEND UDP 2 1.47 GB 0:01:10h
320 16 0 ip-10-0-96-250 c138a41e8d33client 10.0.96.250 UP 4.4.2.144-k8s MANAGEMENT UDP N/A 0:01:18h Configuration snapshot pulled (1 minute ago)
321 16 1 ip-10-0-96-250 c138a41e8d33client 10.0.96.250 UP 4.4.2.144-k8s FRONTEND UDP 1 1.47 GB 0:01:13h
322 16 2 ip-10-0-96-250 c138a41e8d33client 10.0.96.250 UP 4.4.2.144-k8s FRONTEND UDP 2 1.47 GB 0:01:13h
340 17 0 ip-10-0-66-16 c138a41e8d33client 10.0.66.16 UP 4.4.2.144-k8s MANAGEMENT UDP N/A 0:01:22h Configuration snapshot pulled (1 minute ago)
341 17 1 ip-10-0-66-16 c138a41e8d33client 10.0.66.16 UP 4.4.2.144-k8s FRONTEND UDP 1 1.47 GB 0:01:19h
342 17 2 ip-10-0-66-16 c138a41e8d33client 10.0.66.16 UP 4.4.2.144-k8s FRONTEND UDP 2 1.47 GB 0:01:19h
360 18 0 ip-10-0-94-223 c138a41e8d33client 10.0.94.223 UP 4.4.2.144-k8s MANAGEMENT UDP N/A 57.13s Configuration snapshot pulled (1 minute ago)
361 18 1 ip-10-0-94-223 c138a41e8d33client 10.0.94.223 UP 4.4.2.144-k8s FRONTEND UDP 1 1.47 GB 54.63s
362 18 2 ip-10-0-94-223 c138a41e8d33client 10.0.94.223 UP 4.4.2.144-k8s FRONTEND UDP 2 1.47 GB 54.13s
380 19 0 ip-10-0-79-235 c138a41e8d33client 10.0.79.235 UP 4.4.2.144-k8s MANAGEMENT UDP N/A 0:01:18h Configuration snapshot pulled (1 minute ago)
381 19 1 ip-10-0-79-235 c138a41e8d33client 10.0.79.235 UP 4.4.2.144-k8s FRONTEND UDP 1 1.47 GB 0:01:13h
382 19 2 ip-10-0-79-235 c138a41e8d33client 10.0.79.235 UP 4.4.2.144-k8s FRONTEND UDP 2 1.47 GB 0:01:13h
root@ip-10-0-98-109:/# weka status
WekaIO v4.4.2.144-k8s (CLI build 4.4.2.144-k8s)
cluster: cluster-dev (10d5d634-0aa2-4858-8fef-409254bdf74f)
status: OK (14 backend containers UP, 6 drives UP)
protection: 3+2 (Fully protected)
hot spare: 0 failure domains
drive storage: 22.09 TiB total, 21.86 TiB unprovisioned
cloud: connected
license: Unlicensed
io status: STARTED 2 hours ago (14 io-nodes UP, 78 Buckets UP)
link layer: Ethernet
clients: 6 connected
reads: 0 B/s (0 IO/s)
writes: 0 B/s (0 IO/s)
operations: 0 ops/s
alerts: 35 active alerts, use `weka alerts` to list them
Verification
After completing these steps, verify that:
All client pods are in Running state.
The CORES value shows 2 for client containers.
The clients have successfully rejoined the cluster.
The system status shows no errors using weka status.
Troubleshooting
If clients fail to restart:
Ensure sufficient hugepage memory is available.
Check pod events for specific error messages.
Verify the client configuration in the YAML file is correct.
Increase backend cores
Increase the number of cores allocated to compute and drive containers to improve processing capacity for intensive workloads.
The following procedure exemplifies increase of the computeCores and driveCores from 1 to 2 cores.
Procedure
Modify the cluster YAML configuration to update core allocation:
yamlCopytemplate:dynamicdynamicTemplate:computeContainers:6driveContainers:6computeCores:2# Increased from 1driveCores:2# Increased from 1numDrives:1s3Containers:2s3Cores:1envoyCores:1
Core allocation changes may require additional steps for full implementation.
Monitor cluster performance after making changes.
Consider testing in a non-production environment first.
Contact support if core values persist at previous settings after applying changes.
Cluster maintenance
Cluster maintenance ensures optimal performance, security, and reliability through regular updates. Key tasks include updating WekaCluster and WekaClient configurations and rotating pods to apply changes.
Update WekaCluster configuration
This topic explains how to update WekaCluster configuration parameters to enhance cluster performance or resolve issues.
You can update the following WekaCluster parameters:
AdditionalMemory (spec.AdditionalMemory)
Tolerations (spec.Tolerations)
RawTolerations (spec.RawTolerations)
DriversDistService (spec.DriversDistService)
ImagePullSecret (spec.ImagePullSecret)
After completing each of the following procedures, all pods restart within a few minutes to apply the new configuration.
Procedure: Update AdditionalMemory
Open your cluster.yaml file and update the additional memory values:
additionalMemory:compute:100s3:200drive:300
Apply the updated configuration:
kubectlapply-fcluster.yaml
Delete the WekaContainer pods:
kubectldeletepod<wekacontainer-pod-name>
Verify that the memory values have been updated to the new settings.
Procedure: Update Tolerations
Open your cluster.yaml file and update the toleration values:
Delete all container pods and verify that all pods restart and reach the Running state within a few minutes.
In the following commands replace the * with the actual container names.