Convert cluster to multi-container backend
Professional services workflow for converting the cluster architecture from a single-container backend to a multi-container backend.
Since WEKA introduced the multi-container backend (MCB or MBC) architecture, it is required to convert existing single-container backend (SCB) architecture to MCB.
In SCB, the drive, compute, and frontend processes are in the same container. In MCB, a server includes multiple containers, each running a specific process type. The MCB offers benefits such as:
Non-disruptive upgrades
Effective hardware cores usage
Less disruptive maintenance
Conversion to MCB is supported from version 4.0.2 and above.
For more details about MCB, see the WEKA containers architecture overview.
This workflow is intended for experienced professionals in the field of professional services who are familiar with WEKA concepts and maintenance procedures.
SCB to MCB conversion workflow
The conversion runs on one server at a time (rolling). It takes about 4.5 minutes per server, so the cluster performance hit is minimal. Therefore, it is recommended (not mandatory) to perform this workflow during the corporate maintenance window.
1. Prepare the source cluster for conversion
Ensure the source cluster meets the following requirements:
The source cluster is version 4.0.2 or higher.
The cluster must not download a snapshot during the conversion (snapshot upload is allowed).
You must have passwordless SSH access to all backend servers. This access can be granted to either the root user or a regular user. If you opt for a non-root user, it must also have passwordless sudo privileges.
Download the latest Tools Repository. It is recommended to pull the latest version before starting the migration.
Copy the following scripts from the downloaded tools repository to the /tmp directory on the server from which you plan to run it:
All the following conversion scripts:
Verify that all backends are up and with no rebuilds in progress.
Ensure no WEKA filesystem mounts exist on the backends. If required, run
umount -a -t wekafs
.The backend container must not have converged processes (nodes) in the same core. Each process must be either frontend, compute, or drive. You cannot share these on the same core. Some clusters may share the frontend and compute processes, especially in AWS. If you have one of these clusters, you must first use the
core_allocation
script to change to core allocations.Ensure the root user is logged into the WEKA cluster on all backends. Otherwise, copy the
/root/.weka/auth-token.json
to all backend servers. The script runs theweka
commands. Without this file, the commands do not complete, and the message is “error: Authentication Failed”.The conversion script starts running on three containers on ports 14000, 14200, and 14300. Ensure no other processes use ports in the range 14000 to 14299.
Changing the
/opt/weka/logs
loop device to 2 GB is recommended. After the MCB conversion, each container has its own set of logs, so the space required is tripled. Visit the Support Portal and search for the KB: How-to-increase-the-size-of-opt-weka-logs.If the cluster has network device names in the old schema, convert these names to real NIC names. To identify the network devices, run
weka cluster host net -b
. If the result shows network device names such ashost0net0
, it is the old schema.
2. Remove the protocol cluster configurations (if exist)
If protocol cluster configurations are set, remove them if possible. Otherwise, once you convert some containers (later in this workflow), you can move the protocol containers to the converted containers.
Using the protocols
script (from the tools repository), perform the following:
Back up the configuration of the protocol clusters.
Destroy the configuration of the protocol clusters.
During the conversion process, the HostIDs are changed. After the conversion, manually change the HostIDs in the configuration backup file.
3. Ensure failure domains are named
Only clusters with named failure domains can be converted. The conversion script does not support automatic or invalid failure domains.
Check the failure domain names. Run the following command line:
If the cluster has automatic or invalid failure domains, do the following:
Ensure there are no filesystem mounts on the backends.
Run the
change_failure_domains_to_manual.py
script from the /tmp directory.
This script converts each backend to a named failure domain and restarts it (rolling conversion). This operation causes a short rebuild.
4. Convert the cluster to MCB
Before running the conversion script: ./convert_cluster_to_mbc.sh -<flag>
, adhere to the following:
The script includes flags that change the allocated cores for each process type. It is helpful if you want to increase the number of cores allocated for the compute, frontend (for protocols), and drive processes. Leave at least two cores for the OS and protocols. If you run the script without any options, it preserves the existing core settings.
Do not use the
-s
flag without development approval.If the cluster configuration uses IB PKEY, running the conversion script with the
-p
flag is mandatory.It is recommended to convert a single backend first. If the conversion is successful, convert the rest of the cluster. Do not continue converting the cluster until you know it works fine for a single backend. Use the
-b
flag to do this single host. Ensure all the WEKA buckets are available after the conversion and the default container is removed.If a previous conversion fails, remove the file
/root/resources.json.backup
from all backends. Otherwise, use the-f
flag in the following conversion attempt.
SCB to MCB conversion flags
5. Restore the protocol cluster configurations (if required)
If you have destroyed the protocol clusters configuration before the conversion, restore them as follows:
Open the backup file of the protocol cluster configurations created before the conversion. Search for the
HostId
lines and replace these to match thefrontend0
container HostIDs. To retrieve the new HostIDs, run the following command line:weka cluster container -b | grep frontend0
.Run the
protocols
script from the/tmp
directory.
Troubleshooting
MCB script fails
No pause between host deactivation and removal:
During the conversion, it failed to remove the old host/container because it deactivated the host and then didn’t wait for the host/container to be INACTIVE.
During the conversion of SMB hosts, it failed.
Corrective action
Clean up the host state using the following commands:
Continue with the conversion process.
Clients in a bad state cause new backends to get stuck in SYNCING state
On a large cluster with about 2000 clients and above, the rebuild between each MCB conversion hangs. The reason is that the baseline configuration failed to sync to all the clients at the end of the rebuild.
Corrective action
Deactivate and remove the clients that were down or degraded.
Wait for the sync to finish.
If the issue persists, look at the WEKA cluster events and search for
NoConnectivityToLivingNode
events where the event is apeer
.Translate the node ID to a HostID and add the host to the denylist.
Drives are not activated correctly during phase-out
When two drives or four phase out, the conversion script goes on a loop with the error message:
“ 2023-09-04 14:55:38 mbc divider - ERROR: Error querying container status and invoking scan: The following drives did not move to the new container, ['4a7d4250-179e-42b4-ab3f-90cfcb968f64'] retrying “
This symptom occurs because the script assumes the drives belonged to the default cluster, already deleted from the host.
Corrective action
Deactivate the phased-out drives and immediately activate them. The drives automatically move to the correct container. The default container is automatically removed from the cluster.
If the default container is not moved as expected, manually remove the default container from the cluster configuration. Run the command:
weka cluster container remove
, with the--no-unimprint
flag.Check the host with the error.
If the containers do not start at boot, run the command:
weka local enable
.
Too much memory on the Compute container
The compute container takes too long to start due to RAM allocation.
Corrective action
Do one of the following:
Before starting the conversion again, reduce the memory size of each container. When the conversion completes, change the memory size back.
Change the memory at the MCB conversion using the flag
-m 150
. This flag sets the RAM to 150 GiB. When the conversion completes, change the memory size back.
To change the RAM to the previous size, run the command weka local resources
for the compute container. The drive and frontend containers do not need more memory.
When this error occurs, the drive container is still active, with all the drives, and the default container is stopped.
To reverse, create the other containers manually or remove the drives container and restart the default container. If the drives are phased out or unavailable, deactivate the phased-out drives and immediately activate them. Alternatively, remove the drives and re-add them. See Drives are not activated correctly during phase-out.
Compute0 conversion failed on VMware hosts
On a Weka cluster deployed on vsphere, where each container is a VM, running the conversion script succeeds with drives0
but fails on compute0
with the following error:
Corrective action
Run the conversion script with the -v
flag: ./convert_cluster_to_mbc.sh -a -v
Missing NICs were passed issue
The conversion failed due to network interface names that contained uppercase letters. For example, bond4-100G
.
The resources_generator.py
script contains the following line that converts the incoming values to lowercase:
402 parser.add_argument("--net", nargs="+", type=str.lower, metavar="net-devices"
Corrective action
Do one of the following:
Change the network interface names to lowercase. For example, from
bond4-100G
tobond4-100g.
Add the adapter resource to the host and re-run the conversion script.Edit the
resources_generator.py
script line 402, replacetype=str.lower
withtype=str
, and re-run the conversion script.