W E K A
4.4
4.4
  • WEKA v4.4 documentation
    • Documentation revision history
  • WEKA System Overview
    • Introduction
      • WEKA system functionality features
      • Converged WEKA system deployment
      • Optimize redundancy in WEKA deployments
    • SSD capacity management
    • Filesystems, object stores, and filesystem groups
    • WEKA networking
    • Data lifecycle management
    • WEKA client and mount modes
    • WEKA containers architecture overview
    • Glossary
  • Planning and Installation
    • Prerequisites and compatibility
    • WEKA cluster installation on bare metal servers
      • Plan the WEKA system hardware requirements
      • Obtain the WEKA installation packages
      • Install the WEKA cluster using the WMS with WSA
      • Install the WEKA cluster using the WSA
      • Manually install OS and WEKA on servers
      • Manually prepare the system for WEKA configuration
        • Broadcom adapter setup for WEKA system
        • Enable the SR-IOV
      • Configure the WEKA cluster using the WEKA Configurator
      • Manually configure the WEKA cluster using the resources generator
        • VLAN tagging in the WEKA system
      • Perform post-configuration procedures
      • Add clients to an on-premises WEKA cluster
    • WEKA Cloud Deployment Manager Web (CDM Web) User Guide
    • WEKA Cloud Deployment Manager Local (CDM Local) User Guide
    • WEKA installation on AWS
      • WEKA installation on AWS using Terraform
        • Terraform-AWS-WEKA module description
        • Deployment on AWS using Terraform
        • Required services and supported regions
        • Supported EC2 instance types using Terraform
        • WEKA cluster auto-scaling in AWS
        • Detailed deployment tutorial: WEKA on AWS using Terraform
      • WEKA installation on AWS using the Cloud Formation
        • Self-service portal
        • CloudFormation template generator
        • Deployment types
        • AWS Outposts deployment
        • Supported EC2 instance types using Cloud Formation
        • Add clients to a WEKA cluster on AWS
        • Auto scaling group
        • Troubleshooting
    • WEKA installation on Azure
      • Azure-WEKA deployment Terraform package description
      • Deployment on Azure using Terraform
      • Required services and supported regions
      • Supported virtual machine types
      • Auto-scale virtual machines in Azure
      • Add clients to a WEKA cluster on Azure
      • Troubleshooting
      • Detailed deployment tutorial: WEKA on Azure using Terraform
    • WEKA installation on GCP
      • WEKA project description
      • GCP-WEKA deployment Terraform package description
      • Deployment on GCP using Terraform
      • Required services and supported regions
      • Supported machine types and storage
      • Auto-scale instances in GCP
      • Add clients to a WEKA cluster on GCP
      • Troubleshooting
      • Detailed deployment tutorial: WEKA on GCP using Terraform
      • Google Kubernetes Engine and WEKA over POSIX deployment
    • WEKA installation on OCI
  • Getting Started with WEKA
    • Manage the system using the WEKA GUI
    • Manage the system using the WEKA CLI
      • WEKA CLI hierarchy
      • CLI reference guide
    • Run first IOs with WEKA filesystem
    • Getting started with WEKA REST API
    • WEKA REST API and equivalent CLI commands
  • Performance
    • WEKA performance tests
      • Test environment details
  • WEKA Filesystems & Object Stores
    • Manage object stores
      • Manage object stores using the GUI
      • Manage object stores using the CLI
    • Manage filesystem groups
      • Manage filesystem groups using the GUI
      • Manage filesystem groups using the CLI
    • Manage filesystems
      • Manage filesystems using the GUI
      • Manage filesystems using the CLI
    • Attach or detach object store buckets
      • Attach or detach object store bucket using the GUI
      • Attach or detach object store buckets using the CLI
    • Advanced data lifecycle management
      • Advanced time-based policies for data storage location
      • Data management in tiered filesystems
      • Transition between tiered and SSD-only filesystems
      • Manual fetch and release of data
    • Mount filesystems
      • Mount filesystems from Single Client to Multiple Clusters (SCMC)
      • Manage authentication across multiple clusters with connection profiles
    • Snapshots
      • Manage snapshots using the GUI
      • Manage snapshots using the CLI
    • Snap-To-Object
      • Manage Snap-To-Object using the GUI
      • Manage Snap-To-Object using the CLI
    • Snapshot policies
      • Manage snapshot policies using the GUI
      • Manage snapshot policies using the CLI
    • Quota management
      • Manage quotas using the GUI
      • Manage quotas using the CLI
  • Additional Protocols
    • Additional protocol containers
    • Manage the NFS protocol
      • Supported NFS client mount parameters
      • Manage NFS networking using the GUI
      • Manage NFS networking using the CLI
    • Manage the S3 protocol
      • S3 cluster management
        • Manage the S3 service using the GUI
        • Manage the S3 service using the CLI
      • S3 buckets management
        • Manage S3 buckets using the GUI
        • Manage S3 buckets using the CLI
      • S3 users and authentication
        • Manage S3 users and authentication using the CLI
        • Manage S3 service accounts using the CLI
      • S3 lifecycle rules management
        • Manage S3 lifecycle rules using the GUI
        • Manage S3 lifecycle rules using the CLI
      • Audit S3 APIs
        • Configure audit webhook using the GUI
        • Configure audit webhook using the CLI
        • Example: How to use Splunk to audit S3
        • Example: How to use S3 audit events for tracking and security
      • S3 supported APIs and limitations
      • S3 examples using boto3
      • Configure and use AWS CLI with WEKA S3 storage
    • Manage the SMB protocol
      • Manage SMB using the GUI
      • Manage SMB using the CLI
  • Security
    • WEKA security overview
    • Obtain authentication tokens
    • Manage token expiration
    • Manage account lockout threshold policy
    • Manage KMS
      • Manage KMS using GUI
      • Manage KMS using CLI
    • Manage TLS certificates
      • Manage TLS certificates using GUI
      • Manage TLS certificates using CLI
    • Manage Cross-Origin Resource Sharing
    • Manage CIDR-based security policies
    • Manage login banner
  • Secure cluster membership with join secret authentication
  • Licensing
    • License overview
    • Classic license
  • Operation Guide
    • Alerts
      • Manage alerts using the GUI
      • Manage alerts using the CLI
      • List of alerts and corrective actions
    • Events
      • Manage events using the GUI
      • Manage events using the CLI
      • List of events
    • Statistics
      • Manage statistics using the GUI
      • Manage statistics using the CLI
      • List of statistics
    • Insights
    • System congestion
    • User management
      • Manage users using the GUI
      • Manage users using the CLI
    • Organizations management
      • Manage organizations using the GUI
      • Manage organizations using the CLI
      • Mount authentication for organization filesystems
    • Expand and shrink cluster resources
      • Add a backend server
      • Expand specific resources of a container
      • Shrink a cluster
    • Background tasks
      • Set up a Data Services container for background tasks
      • Manage background tasks using the GUI
      • Manage background tasks using the CLI
    • Upgrade WEKA versions
    • Manage WEKA drivers
  • Monitor the WEKA Cluster
    • Deploy monitoring tools using the WEKA Management Station (WMS)
    • WEKA Home - The WEKA support cloud
      • Local WEKA Home overview
      • Deploy Local WEKA Home v3.0 or higher
      • Deploy Local WEKA Home v2.x
      • Explore cluster insights
      • Explore performance statistics in Grafana
      • Manage alerts and integrations
      • Enforce security and compliance
      • Optimize support and data management
      • Export cluster metrics to Prometheus
    • Set up WEKAmon for external monitoring
    • Set up the SnapTool external snapshots manager
  • Kubernetes
    • Composable clusters for multi-tenancy in Kubernetes
    • WEKA Operator deployment
    • WEKA Operator day-2 operations
  • WEKApod
    • WEKApod Data Platform Appliance overview
    • WEKApod servers overview
    • Rack installation
    • WEKApod initial system setup and configuration
    • WEKApod support process
  • AWS Solutions
    • Amazon SageMaker HyperPod and WEKA Integrations
      • Deploy a new Amazon SageMaker HyperPod cluster with WEKA
      • Add WEKA to an existing Amazon SageMaker HyperPod cluster
    • AWS ParallelCluster and WEKA Integration
  • Azure Solutions
    • Azure CycleCloud for SLURM and WEKA Integration
  • Best Practice Guides
    • WEKA and Slurm integration
      • Avoid conflicting CPU allocations
    • Storage expansion best practice
  • Support
    • Get support for your WEKA system
    • Diagnostics management
      • Traces management
        • Manage traces using the GUI
        • Manage traces using the CLI
      • Protocols debug level management
        • Manage protocols debug level using the GUI
        • Manage protocols debug level using the CLI
      • Diagnostics data management
  • Appendices
    • WEKA CSI Plugin
      • Deployment
      • Storage class configurations
      • Tailor your storage class configuration with mount options
      • Dynamic and static provisioning
      • Launch an application using WEKA as the POD's storage
      • Add SELinux support
      • NFS transport failback
      • Upgrade legacy persistent volumes for capacity enforcement
      • Troubleshooting
    • Convert cluster to multi-container backend
    • Create a client image
    • Update WMS and WSA
    • BIOS tool
Powered by GitBook
On this page
  • SCB to MCB conversion workflow
  • 1. Prepare the source cluster for conversion
  • 2. Remove the protocol cluster configurations (if exist)
  • 3. Ensure failure domains are named
  • 4. Convert the cluster to MCB
  • 5. Restore the protocol cluster configurations (if required)
  • Troubleshooting
  • MCB script fails
  • Clients in a bad state cause new backends to get stuck in SYNCING state
  • Drives are not activated correctly during phase-out
  • Too much memory on the Compute container
  • Compute0 conversion failed on VMware hosts
  • Missing NICs were passed issue
  1. Appendices

Convert cluster to multi-container backend

Professional services workflow for converting the cluster architecture from a single-container backend to a multi-container backend.

PreviousTroubleshootingNextCreate a client image

Since WEKA introduced the multi-container backend (MCB or MBC) architecture, it is required to convert existing single-container backend (SCB) architecture to MCB.

In SCB, the drive, compute, and frontend processes are in the same container. In MCB, a server includes multiple containers, each running a specific process type. The MCB offers benefits such as:

  • Non-disruptive upgrades

  • Effective hardware cores usage

  • Less disruptive maintenance

Conversion to MCB is supported from version 4.0.2 and above.

For more details about MCB, see the WEKA containers architecture overview.

This workflow is intended for experienced professionals in the field of professional services who are familiar with WEKA concepts and maintenance procedures.

SCB to MCB conversion workflow

The conversion runs on one server at a time (rolling). It takes about 4.5 minutes per server, so the cluster performance hit is minimal. Therefore, it is recommended (not mandatory) to perform this workflow during the corporate maintenance window.

1. Prepare the source cluster for conversion

  1. Ensure the source cluster meets the following requirements:

    • The source cluster is version 4.0.2 or higher.

    • The cluster must not download a snapshot during the conversion (snapshot upload is allowed).

    • You must have passwordless SSH access to all backend servers. This access can be granted to either the root user or a regular user. If you opt for a non-root user, it must also have passwordless sudo privileges.

  2. Copy the following scripts from the downloaded tools repository to the /tmp directory on the server from which you plan to run it:

    • All the following conversion scripts:

  3. Verify that all backends are up and with no rebuilds in progress.

  4. Ensure no WEKA filesystem mounts exist on the backends. If required, run umount -a -t wekafs.

  5. The backend container must not have converged processes (nodes) in the same core. Each process must be either frontend, compute, or drive. You cannot share these on the same core. Some clusters may share the frontend and compute processes, especially in AWS. If you have one of these clusters, you must first use the core_allocation script to change to core allocations.

  6. Ensure the root user is logged into the WEKA cluster on all backends. Otherwise, copy the /root/.weka/auth-token.json to all backend servers. The script runs the weka commands. Without this file, the commands do not complete, and the message is “error: Authentication Failed”.

  7. The conversion script starts running on three containers on ports 14000, 14200, and 14300. Ensure no other processes use ports in the range 14000 to 14299.

  8. Changing the /opt/weka/logs loop device to 2 GB is recommended. After the MCB conversion, each container has its own set of logs, so the space required is tripled. Visit the Support Portal and search for the KB: How-to-increase-the-size-of-opt-weka-logs.

  9. If the cluster has network device names in the old schema, convert these names to real NIC names. To identify the network devices, run weka cluster host net -b. If the result shows network device names such as host0net0, it is the old schema.

2. Remove the protocol cluster configurations (if exist)

If protocol cluster configurations are set, remove them if possible. Otherwise, once you convert some containers (later in this workflow), you can move the protocol containers to the converted containers.

Using the protocols script (from the tools repository), perform the following:

  1. Back up the configuration of the protocol clusters.

  2. Destroy the configuration of the protocol clusters.

During the conversion process, the HostIDs are changed. After the conversion, manually change the HostIDs in the configuration backup file.

3. Ensure failure domains are named

Only clusters with named failure domains can be converted. The conversion script does not support automatic or invalid failure domains.

  1. Check the failure domain names. Run the following command line:

weka cluster host -b -v --output hostname,fd,fdName,fdType,fdId
Example: Named failure domains
$ weka cluster host -b -v --output hostname,fd,fdName,fdType,fdId
HOSTNAME                       FAILURE DOMAIN  FAILURE DOMAIN NAME  FAILURE DOMAIN TYPE  FAILURE DOMAIN ID
ip-172-31-39-174.ec2.internal  FD_16           FD_16                USER                 16
ip-172-31-36-151.ec2.internal  FD_25           FD_25                USER                 25
ip-172-31-33-66.ec2.internal   FD_63           FD_63                USER                 63
ip-172-31-39-101.ec2.internal  FD_0            FD_0                 USER                 0
ip-172-31-34-33.ec2.internal   FD_23           FD_23                USER                 23
ip-172-31-46-217.ec2.internal  FD_20           FD_20                USER                 20
ip-172-31-45-42.ec2.internal   FD_31           FD_31                USER                 31
ip-172-31-36-77.ec2.internal   FD_30           FD_30                USER                 30
Example: Automatic failure domains
# weka cluster host -b -v --output hostname,fd,fdName,fdType,fdId
HOSTNAME  FAILURE DOMAIN  FAILURE DOMAIN NAME  FAILURE DOMAIN TYPE  FAILURE DOMAIN ID
cst1      AUTO                                 AUTO                 0
cst2      AUTO                                 AUTO                 4
cst5      AUTO                                 AUTO                 2
cst6      AUTO                                 AUTO                 3
cst7      AUTO                                 AUTO                 6sh
Example: Invalid failure domains

Invalid failure domains may appear in clusters started on older WEKA versions.

HOST ID  HOSTNAME         CONTAINER NAME  STATUS  VERSION  MODE     FAILURE DOMAIN  FAILURE DOMAIN NAME  FAILURE DOMAIN TYPE  FAILURE DOMAIN ID
0        drp-srcf-ffb001  default         UP      3.13.6   backend                                       INVALID              10
1        drp-srcf-ffb002  default         UP      3.13.6   backend                                       INVALID              2
2        drp-srcf-ffb003  default         UP      3.13.6   backend                                       INVALID              0
3        drp-srcf-ffb004  default         UP      3.13.6   backend                                       INVALID              3
4        drp-srcf-ffb005  default         UP      3.13.6   backend                                       INVALID              13
5        drp-srcf-ffb006  default         UP      3.13.6   backend                                       INVALID              11
6        drp-srcf-ffb007  default         UP      3.13.6   backend                                       INVALID              8she

If the cluster has automatic or invalid failure domains, do the following:

  • Ensure there are no filesystem mounts on the backends.

  • Run the change_failure_domains_to_manual.py script from the /tmp directory.

This script converts each backend to a named failure domain and restarts it (rolling conversion). This operation causes a short rebuild.

Example: Change failure domains to manual (named failure domains)
[ec2-user@ip-172-31-34-33 postinstall]$ ./change_failure_domains_to_manual.py                                                                                                      |
2023-01-19 15:09:31 LOG: Queried ip-172-31-39-174.ec2.internal: currently running with failure domain type AUTO (id: 16, name=)
No rebuild is currently in progress

Data in each protection level:

2 Protections [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] 32.33 TiB / 32.33 TiB
1 Protections [                                        ] 0 B / 32.33 TiB
0 Protections [                                        ] 0 B / 32.33 TiB
2023-01-19 15:09:31 LOG: Cluster is fully protected (status OK)
2023-01-19 15:09:31 LOG: Change ip-172-31-39-174.ec2.internal:default failure domain to manual? [y]es / [s]kip / all>
y
2023-01-19 15:09:42 LOG: Failure domain ID of ip-172-31-39-174.ec2.internal:default is currently 16 (type=AUTO)
2023-01-19 15:09:42 LOG: New failure domain selected for ip-172-31-39-174.ec2.internal:default is FD_16 (based on ID 16)
2023-01-19 15:09:42 LOG: Changing of failure-domain on ip-172-31-39-174.ec2.internal to FD_16
2023-01-19 15:09:42 LOG: Running 'weka local resources failure-domain --name FD_16' on ip-172-31-39-174.ec2.internal (172.31.39.174) via ssh
Set failure_domain of default to FD_16
2023-01-19 15:09:43 LOG: Applying resources ip-172-31-39-174.ec2.internal
2023-01-19 15:09:43 LOG: Running 'weka local resources apply -f' on ip-172-31-39-174.ec2.internal (172.31.39.174) via ssh
default: Allocated network device "eth1" (with identifier "0000:00:06.0") to slots [1] on "ip-172-31-39-174.ec2.internal":"default" (1/7)
default: Allocated network device "eth2" (with identifier "0000:00:07.0") to slots [2] on "ip-172-31-39-174.ec2.internal":"default" (2/7)
default: Allocated network device "eth3" (with identifier "0000:00:08.0") to slots [3] on "ip-172-31-39-174.ec2.internal":"default" (3/7)
default: Allocated network device "eth4" (with identifier "0000:00:09.0") to slots [4] on "ip-172-31-39-174.ec2.internal":"default" (4/7)
default: Allocated network device "eth5" (with identifier "0000:00:0a.0") to slots [5] on "ip-172-31-39-174.ec2.internal":"default" (5/7)
default: Allocated network device "eth6" (with identifier "0000:00:0b.0") to slots [6] on "ip-172-31-39-174.ec2.internal":"default" (6/7)
default: Allocated network device "eth7" (with identifier "0000:00:0c.0") to slots [7] on "ip-172-31-39-174.ec2.internal":"default" (7/7)
default: Allocated core 6 to slot 6 on "ip-172-31-39-174.ec2.internal":"default" (1/7)
default: Allocated core 7 to slot 7 on "ip-172-31-39-174.ec2.internal":"default" (2/7)
default: Allocated core 1 to slot 1 on "ip-172-31-39-174.ec2.internal":"default" (3/7)
default: Allocated core 3 to slot 3 on "ip-172-31-39-174.ec2.internal":"default" (4/7)
default: Allocated core 2 to slot 2 on "ip-172-31-39-174.ec2.internal":"default" (5/7)
default: Allocated core 5 to slot 5 on "ip-172-31-39-174.ec2.internal":"default" (6/7)
default: Allocated core 4 to slot 4 on "ip-172-31-39-174.ec2.internal":"default" (7/7)
default: Starting hugepages allocation for "ip-172-31-39-174.ec2.internal":"default"
default: Allocated 139456MB hugepages memory from 1 NUMA nodes for "ip-172-31-39-174.ec2.internal":"default"
default: Bandwidth of "ip-172-31-39-174.ec2.internal":"default" set to unlimited
Container "default" is ready (pid = 8497)
Container "default" is RUNNING (pid = 8497)
2023-01-19 15:09:57 LOG: Waiting for container to become ready on ip-172-31-39-174.ec2.internal
2023-01-19 15:09:57 LOG: Running 'weka local status -J' on ip-172-31-39-174.ec2.internal (172.31.39.174) via ssh, capturing output
2023-01-19 15:10:01 LOG: Container on ip-172-31-39-174.ec2.internal is now ready
2023-01-19 15:10:01 LOG: Getting host-id from the current container on ip-172-31-39-174.ec2.internal
2023-01-19 15:10:01 LOG: Running 'weka debug manhole -s 0 getServerInfo' on ip-172-31-39-174.ec2.internal (172.31.39.174) via ssh, capturing output
2023-01-19 15:10:02 LOG: Waiting for host with ip-172-31-39-174.ec2.internal to become UP in 'weka cluster host'
2023-01-19 15:10:02 LOG: Running 'weka cluster host -J -F id=0' on ip-172-31-39-174.ec2.internal (172.31.39.174) via ssh, capturing output
2023-01-19 15:10:02 LOG: Validate the failure domain in the stable resources failure domain is FD_16, which means the container loaded properly with the right resources
2023-01-19 15:10:02 LOG: Running 'weka local resources --stable -J' on ip-172-31-39-174.ec2.internal (172.31.39.174) via ssh, capturing output
2023-01-19 15:10:03 LOG: Timed out waiting for the cluster to become unhealthy - Assuming it's healthy
Rebuild about to start...

Data in each protection level:

2 Protections [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] 32.85 TiB / 32.87 TiB
1 Protections [■                                       ] 18 GiB / 32.87 TiB
0 Protections [                                        ] 0 B / 32.87 TiB
2023-01-19 15:10:08 LOG: Rebuilding at rate of 128MiB/sec (scrubber rate)
2023-01-19 15:10:08 LOG: Still has failures (status REBUILDING)

4. Convert the cluster to MCB

Before running the conversion script: ./convert_cluster_to_mbc.sh -<flag>, adhere to the following:

  • The script includes flags that change the allocated cores for each process type. It is helpful if you want to increase the number of cores allocated for the compute, frontend (for protocols), and drive processes. Leave at least two cores for the OS and protocols. If you run the script without any options, it preserves the existing core settings.

  • Do not use the -s flag without development approval.

  • If the cluster configuration uses IB PKEY, running the conversion script with the -p flag is mandatory.

  • It is recommended to convert a single backend first. If the conversion is successful, convert the rest of the cluster. Do not continue converting the cluster until you know it works fine for a single backend. Use the -b flag to do this single host. Ensure all the WEKA buckets are available after the conversion and the default container is removed.

  • If a previous conversion fails, remove the file /root/resources.json.backup from all backends. Otherwise, use the -f flag in the following conversion attempt.

SCB to MCB conversion flags

# ./convert_cluster_to_mbc.sh -v -h
  -v force resources generator to use nics as VFs.
  -f force override of backup resources file if exist.
  -a run with active alerts.
  -s skip failed hosts (this is a dangerous flag, use with caution!).
  -d override drain grace period for s3 in seconds.
  -b to perform conversion on a single host.
  -p use only the nic identifier and don't resolve the PCI address.
  -l log file will be saved to this location instead of the current dir
  -D assign drive dedicated cores (use only for on-prem deployment, this flag overrides pinned cores).
  -F assign frontend dedicated cores (use only for on-prem deployment, this flag overrides pinned cores).
  -C assign compute dedicated cores (use only for on-prem deployment, this flag overrides pinned cores).
  -m override max memory assignment after conversion. Specify the value in GiB.
  -i path to the ssh identity file.
  -h show this help.
  

5. Restore the protocol cluster configurations (if required)

If you have destroyed the protocol clusters configuration before the conversion, restore them as follows:

  1. Open the backup file of the protocol cluster configurations created before the conversion. Search for the HostId lines and replace these to match the frontend0 container HostIDs. To retrieve the new HostIDs, run the following command line: weka cluster container -b | grep frontend0.

  2. Run the protocols script from the /tmp directory.

Troubleshooting

MCB script fails

  • No pause between host deactivation and removal:

    During the conversion, it failed to remove the old host/container because it deactivated the host and then didn’t wait for the host/container to be INACTIVE.

  • During the conversion of SMB hosts, it failed.

Corrective action

  1. Clean up the host state using the following commands:

# Clean up the cluster (change XXX to host-id of OLD host, which must be DOWN)
weka cluster host deactivate XXX
weka cluster host remove XXX --no-unimprint

# Clean up the host (remove the old container)
weka local ps
sudo weka local rm default
sudo weka local enable
  1. Continue with the conversion process.

Clients in a bad state cause new backends to get stuck in SYNCING state

On a large cluster with about 2000 clients and above, the rebuild between each MCB conversion hangs. The reason is that the baseline configuration failed to sync to all the clients at the end of the rebuild.

Corrective action

  1. Deactivate and remove the clients that were down or degraded.

  2. Wait for the sync to finish.

  3. If the issue persists, look at the WEKA cluster events and search for NoConnectivityToLivingNode events where the event is a peer.

  4. Translate the node ID to a HostID and add the host to the denylist.

Drives are not activated correctly during phase-out

When two drives or four phase out, the conversion script goes on a loop with the error message:

“ 2023-09-04 14:55:38 mbc divider - ERROR: Error querying container status and invoking scan: The following drives did not move to the new container, ['4a7d4250-179e-42b4-ab3f-90cfcb968f64'] retrying “

This symptom occurs because the script assumes the drives belonged to the default cluster, already deleted from the host.

Corrective action

  1. Deactivate the phased-out drives and immediately activate them. The drives automatically move to the correct container. The default container is automatically removed from the cluster.

  2. If the default container is not moved as expected, manually remove the default container from the cluster configuration. Run the command: weka cluster container remove, with the --no-unimprint flag.

  3. Check the host with the error.

  4. If the containers do not start at boot, run the command: weka local enable.

Too much memory on the Compute container

The compute container takes too long to start due to RAM allocation.

Corrective action

Do one of the following:

  • Before starting the conversion again, reduce the memory size of each container. When the conversion completes, change the memory size back.

  • Change the memory at the MCB conversion using the flag -m 150. This flag sets the RAM to 150 GiB. When the conversion completes, change the memory size back.

To change the RAM to the previous size, run the command weka local resources for the compute container. The drive and frontend containers do not need more memory.

When this error occurs, the drive container is still active, with all the drives, and the default container is stopped.

Compute0 conversion failed on VMware hosts

On a Weka cluster deployed on vsphere, where each container is a VM, running the conversion script succeeds with drives0 but fails on compute0 with the following error:

error: Container "compute0" has run into an error: Cannot resolve PCI 0000:0c:00.0 to a valid network device

Corrective action

Run the conversion script with the -v flag: ./convert_cluster_to_mbc.sh -a -v

Missing NICs were passed issue

The conversion failed due to network interface names that contained uppercase letters. For example, bond4-100G.

The resources_generator.py script contains the following line that converts the incoming values to lowercase:

402 parser.add_argument("--net", nargs="+", type=str.lower, metavar="net-devices"

Corrective action

Do one of the following:

  • Change the network interface names to lowercase. For example, from bond4-100G to bond4-100g. Add the adapter resource to the host and re-run the conversion script.

  • Edit the resources_generator.py script line 402, replace type=str.lower with type=str, and re-run the conversion script.

2023-05-02 10:57:13 mbc divider - INFO: Running resources-generator with cmd: /bin/sh -c "/tmp/resources_generator.py --net  bond4-100G/10.18.4.1/24/255.255.255.255  --compute-dedicated-cores 11 --drive-dedicated-cores 6 --frontend-dedicated-cores 2 -f"
ERROR:resources generator:Detected net devices: {'enp1s0f1': {'mac': '1a:75:4d:14:bf:2c', 'master': 'bond4-100G'}, 'usb0': {'mac': '42:06:bf:33:9e:b5'}, 'bond4-100G': {'mac': '1a:75:4d:14:bf:2c'}, 'enp129s0f1': {'mac': '5a:e4:39:5f:be:b3', 'master': 'bond4-1G'}, 'enp1s0f0': {'mac': '1a:75:4d:14:bf:2c', 'master': 'bond4-100G'}, 'bond4-1G': {'mac': '5a:e4:39:5f:be:b3'}, 'enp129s0f0': {'mac': '5a:e4:39:5f:be:b3', 'master': 'bond4-1G'}}
ERROR:resources generator:Missing NICs were passed: {'bond4-100g'}
2023-05-02 10:57:14 mbc divider - WARNING: Something went wrong running: /bin/sh -c "/tmp/resources_generator.py --net  bond4-100G/10.18.4.1/24/255.255.255.255  --compute-dedicated-cores 11 --drive-dedicated-cores 6 --frontend-dedicated-cores 2 -f"
2023-05-02 10:57:14 mbc divider - WARNING: Return Code: 1
2023-05-02 10:57:14 mbc divider - WARNING: Output: b''
2023-05-02 10:57:14 mbc divider - WARNING: Stderr: None
Traceback (most recent call last):
  File "/tmp/mbc_divider_script.py", line 787, in <module>
    main()
  File "/tmp/mbc_divider_script.py", line 625, in main
    run_shell_command(resource_generator_command)
  File "/tmp/mbc_divider_script.py", line 44, in run_shell_command
    raise Exception("Error running command (exit code {}): {}".format(process.returncode, command))
Exception: Error running command (exit code 1): /bin/sh -c "/tmp/resources_generator.py --net  bond4-100G/10.18.4.1/24/255.255.255.255  --compute-dedicated-cores 11 --drive-dedicated-cores 6 --frontend-dedicated-cores 2 -f"

Download the latest . It is recommended to pull the latest version before starting the migration.

To reverse, create the other containers manually or remove the drives container and restart the default container. If the drives are phased out or unavailable, deactivate the phased-out drives and immediately activate them. Alternatively, remove the drives and re-add them. See .

Tools Repository
MBC cluster conversion
Change failure domains to manual
Protocols
Resources generator
Prepare the source cluster for conversion
Remove the protocol cluster configurations (if exist)
Ensure failure domains are named
Convert the cluster from SCB to MCB
Restore the protocol cluster configurations (if required)
Drives are not activated correctly during phase-out
245KB
successfull_conversion_example.txt
Successful SCB to MCB conversion example
SCB vs. MCB