W E K A
4.4
4.4
  • WEKA v4.4 documentation
    • Documentation revision history
  • WEKA System Overview
    • Introduction
      • WEKA system functionality features
      • Converged WEKA system deployment
      • Redundancy optimization in WEKA
    • SSD capacity management
    • Filesystems, object stores, and filesystem groups
    • WEKA networking
    • Data lifecycle management
    • WEKA client and mount modes
    • WEKA containers architecture overview
    • Glossary
  • Planning and Installation
    • Prerequisites and compatibility
    • WEKA cluster installation on bare metal servers
      • Plan the WEKA system hardware requirements
      • Obtain the WEKA installation packages
      • Install the WEKA cluster using the WMS with WSA
      • Install the WEKA cluster using the WSA
      • Manually install OS and WEKA on servers
      • Manually prepare the system for WEKA configuration
        • Broadcom adapter setup for WEKA system
        • Enable the SR-IOV
      • Configure the WEKA cluster using the WEKA Configurator
      • Manually configure the WEKA cluster using the resources generator
        • VLAN tagging in the WEKA system
      • Perform post-configuration procedures
      • Add clients to an on-premises WEKA cluster
    • WEKA Cloud Deployment Manager Web (CDM Web) User Guide
    • WEKA Cloud Deployment Manager Local (CDM Local) User Guide
    • WEKA installation on AWS
      • WEKA installation on AWS using Terraform
        • Terraform-AWS-WEKA module description
        • Deployment on AWS using Terraform
        • Required services and supported regions
        • Supported EC2 instance types using Terraform
        • WEKA cluster auto-scaling in AWS
        • Detailed deployment tutorial: WEKA on AWS using Terraform
      • WEKA installation on AWS using the Cloud Formation
        • Self-service portal
        • CloudFormation template generator
        • Deployment types
        • AWS Outposts deployment
        • Supported EC2 instance types using Cloud Formation
        • Add clients to a WEKA cluster on AWS
        • Auto scaling group
        • Troubleshooting
    • WEKA installation on Azure
      • Azure-WEKA deployment Terraform package description
      • Deployment on Azure using Terraform
      • Required services and supported regions
      • Supported virtual machine types
      • Auto-scale virtual machines in Azure
      • Add clients to a WEKA cluster on Azure
      • Troubleshooting
      • Detailed deployment tutorial: WEKA on Azure using Terraform
    • WEKA installation on GCP
      • WEKA project description
      • GCP-WEKA deployment Terraform package description
      • Deployment on GCP using Terraform
      • Required services and supported regions
      • Supported machine types and storage
      • Auto-scale instances in GCP
      • Add clients to a WEKA cluster on GCP
      • Troubleshooting
      • Detailed deployment tutorial: WEKA on GCP using Terraform
      • Google Kubernetes Engine and WEKA over POSIX deployment
    • WEKA installation on OCI
  • Getting Started with WEKA
    • Manage the system using the WEKA GUI
    • Manage the system using the WEKA CLI
      • WEKA CLI hierarchy
      • CLI reference guide
    • Run first IOs with WEKA filesystem
    • Getting started with WEKA REST API
    • WEKA REST API and equivalent CLI commands
  • Performance
    • WEKA performance tests
      • Test environment details
  • WEKA Filesystems & Object Stores
    • Manage object stores
      • Manage object stores using the GUI
      • Manage object stores using the CLI
    • Manage filesystem groups
      • Manage filesystem groups using the GUI
      • Manage filesystem groups using the CLI
    • Manage filesystems
      • Manage filesystems using the GUI
      • Manage filesystems using the CLI
    • Attach or detach object store buckets
      • Attach or detach object store bucket using the GUI
      • Attach or detach object store buckets using the CLI
    • Advanced data lifecycle management
      • Advanced time-based policies for data storage location
      • Data management in tiered filesystems
      • Transition between tiered and SSD-only filesystems
      • Manual fetch and release of data
    • Mount filesystems
      • Mount filesystems from Single Client to Multiple Clusters (SCMC)
      • Manage authentication across multiple clusters with connection profiles
    • Snapshots
      • Manage snapshots using the GUI
      • Manage snapshots using the CLI
    • Snap-To-Object
      • Manage Snap-To-Object using the GUI
      • Manage Snap-To-Object using the CLI
    • Snapshot policies
      • Manage snapshot policies using the GUI
      • Manage snapshot policies using the CLI
    • Quota management
      • Manage quotas using the GUI
      • Manage quotas using the CLI
  • Additional Protocols
    • Additional protocol containers
    • Manage the NFS protocol
      • Supported NFS client mount parameters
      • Manage NFS networking using the GUI
      • Manage NFS networking using the CLI
    • Manage the S3 protocol
      • S3 cluster management
        • Manage the S3 service using the GUI
        • Manage the S3 service using the CLI
      • S3 buckets management
        • Manage S3 buckets using the GUI
        • Manage S3 buckets using the CLI
      • S3 users and authentication
        • Manage S3 users and authentication using the CLI
        • Manage S3 service accounts using the CLI
      • S3 lifecycle rules management
        • Manage S3 lifecycle rules using the GUI
        • Manage S3 lifecycle rules using the CLI
      • Audit S3 APIs
        • Configure audit webhook using the GUI
        • Configure audit webhook using the CLI
        • Example: How to use Splunk to audit S3
        • Example: How to use S3 audit events for tracking and security
      • S3 supported APIs and limitations
      • S3 examples using boto3
      • Configure and use AWS CLI with WEKA S3 storage
    • Manage the SMB protocol
      • Manage SMB using the GUI
      • Manage SMB using the CLI
  • Security
    • WEKA security overview
    • Obtain authentication tokens
    • Manage token expiration
    • Manage account lockout threshold policy
    • Manage KMS
      • Manage KMS using GUI
      • Manage KMS using CLI
    • Manage TLS certificates
      • Manage TLS certificates using GUI
      • Manage TLS certificates using CLI
    • Manage Cross-Origin Resource Sharing
    • Manage CIDR-based security policies
    • Manage login banner
  • Secure cluster membership with join secret authentication
  • Licensing
    • License overview
    • Classic license
  • Operation Guide
    • Alerts
      • Manage alerts using the GUI
      • Manage alerts using the CLI
      • List of alerts and corrective actions
    • Events
      • Manage events using the GUI
      • Manage events using the CLI
      • List of events
    • Statistics
      • Manage statistics using the GUI
      • Manage statistics using the CLI
      • List of statistics
    • Insights
    • System congestion
    • User management
      • Manage users using the GUI
      • Manage users using the CLI
    • Organizations management
      • Manage organizations using the GUI
      • Manage organizations using the CLI
      • Mount authentication for organization filesystems
    • Expand and shrink cluster resources
      • Add a backend server
      • Expand specific resources of a container
      • Shrink a cluster
    • Background tasks
      • Set up a Data Services container for background tasks
      • Manage background tasks using the GUI
      • Manage background tasks using the CLI
    • Upgrade WEKA versions
    • Manage WEKA drivers
  • Monitor the WEKA Cluster
    • Deploy monitoring tools using the WEKA Management Station (WMS)
    • WEKA Home - The WEKA support cloud
      • Local WEKA Home overview
      • Deploy Local WEKA Home v3.0 or higher
      • Deploy Local WEKA Home v2.x
      • Explore cluster insights
      • Explore performance statistics in Grafana
      • Manage alerts and integrations
      • Enforce security and compliance
      • Optimize support and data management
      • Export cluster metrics to Prometheus
    • Set up WEKAmon for external monitoring
    • Set up the SnapTool external snapshots manager
  • Kubernetes
    • Composable clusters for multi-tenancy in Kubernetes
    • WEKA Operator deployment
    • WEKA Operator day-2 operations
  • WEKApod
    • WEKApod Data Platform Appliance overview
    • WEKApod servers overview
    • Rack installation
    • WEKApod initial system setup and configuration
    • WEKApod support process
  • AWS Solutions
    • Amazon SageMaker HyperPod and WEKA Integrations
      • Deploy a new Amazon SageMaker HyperPod cluster with WEKA
      • Add WEKA to an existing Amazon SageMaker HyperPod cluster
    • AWS ParallelCluster and WEKA Integration
  • Azure Solutions
    • Azure CycleCloud for SLURM and WEKA Integration
  • Best Practice Guides
    • WEKA and Slurm integration
      • Avoid conflicting CPU allocations
    • Storage expansion best practice
  • Support
    • Get support for your WEKA system
    • Diagnostics management
      • Traces management
        • Manage traces using the GUI
        • Manage traces using the CLI
      • Protocols debug level management
        • Manage protocols debug level using the GUI
        • Manage protocols debug level using the CLI
      • Diagnostics data management
  • Appendices
    • WEKA CSI Plugin
      • Deployment
      • Storage class configurations
      • Tailor your storage class configuration with mount options
      • Dynamic and static provisioning
      • Launch an application using WEKA as the POD's storage
      • Add SELinux support
      • NFS transport failback
      • Upgrade legacy persistent volumes for capacity enforcement
      • Troubleshooting
    • Convert cluster to multi-container backend
    • Create a client image
    • Update WMS and WSA
    • BIOS tool
Powered by GitBook
On this page
  • Redundancy levels
  • Stripe width and capacity optimization
  • Hot spare capacity
  • Failure domain folding process
  • Performance during data rebuilds
  • Write performance and stripe width
  1. WEKA System Overview
  2. Introduction

Redundancy optimization in WEKA

PreviousConverged WEKA system deploymentNextSSD capacity management

Last updated 5 days ago

Redundancy levels

WEKA’s distributed RAID supports a range of redundancy configurations, from 3+2 to 16+4. It uses a D+P model, where D is the number of data drives and P is the number of parity drives. This is represented as N+2, N+3, or N+4, all supported by WEKA. The number of data drives must be greater than the number of parity drives. Configurations like 3+3 are not allowed.

Choosing the appropriate redundancy level balances fault tolerance, usable capacity, and performance:

  • N+2: Recommended for most environments; provides standard fault tolerance.

  • N+3: Offers increased protection; suitable for higher availability requirements.

  • N+4: Designed for large-scale clusters (100+ backends) or critical data scenarios requiring maximum redundancy.

Stripe width and capacity optimization

Stripe width, the number of drives participating in each distributed RAID stripe, is configurable between 3 and 16, affects both capacity efficiency and rebuild performance. Wider stripes increase net usable capacity by reducing parity overhead but may degrade rebuild speed, as more drives must be read from concurrently during recovery operations.

For deployments with stringent performance or data protection requirements, it is recommended to consult with the to determine the optimal configuration.

Hot spare capacity

By default, WEKA allocates 1/N of the total space as virtual hot spare capacity. For example, a 3+2 configuration is deployed as 3+2+1, reserving one-sixth of the cluster’s capacity for redundancy. This proactive provisioning ensures availability for immediate failure tolerance.

In cases where hardware failures persist and components are not replaced promptly, WEKA employs a mechanism called failure domain folding to maintain write availability. This mechanism temporarily relaxes the requirement that each RAID stripe must span only distinct failure domains (for example, one per backend storage server). It allows a single failure domain to appear multiple times in a stripe, enabling the system to allocate new stripes and continue accepting write operations even in degraded states.

Failure domain folding process

Failure domain folding is automatically triggered when the number of active failure domains becomes insufficient to satisfy the original stripe width, typically due to server deactivation or loss of multiple drives. This approach ensures that the system can remain operational during extended fault conditions without immediate hardware replacement.

The following illustration demonstrates how WEKA maintains write capability during sustained component failures by applying failure domain folding. It shows three stages:

  • Stage A: Normal operation with all drives active: In the initial state, all backend storage servers are operational. Each server is treated as a distinct failure domain, represented vertically in the diagram. RAID stripes span horizontally across all failure domains, combining data (yellow) and parity (purple) blocks. Although each block is shown as one NVMe drive for clarity, actual allocation occurs at the stripe level. To tolerate hardware failures, WEKA is configured with hot spare capacity, equivalent to the capacity of two full servers (6 drives). This reserve is not tied to specific drives but is notionally allocated across the system. At this stage, new stripe allocations proceed normally using free space across all failure domains.

  • Stage B: Write blocking after a drive failure: When a single drive fails, any new stripe that must span all failure domains can no longer be allocated if any one domain lacks space. Even though only one drive has failed, this strict requirement effectively blocks new writes, since the allocation rule cannot be satisfied. This results in a disproportionate loss of writable capacity relative to the size of the failure, particularly in systems with fewer drives per server.

  • Stage C: Write recovery through failure domain folding: To mitigate the blocked write condition, the affected storage server can be manually deactivated. This allows WEKA to apply failure domain folding, which permits the reuse of the same failure domain within a stripe. By relaxing the one-domain-per-stripe rule, new stripes can once again be allocated despite the failed drive. This folding mechanism restores write capability without immediate hardware replacement, ensuring continued system operation under degraded conditions.

Performance during data rebuilds

Rebuild operations in WEKA are primarily read-intensive, as the system reconstructs missing data by reading from all drives in the stripe. While read performance may degrade slightly during this process, write performance remains unaffected, as the system continues writing to available backends.

WEKA provides a critical optimization during rebuilds. If a failed component, such as drive or server, comes back online after a rebuild has started, the rebuild is automatically aborted. This approach prevents unnecessary data movement and quickly restores normal operations in the case of temporary failures, such as servers returning from maintenance. This behavior significantly differentiates WEKA from traditional systems that continue rebuilding even after transient faults are resolved.

Write performance and stripe width

Larger stripe widths improve write throughput by reducing the proportion of parity overhead in write operations. This benefit is especially important for high-ingest workloads, such as initial data loading or write-heavy applications.

Customer Success Team
Failure domain in action (example)