W E K A
4.3
4.3
  • WEKA v4.3 documentation
    • Documentation revision history
  • WEKA System Overview
    • WEKA Data Platform introduction
      • WEKA system functionality features
      • Converged WEKA system deployment
      • Optimize redundancy in WEKA deployments
    • SSD capacity management
    • Filesystems, object stores, and filesystem groups
    • WEKA networking
    • Data lifecycle management
    • WEKA client and mount modes
    • WEKA containers architecture overview
    • Glossary
  • Planning and Installation
    • Prerequisites and compatibility
    • WEKA cluster installation on bare metal servers
      • Plan the WEKA system hardware requirements
      • Obtain the WEKA installation packages
      • Install the WEKA cluster using the WMS with WSA
      • Install the WEKA cluster using the WSA
      • Manually install OS and WEKA on servers
      • Manually prepare the system for WEKA configuration
        • Broadcom adapter setup for WEKA system
        • Enable the SR-IOV
      • Configure the WEKA cluster using the WEKA Configurator
      • Manually configure the WEKA cluster using the resource generator
      • Perform post-configuration procedures
      • Add clients to an on-premises WEKA cluster
    • WEKA Cloud Deployment Manager Web (CDM Web) User Guide
    • WEKA Cloud Deployment Manager Local (CDM Local) User Guide
    • WEKA installation on AWS
      • WEKA installation on AWS using Terraform
        • Terraform-AWS-WEKA module description
        • Deployment on AWS using Terraform
        • Required services and supported regions
        • Supported EC2 instance types using Terraform
        • WEKA cluster auto-scaling in AWS
        • Detailed deployment tutorial: WEKA on AWS using Terraform
      • WEKA installation on AWS using the Cloud Formation
        • Self-service portal
        • CloudFormation template generator
        • Deployment types
        • AWS Outposts deployment
        • Supported EC2 instance types using Cloud Formation
        • Add clients to a WEKA cluster on AWS
        • Auto scaling group
        • Troubleshooting
      • Install SMB on AWS
    • WEKA installation on Azure
    • WEKA installation on GCP
      • WEKA project description
      • GCP-WEKA deployment Terraform package description
      • Deployment on GCP using Terraform
      • Required services and supported regions
      • Supported machine types and storage
      • Auto-scale instances in GCP
      • Add clients to a WEKA cluster on GCP
      • Troubleshooting
      • Detailed deployment tutorial: WEKA on GCP using Terraform
      • Google Kubernetes Engine and WEKA over POSIX deployment
  • Getting Started with WEKA
    • Manage the system using the WEKA GUI
    • Manage the system using the WEKA CLI
      • WEKA CLI hierarchy
      • CLI reference guide
    • Run first IOs with WEKA filesystem
    • Getting started with WEKA REST API
    • WEKA REST API and equivalent CLI commands
  • Performance
    • WEKA performance tests
      • Test environment details
  • WEKA Filesystems & Object Stores
    • Manage object stores
      • Manage object stores using the GUI
      • Manage object stores using the CLI
    • Manage filesystem groups
      • Manage filesystem groups using the GUI
      • Manage filesystem groups using the CLI
    • Manage filesystems
      • Manage filesystems using the GUI
      • Manage filesystems using the CLI
    • Attach or detach object store buckets
      • Attach or detach object store bucket using the GUI
      • Attach or detach object store buckets using the CLI
    • Advanced data lifecycle management
      • Advanced time-based policies for data storage location
      • Data management in tiered filesystems
      • Transition between tiered and SSD-only filesystems
      • Manual fetch and release of data
    • Mount filesystems
      • Mount filesystems from Single Client to Multiple Clusters (SCMC)
    • Snapshots
      • Manage snapshots using the GUI
      • Manage snapshots using the CLI
    • Snap-To-Object
      • Manage Snap-To-Object using the GUI
      • Manage Snap-To-Object using the CLI
    • Quota management
      • Manage quotas using the GUI
      • Manage quotas using the CLI
  • Additional Protocols
    • Additional protocol containers
    • Manage the NFS protocol
      • Supported NFS client mount parameters
      • Manage NFS networking using the GUI
      • Manage NFS networking using the CLI
    • Manage the S3 protocol
      • S3 cluster management
        • Manage the S3 service using the GUI
        • Manage the S3 service using the CLI
      • S3 buckets management
        • Manage S3 buckets using the GUI
        • Manage S3 buckets using the CLI
      • S3 users and authentication
        • Manage S3 users and authentication using the CLI
        • Manage S3 service accounts using the CLI
      • S3 rules information lifecycle management (ILM)
        • Manage S3 lifecycle rules using the GUI
        • Manage S3 lifecycle rules using the CLI
      • Audit S3 APIs
        • Configure audit webhook using the GUI
        • Configure audit webhook using the CLI
        • Example: How to use Splunk to audit S3
      • S3 supported APIs and limitations
      • S3 examples using boto3
      • Access S3 using AWS CLI
    • Manage the SMB protocol
      • Manage SMB using the GUI
      • Manage SMB using the CLI
  • Operation Guide
    • Alerts
      • Manage alerts using the GUI
      • Manage alerts using the CLI
      • List of alerts and corrective actions
    • Events
      • Manage events using the GUI
      • Manage events using the CLI
      • List of events
    • Statistics
      • Manage statistics using the GUI
      • Manage statistics using the CLI
      • List of statistics
    • Insights
    • System congestion
    • Security management
      • Obtain authentication tokens
      • KMS management
        • Manage KMS using the GUI
        • Manage KMS using the CLI
      • TLS certificate management
        • Manage the TLS certificate using the GUI
        • Manage the TLS certificate using the CLI
      • CA certificate management
        • Manage the CA certificate using the GUI
        • Manage the CA certificate using the CLI
      • Account lockout threshold policy management
        • Manage the account lockout threshold policy using GUI
        • Manage the account lockout threshold policy using CLI
      • Manage the login banner
        • Manage the login banner using the GUI
        • Manage the login banner using the CLI
      • Manage Cross-Origin Resource Sharing
    • User management
      • Manage users using the GUI
      • Manage users using the CLI
    • Organizations management
      • Manage organizations using the GUI
      • Manage organizations using the CLI
      • Mount authentication for organization filesystems
    • Expand and shrink cluster resources
      • Add a backend server
      • Expand specific resources of a container
      • Shrink a cluster
    • Background tasks
      • Set up a Data Services container for background tasks
      • Manage background tasks using the GUI
      • Manage background tasks using the CLI
    • Upgrade WEKA versions
  • Licensing
    • License overview
    • Classic license
  • Monitor the WEKA Cluster
    • Deploy monitoring tools using the WEKA Management Station (WMS)
    • WEKA Home - The WEKA support cloud
      • Local WEKA Home overview
      • Deploy Local WEKA Home v3.0 or higher
      • Deploy Local WEKA Home v2.x
      • Explore cluster insights and statistics
      • Manage alerts and integrations
      • Enforce security and compliance
      • Optimize support and data management
    • Set up the WEKAmon external monitoring
    • Set up the SnapTool external snapshots manager
  • Support
    • Get support for your WEKA system
    • Diagnostics management
      • Traces management
        • Manage traces using the GUI
        • Manage traces using the CLI
      • Protocols debug level management
        • Manage protocols debug level using the GUI
        • Manage protocols debug level using the CLI
      • Diagnostics data management
  • Best Practice Guides
    • WEKA and Slurm integration
      • Avoid conflicting CPU allocations
    • Storage expansion best practice
  • WEKApod
    • WEKApod Data Platform Appliance overview
    • WEKApod servers overview
    • Rack installation
    • WEKApod initial system setup and configuration
    • WEKApod support process
  • Appendices
    • WEKA CSI Plugin
      • Deployment
      • Storage class configurations
      • Tailor your storage class configuration with mount options
      • Dynamic and static provisioning
      • Launch an application using WEKA as the POD's storage
      • Add SELinux support
      • NFS transport failback
      • Upgrade legacy persistent volumes for capacity enforcement
      • Troubleshooting
    • Convert cluster to multi-container backend
    • Create a client image
    • Update WMS and WSA
    • BIOS tool
Powered by GitBook
On this page
  • Raw capacity
  • Net capacity
  • Stripe width
  • Protection level
  • Resilience to serial failures
  • Failure domains (optional)
  • Hot spare
  • WEKA filesystem overhead
  • Provisioned capacity
  • Available capacity
  • Deductions from raw capacity to obtain net storage capacity
  • SSD net storage capacity calculation
  1. WEKA System Overview

SSD capacity management

Understand the key terms of WEKA system capacity management and the formula for calculating the net data storage capacity.

Raw capacity

Raw capacity is the total capacity of all SSDs assigned to a WEKA system cluster. For example, 10 SSDs of one terabyte each provide a total raw capacity of 10 terabytes. This represents the total capacity available for the WEKA system. This capacity automatically adjusts when more servers or SSDs are added.

Net capacity

Net capacity is the space available for user data on the SSDs in a configured WEKA system. It is derived from the raw capacity minus the WEKA filesystem overheads for redundancy protection and other requirements. This capacity automatically adjusts when more servers or SSDs are added.

Stripe width

The stripe width is the number of blocks within a common protection set, ranging from 3 to 16. The WEKA system employs distributed any-to-any protection. In a system with a stripe width of 8, many groups of 8 data units spread across various servers protect each other, rather than a fixed group of 8 servers forming a protection group.

The stripe width is set during cluster formation and cannot be changed. The choice of stripe width impacts performance and net capacity.

If not configured, the stripe width is automatically set to: #Failure Domains - Protection Level -1.

Protection level

Protection level refers to the number of extra protection blocks added to each data stripe in your storage system. These blocks help protect your data against hardware failures. The protection levels available are:

  • Protection level 2: Can survive 2 concurrent disk or server failures.

  • Protection level 4: Can survive 4 concurrent disk failures or 2 concurrent server failures.

A higher protection level means better data durability and availability but requires more storage space and can affect performance.

Key points:

  • Durability:

    • Higher protection levels offer better data protection.

    • Level 4 is more durable than level 2.

  • Availability:

    • Ensures system availability during hardware failures.

    • Level 4 maintains availability through more extensive failures compared to level 2.

  • Space and performance:

    • Higher protection levels use more storage space.

    • They can also slow down the system due to additional processing.

  • Configuration:

    • The protection level is set during cluster formation and cannot be changed later.

    • If not configured, the system defaults to protection level 2.

Resilience to serial failures

Beyond the +2 or +4 concurrent failure protection, a WEKA cluster is also to of additional failure domains. Providing that each data rebuild completes successfully, and there is sufficient free NVMe capacity in the cluster. A cluster is resilient to an additional server failure, even if the failure would reduce the number of available servers beyond what is expected by the concurrent protection level.

Example of failure resilience after rebuild completion

Consider a cluster of 20 servers with a stripe width of 18 (16+2). After rebuilding from a concurrent failure of 2 servers, the cluster still resilient to two additional concurrent server failures.

In the event of subsequent server failures, the cluster rebuilds with the remaining healthy servers to support a stripe width of 18. If further serial server failures occur, the system rebuilds its data stripes as those individual servers fail, subject to sufficient NVMe space, until the lower limit of 9 servers is reached (in this case). Failures beyond this point result in the filesystem going offline.

In the event of serial server failures and insufficient NVMe capacity, the cluster attempts to tier data that currently resides in NVMe out to its object stores if configured. In contrast to the usual age-related orderly tiering that occurs in normal usage, this does not consider data's age when making tiering decisions, and instead will tier data in an approximately random fashion. This is not the desired mode of operation, but it ensures data integrity in the event of continually-decreasing NVMe capacity by offloading data to an object store.

Resilience level and minimum required healthy servers

The stripe width and protection level determine the minimum number of required healthy servers. This can be represented by the following formula:

Where:

  • D is the data blocks in the stripe.

  • P is the protection blocks in the stripe.

  • H is the minimum number of healthy servers.

The following are a few examples:

Stripe width (D+P)
Minimum required healthy servers (H)

5+2

4

16+2

9

5+4

3

16+4

5

Failure domains (optional)

A failure domain is a set of WEKA servers susceptible to simultaneous failure due to a single root cause, such as a power circuit or network switch malfunction.

A cluster can be configured with either explicit or implicit failure domains:

  • Explicit failure domains: In this setup, blocks that offer mutual protection are distributed across distinct failure domains.

  • Implicit failure domains: Here, blocks are distributed across multiple servers, with each server considered a separate failure domain. Additional failure domains and servers can be integrated into existing or new failure domains.

Hot spare

A hot spare is reserved capacity designed to handle data rebuilds while maintaining the system’s net capacity, even in the event of failure domains being lost. It represents the number of failure domains the system can afford to lose and still perform a complete data rebuild successfully.

All failure domains actively contribute to data storage, and the hot spare capacity is evenly distributed among them. While a higher hot spare count requires additional hardware to maintain the same net capacity, it provides greater flexibility for IT maintenance and hardware replacements.

If not configured, the hot spare is automatically set to 1.

WEKA filesystem overhead

After accounting for protection and hot spare capacity, only 90% of the remaining capacity is available as net user capacity, with the other 10% reserved for the WEKA filesystems. This is a fixed formula and cannot be configured.

Provisioned capacity

Provisioned capacity is the total capacity assigned to filesystems, including both SSD and object store capacity.

Available capacity

Available capacity is the total capacity used to allocate new filesystems, calculated as net capacity minus provisioned capacity.

Deductions from raw capacity to obtain net storage capacity

The net capacity of the WEKA system is determined by making the following three deductions during configuration:

  • Protection level: Storage capacity dedicated to system protection.

  • Hot spare(s): Storage capacity reserved for redundancy and rebuilding following component failures.

  • WEKA filesystem overhead: Storage capacity allocated to enhance overall performance.

SSD net storage capacity calculation

Examples:

Scenario 1: A homogeneous system of 10 servers, each with one terabyte of Raw SSD Capacity, one hot spare, and a protection scheme of 6+2.

Scenario 2: A homogeneous system of 20 servers, each with one terabyte of Raw SSD Capacity, two hot spares, and a protection scheme of 16+2.

PreviousOptimize redundancy in WEKA deploymentsNextFilesystems, object stores, and filesystem groups

Last updated 1 month ago

H=Roundup((D+P)/P)H = Roundup((D+P)/P)H=Roundup((D+P)/P)

This documentation assumes a homogeneous WEKA system deployment, meaning an equal number of servers and identical SSD capacities per server in each failure domain. For guidance on heterogeneous configurations, contact the .

SSDNetCapacity=10TB∗(10−1)/10∗6/(6+2)∗0.9=6.075TBSSD Net Capacity = 10 TB * (10-1) / 10 * 6/(6+2) * 0.9 = 6.075 TBSSDNetCapacity=10TB∗(10−1)/10∗6/(6+2)∗0.9=6.075TB
SSDNetCapacity=20TB∗(20−2)/20∗16/(16+2)∗0.9=14.4TBSSD Net Capacity = 20 TB * (20-2) / 20 * 16/(16+2) * 0.9 = 14.4 TBSSDNetCapacity=20TB∗(20−2)/20∗16/(16+2)∗0.9=14.4TB
Customer Success Team