W E K A
4.0
4.0
  • WEKA v4.0 documentation
  • WEKA System Overview
    • About the WEKA system
    • SSD capacity management
    • Filesystems, object stores, and filesystem groups
    • Weka networking
    • Data lifecycle management
    • Weka client and mount modes
    • Weka containers architecture overview
    • Glossary
  • Getting Started with Weka
    • Quick installation guide
    • Manage the system using the Weka CLI
    • Manage the system using the Weka GUI
    • Run first IOs with WekaFS
    • Getting started with Weka REST API
  • Planning & Installation
    • Prerequisites for installation
    • Weka installation on bare metal
      • Planning a Weka System Installation
      • Prepare the system for Weka installation
        • SR-IOV enablement
      • Obtain the Weka software installation package
      • Weka cluster installation
      • WEKA legacy system installation process
      • Add clients
    • Weka installation on AWS
      • Self-service portal
      • CloudFormation template generator
      • Deployment types
      • AWS outposts deployment
      • Supported EC2 instance types
      • Add clients
      • Auto scaling group
      • Troubleshooting
  • Performance
    • Weka performance tests
      • Test environment details
  • WekaFS Filesystems & Object Stores
    • Manage object stores
      • Manage object stores using the GUI
      • Manage object stores using the CLI
    • Manage filesystem groups
      • Manage filesystem groups using the GUI
      • Manage filesystem groups using the CLI
    • Manage filesystems
      • Manage filesystems using the GUI
      • Manage filesystems using the CLI
    • Attach or detach object store buckets
      • Attach or detach object store bucket using the GUI
      • Attach or detach object store buckets using the CLI
    • Advanced data lifecycle management
      • Advanced time-based policies for data storage location
      • Data management in tiered filesystems
      • Transition between tiered and SSD-only filesystems
      • Manual fetch and release of data
    • Mount filesystems
    • Snapshots
      • Manage snapshots using the GUI
      • Manage snapshots using the CLI
    • Snap-To-Object
      • Manage Snap-To-Object using the GUI
      • Manage Snap-To-Object using the CLI
    • Quota management
  • Additional Protocols
    • NFS
      • Manage NFS networking using the GUI
      • Manage NFS networking using the CLI
    • SMB
      • Manage SMB using the GUI
      • Manage SMB using the CLI
    • S3
      • S3 cluster management
        • Manage the S3 service using the GUI
        • Manage the S3 service using the CLI
      • S3 buckets management
        • Manage S3 buckets using the GUI
        • Manage S3 buckets using the CLI
      • S3 users and authentication
        • Manage S3 users and authentication using the CLI
        • Manage S3 service accounts using the CLI
      • S3 rules information lifecycle management (ILM)
        • Manage S3 rules using the CLI
      • Audit S3 APIs
        • Configure audit webhook using the GUI
        • Configure audit webhook using the CLI
        • Example: How to use Splunk to audit S3
      • S3 supported APIs and limitations
      • S3 examples using boto3
  • Operation Guide
    • Alerts
      • Manage alerts using the GUI
      • Manage alerts using the CLI
      • List of alerts and corrective actions
    • Events
      • Manage events using the GUI
      • Manage events using the CLI
      • List of events
    • Statistics
      • Manage statistics using the GUI
      • Manage statistics using the CLI
      • List of statistics
    • System congestion
    • Security management
      • Obtain authentication tokens
      • KMS management
        • Manage KMS using the GUI
        • Manage KMS using the CLI
      • TLS certificate management
        • Manage the TLS certificate using the GUI
        • Manage the TLS certificate using the CLI
      • CA certificate management
        • Manage the CA certificate using the GUI
        • Manage the CA certificate using the CLI
      • Account lockout threshold policy management
        • Manage the account lockout threshold policy using GUI
        • Manage the account lockout threshold policy using CLI
      • Manage the login banner
        • Manage the login banner using the GUI
        • Manage the login banner using the CLI
    • User management
      • Manage users using the GUI
      • Manage users using the CLI
    • Organizations management
      • Manage organizations using the GUI
      • Manage organizations using the CLI
      • Mount authentication for organization filesystems
    • Expand and shrink cluster resources
      • Expand and shrink overview
      • Workflow: Add a backend host
      • Expansion of specific resources
      • Shrink a Cluster
    • Background tasks
    • Upgrade Weka versions
  • Billing & Licensing
    • License overview
    • Classic license
    • Pay-As-You-Go license
  • Support
    • Prerequisites and compatibility
    • Get support for your Weka system
    • Diagnostics management
      • Traces management
        • Manage traces using the GUI
        • Manage traces using the CLI
      • Protocols debug level management
        • Manage protocols debug level using the GUI
        • Manage protocols debug level using the CLI
      • Collect and upload diagnostics data
    • Weka Home - The Weka support cloud
      • Local Weka Home overview
      • Local Weka Home deployment
      • Set the Local Weka Home to send alerts or events
      • Download the Usage Report or Analytics
  • Appendix
    • Weka CSI Plugin
    • Set up the Weka-mon external monitoring
    • Set up the SnapTool external snapshots manager
  • REST API Reference Guide
Powered by GitBook
On this page
  • Media options for data storage in the Weka system
  • Guidelines for data storage in tiered Weka system configurations
  • States in the Weka system data management storage process
  • The role of SSDs in tiered Weka configurations
  • Metadata processing
  • SSD as a staging area
  • SSD as a cache
  • Time-based policies for the control of data storage location
  • Bypassing the time-based policies
  1. WEKA System Overview

Data lifecycle management

This page covers the principles for data lifecycle management and how data storage is managed in SSD-only and tiered Weka system configurations.

PreviousWeka networkingNextWeka client and mount modes

Last updated 2 years ago

Media options for data storage in the Weka system

In the Weka system, data can be stored on two forms of media:

  1. On locally-attached SSDs, which are an integral part of the Weka system configuration.

  2. On object-store systems external to the Weka system, which are either third-party solutions, cloud services, or part of the Weka system.

The Weka system can be configured either as an SSD-only system or as a data management system consisting of both SSDs and object stores. By nature, SSDs provide high performance and low latency storage, while object stores compromise performance and latency but are the most cost-effective solution available for storage. Consequently, users focused on high performance only should consider using an SSD-only Weka system configuration, while users are seeking to balance performance and cost should consider a tiered data management system, with the assurance that the Weka system features will control the allocation of hot data on SSDs and warm data on object stores, thereby optimizing the overall user experience and budget.

Note: In SSD-only configurations, the Weka system will sometimes use an external object store for backup, as explained in .

Guidelines for data storage in tiered Weka system configurations

In tiered Weka system configurations, there are various locations for data storage as follows:

  1. Metadata is stored only on SSDs.

  2. Writing of new files, adding data to existing files, or modifying the content of files is always terminated on the SSD, irrespective of whether the file is currently stored on the SSD or tiered to an object-store.

  3. When reading the content of a file, data can be accessed from either the SSD (if it is available on the SSD) or rehydrated from the object store (if it is not available on the SSD).

This data management approach to data storage on one of two possible media requires system planning to ensure that most commonly-used data (hot data) resides on the SSD to ensure high performance, while less-used data (warm data) is stored on the object store. In the Weka system, this determination of the data storage media is a completely seamless, automatic, and transparent process, with users and applications unaware of the transfer of data from SSDs to object stores, or from object stores to SSDs. The data is accessible at all times through the same strongly-consistent POSIX filesystem API, irrespective of where it is stored. Only latency, throughput, and IOPS are affected by the actual storage media.

Furthermore, the Weka system tiers data in chunks, rather than complete files. This enables the smart tiering of subsets of a file (and not only complete files) between SSDs and object stores.

The network resources allocated to the object store connections can be . This enables cost control when using cloud-based object storage services since the cost of data stored in the cloud depends on the quantity stored and the number of requests for access made.

States in the Weka system data management storage process

Data management represents the media being used for the storage of data. In tiered Weka system configurations, data can exist in one of three possible states:

  1. SSD-only: When data is created, it exists only on the SSDs.

  2. SSD-cached: A tiered copy of the data exists on both the SSD and the object store.

  3. Object store only: Data resides only on the object store.

Note: These states represent the lifecycle of data and not the lifecycle of a file. When a file is modified, each modification creates a separate data lifecycle for the modified data.

The Data Lifecycle Diagram represents the transitions of data between the above states. #1 represents the Tiering operation, #2 represents the Releasing operation and #3 represents the Rehydrating operation:

  1. Rehydrating data from the object store to the SSD, for the purpose of data access.

In order to read data residing only on an object store, the data must first be rehydrated back to the SSD.

In the Weka system, file modification is never implemented as in-place write, but rather as a write to a new area located on the SSD, and the relevant modification of the meta-data. Consequently, write operations are never associated with object store operations.

The role of SSDs in tiered Weka configurations

All writing in the Weka system is performed to SSDs. The data residing on SSDs is hot data, i.e., data that is currently in use. In tiered Weka configurations, SSDs have three primary roles in accelerating performance: metadata processing, a staging area for writing, and as a cache for reading performance.

Metadata processing

Since filesystem metadata is by nature a large number of update operations each with a small number of bytes, the embedding of metadata on SSDs serves to accelerate file operations in the Weka system.

SSD as a staging area

Since writing directly to an object store demands high latency levels while waiting for approval that the data has been written, with the Weka system there is no writing directly to object stores. Much faster writing is performed directly to the SSDs, with very low latency and therefore much better performance. Consequently, in the Weka system, the SSDs serve as a staging area, providing a buffer that is big enough for writing until later tiering of data to the object-store. On completion of writing, the Weka system is responsible for tiering the data to the object store and for releasing it from the SSD.

SSD as a cache

Recently accessed or modified data is stored on SSDs, and most read operations will be of such data and served from SSDs. This is based on a single, large LRU clearing policy for the cache that ensures optimal read performance.

Note: On a tiered filesystem, the total capacity determines the maximum capacity that will be used to store data. It could be that it will all reside on the object store due to the SSD uses above and the below time-based policies.

E.g., consider a 100 TB filesystem (total capacity) with a 10TB SSD capacity for this filesystem. It could be that all the data will reside on the object-store, and no new writes will be allowed, although the SSD space is not completely used (until deleting files or increasing filesystem total size), leaving the SSD for metadata and cache only.

Time-based policies for the control of data storage location

The Weka system includes user-defined policies that serve as guidelines to control data storage management. They are derived from a number of factors:

  1. The rate at which data is written to the system and the quantity of data.

  2. The capacity of the SSDs configured to the Weka system.

  3. The speed of the network between the Weka system and the object store, and the performance capabilities of the object store itself, e.g., how much the object store can actually contain.

Filesystem groups are used to define these policies, while a filesystem is placed in a filesystem group according to the desired policy if the filesystem is tiered.

For tiered filesystems, the following parameters should be defined per filesystem:

  1. The size of the filesystem.

  2. The amount of filesystem data to be stored on the SSD.

The following parameters should be defined per filesystem group:

For Example:

When writing log files that are processed every month but retained forever: It is recommended to define a Retention Period of 1 month, a Tiering Cue of 1 day, and ensure that there is sufficient SSD capacity to hold 1 month of log files.

When storing genomic data which is frequently accessed during the first 3 months after creation, requires a scratch space for 6 hours of processing, and requires output to be retained forever: It is recommended to define a Retention Period of 3 months and to allocate an SSD capacity that will be sufficient for 3 months of output data and the scratch space. The Tiering Cue should be defined as 1 day, in order to avoid a situation where the scratch space data is tiered to an object store and released from the SSD immediately afterward.

Bypassing the time-based policies

Tiering of data from the SSD to create a replicate in the object store. A guideline for the tiering of data is based on a user-defined, time-based policy ().

Releasing data from the SSD, leaving only the object-store copy (based on the demand for more space for data on the SSD). A guideline for the release of data is based on a user-defined, time-based policy ().

The , a time-based policy which is the target time for data to be stored on an SSD after creation, modification, or access, and before release from the SSD, even if it is already tiered to the object store, for metadata processing and SSD caching purposes (this is only a target; the actual release schedule depends on the amount of available space).

The , a time-based policy that determines the minimum amount of time that data will remain on an SSD before it is considered for release to the object store. As a rule of thumb, this should be configured to a third of the Retention Period, and in most cases, this will work well. The Tiering Cue is important because it is pointless to tier a file that is about to be modified or deleted from the object store.

Note: Using the feature causes data to be tiered regardless of the tiering policies.

Regardless of the time-based policies, it is possible to use a special mount option to bypass the time-based policies. Any creation/writing of files from a mount point with this option will mark it to release as soon as possible, before taking into account other files retention policy.

For a more in-depth explanation, refer to .

Snap-To-Object
Advanced Data Lifecycle Management
Data Lifecycle Diagram
obs_direct
controlled
Tiering Cue
Retention Period
Drive Retention Period Policy
Tiering Cue Policy
Snap-To-Object Data Lifecycle Management