Data Lifecycle Management
This page covers the principles for data lifecycle management and how data storage is managed in SSD-only and tiered Weka system configurations.
Media Options for Data Storage in the Weka System
In the Weka system, data can be stored on two forms of media:
On locally-attached SSDs, which are an integral part of the Weka system configuration.
On object-store systems external to the Weka system, which are either third-party solutions, cloud services, or part of the Weka system.
The Weka system can be configured either as an SSD-only system or as a data management system consisting of both SSDs and object stores. By nature, SSDs provide high performance and low latency storage, while object stores compromise performance and latency but are the most cost-effective solution available for storage. Consequently, users focused on high performance only should consider using an SSD-only Weka system configuration, while users seeking to balance between performance and cost should consider a tiered data management system, with the assurance that the Weka system features will control the allocation of hot data on SSDs and warm data on object stores, thereby optimizing the overall user experience and budget.
Note: In SSD-only configurations, the Weka system will sometimes use an external object store for backup, as explained in Snap-To-Object Data Lifecycle Management.
Guidelines for Data Storage in Tiered Weka System Configurations
In tiered Weka system configurations, there are various locations for data storage as follows:
Metadata is stored only on the SSDs.
Writing of new files, adding data to existing files, or modifying the content of files is always terminated on the SSD, irrespective of whether the file is currently stored on the SSD or tiered to an object-store.
When reading the content of a file, data can be accessed from either the SSD (if it is available on the SSD) or rehydrated from the object store (if it is not available on the SSD).
This data management approach to data storage on one of two possible media requires system planning to ensure that most commonly-used data (hot data) resides on the SSD to ensure high performance, while less-used data (warm data) is stored on the object-store. In the Weka system, this determination of the data storage media is a completely seamless, automatic, and transparent process, with users and applications unaware of the transfer of data from SSDs to object stores, or from object stores to SSDs. The data is accessible at all times through the same strongly-consistent POSIX filesystem API, irrespective of where it is stored. Only latency, throughput, and IOPS are affected by the actual storage media.
Furthermore, the Weka system tiers data in chunks, rather than complete files. This enables the smart tiering of subsets of a file (and not only complete files) between SSDs and object-stores.
The network resources allocated to the object store connections can be controlled. This enables cost control when using cloud-based object storage services since the cost of data stored in the cloud depends on the quantity stored and the number of requests for access made.
States in the Weka System Data Management Storage Process
Data management represents the media being used for the storage of data. In tiered Weka system configurations, data can exist in one of three possible states:
SSD-only: When data is created, it exists only on the SSDs.
SSD-cached: A tiered copy of the data exists on both the SSD and the object-store.
Object Store only: Data resides only on the object-store.
Note: These states represent the lifecycle of data and not the lifecycle of a file. When a file is modified, each modification creates a separate data lifecycle for the modified data.
The Data Lifecycle Diagram represents the transitions of data between the above states. #1 represents the Tiering operation, #2 represents the Releasing operation and #3 represents the Rehydrating operation:
Tiering of data from the SSD to create a replicate in the object-store. A guideline for the tiering of data is based on a user-defined, time-based policy (Tiering Cue).
Releasing data from the SSD, leaving only the object-store copy (based on the demand for more space for data on the SSD). A guideline for the release of data is based on a user-defined, time-based policy (Retention Period).
Rehydrating data from the object-store to the SSD, for the purpose of data access.
In order to read data residing only on an object store, the data must first be rehydrated back to the SSD.
In the Weka system, file modification is never implemented as in-place write, but rather as a write to a new area located on the SSD, and the relevant modification of the meta-data. Consequently, write operations are never associated with object store operations.
The Role of SSDs in Tiered Weka Configurations
All writing in the Weka system is performed to SSDs. The data residing on SSDs is hot data, i.e., data that is currently in use. In tiered Weka configurations, SSDs have three primary roles in accelerating performance: metadata processing, a staging area for writing, and as a cache for read performance.
Metadata Processing
Since filesystem metadata is by nature a large number of update operations each with a small number of bytes, the embedding of metadata on SSDs serves to accelerate file operations in the Weka system.
SSD as a Staging Area
Since writing directly to an object store demands high latency levels while waiting for approval that the data has been written, with the Weka system there is no writing directly to object stores. Much faster writing is performed directly to the SSDs, with very low latency and therefore much better performance. Consequently, in the Weka system, the SSDs serve as a staging area, providing a buffer that is big enough for writing until later tiering of data to the object-store. On completion of writing, the Weka system is responsible for tiering the data to the object store and for releasing it from the SSD.
SSD as a Cache
Recently accessed or modified data is stored on SSDs, and most read operations will be of such data and served from SSDs. This is based on a single, large LRU clearing policy for the cache that ensures optimal read performance.
Note: On a tiered filesystem, the total capacity determines the maximum capacity that will be used to store data. It could be that it will all reside on the object store due to the SSD uses above and the below time-based policies.
E.g., consider a 100 TB filesystem (total capacity) with a 10TB SSD capacity for this filesystem. It could be that all the data will reside on the object-store, and no new writes will be allowed, although the SSD space is not completely used (until deleting files or increasing filesystem total size), leaving the SSD for metadata and cache only.
Time-based Policies for the Control of Data Storage Location
The Weka system includes user-defined policies that serve as guidelines to control the data storage management. They are derived from a number of factors:
The rate at which data is written to the system and the quantity of data.
The capacity of the SSDs configured to the Weka system.
The speed of the network between the Weka system and the object store, and the performance capabilities of the object store itself, e.g., how much the object store can actually contain.
Filesystem groups are used to define these policies, while a filesystem is placed in a filesystem group according to the desired policy if the filesystem is tiered.
For tiered filesystems, the following parameters should be defined per filesystem:
The size of the filesystem.
The amount of filesystem data to be stored on the SSD.
The following parameters should be defined per filesystem group:
The Drive Retention Period Policy, a time-based policy which is the target time for data to be stored on an SSD after creation, modification, or access, and before release from the SSD, even if it is already tiered to the object store, for metadata processing and SSD caching purposes (this is only a target; the actual release schedule depends on the amount of available space).
The Tiering Cue Policy, a time-based policy that determines the minimum amount of time that data will remain on an SSD before it is considered for release to the object-store. As a rule of thumb, this should be configured to a third of the Retention Period, and in most cases, this will work well. The Tiering Cue is important because it is pointless to tier a file that is about to be modified or deleted from the object-store.
For Example:
When writing log files that are processed every month but retained forever: It is recommended to define a Retention Period of 1 month, a Tiering Cue of 1 day, and ensure that there is sufficient SSD capacity to hold 1 month of log files.
When storing genomic data which is frequently accessed during the first 3 months after creation, requires a scratch space for 6 hours of processing, and requires output to be retained forever: It is recommended to define a Retention Period of 3 months and to allocate an SSD capacity that will be sufficient for 3 months of output data and the scratch space. The Tiering Cue should be defined as 1 day, in order to avoid a situation where the scratch space data is tiered to an object store and released from the SSD immediately afterward.
Note: Using the Snap-To-Object feature causes data to be tiered regardless of the tiering policies.
Bypassing the time-based policies
Regardless of the time-based policies, it is possible to use a special mount option obs_direct
to bypass the time-based policies. Any creation/writing of files from a mount point with this option will mark it to release as soon as possible, before taking into account other files retention policy.
For a more in-depth explanation, refer to Advanced Data Lifecycle Management.
Last updated