System congestion

This page describes possible congestion issues in the WEKA system.

Overview

The WEKA system is designed for efficiency, delivering maximum performance and fully utilizing network links. However, in certain situations, the system may slow down I/O operations or even block new I/Os if specific limits are reached, until the congested resource is alleviated.

While these situations are often temporary and may resolve themselves quickly, persistent congestion can indicate an underlying issue, such as a workload overwhelming the cluster's resources. In such cases, expanding the cluster's resources, as detailed in Expand and shrink cluster resources. For further assistance, contact the Customer Success Team.

System congestion events and alerts

The WEKA system can issue several types of congestion events and alerts:

TypeDescriptionActions

FIBERS

Extreme load of concurrent system operations on a process.

This is typically a transient condition caused by system load. If the load is persistent, consider adding more resources, such as servers or cores. See Add a backend server or Add CPU cores to a container for guidance.

DESTAGER

Excessive pending I/O operations waiting to be written for a specific process.

This situation is often temporary due to system load. If the condition persists, increase the number of servers in the cluster.

See Add a backend server or Expand specific resources of a container for guidance.

SSD

An excessive number of pending I/O operations to the SSD.

If there is only one SSD, it may be faulty and require replacement. If multiple SSDs are involved, the system load is too high.

To manage the load, add more SSDs to the system. See Expand specific resources of a container for guidance.

RAID_NOT_OK

I/O failures exceed the system's handling capacity, and I/Os cannot be processed.

Ensure all servers are operational. If any server is down, bring it up. If all servers are active and the issue persists, contact the Customer Success Team.

XDESTAGE

Auxiliary cluster resources are low

This is usually a temporary condition due to system load. If the problem persists, add more servers to the cluster.

See Add a backend server for guidance or contact the Customer Success Team.

Last updated