LWH stats: sizing and performance optimization

Size and tune LWH stats components for high throughput by configuring resource allocations based on stats. This prevents stats stream saturation, processing delays, and NATS backpressure.

Stats workload principles

Statistical data volume in large clusters can exceed default configurations. When stats workers fail to process incoming data quickly, the stats stream reaches its 3 GiB capacity, causing NATS to reject new messages.

Effective sizing requires a clear understanding of the primary components:

  • api.stats: Manages the ingestion and exposure of statistical data.

  • workers.stats: Performs heavy processing of statistics. This component is typically the primary bottleneck in large environments.

  • workers.forwarding: Handles the transmission of processed data. These processes require fewer CPU resources but still scale with the cluster size.

Load scales linearly based on the number of unique (host_id, node_id) metric pairs.

workers.stats capacity references

Use these values to determine the necessary CPU resources for a cluster.

Metric
Theoretical maximum
Recommended safe value

Pairs per 1 CPU core

750

550

Target utilization

100%

70%

Sizing by cluster scale

Cluster size
Unique pairs
Estimated CPU
Recommended number of pods

Small

Up to 1,500

2 cores

1

Medium

1,500 to 5,000

2 to 8 cores

1 to 2

Large

5,000 to 10,000

8 to 14 cores

2+

Calculate required replicas

Determine the required number of pod replicas in a specific environment using the following formulas.

Prerequisites

  • Identify the total number of unique (host_id, node_id) pairs in the cluster.

  • Define the CPU limit per pod.

Procedure

  1. Calculate the required CPU cores.

Required_CPU=Number_of_pairsPairs_per_1_CPU_coreRequired\_CPU = \frac{Number\_of\_pairs}{Pairs\_per\_1\_CPU\_core}
  1. Calculate the required replicas based on the pod CPU limit.

Required_replicas=Required_CPUCPU_limit_per_podRequired\_replicas = \frac{Required\_CPU}{CPU\_limit\_per\_pod}

Example

For a cluster with 10,000 unique pairs and a limit of 16 CPU cores per pod:

  1. Required CPU cores: 10,000 / 750 = 13.4 cores.

  2. Required replicas = 13.4 / 16 =~ 1 (maximum utilization).

To ensure safe usage at 70%, round up to 2 replicas.

Configure resource overrides for high stats throughput

Modify the Helm configuration for either Kubernetes (K8s) or K3s to support high stats throughput by defining resource requests, limits, and autoscaling parameters.

  • K8s Helm values: Manage performance tuning in K8s environments through a values.yaml file. Define overrides within the api and workers sections to govern resources for the entire cluster. Use the Helm CLI to apply these settings and update the deployment state.

  • K3s configuration JSON: Manage performance tuning in K3s environments, typically running on a WEKA Management Station (WMS) or a dedicated server. Define overrides within the helmOverrides block of the /opt/wekahome/config/config.json file. The homecli local upgrade command ingests this JSON to apply the specified CPU and memory limits to the local containers.

chevron-rightExample for K8s: api and workers sections with default valueshashtag
chevron-rightExample for K3s: helmOverrides section with default valueshashtag

Before you begin

  • Calculate the required resources based on the sizing formulas provided in the Calculate required replicas section.

  • Ensure the Helm CLI is configured with the correct cluster context and namespace permissions.

Procedure

  1. Open the configuration file depending on your LWH environment:

    • K8s: Update the api and workers sections in the values.yaml file.

    • K3s: Update the helmOverrides section in the config.json file.

  2. Define the resources and autoscaling blocks for api.stats, workers.stats, and workers.forwarding.

  3. Set the minReplicas to a baseline value that ensures stability and the maxReplicas to a level that accounts for traffic bursts.

  4. Apply the configuration:

    • K8s: Run the helm upgrade command specifying your values file.

    • K3s: Run homecli local upgrade.

Related topics

Upgrade Local WEKA Home

Upgrade the Local WEKA Home

Operational maintenance

Monitor the environment to ensure performance remains within expected limits:

  • Track stats worker CPU usage and queue depth.

  • Monitor the stats stream size and message backlog.

  • Increase replicas before increasing memory if CPU saturation occurs.

  • Verify Horizontal Pod Autoscaler (HPA) behavior during peak load periods.

Last updated