W E K A
4.3
4.3
  • WEKA v4.3 documentation
    • Documentation revision history
  • WEKA System Overview
    • WEKA Data Platform introduction
      • WEKA system functionality features
      • Converged WEKA system deployment
      • Optimize redundancy in WEKA deployments
    • SSD capacity management
    • Filesystems, object stores, and filesystem groups
    • WEKA networking
    • Data lifecycle management
    • WEKA client and mount modes
    • WEKA containers architecture overview
    • Glossary
  • Planning and Installation
    • Prerequisites and compatibility
    • WEKA cluster installation on bare metal servers
      • Plan the WEKA system hardware requirements
      • Obtain the WEKA installation packages
      • Install the WEKA cluster using the WMS with WSA
      • Install the WEKA cluster using the WSA
      • Manually install OS and WEKA on servers
      • Manually prepare the system for WEKA configuration
        • Broadcom adapter setup for WEKA system
        • Enable the SR-IOV
      • Configure the WEKA cluster using the WEKA Configurator
      • Manually configure the WEKA cluster using the resource generator
      • Perform post-configuration procedures
      • Add clients to an on-premises WEKA cluster
    • WEKA Cloud Deployment Manager Web (CDM Web) User Guide
    • WEKA Cloud Deployment Manager Local (CDM Local) User Guide
    • WEKA installation on AWS
      • WEKA installation on AWS using Terraform
        • Terraform-AWS-WEKA module description
        • Deployment on AWS using Terraform
        • Required services and supported regions
        • Supported EC2 instance types using Terraform
        • WEKA cluster auto-scaling in AWS
        • Detailed deployment tutorial: WEKA on AWS using Terraform
      • WEKA installation on AWS using the Cloud Formation
        • Self-service portal
        • CloudFormation template generator
        • Deployment types
        • AWS Outposts deployment
        • Supported EC2 instance types using Cloud Formation
        • Add clients to a WEKA cluster on AWS
        • Auto scaling group
        • Troubleshooting
      • Install SMB on AWS
    • WEKA installation on Azure
    • WEKA installation on GCP
      • WEKA project description
      • GCP-WEKA deployment Terraform package description
      • Deployment on GCP using Terraform
      • Required services and supported regions
      • Supported machine types and storage
      • Auto-scale instances in GCP
      • Add clients to a WEKA cluster on GCP
      • Troubleshooting
      • Detailed deployment tutorial: WEKA on GCP using Terraform
      • Google Kubernetes Engine and WEKA over POSIX deployment
  • Getting Started with WEKA
    • Manage the system using the WEKA GUI
    • Manage the system using the WEKA CLI
      • WEKA CLI hierarchy
      • CLI reference guide
    • Run first IOs with WEKA filesystem
    • Getting started with WEKA REST API
    • WEKA REST API and equivalent CLI commands
  • Performance
    • WEKA performance tests
      • Test environment details
  • WEKA Filesystems & Object Stores
    • Manage object stores
      • Manage object stores using the GUI
      • Manage object stores using the CLI
    • Manage filesystem groups
      • Manage filesystem groups using the GUI
      • Manage filesystem groups using the CLI
    • Manage filesystems
      • Manage filesystems using the GUI
      • Manage filesystems using the CLI
    • Attach or detach object store buckets
      • Attach or detach object store bucket using the GUI
      • Attach or detach object store buckets using the CLI
    • Advanced data lifecycle management
      • Advanced time-based policies for data storage location
      • Data management in tiered filesystems
      • Transition between tiered and SSD-only filesystems
      • Manual fetch and release of data
    • Mount filesystems
      • Mount filesystems from Single Client to Multiple Clusters (SCMC)
    • Snapshots
      • Manage snapshots using the GUI
      • Manage snapshots using the CLI
    • Snap-To-Object
      • Manage Snap-To-Object using the GUI
      • Manage Snap-To-Object using the CLI
    • Quota management
      • Manage quotas using the GUI
      • Manage quotas using the CLI
  • Additional Protocols
    • Additional protocol containers
    • Manage the NFS protocol
      • Supported NFS client mount parameters
      • Manage NFS networking using the GUI
      • Manage NFS networking using the CLI
    • Manage the S3 protocol
      • S3 cluster management
        • Manage the S3 service using the GUI
        • Manage the S3 service using the CLI
      • S3 buckets management
        • Manage S3 buckets using the GUI
        • Manage S3 buckets using the CLI
      • S3 users and authentication
        • Manage S3 users and authentication using the CLI
        • Manage S3 service accounts using the CLI
      • S3 rules information lifecycle management (ILM)
        • Manage S3 lifecycle rules using the GUI
        • Manage S3 lifecycle rules using the CLI
      • Audit S3 APIs
        • Configure audit webhook using the GUI
        • Configure audit webhook using the CLI
        • Example: How to use Splunk to audit S3
      • S3 supported APIs and limitations
      • S3 examples using boto3
      • Access S3 using AWS CLI
    • Manage the SMB protocol
      • Manage SMB using the GUI
      • Manage SMB using the CLI
  • Operation Guide
    • Alerts
      • Manage alerts using the GUI
      • Manage alerts using the CLI
      • List of alerts and corrective actions
    • Events
      • Manage events using the GUI
      • Manage events using the CLI
      • List of events
    • Statistics
      • Manage statistics using the GUI
      • Manage statistics using the CLI
      • List of statistics
    • Insights
    • System congestion
    • Security management
      • Obtain authentication tokens
      • KMS management
        • Manage KMS using the GUI
        • Manage KMS using the CLI
      • TLS certificate management
        • Manage the TLS certificate using the GUI
        • Manage the TLS certificate using the CLI
      • CA certificate management
        • Manage the CA certificate using the GUI
        • Manage the CA certificate using the CLI
      • Account lockout threshold policy management
        • Manage the account lockout threshold policy using GUI
        • Manage the account lockout threshold policy using CLI
      • Manage the login banner
        • Manage the login banner using the GUI
        • Manage the login banner using the CLI
      • Manage Cross-Origin Resource Sharing
    • User management
      • Manage users using the GUI
      • Manage users using the CLI
    • Organizations management
      • Manage organizations using the GUI
      • Manage organizations using the CLI
      • Mount authentication for organization filesystems
    • Expand and shrink cluster resources
      • Add a backend server
      • Expand specific resources of a container
      • Shrink a cluster
    • Background tasks
      • Set up a Data Services container for background tasks
      • Manage background tasks using the GUI
      • Manage background tasks using the CLI
    • Upgrade WEKA versions
  • Licensing
    • License overview
    • Classic license
  • Monitor the WEKA Cluster
    • Deploy monitoring tools using the WEKA Management Station (WMS)
    • WEKA Home - The WEKA support cloud
      • Local WEKA Home overview
      • Deploy Local WEKA Home v3.0 or higher
      • Deploy Local WEKA Home v2.x
      • Explore cluster insights and statistics
      • Manage alerts and integrations
      • Enforce security and compliance
      • Optimize support and data management
    • Set up the WEKAmon external monitoring
    • Set up the SnapTool external snapshots manager
  • Support
    • Get support for your WEKA system
    • Diagnostics management
      • Traces management
        • Manage traces using the GUI
        • Manage traces using the CLI
      • Protocols debug level management
        • Manage protocols debug level using the GUI
        • Manage protocols debug level using the CLI
      • Diagnostics data management
  • Best Practice Guides
    • WEKA and Slurm integration
      • Avoid conflicting CPU allocations
    • Storage expansion best practice
  • WEKApod
    • WEKApod Data Platform Appliance overview
    • WEKApod servers overview
    • Rack installation
    • WEKApod initial system setup and configuration
    • WEKApod support process
  • Appendices
    • WEKA CSI Plugin
      • Deployment
      • Storage class configurations
      • Tailor your storage class configuration with mount options
      • Dynamic and static provisioning
      • Launch an application using WEKA as the POD's storage
      • Add SELinux support
      • NFS transport failback
      • Upgrade legacy persistent volumes for capacity enforcement
      • Troubleshooting
    • Convert cluster to multi-container backend
    • Create a client image
    • Update WMS and WSA
    • BIOS tool
Powered by GitBook
On this page
  1. Operation Guide
  2. Alerts

List of alerts and corrective actions

Check WEKA system alerts and take necessary actions based on severity and nature.

Alert name
Description
Corrective actions

AdminDefaultPassword

Default admin password in use

Change the admin user password to ensure only authorized users can access the cluster.

AgentNotRunning

The local agent does not run

Restart the local agent on the specified server using the command ‘service weka-agent start’.

ApproachingClientsUnavailability

Approaching connected clients limit

Ensure all backend containers are up or expand the cluster with more backend containers or servers.

ApproachingSystemLimit

Approaching a system limit

See the corrective action in the alert.

AutoRemoveTimeoutTooLow

Stateless Client auto-remove timeout too low

Remount the host with a higher auto-remove timeout value.

BackendNumaBalancingEnabled

NUMA balancing is enabled on a backend server

Disable the automatic NUMA balancing by running the command line 'echo 0 > /proc/sys/kernel/numa_balancing' on the backend server.

BackendVersionsMismatch

Backends mismatch cluster version

Upgrade all the backends to match the cluster's version.

BadDisksCapacityRatio

Bad ratio between smallest and biggest drive

Replace drives so that there will be no significant difference in drives' sizes.

BlockedJrpcMethod

JRPC method is blocked

Unblock the JRPC method by running the command 'blocked_jrpc_methods_remove' or 'blocked_jrpc_methods_clear' manhole.

BondInterfaceCompromised

Network high availability interface compromised

Ensure the proper operation of the network configuration, cables, and NICs.

BucketCapacityExhausting

Buckets are nearing exhaustion of their maximum capacity

Consider migration to a cluster with more buckets.

BucketHasNoQuorum

Too many compute processes are down

Ensure the compute processes on the containers {hosts} are up and running and connected. If the issue is not resolved, contact the Customer Success Team.

BucketUnresponsive

Compute resource failure

Check the connectivity and status of the drives of the leader container and ensure the compute processes are running and connected. If the issue is not resolved, contact the Customer Success Team.

CPUFrequentStarvation

CPU frequent starvation detected at the last minute

Check the logs of the relevant containers for potential hardware or core allocation problems.

CPUStarvation

CPU starvation detected at the last minute

Check the logs of the relevant containers for potential run /weka/weka_addr2line within the reported weka container on the address to convert them into a symbol name.

CWTaskAbortionStuck

CWTask abortion stuck

Start IO to allow the task to complete aborting.

ChokingDetected

High congestion level

For more information, see the System congestion topic in the documentation.

ClientVersionsMismatch

Clients mismatch cluster version

Upgrade the clients to the same version as the cluster by running 'weka local upgrade' locally.

ClockSkew

Clock skew on server

Ensure the NTP is configured correctly on the containers and that their clocks are synchronized.

CloudHealth

WEKA Home disconnected

Check that the server has Internet connectivity and is connected to the WEKA Home. See the WEKA Home - The WEKA support cloud topic in the documentation.

CloudStatsError

Statistics upload failed

See the event details in the System Events.

ClusterInitializationError

Cluster initialization error

Search for the underlying problem causing the error and act accordingly to start IO operations. To clear this alert, run 'weka cluster stop-io'.

ClusterIsUpgrading

Cluster is upgrading

If the upgrade doesn't finish successfully, contact the Customer Success Team.

CoreOverlapping

Core Overlapping

Contact the Customer Success Team.

DataIntegrity

Data integrity problem found

Contact the Customer Success Team.

DataProtection

Partial data protection

Check which process, container, or drive is down and act accordingly.

DedicatedWatchdog

A dedicated server requires the installation of a hardware watchdog driver.

Ensure a hardware watchdog driver is available at /dev/watchdog. For details, search the Knowledge Base in the WEKA support portal.

DriveCriticalWarnings

Drive critical warnings

Deactivate the drive using the command 'weka cluster drive deactivate' and replace it.

DriveDown

Drive down

Contact the Customer Success Team to check if the drive requires a replacement.

DriveEndurancePercentageUsed

Drive exceeds its life expectancy

Replace the specified drive before it fails.

DriveEnduranceSparesRemaining

Drive internal spares run too low

Replace the specified drive before it fails.

DriveNVKVRunningLow

Drive nearing exhaustion of internal resources

Contact the Customer Success Team.

DriveNeedsPhaseout

A drive has too many errors

Deactivate the drive using the command 'weka cluster drive deactivate', and probably replace it.

ExampleAlert

Example Alert

Disable this alert by running the set_example_alert_off manhole (internal maintenance command).

ExceptionsDuringAlertsEvaluation

Exceptions thrown during alerts evaluation

Check Assertion failures event that may reveal the source of the problem. Contact the Customer Success Team if help is required.

FaultsEnabled

Faults are enabled

Contact the Customer Success Team.

FilesystemHasTooManyFiles

Insufficient SSD capacity for metadata on the filesystem

Consider expanding the filesystem size or removing data and directories. If you have previously configured max-files settings, contact the Customer Success Team for assistance.

FilesystemsThinProvisioningLowSpace

Filesystems thin provisioning low space

Consider adding SSD capacity to this organization containing these filesystems.

FilesystemsThinProvisioningReserveReached

Filesystems thin provisioning capacity reserve reached

You can create a filesystem or expand the filesystem capacity using the reserved capacity.

HangingCacheSync

Cache sync is hanging

Consider using weka debug fs drop-dirty-cache to drop the cache and enable other clients to access the file (unsynchronized writes will be lost).

HangingClusterTasks

Cluster background task progress is hanging

Contact the Customer Success Team.

HangingIos

Some IOs stop responding

Ensure the compute processes are up and running and connected. If a backend object store is configured, ensure it is connected and responsive. If the issue is not resolved, contact the Customer Success Team.

HighDrivesCapacity

SSD capacity overflow

Free up space on the SSDs or add more SSDs to the cluster. To add SSDs, see the Exapnd specific resources of a container topic in the documentation.

HighLevelOfUnreclaimedCapacityInObjectStore

High level of unreclaimed space in an object store

HotspotInodes

Some files have a long waiting queue for IOs

Contact the Customer Success Team to help with resolution.

IBNotEnhanced

Enhanced IB mode disabled

Contact Customer Success to correct this issue.

ImbalancedCpuUsage

Imbalanced CPU usage detected in cluster processes

Check system configuration: Examine the system configuration for abnormalities causing the CPU usage imbalance.

JumboConnectivity

A container cannot send jumbo frames

Check the container network settings and the switch to which the container is connected, and ensure to enable jumbo frames. This setting improves performance.

KMSError

KMS Error

Review the KMS configuration and connectivity.

LeaderPreparedForUpgrade

Leader prepared for upgrade

The leader state automatically returns to normal after the upgrade. If this alert persists, contact the Customer Success Team.

LegacyManualOverridesActive

Legacy manual overrides are active

Contact the Customer Success Team.

LicenseError

License error

Ensure the cluster uses the correct license, the license has not expired, and the allocated space does not exceed the license limits.

LowDiskSpace

Low disk space

See the event details in the System Events.

ManualOverridesActive

Manual overrides are active

Contact the Customer Success Team.

ManualOverridesForced

Manual overrides are forced

Contact the Customer Success Team.

MismatchedDriveFailureDomain

A drive failure domain does not match the failure domain of its attached container

Do one of the following: a) Connect the mismatched drive to a container with a matching failure domain. b) Re-provision the drive to erase its failure domain.

NegativeUnprovisionedCapacity

Negative unprovisioned capacity

Resize one or more filesystems to reclaim capacity. For more information, contact the Customer Success Team.

NetworkInterfaceLinkDown

Network interface link status down

Check the connectivity to the specified network interface. Verify that nothing blocks it.

NfsLocksDisabled

NFS Locks disabled

Configure config fs using weka nfs global-config set --config-fs=.

NfsServiceDownAlert

NFS Service Down

If down services persist, contact the Customer Success Team.

NoCgroupsConfigured

No cgroups configured warnings

Take measures to enable cgroups if possible.

NoClusterLicense

No license assigned

Obtain and install a license from get.weka.io.

NodeBlacklisted

A process cannot rejoin the cluster

To enable the process to rejoin the cluster, allow it by running the command ‘weka debug blacklist disable’.

NodeDisconnected

Process disconnected

Check network connectivity to ensure the processes can communicate with the cluster.

NodeNetworkUnstable

A process with an unstable network detected

Ensure proper network connectivity in the cluster. If the problem is not resolved, contact the Customer Success Team.

NodeRDMANotActive

PA process with supported RDMA is Inactive

Ensure Mellanox OFED version 4.6 or later is installed on the server and at least one RDMA-capable device exists.

NodeTieringConnectivity

A process cannot Connect to an object store

Check the connectivity with the object store and ensure the process communicates with it.

NonTlsApisAllowed

Non-TLS APIs are allowed

Update TLS strictness to enforce encrypted TLS APIs over HTTP.

NotEnoughActiveDrives

Reduced data protection

Check the connectivity and server status. Replace failed drives and expand the cluster with new failure domains.

NotEnoughMemoryForFilesystemOperation

Insufficient cluster-wide RAM for proper Filesystem's Operation

Increase RAM cluster-wide to meet Filesystems(RAID) requirements for RAM or remove drives contributing to SSD capacity.

NotEnoughSSDCapacity

Some provisioned capacity is unavailable due to failed drives

Check for down drives.

PartialConnectivityTrackingDisabled

Partial connectivity tracking is disabled

To turn on the Grim Reaper, please Contact the WEKA Support Team.

PartiallyConnectedNode

A partially connected process detected

Ensure proper network connectivity in the cluster. If the problem is not resolved, contact the Customer Success Team.

PassedClientsAvailabilityThreshold

Reached connected clients limit

Add more backend containers or servers to the cluster, check whether the backends are down, or disconnect some clients.

PathsDegraded

Degraded Paths

Contact the Customer Success Team to review path connectivity.

PerformanceDegradedLowRAM

Server low RAM

Ensure all the compute processes are up. Add more servers to the cluster or add RAM to the backend servers.

QuotasHardLimitReached

Directory quota hard limit exceeded

Run 'weka fs quota list' to get the list of directories exceeding their hard quota limits. Clear some space for these directories or increase their hard quota limit.

QuotasSoftLimitReached

Directory quota soft limit exceeded

Run 'weka fs quota list' to get the list of directories exceeding their soft quota limits. Clear some space for these directories or increase their soft quota limit.

RAIDCapacityExhaustion

RAID capacity exhaustion

If the situation is not resolved within minutes, contact the Customer Success Team.

RequestedActionFailure

Requested action failure

Check the logs for more information.

ResourcesNotApplied

Resource changes are not applied

Apply the resource changes by running the command 'weka cluster container apply '.

S3EtcdMigrationAlert

S3 etcd migration

Contact Customer Success to migrate this cluster configuration storage from ETCD to the new built-in WEKA solution.

SSDCapacityDiscrepancy

Used SSD capacity mismatches the expected range

Monitor the compute processes' stability and contact the Customer Success Team.

SSDCapacityTooHigh

Available capacity cannot be fully utilized

For improved SSD capacity usage, contact the Customer Success Team for assistance.

SystemDefinedTLS

TLS certificate is not user-defined

Replace the auto-generated self-signed certificate with a user-defined certificate by running the command 'weka security tls set'.

TLSCertificateExpired

TLS certificate expired

Replace the existing certificate by running the command 'weka security tls set'.

TLSCertificateExpiresSoon

TLS certificate is about to expire

Replace the existing certificate by running the command 'weka security tls set'.

TelemetryStatusFault

Telemetry status is not streaming

Check your telemetry sinks configuration.

TieredFilesystemOverfillingSSD

Tiered filesystems' SSD capacity overfilling

To address this issue, consider expanding the filesystem size or removing data and directories. Identify and resolve connectivity problems with the configured Object Store and increase the upload bandwidth if required.

TraceDumperDown

Trace dumper is down

Contact the Customer Success Team to restart the trace dumper.

TracesDisabled

Traces are disabled

To turn the cluster traces, run the command 'weka debug traces start'. For more information, see the Traces management topic in the documentation.

TracesFreezePeriodActive

Freeze traces feature is active

If the problem persists after the case is resolved, contact the Customer Success Team.

UdpModePerformanceWarning

A backend container is configured in UDP mode

If this is a misconfiguration, add network devices to the specified backend container using the command ‘weka cluster container net add’.

UnwritableDisksConfigured

A drive is set to unwritable

If the drive remains unwritable after maintenance, contact the Customer Success Team.

PreviousManage alerts using the CLINextEvents