List of alerts and corrective actions
Check WEKA system alerts and take necessary actions based on severity and nature.
Alert name | Description | Corrective actions |
---|---|---|
AdminDefaultPassword | Default admin password in use | Change the admin user password to ensure only authorized users can access the cluster. |
AgentNotRunning | The local agent does not run | Restart the local agent on the specified server using the command ‘service weka-agent start’. |
ApproachingClientsUnavailability | Approaching connected clients limit | Ensure all backend containers are up or expand the cluster with more backend containers or servers. |
ApproachingSystemLimit | Approaching a system limit | See the corrective action in the alert. |
AutoRemoveTimeoutTooLow | Stateless Client auto-remove timeout too low | Remount the host with a higher auto-remove timeout value. |
BackendNumaBalancingEnabled | NUMA balancing is enabled on a backend server | Disable the automatic NUMA balancing by running the command line 'echo 0 > /proc/sys/kernel/numa_balancing' on the backend server. |
BackendVersionsMismatch | Backends mismatch cluster version | Upgrade all the backends to match the cluster's version. |
BadDisksCapacityRatio | Bad ratio between smallest and biggest drive | Replace drives so that there will be no significant difference in drives' sizes. |
BlockedJrpcMethod | JRPC method is blocked | Unblock the JRPC method by running the command 'blocked_jrpc_methods_remove' or 'blocked_jrpc_methods_clear' manhole. |
BondInterfaceCompromised | Network high availability interface compromised | Ensure the proper operation of the network configuration, cables, and NICs. |
BucketCapacityExhausting | Buckets are nearing exhaustion of their maximum capacity | Consider migration to a cluster with more buckets. |
BucketHasNoQuorum | Too many compute processes are down | Ensure the compute processes on the containers {hosts} are up and running and connected. If the issue is not resolved, contact the Customer Success Team. |
BucketUnresponsive | Compute resource failure | Check the connectivity and status of the drives of the leader container and ensure the compute processes are running and connected. If the issue is not resolved, contact the Customer Success Team. |
CPUFrequentStarvation | CPU frequent starvation detected at the last minute | Check the logs of the relevant containers for potential hardware or core allocation problems. |
CPUStarvation | CPU starvation detected at the last minute | Check the logs of the relevant containers for potential run /weka/weka_addr2line within the reported weka container on the address to convert them into a symbol name. |
CWTaskAbortionStuck | CWTask abortion stuck | Start IO to allow the task to complete aborting. |
ChokingDetected | High congestion level | For more information, see the System congestion topic in the documentation. |
ClientVersionsMismatch | Clients mismatch cluster version | Upgrade the clients to the same version as the cluster by running 'weka local upgrade' locally. |
ClockSkew | Clock skew on server | Ensure the NTP is configured correctly on the containers and that their clocks are synchronized. |
CloudHealth | WEKA Home disconnected | Check that the server has Internet connectivity and is connected to the WEKA Home. See the WEKA Home - The WEKA support cloud topic in the documentation. |
CloudStatsError | Statistics upload failed | See the event details in the System Events. |
ClusterInitializationError | Cluster initialization error | Search for the underlying problem causing the error and act accordingly to start IO operations. To clear this alert, run 'weka cluster stop-io'. |
ClusterIsUpgrading | Cluster is upgrading | If the upgrade doesn't finish successfully, contact the Customer Success Team. |
CoreOverlapping | Core Overlapping | Contact the Customer Success Team. |
DataIntegrity | Data integrity problem found | Contact the Customer Success Team. |
DataProtection | Partial data protection | Check which process, container, or drive is down and act accordingly. |
DedicatedWatchdog | A dedicated server requires the installation of a hardware watchdog driver. | Ensure a hardware watchdog driver is available at /dev/watchdog. For details, search the Knowledge Base in the WEKA support portal. |
DriveCriticalWarnings | Drive critical warnings | Deactivate the drive using the command 'weka cluster drive deactivate' and replace it. |
DriveDown | Drive down | Contact the Customer Success Team to check if the drive requires a replacement. |
DriveEndurancePercentageUsed | Drive exceeds its life expectancy | Replace the specified drive before it fails. |
DriveEnduranceSparesRemaining | Drive internal spares run too low | Replace the specified drive before it fails. |
DriveNVKVRunningLow | Drive nearing exhaustion of internal resources | Contact the Customer Success Team. |
DriveNeedsPhaseout | A drive has too many errors | Deactivate the drive using the command 'weka cluster drive deactivate', and probably replace it. |
ExampleAlert | Example Alert | Disable this alert by running the set_example_alert_off manhole (internal maintenance command). |
ExceptionsDuringAlertsEvaluation | Exceptions thrown during alerts evaluation | Check Assertion failures event that may reveal the source of the problem. Contact the Customer Success Team if help is required. |
FaultsEnabled | Faults are enabled | Contact the Customer Success Team. |
FilesystemHasTooManyFiles | Insufficient SSD capacity for metadata on the filesystem | Consider expanding the filesystem size or removing data and directories. If you have previously configured max-files settings, contact the Customer Success Team for assistance. |
FilesystemsThinProvisioningLowSpace | Filesystems thin provisioning low space | Consider adding SSD capacity to this organization containing these filesystems. |
FilesystemsThinProvisioningReserveReached | Filesystems thin provisioning capacity reserve reached | You can create a filesystem or expand the filesystem capacity using the reserved capacity. |
HangingCacheSync | Cache sync is hanging | Consider using |
HangingClusterTasks | Cluster background task progress is hanging | Contact the Customer Success Team. |
HangingIos | Some IOs stop responding | Ensure the compute processes are up and running and connected. If a backend object store is configured, ensure it is connected and responsive. If the issue is not resolved, contact the Customer Success Team. |
HighDrivesCapacity | SSD capacity overflow | Free up space on the SSDs or add more SSDs to the cluster. To add SSDs, see the Exapnd specific resources of a container topic in the documentation. |
HighLevelOfUnreclaimedCapacityInObjectStore | High level of unreclaimed space in an object store | |
HotspotInodes | Some files have a long waiting queue for IOs | Contact the Customer Success Team to help with resolution. |
IBNotEnhanced | Enhanced IB mode disabled | Contact Customer Success to correct this issue. |
ImbalancedCpuUsage | Imbalanced CPU usage detected in cluster processes | Check system configuration: Examine the system configuration for abnormalities causing the CPU usage imbalance. |
JumboConnectivity | A container cannot send jumbo frames | Check the container network settings and the switch to which the container is connected, and ensure to enable jumbo frames. This setting improves performance. |
KMSError | KMS Error | Review the KMS configuration and connectivity. |
LeaderPreparedForUpgrade | Leader prepared for upgrade | The leader state automatically returns to normal after the upgrade. If this alert persists, contact the Customer Success Team. |
LegacyManualOverridesActive | Legacy manual overrides are active | Contact the Customer Success Team. |
LicenseError | License error | Ensure the cluster uses the correct license, the license has not expired, and the allocated space does not exceed the license limits. |
LowDiskSpace | Low disk space | See the event details in the System Events. |
ManualOverridesActive | Manual overrides are active | Contact the Customer Success Team. |
ManualOverridesForced | Manual overrides are forced | Contact the Customer Success Team. |
MismatchedDriveFailureDomain | A drive failure domain does not match the failure domain of its attached container | Do one of the following: a) Connect the mismatched drive to a container with a matching failure domain. b) Re-provision the drive to erase its failure domain. |
NegativeUnprovisionedCapacity | Negative unprovisioned capacity | Resize one or more filesystems to reclaim capacity. For more information, contact the Customer Success Team. |
NetworkInterfaceLinkDown | Network interface link status down | Check the connectivity to the specified network interface. Verify that nothing blocks it. |
NfsLocksDisabled | NFS Locks disabled | Configure config fs using weka nfs global-config set --config-fs=. |
NfsServiceDownAlert | NFS Service Down | If down services persist, contact the Customer Success Team. |
NoCgroupsConfigured | No cgroups configured warnings | Take measures to enable cgroups if possible. |
NoClusterLicense | No license assigned | Obtain and install a license from get.weka.io. |
NodeBlacklisted | A process cannot rejoin the cluster | To enable the process to rejoin the cluster, allow it by running the command ‘weka debug blacklist disable’. |
NodeDisconnected | Process disconnected | Check network connectivity to ensure the processes can communicate with the cluster. |
NodeNetworkUnstable | A process with an unstable network detected | Ensure proper network connectivity in the cluster. If the problem is not resolved, contact the Customer Success Team. |
NodeRDMANotActive | PA process with supported RDMA is Inactive | Ensure Mellanox OFED version 4.6 or later is installed on the server and at least one RDMA-capable device exists. |
NodeTieringConnectivity | A process cannot Connect to an object store | Check the connectivity with the object store and ensure the process communicates with it. |
NonTlsApisAllowed | Non-TLS APIs are allowed | Update TLS strictness to enforce encrypted TLS APIs over HTTP. |
NotEnoughActiveDrives | Reduced data protection | Check the connectivity and server status. Replace failed drives and expand the cluster with new failure domains. |
NotEnoughMemoryForFilesystemOperation | Insufficient cluster-wide RAM for proper Filesystem's Operation | Increase RAM cluster-wide to meet Filesystems(RAID) requirements for RAM or remove drives contributing to SSD capacity. |
NotEnoughSSDCapacity | Some provisioned capacity is unavailable due to failed drives | Check for down drives. |
PartialConnectivityTrackingDisabled | Partial connectivity tracking is disabled | To turn on the Grim Reaper, please Contact the WEKA Support Team. |
PartiallyConnectedNode | A partially connected process detected | Ensure proper network connectivity in the cluster. If the problem is not resolved, contact the Customer Success Team. |
PassedClientsAvailabilityThreshold | Reached connected clients limit | Add more backend containers or servers to the cluster, check whether the backends are down, or disconnect some clients. |
PathsDegraded | Degraded Paths | Contact the Customer Success Team to review path connectivity. |
PerformanceDegradedLowRAM | Server low RAM | Ensure all the compute processes are up. Add more servers to the cluster or add RAM to the backend servers. |
QuotasHardLimitReached | Directory quota hard limit exceeded | Run 'weka fs quota list' to get the list of directories exceeding their hard quota limits. Clear some space for these directories or increase their hard quota limit. |
QuotasSoftLimitReached | Directory quota soft limit exceeded | Run 'weka fs quota list' to get the list of directories exceeding their soft quota limits. Clear some space for these directories or increase their soft quota limit. |
RAIDCapacityExhaustion | RAID capacity exhaustion | If the situation is not resolved within minutes, contact the Customer Success Team. |
RequestedActionFailure | Requested action failure | Check the logs for more information. |
ResourcesNotApplied | Resource changes are not applied | Apply the resource changes by running the command 'weka cluster container apply '. |
S3EtcdMigrationAlert | S3 etcd migration | Contact Customer Success to migrate this cluster configuration storage from ETCD to the new built-in WEKA solution. |
SSDCapacityDiscrepancy | Used SSD capacity mismatches the expected range | Monitor the compute processes' stability and contact the Customer Success Team. |
SSDCapacityTooHigh | Available capacity cannot be fully utilized | For improved SSD capacity usage, contact the Customer Success Team for assistance. |
SystemDefinedTLS | TLS certificate is not user-defined | Replace the auto-generated self-signed certificate with a user-defined certificate by running the command 'weka security tls set'. |
TLSCertificateExpired | TLS certificate expired | Replace the existing certificate by running the command 'weka security tls set'. |
TLSCertificateExpiresSoon | TLS certificate is about to expire | Replace the existing certificate by running the command 'weka security tls set'. |
TelemetryStatusFault | Telemetry status is not streaming | Check your telemetry sinks configuration. |
TieredFilesystemOverfillingSSD | Tiered filesystems' SSD capacity overfilling | To address this issue, consider expanding the filesystem size or removing data and directories. Identify and resolve connectivity problems with the configured Object Store and increase the upload bandwidth if required. |
TraceDumperDown | Trace dumper is down | Contact the Customer Success Team to restart the trace dumper. |
TracesDisabled | Traces are disabled | To turn the cluster traces, run the command 'weka debug traces start'. For more information, see the Traces management topic in the documentation. |
TracesFreezePeriodActive | Freeze traces feature is active | If the problem persists after the case is resolved, contact the Customer Success Team. |
UdpModePerformanceWarning | A backend container is configured in UDP mode | If this is a misconfiguration, add network devices to the specified backend container using the command ‘weka cluster container net add’. |
UnwritableDisksConfigured | A drive is set to unwritable | If the drive remains unwritable after maintenance, contact the Customer Success Team. |