W E K A
4.2
4.2
  • WEKA v4.2 documentation
    • Documentation revision history
  • WEKA System Overview
    • Introduction
      • WEKA system functionality features
      • Converged WEKA system deployment
      • Optimize redundancy in WEKA deployments
    • SSD capacity management
    • Filesystems, object stores, and filesystem groups
    • WEKA networking
    • Data lifecycle management
    • WEKA client and mount modes
    • WEKA containers architecture overview
    • Glossary
  • Planning and Installation
    • Prerequisites and compatibility
    • WEKA cluster installation on bare metal servers
      • Plan the WEKA system hardware requirements
      • Obtain the WEKA installation packages
      • Install the WEKA cluster using the WMS with WSA
      • Install the WEKA cluster using the WSA
      • Manually install OS and WEKA on servers
      • Manually prepare the system for WEKA configuration
        • Broadcom adapter setup for WEKA system
        • Enable the SR-IOV
      • Configure the WEKA cluster using the WEKA Configurator
      • Manually configure the WEKA cluster using the resources generator
      • Perform post-configuration procedures
      • Add clients
    • WEKA installation on AWS
      • WEKA installation on AWS using Terraform
        • Terraform-AWS-WEKA module description
        • Deployment on AWS using Terraform
        • Required services and supported regions
        • Supported EC2 instance types using Terraform
        • WEKA cluster auto-scaling in AWS
        • Detailed deployment tutorial: WEKA on AWS using Terraform
      • WEKA installation on AWS using the Cloud Formation
        • Self-service portal
        • CloudFormation template generator
        • Deployment types
        • AWS Outposts deployment
        • Supported EC2 instance types using Cloud Formation
        • Add clients
        • Auto scaling group
        • Troubleshooting
    • WEKA installation on Azure
    • WEKA installation on GCP
      • WEKA project description
      • GCP-WEKA deployment Terraform package description
      • Deployment on GCP using Terraform
      • Required services and supported regions
      • Supported machine types and storage
      • Auto-scale instances in GCP
      • Add clients
      • Troubleshooting
  • Getting Started with WEKA
    • Manage the system using the WEKA GUI
    • Manage the system using the WEKA CLI
      • WEKA CLI hierarchy
      • CLI reference guide
    • Run first IOs with WEKA filesystem
    • Getting started with WEKA REST API
    • WEKA REST API and equivalent CLI commands
  • Performance
    • WEKA performance tests
      • Test environment details
  • WEKA Filesystems & Object Stores
    • Manage object stores
      • Manage object stores using the GUI
      • Manage object stores using the CLI
    • Manage filesystem groups
      • Manage filesystem groups using the GUI
      • Manage filesystem groups using the CLI
    • Manage filesystems
      • Manage filesystems using the GUI
      • Manage filesystems using the CLI
    • Attach or detach object store buckets
      • Attach or detach object store bucket using the GUI
      • Attach or detach object store buckets using the CLI
    • Advanced data lifecycle management
      • Advanced time-based policies for data storage location
      • Data management in tiered filesystems
      • Transition between tiered and SSD-only filesystems
      • Manual fetch and release of data
    • Mount filesystems
      • Mount filesystems from Single Client to Multiple Clusters (SCMC)
      • Manage authentication across multiple clusters with connection profiles
    • Snapshots
      • Manage snapshots using the GUI
      • Manage snapshots using the CLI
    • Snap-To-Object
      • Manage Snap-To-Object using the GUI
      • Manage Snap-To-Object using the CLI
    • Quota management
      • Manage quotas using the GUI
      • Manage quotas using the CLI
  • Additional Protocols
    • Additional protocol containers
    • Manage the NFS protocol
      • Supported NFS client mount parameters
      • Manage NFS networking using the GUI
      • Manage NFS networking using the CLI
    • Manage the S3 protocol
      • S3 cluster management
        • Manage the S3 service using the GUI
        • Manage the S3 service using the CLI
      • S3 buckets management
        • Manage S3 buckets using the GUI
        • Manage S3 buckets using the CLI
      • S3 users and authentication
        • Manage S3 users and authentication using the CLI
        • Manage S3 service accounts using the CLI
      • S3 rules information lifecycle management (ILM)
        • Manage S3 lifecycle rules using the GUI
        • Manage S3 lifecycle rules using the CLI
      • Audit S3 APIs
        • Configure audit webhook using the GUI
        • Configure audit webhook using the CLI
        • Example: How to use Splunk to audit S3
      • S3 supported APIs and limitations
      • S3 examples using boto3
    • Manage the SMB protocol
      • Manage SMB using the GUI
      • Manage SMB using the CLI
  • Operation Guide
    • Alerts
      • Manage alerts using the GUI
      • Manage alerts using the CLI
      • List of alerts and corrective actions
    • Events
      • Manage events using the GUI
      • Manage events using the CLI
      • List of events
    • Statistics
      • Manage statistics using the GUI
      • Manage statistics using the CLI
      • List of statistics
    • Insights
    • System congestion
    • Security management
      • Obtain authentication tokens
      • KMS management
        • Manage KMS using the GUI
        • Manage KMS using the CLI
      • TLS certificate management
        • Manage the TLS certificate using the GUI
        • Manage the TLS certificate using the CLI
      • CA certificate management
        • Manage the CA certificate using the GUI
        • Manage the CA certificate using the CLI
      • Account lockout threshold policy management
        • Manage the account lockout threshold policy using GUI
        • Manage the account lockout threshold policy using CLI
      • Manage the login banner
        • Manage the login banner using the GUI
        • Manage the login banner using the CLI
    • User management
      • Manage users using the GUI
      • Manage users using the CLI
    • Organizations management
      • Manage organizations using the GUI
      • Manage organizations using the CLI
      • Mount authentication for organization filesystems
    • Expand and shrink cluster resources
      • Add a backend server
      • Expand specific resources of a container
      • Shrink a cluster
    • Background tasks
      • Manage background tasks using the GUI
      • Manage background tasks using the CLI
    • Upgrade WEKA versions
  • Billing & Licensing
    • License overview
    • Classic license
  • Monitor the WEKA Cluster
    • Deploy monitoring tools using the WEKA Management Station (WMS)
    • WEKA Home - The WEKA support cloud
      • Local WEKA Home overview
      • Deploy Local WEKA Home v3.0 or higher
      • Deploy Local WEKA Home v2.x
      • Explore cluster insights and statistics
      • Manage alerts and integrations
      • Enforce security and compliance
      • Optimize support and data management
    • Set up the WEKAmon external monitoring
    • Set up the SnapTool external snapshots manager
  • Support
    • Get support for your WEKA system
    • Diagnostics management
      • Traces management
        • Manage traces using the GUI
        • Manage traces using the CLI
      • Protocols debug level management
        • Manage protocols debug level using the GUI
        • Manage protocols debug level using the CLI
      • Diagnostics data management
  • Best Practice Guides
    • WEKA and Slurm integration
      • Avoid conflicting CPU allocations
    • Storage expansion best practice
  • Appendices
    • WEKA CSI Plugin
      • Deployment
      • Storage class configurations
      • Tailor your storage class configuration with mount options
      • Dynamic and static provisioning
      • Launch an application using WEKA as the POD's storage
      • Add SELinux support
      • NFS transport failback
      • Upgrade legacy persistent volumes for capacity enforcement
      • Troubleshooting
    • Convert cluster to multi-container backend
    • Create a client image
Powered by GitBook
On this page
  • Installation logs
  • AWS account limits
  • AWS instance launch error
  • Launch in placement group error
  • Instance type not supported in AZ
  • ClusterBootCondition timeout
  • Clients failed to join
  • Health monitoring
  1. Planning and Installation
  2. WEKA installation on AWS
  3. WEKA installation on AWS using the Cloud Formation

Troubleshooting

This page details common errors that can occur when deploying WEKA in AWS using CloudFormation and what can be done to resolve them.

PreviousAuto scaling groupNextWEKA installation on Azure

Last updated 1 year ago

Using CloudFormation deployment saves a lot of potential errors that may occur during installation, especially in the configuration of security groups and other connectivity-related issues. However, the following errors related to the following subjects may occur during installation:

  • Installation logs

  • AWS account limits

  • AWS instance launch error

  • Launch in placement group error

  • Instance type not supported in AZ

  • ClusterBootCondition timeout

  • Clients failed to join cluster

Installation logs

As explained in , each instance launched in a WEKA CloudFormation template starts by installing WEKA on itself. This is performed using a script named wekaio-instance-boot.sh and launched by cloud-init. All logs generated by this script are written to the instance’s Syslog.

Additionally, the is installed on each instance, dumping Syslog to CloudWatch under a log-group named/wekaio/<stack-name>. For example, if the stack is namedcluster1, a log-group named /wekaio/cluster1 should appear in CloudWatch a few moments after the template shows the instances have reached CREATE_COMPLETE state.

Under the log-group, there should be a log-stream for each instance Syslog matching the instance name in the CloudFormation template. For example, in a cluster with 6 backend instances, log-streams named Backend0-syslog through Backend5-syslog should be observed.

AWS account limits

When deploying the stack, this error may be received in the description of a CREATE_FAILED event for one or more instances, indicating that more instances (N) have been requested than that permitted by the current instance limit of L for the specified instance type. To request an adjustment to this limit, go to to open a support case with AWS.

AWS instance launch error

If the error Instance i-0a41ba7327062338e failed to stabilize. Current state: shutting-down. Reason: Server.InternalError: Internal error on launch is received, one of the instances was unable to start. This is an internal AWS error and it is necessary to try to deploy the stack again.

Launch in placement group error

If the error We currently do not have sufficient capacity to launch all of the additional requested instances into Placement Group 'PG' is received, it was not possible to place all the requested instances in one placement-group.

The CloudFormation template creates all instances in one placement-group to guarantee best performance. Consequently, if the deployment fails with this error, try to deploy in another AZ.

Instance type not supported in AZ

If the error The requested configuration is currently not supported. Please check the documentation for supported configurations or Your requested instance type (T) is not supported in your requested Availability Zone (AZ). Please retry your request by not specifying an Availability Zone or choosing az1, az2, az3 is received, the instance type that you tried to provision is not supported in the specified AZ. Try selecting another subnet to deploy the cluster in, which will implicitly select another AZ.

ClusterBootCondition timeout

When a ClusterBootCondition timeout occurs, there was a problem creating the initial WEKA system cluster. To debug this error, look in the Backend0-syslog log-stream (as described above). The first backend instance is responsible for creating the cluster and therefore, its log should provide the information necessary to debug this error.

Clients failed to join

When the message Clients failed to join for uniqueId: ClientN is received while in the WaitCondition, one of the clients was unable to join the cluster. Look at the Syslog of the client specified in uniqueId as described above.

Example: If the error message specifies that client 3 failed to join, a message ending with uniqueId: Client3 should be displayed. Look at the log-stream named Client3-syslog.

Health monitoring

You can monitor the cluster instances by checking the cluster EC2 instances in the AWS EC2 service. You can set up as external monitoring to the cluster.

Connecting to the WEKA cluster GUI provides a where you can see if any component is not properly functioning and view system , , and .

Self-Service Installation
CloudWatch Logs Agent
aws.amazon.com
Cloud Watch
alerts
events
statistics
system dashboard