Configure data catalog

Deploy and configure data catalog to enable high-performance data indexing and metadata management across filesystems.

Data catalog architecture

Gain insights into how data catalog components collaboratively function to index and query filesystem metadata at scale. By deploying dedicated catalog services, the system concurrently scans modified files and stores searchable metadata in a centralized index database.

Catalog components

The catalog feature integrates several components within the WEKA cluster to manage and query metadata:

  • Data service containers: The compute units that power catalog services. The catalog requires a minimum of five data service containers: one serves as the coordinator and the others function as workers.

  • Difflist: A service that runs on the data service containers to detect changes in the filesystems.

  • Data manager: The component that manages rolling snapshots used by the difflist to track data changes.

  • Index database: The central repository that receives indexed fields and stores results for metadata queries.

  • Index filesystem (.indexfs): A dedicated filesystem that stores catalog index data.

  • Query API and GUI: Interfaces used to access and visualize the indexed metadata.

Related topics

Set up a Data Services container for background tasks

Track filesystem changes with the DiffList REST API

Catalog workflow

The data catalog maintains synchronization through a structured three-step process:

  1. Snapshot management: The Data manager creates rolling snapshots of the filesystems to provide a reference point for the Difflist.

  2. Parallel scanning: The system identifies changed files and scans them in parallel across the data service containers to maintain high performance.

  3. Indexing and storage: The service indexes common metadata fields and sends the results to the Index database. This data is stored on the index filesystem for long-term retention and querying.

Catalog architecture

Deploy the catalog services

Configure the infrastructure and filesystems required to activate catalog services for your data.

Before you begin

  • The catalog services require at least five backend servers. Ensure each server has 32 GB of free memory and connectivity on port 14400.

  • Each server that runs catalog services must also run a frontend container. The catalog feature requires a frontend container and a data service container on the same server.

  • For optimal operation, ensure a minimum of 500 GB of available storage for the index filesystem (.indexfs). See Sizing guidelines.

Procedure

  1. Create the index filesystem: Add the .indexfs.

  2. Deploy data service containers:

    Run the following command on each server, whether dedicated backend servers or existing backend servers:

    You can use the same name on all servers or increment it across servers (for example, dataserv0, dataserv1, and so on).

    Dedicated backend servers only: To avoid impact on client workload I/Os, also run the following command on each dedicated backend server to add a frontend container (existing backend servers already contain frontend containers):

  3. Set global configuration of data service container by running the following command on any of the cluster servers.

  4. Initialize the catalog services:

    1. Add the newly created dataserv container IDs to the catalog cluster. You can specify the container IDs or provide the --all-servers flag.

    2. Wait for 30 seconds and then check if the catalog services are active:

      Output example:

  5. Enable indexing:

    1. Enable the catalog feature on your specified filesystem. Replace fs-name with the name of the filesystem you want to index.

    2. View the filesystem listing as seen from the catalog perspective:

      Output example:

    3. Verify catalog configuration:

      Output example:

  6. Configure index interval and snapshot retention period: Adjust these settings to match your workload needs. They dictate how often data is indexed and the duration for which point-in-time snapshots are kept. By default, the --index-interval is set to 1 day, and the --retention-period is 30 days. Use the following command to update these configurations:

    Supported time units:

    • --index-interval: Accepts values in minutes, hours, or days (for example: 30m, 2h, 1d5h , 3d8h30m, 7d). Valid range: 30m7d.

    • --retention-period: Accepts values in minutes, hours, or days (for example: 30d, 45d, 90d, 180d or 366d). Must be at least --index-interval and no more than 366d.

The duration of the initial Data Catalog snapshot creation is proportional to the total number of objects (files and directories) in the filesystem. For approximate baseline and differential snapshot creation times, refer to Sizing for baseline indexing time below. The Data Catalog UI remains unpopulated until the first snapshot has been successfully created.

Troubleshoot catalog deployment issues

Resolve issues related to container deployment and catalog services operations by following these resolution steps.

Troubleshoot container deployment

Container with name already exists

Resolution:

  1. If a container exists but is not functional, remove it manually by running weka local stop <name> and weka local rm <name>.

  2. Re-run the script after removal.

Container not found after creation

Resolution: The system waits 30 seconds and retries after 15 seconds. If containers do not appear in weka cluster container, verify the following:

  • Network connectivity to the cluster leader process.

  • The accuracy of the --join-ips value.

Troubleshoot catalog cluster operations

Catalog cluster status shows inactive

Resolution:

  1. Check the cluster state:

  1. Verify that all data service containers are in the UP state:

  1. For any failed container, SSH to the server and restart it:

Indexing does not progress

Resolution:

  1. Verify indexing is enabled by checking the Indexing column in weka catalog fs status.

  2. Check the index interval with weka catalog config show.

  3. Ensure the catalog services have a sufficient number of servers configured and running with active status.

Catalog tasks are stuck or slow

Resolution:

  1. Run weka cluster task --show-catalog to view the progress of each ingestion phase.

  2. Identify the phase with the longest elapsed time and investigate resource availability on those specific containers.

Catalog diagnostic command reference

To monitor and diagnose the health of the catalog cluster, use these CLI commands. Review the example outputs to understand the expected results for these diagnostic commands.

Display catalog cluster status

weka catalog cluster status

Check the overall health of the catalog services. Ensure each container displays a status of active during normal operation. Use this as the initial step when troubleshooting catalog-related issues.

Example

List cluster containers

weka cluster container

This section details all containers within the cluster, including data service (dataserv) containers. It provides information on the status, host, and resource allocation for each container. This is useful for verifying that all dataserv containers are running and properly registered.

Example

Show catalog configuration

weka catalog config show

Displays the catalog configuration settings, including the index filesystem status, snapshot scheduling frequency, retention period, and the maximum number of concurrent ingest tasks.

Example

List filesystem snapshots

weka catalog metadata show <fs-name>

Lists all catalog point-in-time snapshots (not filesystem snapshots) for a specified filesystem. These snapshots are essential for the catalog's data ingestion strategy, as they help discover changes between point-in-time snapshot names. This understanding is crucial for the catalog indexing pipeline.

Example

Monitor ingestion tasks

weka cluster task --show-catalog

Use this essential diagnostic command to monitor catalog ingestion progress. It displays the ongoing ingestion phases, elapsed time for each phase, and the percentage of completion. Use this data to pinpoint bottlenecks and track overall catalog data processing advancement effectively.

Example

Sizing guidelines

Use the following guidelines to determine the hardware resources required for the catalog deployment before enabling indexing. Resource requirements scale with the number of filesystem objects and the anticipated data growth rate.

Download the filestats.sh script and run it on your filesystem to measure its object count and directory-to-file ratio before sizing your hardware:

3KB
Open

Script example output

Use the Total Items count from the script output to select the appropriate column in the sizing tables below. For example, 40 million objects falls under the 100+ million objects column.

Use the Dir-to-file Ratio to estimate where your baseline ingest time falls within the range shown in the Sizing for baseline indexing time table below. For example, a ratio of 0.0589 falls under the Baseline (full ingest) row, 8–9 hours.

Sizing for the catalog deployment

The table below maps filesystem scale to the minimum recommended resources. These values are tested sizing guidance based on a typical dataset (files and directories).

Each specification assumes approximately 10% monthly data growth over a 12 to 18 month horizon and a directory-to-file ratio between 0.05 and 0.50.

  • 0.05 means about 1 directory for every 20 files.

  • 0.50 means about 1 directory for every 2 files.

Review and adjust catalog resource allocations every 6–9 months to account for filesystem growth. Run the filestats.sh script to get the current object count per filesystem, then use the sizing table to determine whether your existing resources still meet the requirements.

Parameter
100+ million objects
200+ million objects
500+ million objects
1+ billion objects

Data service containers

5 total: 4 workers + 1 coordinator

5 total: 4 workers + 1 coordinator

10 total: 9 workers + 1 coordinator

10 total: 9 workers + 1 coordinator

CPU

2 spare cores per server (minimum)

2 spare cores per server (minimum)

2 spare cores per server (minimum)

2 spare cores per server (minimum)

Memory

32 GB free per server

32 GB free per server

32 GB free per server

64 GB free per server

Disk (index filesystem)

100 GB – 250 GB (30 days – 1 year retention)

150 GB – 500 GB (30 days – 1 year retention)

250 GB – 1.5 TB (30 days – 1 year retention)

1 TB – 5 TB (30 days – 1 year retention)

The following considerations apply to all deployment sizes:

  • The catalog supports selective manual assignment of data service containers or automatic assignment of all backend servers. Coordinator and worker servers cannot be assigned manually.

  • Deploy one data service container per server.

  • Disk sizing for the index filesystem depends on the snapshot retention period configured with --retention-period.

Sizing for baseline indexing time

The table below provides estimated durations for the initial full ingest and ongoing incremental (delta) indexing.

  • Baseline ingest time depends on the directory-to-file ratio: a lower ratio means fewer directory scans and faster ingestion.

  • Delta estimates assume 1% change per day.

Operation
100+ million objects
200+ million objects
500+ million objects
1+ billion objects

Baseline (full ingest)

8–9 hours

8–22 hours

24–40 hours

48–60 hours

Delta (incremental changes)

20–25 minutes

40–50 minutes

1–2 hours

3–5 hours

To monitor ingest progress in real time, run:

Catalog REST API examples

The following examples show request and response payloads for each catalog REST API endpoint.

For the catalog REST API reference, see Catalog.

Run a catalog query

Query the catalog index to retrieve files and directories that match the specified conditions. This is the API equivalent of the Catalog Discovery feature. Use the criteria field to filter results with nested AND/OR conditions.

Results are paginated: when the response includes a next_cookie value, pass it as resume_cookie in the next request to retrieve the following page. Repeat until next_cookie is empty, which indicates the last page.

POST /catalog/query

The following fields are available in select_fields:

inode, filepath, filename, size, file_type, uid, gid, file_extension, mode, birth_time, access_time, modify_time, change_time

Basic example with pagination

This example searches for all files named test.log across the filesystem, returning 100 results per page.

Request:

Response (page 1):

To retrieve the next page, resubmit the same request body with resume_cookie set to the value of next_cookie from the previous response:

When next_cookie is empty in the response, all pages have been retrieved.

Complex example 1: AND query with multiple field filters

This example filters for regular files named test.log under /data/logs that are larger than 1024 bytes, sorted by filename ascending.

Request:

Complex example 2: OR query across multiple paths

This example retrieves files from two directories — /data/logs and /data/tmp — in a single query using an OR operator.

Request:

Response:


The key structural decisions: pagination gets its own explanation upfront before the examples, the resume_cookie loop is shown explicitly with a before/after request pair, and the two complex queries are labeled by their filter logic (AND vs OR) rather than just numbered, so the purpose is immediately clear.

Get changes between two point-in-time snapshots

Return the files and directories added, deleted, or modified between two point-in-time snapshots.

POST /catalog/query/diff

Response example:

Get data usage by user ID

Query filesystem usage statistics grouped by user, returning file count and total size per User ID (UID).

GET /catalog/stats/usageByUser

Response example:

Get data usage by group ID

Query filesystem usage statistics grouped by group name, returning file count and total size per Group ID (GID).

GET /catalog/stats/usageByGroup

Response example:

Get list of snapshot metadata available for a filesystem

List the catalog snapshots available for a filesystem, including the start and end timestamps of each data ingestion cycle into the index filesystem.

GET /catalog/snapshots/{fs_uuid}

Response example:

Get file distribution by extension types

Query file count and total size grouped by file extension types across the filesystem.

GET /catalog/stats/distributionByExtension

Response example:

Get data by file size ranges

Query file distribution grouped by size ranges, from 0–1 KB up to 10 TB and above.

GET /catalog/stats/filesBySize

Response example:

Get capacity by file age

Query total file capacity grouped by file age based on modification time, from under one week to five years and older.

GET /catalog/stats/capacityByFileAge

Response example:

Get dashboard statistics

Get a top-level summary of filesystem statistics, including total file count, directory count, capacity, and the top users ranked by file count and total size.

GET /catalog/stats/dashboard

Response example:

Get hierarchical directory tree with size statistics

Retrieve a directory tree up to a specified depth, showing direct and recursive size aggregations for each node. All sizes are returned in bytes.

GET /catalog/stats/directoryTree

Response example:

Get filesystem capacity metadata history

Query filesystem capacity metadata showing SSD and total capacity trends across multiple snapshots over time.

GET /catalog/filesystem/metadata

Response example:

Get point-in-time filesystem capacity metadata

Query filesystem capacity metadata for a specific snapshot access point. If access_point is provided, the response returns metadata for that snapshot. If access_point is omitted, the response returns the most recent metadata.

GET /catalog/filesystem/metadata/point-in-time

Response example:

Last updated