Configure data catalog
Deploy and configure data catalog to enable high-performance data indexing and metadata management across filesystems.
Data catalog architecture
Gain insights into how data catalog components collaboratively function to index and query filesystem metadata at scale. By deploying dedicated catalog services, the system concurrently scans modified files and stores searchable metadata in a centralized index database.
Catalog components
The catalog feature integrates several components within the WEKA cluster to manage and query metadata:
Data service containers: The compute units that power catalog services. The catalog requires a minimum of five data service containers: one serves as the coordinator and the others function as workers.
Difflist: A service that runs on the data service containers to detect changes in the filesystems.
Data manager: The component that manages rolling snapshots used by the difflist to track data changes.
Index database: The central repository that receives indexed fields and stores results for metadata queries.
Index filesystem (.indexfs): A dedicated filesystem that stores catalog index data.
Query API and GUI: Interfaces used to access and visualize the indexed metadata.
Related topics
Set up a Data Services container for background tasks
Track filesystem changes with the DiffList REST API
Catalog workflow
The data catalog maintains synchronization through a structured three-step process:
Snapshot management: The Data manager creates rolling snapshots of the filesystems to provide a reference point for the Difflist.
Parallel scanning: The system identifies changed files and scans them in parallel across the data service containers to maintain high performance.
Indexing and storage: The service indexes common metadata fields and sends the results to the Index database. This data is stored on the index filesystem for long-term retention and querying.

Deploy the catalog services
Configure the infrastructure and filesystems required to activate catalog services for your data.
Before you begin
The catalog services require at least five backend servers. Ensure each server has 32 GB of free memory and connectivity on port 14400.
Each server that runs catalog services must also run a frontend container. The catalog feature requires a frontend container and a data service container on the same server.
For optimal operation, ensure a minimum of 500 GB of available storage for the index filesystem (
.indexfs). See Sizing guidelines.
Procedure
Create the index filesystem: Add the
.indexfs.Deploy data service containers:
Run the following command on each server, whether dedicated backend servers or existing backend servers:
You can use the same name on all servers or increment it across servers (for example,
dataserv0,dataserv1, and so on).Dedicated backend servers only: To avoid impact on client workload I/Os, also run the following command on each dedicated backend server to add a frontend container (existing backend servers already contain frontend containers):
Set global configuration of data service container by running the following command on any of the cluster servers.
Initialize the catalog services:
Add the newly created
dataservcontainer IDs to the catalog cluster. You can specify the container IDs or provide the--all-serversflag.Wait for 30 seconds and then check if the catalog services are active:
Output example:
Enable indexing:
Enable the catalog feature on your specified filesystem. Replace
fs-namewith the name of the filesystem you want to index.View the filesystem listing as seen from the catalog perspective:
Output example:
Verify catalog configuration:
Output example:
Configure index interval and snapshot retention period: Adjust these settings to match your workload needs. They dictate how often data is indexed and the duration for which point-in-time snapshots are kept. By default, the
--index-intervalis set to 1 day, and the--retention-periodis 30 days. Use the following command to update these configurations:Supported time units:
--index-interval: Accepts values in minutes, hours, or days (for example:30m,2h,1d5h,3d8h30m,7d). Valid range:30m–7d.--retention-period: Accepts values in minutes, hours, or days (for example:30d, 45d, 90d, 180d or 366d). Must be at least--index-intervaland no more than366d.
The duration of the initial Data Catalog snapshot creation is proportional to the total number of objects (files and directories) in the filesystem. For approximate baseline and differential snapshot creation times, refer to Sizing for baseline indexing time below. The Data Catalog UI remains unpopulated until the first snapshot has been successfully created.
Troubleshoot catalog deployment issues
Resolve issues related to container deployment and catalog services operations by following these resolution steps.
Troubleshoot container deployment
Container with name already exists
Resolution:
If a container exists but is not functional, remove it manually by running
weka local stop <name>andweka local rm <name>.Re-run the script after removal.
Container not found after creation
Resolution: The system waits 30 seconds and retries after 15 seconds. If containers do not appear in weka cluster container, verify the following:
Network connectivity to the cluster leader process.
The accuracy of the
--join-ipsvalue.
Troubleshoot catalog cluster operations
Catalog cluster status shows inactive
Resolution:
Check the cluster state:
Verify that all data service containers are in the
UPstate:
For any failed container, SSH to the server and restart it:
Indexing does not progress
Resolution:
Verify indexing is enabled by checking the
Indexingcolumn inweka catalog fs status.Check the index interval with
weka catalog config show.Ensure the catalog services have a sufficient number of servers configured and running with active status.
Catalog tasks are stuck or slow
Resolution:
Run
weka cluster task --show-catalogto view the progress of each ingestion phase.Identify the phase with the longest elapsed time and investigate resource availability on those specific containers.
Catalog diagnostic command reference
To monitor and diagnose the health of the catalog cluster, use these CLI commands. Review the example outputs to understand the expected results for these diagnostic commands.
Display catalog cluster status
weka catalog cluster status
Check the overall health of the catalog services. Ensure each container displays a status of active during normal operation. Use this as the initial step when troubleshooting catalog-related issues.
List cluster containers
weka cluster container
This section details all containers within the cluster, including data service (dataserv) containers. It provides information on the status, host, and resource allocation for each container. This is useful for verifying that all dataserv containers are running and properly registered.
Show catalog configuration
weka catalog config show
Displays the catalog configuration settings, including the index filesystem status, snapshot scheduling frequency, retention period, and the maximum number of concurrent ingest tasks.
List filesystem snapshots
weka catalog metadata show <fs-name>
Lists all catalog point-in-time snapshots (not filesystem snapshots) for a specified filesystem. These snapshots are essential for the catalog's data ingestion strategy, as they help discover changes between point-in-time snapshot names. This understanding is crucial for the catalog indexing pipeline.
Monitor ingestion tasks
weka cluster task --show-catalog
Use this essential diagnostic command to monitor catalog ingestion progress. It displays the ongoing ingestion phases, elapsed time for each phase, and the percentage of completion. Use this data to pinpoint bottlenecks and track overall catalog data processing advancement effectively.
Sizing guidelines
Use the following guidelines to determine the hardware resources required for the catalog deployment before enabling indexing. Resource requirements scale with the number of filesystem objects and the anticipated data growth rate.
Download the filestats.sh script and run it on your filesystem to measure its object count and directory-to-file ratio before sizing your hardware:
Script example output
Use the Total Items count from the script output to select the appropriate column in the sizing tables below. For example, 40 million objects falls under the 100+ million objects column.
Use the Dir-to-file Ratio to estimate where your baseline ingest time falls within the range shown in the Sizing for baseline indexing time table below. For example, a ratio of 0.0589 falls under the Baseline (full ingest) row, 8–9 hours.
Sizing for the catalog deployment
The table below maps filesystem scale to the minimum recommended resources. These values are tested sizing guidance based on a typical dataset (files and directories).
Each specification assumes approximately 10% monthly data growth over a 12 to 18 month horizon and a directory-to-file ratio between 0.05 and 0.50.
0.05 means about 1 directory for every 20 files.
0.50 means about 1 directory for every 2 files.
Review and adjust catalog resource allocations every 6–9 months to account for filesystem growth. Run the filestats.sh script to get the current object count per filesystem, then use the sizing table to determine whether your existing resources still meet the requirements.
Data service containers
5 total: 4 workers + 1 coordinator
5 total: 4 workers + 1 coordinator
10 total: 9 workers + 1 coordinator
10 total: 9 workers + 1 coordinator
CPU
2 spare cores per server (minimum)
2 spare cores per server (minimum)
2 spare cores per server (minimum)
2 spare cores per server (minimum)
Memory
32 GB free per server
32 GB free per server
32 GB free per server
64 GB free per server
Disk (index filesystem)
100 GB – 250 GB (30 days – 1 year retention)
150 GB – 500 GB (30 days – 1 year retention)
250 GB – 1.5 TB (30 days – 1 year retention)
1 TB – 5 TB (30 days – 1 year retention)
The following considerations apply to all deployment sizes:
The catalog supports selective manual assignment of data service containers or automatic assignment of all backend servers. Coordinator and worker servers cannot be assigned manually.
Deploy one data service container per server.
Disk sizing for the index filesystem depends on the snapshot retention period configured with
--retention-period.
Sizing for baseline indexing time
The table below provides estimated durations for the initial full ingest and ongoing incremental (delta) indexing.
Baseline ingest time depends on the directory-to-file ratio: a lower ratio means fewer directory scans and faster ingestion.
Delta estimates assume 1% change per day.
Baseline (full ingest)
8–9 hours
8–22 hours
24–40 hours
48–60 hours
Delta (incremental changes)
20–25 minutes
40–50 minutes
1–2 hours
3–5 hours
To monitor ingest progress in real time, run:
Catalog REST API examples
The following examples show request and response payloads for each catalog REST API endpoint.
For the catalog REST API reference, see Catalog.
Run a catalog query
Query the catalog index to retrieve files and directories that match the specified conditions. This is the API equivalent of the Catalog Discovery feature. Use the criteria field to filter results with nested AND/OR conditions.
Results are paginated: when the response includes a next_cookie value, pass it as resume_cookie in the next request to retrieve the following page. Repeat until next_cookie is empty, which indicates the last page.
POST /catalog/query
The following fields are available in select_fields:
inode, filepath, filename, size, file_type, uid, gid, file_extension, mode, birth_time, access_time, modify_time, change_time
Basic example with pagination
This example searches for all files named test.log across the filesystem, returning 100 results per page.
Request:
Response (page 1):
To retrieve the next page, resubmit the same request body with resume_cookie set to the value of next_cookie from the previous response:
When next_cookie is empty in the response, all pages have been retrieved.
Complex example 1: AND query with multiple field filters
This example filters for regular files named test.log under /data/logs that are larger than 1024 bytes, sorted by filename ascending.
Request:
Complex example 2: OR query across multiple paths
This example retrieves files from two directories — /data/logs and /data/tmp — in a single query using an OR operator.
Request:
Response:
The key structural decisions: pagination gets its own explanation upfront before the examples, the resume_cookie loop is shown explicitly with a before/after request pair, and the two complex queries are labeled by their filter logic (AND vs OR) rather than just numbered, so the purpose is immediately clear.
Get changes between two point-in-time snapshots
Return the files and directories added, deleted, or modified between two point-in-time snapshots.
POST /catalog/query/diff
Response example:
Get data usage by user ID
Query filesystem usage statistics grouped by user, returning file count and total size per User ID (UID).
GET /catalog/stats/usageByUser
Response example:
Get data usage by group ID
Query filesystem usage statistics grouped by group name, returning file count and total size per Group ID (GID).
GET /catalog/stats/usageByGroup
Response example:
Get list of snapshot metadata available for a filesystem
List the catalog snapshots available for a filesystem, including the start and end timestamps of each data ingestion cycle into the index filesystem.
GET /catalog/snapshots/{fs_uuid}
Response example:
Get file distribution by extension types
Query file count and total size grouped by file extension types across the filesystem.
GET /catalog/stats/distributionByExtension
Response example:
Get data by file size ranges
Query file distribution grouped by size ranges, from 0–1 KB up to 10 TB and above.
GET /catalog/stats/filesBySize
Response example:
Get capacity by file age
Query total file capacity grouped by file age based on modification time, from under one week to five years and older.
GET /catalog/stats/capacityByFileAge
Response example:
Get dashboard statistics
Get a top-level summary of filesystem statistics, including total file count, directory count, capacity, and the top users ranked by file count and total size.
GET /catalog/stats/dashboard
Response example:
Get hierarchical directory tree with size statistics
Retrieve a directory tree up to a specified depth, showing direct and recursive size aggregations for each node. All sizes are returned in bytes.
GET /catalog/stats/directoryTree
Response example:
Get filesystem capacity metadata history
Query filesystem capacity metadata showing SSD and total capacity trends across multiple snapshots over time.
GET /catalog/filesystem/metadata
Response example:
Get point-in-time filesystem capacity metadata
Query filesystem capacity metadata for a specific snapshot access point. If access_point is provided, the response returns metadata for that snapshot. If access_point is omitted, the response returns the most recent metadata.
GET /catalog/filesystem/metadata/point-in-time
Response example:
Last updated