Manage CPU allocations for WEKA and Slurm

Configure efficient CPU allocation to prevent conflicts between the WEKA filesystem and Slurm job scheduling. Improper CPU allocation can lead to performance degradation, CPU starvation, or resource contention.

Follow these steps to ensure WEKA and Slurm coexist by managing CPUsets and NUMA node allocations.

1. Disable WEKA CPUset isolation

Ensure that WEKA's default CPUset isolation is disabled to avoid conflicts with Slurm.

grep 'isolate_cpusets=' /etc/wekaio/service.conf

Example result:

isolate_cpusets = false

2. Verify hyperthreading and NUMA configuration

Verify the hyperthreading and NUMA configuration of the server. Hyperthreading is typically disabled in Slurm-managed environments.

lscpu | egrep 'Thread|NUMA'

Example result: In this example, hyperthreading is disabled (1 thread per core), and there are four NUMA nodes.

Thread(s) per core:  1
NUMA node(s):        4
NUMA node0 CPU(s):   0-13
NUMA node1 CPU(s):   14-27
NUMA node2 CPU(s):   28-41
NUMA node3 CPU(s):   42-55

3. Identify the NUMA node of the dataplane network interface

Determine the NUMA node associated with the dataplane network interface.

Example result: In this example, the interface ib0 is located in NUMA node 1.

Verify the local CPU list for the interface:

Example result:

4. Assign CPU cores to WEKA

When you mount the WEKA filesystem, specify the CPU cores for the WEKA client.

  • Select cores located in the same NUMA node as the network interface.

  • Avoid using core 0.

  • Select the last cores in the NUMA node.

Run the following command to confirm the cores and network interfaces used by WEKA:

Example result:

5. Configure Slurm to exclude WEKA's cores

Configure Slurm to exclude the cores assigned to WEKA from user jobs by setting the CPUSpecList parameter.

Run the following command to verify the configuration:

Example result:

6. Verify CPUset configuration

Ensure that the Slurm CPUset excludes the cores assigned to the WEKA client.

WEKA client:

Example result:

Slurm:

Example result:

7. Manage hyperthreading

If hyperthreading is enabled, identify the sibling CPUs and include them in both the WEKA mount options and the Slurm CPUSpecList. Although WEKA automatically reserves these CPUs, explicit specification helps prevent potential issues.

Example result:

8. Address logical and physical CPU index mismatch

Environmental factors, such as BIOS or hypervisor settings, may cause discrepancies between logical CPU numbers and physical or OS-assigned numbers. This mismatch can result in the Slurm CPUset mistakenly including CPUs that must remain reserved for the WEKA client.

Identify conflicts

If the CPUset configuration shows that Slurm does not correctly exclude the WEKA-assigned CPUs, you might observe an overlap. In the following troubleshooting example, WEKA is assigned physical cores 56-63, but they appear in the Slurm CPUset, causing conflicts.

WEKA:

Example result:

Slurm:

Example result:

This issue often arises from non-sequential CPU numbering where CPUs are interleaved between NUMA nodes:

Example result:

Resolve index mismatches

To address this mismatch, perform the following actions:

  1. Ensure that the WEKA agent isolate_cpuset=false setting is applied (see Step 1) and that you restart the agent.

  2. Use hwloc-ls or lstopo-no-graphics to map the logical index (L#) to the physical/OS index (P#) for the CPUs assigned to WEKA.

circle-exclamation

Use the following command to verify the logical index numbers provided by user-space tools:

Example result: The following output indicates a mismatch between the Logical Index (L#) and Physical Index (P#):

If the logical and physical indexes do not match, use the logical index numbers in the Slurm CPUSpecList parameter.

In this example, although WEKA uses physical cores 56-63, you must set the Slurm CPUSpecList to 28-31,60-63 to correctly allocate the CPUs based on their logical index.

Related information

Slurm GRES documentationarrow-up-right (for more details on logical and physical core index mapping)

Last updated