Manage CPU allocations for WEKA and Slurm

Configure efficient CPU allocation to prevent conflicts between the WEKA filesystem and Slurm job scheduling. Improper CPU allocation can lead to performance degradation, CPU starvation, or resource contention.

Follow these steps to ensure WEKA and Slurm coexist by managing CPUsets and NUMA node allocations.

1. Disable WEKA CPUset isolation

Ensure WEKA's default CPUset isolation is disabled to avoid conflicts with Slurm's resource management.

circle-exclamation
grep 'isolate_cpusets=' /etc/wekaio/service.conf

Example result:

isolate_cpusets = false

2. Verify hyperthreading and NUMA configuration

Verify the server's topology. Hyperthreading is typically disabled in Slurm-managed environments.

lscpu | egrep 'Thread|NUMA'

Example result: (Hyperthreading disabled, four NUMA nodes)

Thread(s) per core:  1
NUMA node(s):        4
NUMA node0 CPU(s):   0-13
NUMA node1 CPU(s):   14-27
NUMA node2 CPU(s):   28-41
NUMA node3 CPU(s):   42-55

3. Identify the dataplane network NUMA node

Determine the NUMA node associated with the dataplane network interface (for example: ib0).

4. Assign CPU cores to WEKA

When mounting, select cores in the same NUMA node as the network interface. Avoid core 0 and select the last cores in the node.

Verify with:

Example result:

5. Configure Slurm to exclude WEKA's cores

Set the CPUSpecList parameter in Slurm to exclude the cores reserved for WEKA.

Verify the configuration

Example result:

circle-exclamation

6. Verify CPUset configuration (v1 and v2)

Ensure the Slurm CPUset correctly excludes the WEKA cores. Check the path corresponding to your OS cgroup version.

WEKA client:

  • v1:

  • v2:

  • Example result: The Slurm list should not contain the cores listed in the WEKA client list.

Slurm:

  • v1:

  • v2:

  • Example result: The Slurm list should not contain the cores listed in the WEKA client list.

7. Manage hyperthreading

If hyperthreading is enabled, identify the sibling CPUs and include them in both the WEKA mount options and the Slurm CPUSpecList. Although WEKA automatically reserves these CPUs, explicit specification helps prevent potential issues.

Example result:

8. Address logical and physical CPU index mismatch

Environmental factors (BIOS/Hypervisor) may cause logical CPU numbers to differ from physical/OS-assigned numbers. This causes Slurm to mistakenly include WEKA-reserved CPUs.

Identify conflicts

If the CPUset configuration shows that Slurm does not correctly exclude the WEKA-assigned CPUs, you might observe an overlap. In the following troubleshooting example, WEKA is assigned physical cores 56-63, but they appear in the Slurm CPUset, causing conflicts.

WEKA:

Example result:

Slurm:

Example result:

This issue often arises from non-sequential CPU numbering where CPUs are interleaved between NUMA nodes:

Example result:

Resolve index mismatches

  1. Re-verify Step 1: Ensure isolate_cpuset=false is set and the agent is restarted.

  2. Map Indexes: Use hwloc-ls to map Logical (L#) to Physical (P#).

circle-exclamation

Example result: The following output indicates a mismatch between the Logical Index (L#) and Physical Index (P#):

  1. Correct Slurm configuration: If a mismatch exists, you must use the Logical Index numbers in the Slurm CPUSpecList.

    • Example: WEKA uses physical cores 56-63, but hwloc shows these are logical 28-31 and 60-63. Set CPUSpecList=28-31,60-63.

If the logical and physical indexes do not match, use the logical index numbers in the Slurm CPUSpecList parameter.

Related information

Slurm GRES documentationarrow-up-right (for more details on logical and physical core index mapping)

Last updated