Avoid conflicting CPU allocations
In a WEKA and Slurm integration, efficient CPU allocation is crucial to prevent conflicts between the WEKA filesystem and Slurm job scheduling. Improper CPU allocation can lead to performance degradation, CPU starvation, or resource contention. This section outlines best practices to ensure WEKA and Slurm coexist harmoniously by carefully managing CPUsets and NUMA node allocations.
1. Disable WEKA CPUset isolation
Ensure that WEKA's default CPUset isolation is disabled to avoid conflicts with Slurm.
2. Verify hyperthreading and NUMA configuration
Verify your system's hyperthreading and NUMA configuration. Typically, hyperthreading is disabled in most Slurm-managed environments. In this example, hyperthreading is disabled, and there are four NUMA nodes.
3. Identify the NUMA node of the dataplane network interface
Determine the NUMA node associated with the dataplane network interface. For instance, ib0
is located in NUMA node 1.
4. Assign CPU cores to WEKA
When mounting the WEKA filesystem, specify the CPU cores for the WEKA client to use. These cores should be in the same NUMA node as the network interface.
Avoid using core 0. Typically, the last cores in the NUMA node are chosen. For example:
After mounting, confirm the cores and network interfaces used by WEKA:
5. Configure Slurm to exclude WEKA's cores
Configure Slurm to exclude WEKA's cores from those available for user jobs by setting the CPUSpecList
parameter.
Verify the configuration with:
6. Verify CPUset configuration
Ensure that the Slurm CPUset excludes the cores assigned to the WEKA client.
7. Manage hyperthreading
If hyperthreading is enabled, identify the sibling CPUs and include them in both the WEKA mount options and Slurm’s CPUSpecList
. For clarity, even though WEKA automatically reserves these CPUs, explicitly specifying them can help avoid potential issues.
In this example, hyperthreading is disabled, so no additional CPUs are required:
8. Address logical and physical CPU index mismatch
In certain situations, environmental factors like BIOS or hypervisor settings may cause discrepancies between logical CPU numbers and the physical or OS-assigned numbers. This can result in the Slurm CPUset mistakenly including CPUs that should be reserved for the WEKA client, potentially leading to resource conflicts such as CPU starvation.
For example, if the CPUset configuration shows that Slurm is not correctly excluding the WEKA-assigned CPUs, you might see something like this, where CPUs 56, 58, 60, and 62 are listed in both CPUsets, which will cause conflicts:
The issue may arise from non-sequential CPU numbering, where CPUs are interleaved between NUMA nodes:
To address this, do the following:
Ensure that the WEKA agent's
isolate_cpuset=false
setting is applied (see Step 1), and that the agent has been restarted.Use
hwloc-ls
orlstopo-no-graphics
to map the logical index (L#) to the physical/OS index (P#) for the CPUs assigned to WEKA. If the logical and physical indexes don’t match, use the logical index numbers in Slurm’sCPUSpecList
parameter.
Before starting any Weka container or mounting WekaFS filesystems, ensure that you examine the output of hwloc-ls
or lstopo-no-graphics
. This step is critical as it verifies the logical index numbers provided by these user-space tools. Failure to perform this check may result in incorrect logical index mappings, which can lead to configuration or performance issues.
In this example, the output indicates a mismatch between the L# and P#:
Although WEKA uses physical cores 56-63
, set Slurm’s CPUSpecList
to 28-31,60-63
to correctly allocate the CPUs based on their logical index.
Related information
Slurm GRES documentation (for more details on logical and physical core index mapping)
Last updated