...
Examples for using the Intel extension for JAX can be found here.
Distributed Training
multigpu and multinode jobs can be executed using the following strategy in a job submission script:
Codeblock |
---|
module load intel/2024.0.0
module load impi
export CCL_ROOT=/sw/compiler/intel/oneapi/ccl/2021.12
export LD_LIBRARY_PATH=$I_MPI_ROOT/lib:$LD_LIBRARY_PATH
hnode=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$(scontrol getaddrs $hnode | cut -d' ' -f 2 | cut -d':' -f 1)
export MASTER_PORT=29500 |
It is advantageous to define the GPU tile usage (each Intel Max 1550 has two compute “tiles”) using affinity masks, wherein the format GPU_ID.TILE_ID
(zero-base index) specifies which GPU(s) and tile(s) to use. Eg, two use two GPUs and four tiles, one can specify:
Codeblock |
---|
export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
export ZE_AFFINITY_MASK=0.0,0.1,1.0,1.1 |
To use four GPUs and eight tiles, one would specify:
Codeblock |
---|
export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
export ZE_AFFINITY_MASK=0.0,0.1,1.0,1.1,2.0,2.1,3.0,3.1 |
These specifications are applied to all nodes of a job.