Seitenvergleich

...

Examples for using the Intel extension for JAX can be found here.

Distributed Training

multigpu and multinode jobs can be executed using the following strategy in a job submission script:

Codeblock

module load intel/2024.0.0
module load impi

export CCL_ROOT=/sw/compiler/intel/oneapi/ccl/2021.12
export LD_LIBRARY_PATH=$I_MPI_ROOT/lib:$LD_LIBRARY_PATH
hnode=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$(scontrol getaddrs $hnode | cut -d' ' -f 2 | cut -d':' -f 1)
export MASTER_PORT=29500

It is advantageous to define the GPU tile usage (each Intel Max 1550 has two compute “tiles”) using affinity masks, wherein the format GPU_ID.TILE_ID (zero-base index) specifies which GPU(s) and tile(s) to use. Eg, two use two GPUs and four tiles, one can specify:

Codeblock
export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE export ZE_AFFINITY_MASK=0.0,0.1,1.0,1.1

To use four GPUs and eight tiles, one would specify:

Codeblock
export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE export ZE_AFFINITY_MASK=0.0,0.1,1.0,1.1,2.0,2.1,3.0,3.1

These specifications are applied to all nodes of a job.

Versionen im Vergleich

Alte Version 11

Neue Version 12

Schlüssel

Distributed Training