PVC AI Tools and Frameworks
Popular tools such as Pytorch
, TensorFlow
, and JAX
can be used with the Intel distribution for Python (use the offline installer on the login nodes) together with certain special framework-specific extensions. Environments can be separately prepared for each framework below for use with Intel GPUs. Note that the module intel/2024.0.0
(under sw.pvc
) must be loaded for these frameworks to be installed or run properly.
We also offer a standalone module (intel_AI_tools/2024.0.0
) that loads a conda
installation with the following pre-installed, Intel GPU/XPU-ready environments:
intel_pytorch_2.1.0a0
intel_tensorflow_2.14.0
intel_jax_0.4.20
Please note that PVC nodes currently run on Rocky 8 linux, and so only python versions <=3.9 are supported.
NumPy 2.0.0 breaks binary backwards compatibility. If Numpy-related runtime errors are encountered, please consider downgrading to a version <2.0.0
Pytorch
Load the Intel OneAPI module and create a new conda environment within your Intel python distribution:
module load intel/2024.0.0
conda create -n intel_pytorch_gpu python=3.9
conda activate intel_pytorch_gpu
Once the new environment has been activated, the following commands install Pytorch
:
python -m pip install torch==2.1.0a0 torchvision==0.16.0a0 torchaudio==2.1.0a0 intel-extension-for-pytorch==2.1.10+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
This installs Pytorch
together with Intel extension for Pytorch necessary to run non-CUDA operations on Intel GPUs. On a compute node, the presence of GPUs can be assessed:
Python 3.9.18 (tags/v3.9.18-26-g6b320c3b2f6-dirty:6b320c3b2f6, Sep 28 2023, 00:35:27)
[GCC 13.2.0] :: Intel Corporation on linux
(null)Type "help", "copyright", "credits" or "license" for more information.
Intel(R) Distribution for Python is brought to you by Intel Corporation.
Please check out: https://software.intel.com/en-us/python-distribution
>>> import torch
>>> import intel_extension_for_pytorch as ipex
My guessed rank = 0
>>> [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())]
[0]: _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=512, gpu_eu_count=512)
[1]: _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=512, gpu_eu_count=512)
[2]: _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=512, gpu_eu_count=512)
[3]: _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=512, gpu_eu_count=512)
[4]: _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=512, gpu_eu_count=512)
[5]: _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=512, gpu_eu_count=512)
[6]: _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=512, gpu_eu_count=512)
[7]: _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=512, gpu_eu_count=512)
[None, None, None, None, None, None, None, None]
Examples of how to use the Intel extension for Pytorch
can be found here.
TensorFlow
Similar to Pytorch
, an Intel extension for TensorFlow exists. To prepare a TensorFlow
environment for use with Intel GPUs, first create a new conda environment:
Once the new environment has been activated, the following commands install TensorFlow
:
This installs TensorFlow
together with it's Intel extension necessary to run non-CUDA operations on Intel GPUs. On a compute node, the presence of GPUs can be assessed:
Examples of how to use the Intel extension for TensorFlow
can be found here.
JAX
Intel XPU support is still experimental for JAX, as of version 0.4.20
Like Pytorch
and TensorFlow
, JAX
also has an extension via OpenXLA. To prepare a JAX
environment for use with Intel GPUs, first create a new conda environment:
Once the environment is activated, the following commands install JAX
This installs JAX
together with its Intel extension necessary to run non-CUDA operations on Intel GPUs. On a compute node, the presence of GPUs can be assessed:
Examples for using the Intel extension for JAX can be found here.
Distributed Training
multigpu and multinode jobs can be executed using the following strategy in a job submission script:
It is advantageous to define the GPU tile usage (each Intel Max 1550 has two compute “tiles”) using affinity masks, wherein the format GPU_ID.TILE_ID
(zero-base index) specifies which GPU(s) and tile(s) to use. Eg, two use two GPUs and four tiles, one can specify:
To use four GPUs and eight tiles, one would specify:
These specifications are applied to all nodes of a job. For more information, and alternative modes, please see the intel level-zero documentation.
Intel MPI can then be used to distribute and run your job, eg: