Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device. Atlassian cookies and tracking notice, (opens new window)
User Manual

User Manual
Results will update as you type.
  • Application Guide
  • Status of System
  • Usage Guide
  • Compute partitions
  • Software
    • AI Frameworks and Tools
      • PyTorch
      • TensorFlow
      • JAX
      • XGBoost
    • Bring your own license
    • Chemistry
    • Data Manipulation
    • Engineering
    • Environment Modules
    • Miscellaneous
    • Numerics
    • Virtualization
    • Devtools Compiler Debugger
    • Visualisation Tools
  • FAQ
  • NHR Community
  • Contact

    You‘re viewing this with anonymous access, so some content might be blocked.
    /
    TensorFlow
    Updated Feb. 02, 2024

    TensorFlow

     

    Tensorflow_logo.svg.png

     


    TensorFlow is a powerful deep learning/autodifferentiation/optimization python package that supports eager execution and JIT compilation for both CPU and GPU accelerators. It can be loaded in a python environment, and the presence of GPU accelerators can be tested as such:

    Python 3.9.18 (main, Sep 11 2023, 13:41:44) [GCC 11.2.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow as tf >>> dl = tf.config.list_physical_devices() >>> for d in dl: ... print(d) ... PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU') PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU') PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU') PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU') PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')

    Extensions

    The anaconda3/2023.09 module also contains some useful TensorFlow-related packages:

    • Keras - Python API for building and training TensorFlow models with less boilerplate.

    • Horovod - Python package for distributed, multinode training with TensorFlow (as well as other deep learning frameworks).

    Examples

    Examples of CPU and (multi) GPU training tasks for HPC environments can be found here. Below are reproduced examples for training convolutional neural network image classification models on the Fashion-MNIST dataset.

    Currently, there is no Lightning support for TensorFlow. However, users may still find the same config parsing backend, jsonargparse, to be useful for developing models and conducting machine learning experiments on the compute nodes.

    Setup (on login node):

    This sets up some simple packages:

    $ module load anaconda3/2023.09 $ conda activate base $ git clone https://github.com/Ruunyox/tf-hpc $ cd tf-hpc $ pip install --user .

    1. Single node, single GPU:

    We start with a training YAML file (config_conv_gpu.yaml) appropriate for Keras. Since only 1 GPU is needed, it is better to use the gpu-a100:shared partition and request just one GPU (gres=gpu:A100:1) rather than queuing for a full node with 4 GPUs. The following SLURM submission script details the options:

    #! /bin/bash #SBATCH -J tf_cli_conv_test_gpu #SBATCH -o tf_cli_conv_test_gpu.out #SBATCH --time=00:30:00 #SBATCH --partition=gpu-a100 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:A100:1 #SBATCH --mem-per-cpu=1G #SBATCH --cpus-per-task=4 module load sw.a100 module load cuda/11.8 module load anaconda3/2023.09 conda activate base export TF_CPP_MIN_LOG_LEVEL=2 export XLA_FLAGS=--xla_gpu_cuda_data_dir=/sw/compiler/cuda/11.8/a100/install tfhpc --config config_conv_gpu.yaml

    and can be run using:

    $ sbatch cli_test_conv_gpu.sh

    The results can be inspected using TensorBoard package (also included in the anaconda3/2023.09 module):

    $ tensorboard --logdir ./fashionmnist_conv_gpu/tensorboard --port 8877

    which can be viewed on your local machine via SSH tunneling:

    ssh -NL 8877:localhost:8877 your_hlrn_username@your_login_address

    Note: you may change the port 8877 to something else if needed. Alternatively, you may copy your events* logfiles to your local machine and inspect them with tensorboard there.

    2. Single node, multiple GPUs

    Adding more GPUs with Keras is as simple as setting:

    strategy: name: mirrored_strategy opts: devices: ["/gpu:0", "/gpu:1", "/gpu:2", "/gpu:3"] cross_device_ops: op: hierarchical_copy_all_reduce opts: null

    In the training yaml (see config_conv_multi_gpu.yaml ), and requesting a non-shared partition in the SBATCH options:

    #SBATCH --partition=gpu-a100 #SBATCH --gres=gpu:A100:4

    Remember that the number of GPUs requested through SLURM must match those requested in the Keras training YAML.

    3. Multiple node, multiple GPUs

    For training across multiple nodes using Tensorflow, we direct the users to Horovod examples.

    {"serverDuration": 10, "requestCorrelationId": "e1f390e857c74431af188089b7787d4e"}