Tensorflow and CUDA on processors without modern instructions

While on most occasions simple pip install tensorflow works just fine, certain combinations of hardware may be incompatible with the repository-installed tensorflow package. In this brief tutorial I will build the latest tensorflow 2.3.1 python package from the source. This tutorial may also be helpfull for those who want to update to the latest tensorflow version on older GPUs because older hardware support was removed from the precompiled version since 2.3.0.

Prepare the building environment

Obtain the following docker container:

docker pull tensorflow/tensorflow:devel-gpu

Choose a place and create a directory that you will share with the container. In my case, I will use /home/alexandr/temp/tensorflow. Then enter the working directory and start the docker container

cd /home/alexandr/temp/tensorflow
docker run -it -w /tensorflow_src -v $(pwd):/share tensorflow/tensorflow:devel-gpu bash

Update the repository within the container and chose the latest stable branch (at the time of writing that was 2.3)

git pull
git checkout r2.3

Next, upgrade pip and install few python dependencies

/usr/bin/python3 -m pip install --upgrade pip
pip3 install six numpy wheel keras_applications keras_preprocessing

Figure out the CPU limitations on the target machine

You need to tell the compiler what instructions to avoid in the final binaries. This is not an obvious step, as those limitations are machine specific. You may want to consult with internet and even to use some trial-and-error to see what flags are required for making the binaries stable on your particular machine.

If you are compiling on the target machine, -march=native should suffice, as it should enable all instructions your CPU supports. If you crosscompile, as I am in this example, then you need to dig deeper.

One approach is to look at grep flags /proc/cpuinfo | head -n 1 which shows CPU features. In my case, I have a laptop where latest tensorflow works out of the box and a desktop PC with GPU where it does not. Comparing the lists from both machines, I find that the desktop PC lacks the following items: 'ida', 'bmi2', 'smep', 'rtm', 'bmi1', 'fma', 'f16c', 'hle', 'avx2', 'smx', 'adx', 'avx', 'mpx'.

Here we can see that ida stands for Intel Dynamic Acceleration and is a part of CPU’s Thermal and Power Management, so it is not very likely to be the breaking factor in my case. In the same vein, smep, rtm, hle, smx, and mpx features are unlikely to affect tensorflow execution. I could not easily find wheter f16c and adx are used by tensorflow binaries.

On the other hand avx and avx2 (Advanced Vector Extensions), bmi1, bmi2 (1st/group bit manipulation extensions), and fma (Fused multiply-add) seem to be quite important for tensorflow. Therefore, I will use the following combination of flags to build tensorflow binaries

-march=native -mno-avx -mno-avx2 -mno-fma -mno-bmi -mno-bmi2

Configure a tensorflow’s build chain

In the docker container execute

python3 configure.py

to start a configuration manager. For most questions you can just select the default answer.

One important question is about compute capability of your GPU. Since the default option may not include your GPU type, its is better to check it ahead of time here and input into the field provided. If you misspecify your GPU compute capability, then you are likely to get the following error message at an attempt to use CUDA in tensorflow:

InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

When you get to the question about optimization flags, you should input the flags we came up with in the previous section. In my case I also added a -Wno-sign-compare flag. Once you are done, you can start the building process by running

bazel build //tensorflow/tools/pip_package:build_pip_package --local_ram_resources=16384

If bazel complains about its version it will also likely provide you with a one-liner to update it.

Build process requires a substantial amount of RAM especially on machines with many cores, so you may want to limit RAM usage by using flag --local_ram_resources=16384. In my case I limit it to 16 GiB out of 24 GiB available on the machine. Another way of limiting resources could be restricting the number of threads to a smaller number using flag --jobs 4.

The build will take quite a long time.

Prepare and install a python package

Execute the following command to assemble a python package

bazel-bin/tensorflow/tools/pip_package/build_pip_package /share

Now you should be able to see a .whl file in the mounted directory of the host machine. Copy that file to the target machine and then install with pip. The filename suggests which python version you should use at the target machine. If you have a different version, you can use conda’s enviroments to create a separate environment for tensorflow with a required version of python.

conda create -n "tensorflow2" python=3.6
conda activate tensorflow2
pip install tensorflow-2.3.1-cp36-cp36m-linux_x86_64.whl

Now you should be able to import and use tensorflow in this new environment.

That is it!

Alexandr Moskalev

Tensorflow and CUDA on processors without modern instructions

Prepare the building environment

Figure out the CPU limitations on the target machine

Configure a tensorflow’s build chain

Prepare and install a python package

Leave a Reply Cancel reply