Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CentOS 7.9.2009 fails with Tensorflow 1.15.5+nv21.06: bad URI https://oauth2:[email protected]/cudnn/cudnn_frontend.git #64

Open
kognat-docs opened this issue Jul 2, 2022 · 4 comments

Comments

@kognat-docs
Copy link

kognat-docs commented Jul 2, 2022

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS 7.9.2009
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): source
  • TensorFlow version: 1.15.5+nv21.06
  • Python version: 3.8
  • Installed using virtualenv? pip? conda?:
  • Bazel version (if compiling from source): 0.24.1
  • GCC/Compiler version (if compiling from source): 7.3.0
  • CUDA/cuDNN version: CUDA 11.3.1 CUDNN 8.2.1
  • GPU model and memory: RTX 3090

Describe the problem

Cannot access the URL here:
https://github.com/NVIDIA/tensorflow/blob/r1.15.5%2Bnv21.06/tensorflow/workspace.bzl#L135

    new_git_repository(
        name = "cudnn_frontend_archive",
        build_file = clean_dep("//third_party:cudnn_frontend.BUILD"),
        patches = [clean_dep("//third_party:cudnn_frontend_header_fix.patch")],
        patch_args = ['-p1'],
        commit = "e9ad21cc61f8427bbaed98045b7e4f24bad57619",
        remote = "https://oauth2:[email protected]/cudnn/cudnn_frontend.git"
    )

Provide the exact sequence of commands / steps that you executed before running into the problem

running bazel build to build from source to link against GLIBC 2.17 fails, with the following error

git clone https://github.com/NVIDIA/tensorflow.git
git checkout r1.15.5+nv21.06

install working build toolchain using conda environment with libstdc++ 9.5 GCC 7.3.0 from SCL toolset and bazel 0.24.1 and Python 3.8 with other Python dependencies

./configure

./configure 
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.google.protobuf.UnsafeUtil (file:/home/sam/.cache/bazel/_bazel_sam/install/96b7e79a4e60cc1d7fbf4394c4acc8a6/_embedded_binaries/A-server.jar) to field java.nio.Buffer.address
WARNING: Please consider reporting this to the maintainers of com.google.protobuf.UnsafeUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.24.1- (@non-git) installed.
Please specify the location of python. [Default is /home/sam/Documents/env-tf-1.15.5-nv21.06-centos/bin/python]: 


Found possible Python library paths:
  /home/sam/Documents/env-tf-1.15.5-nv21.06-centos/lib/python3.8/site-packages
Please input the desired Python library path to use.  Default is [/home/sam/Documents/env-tf-1.15.5-nv21.06-centos/lib/python3.8/site-packages]

Do you wish to build TensorFlow with XLA JIT support? [Y/n]: 
XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: 
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with ROCm support? [y/N]: 
No ROCm support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Do you wish to build TensorFlow with TensorRT support? [y/N]: 
No TensorRT support will be enabled for TensorFlow.

Could not find any cuda.h matching version '' in any subdirectory:
        ''
        'include'
        'include/cuda'
        'include/*-linux-gnu'
        'extras/CUPTI/include'
        'include/cuda/CUPTI'
of:
        '/lib64'
        '/usr'
        '/usr/lib64//bind9-export'
        '/usr/lib64/atlas'
        '/usr/lib64/dyninst'
        '/usr/lib64/iscsi'
        '/usr/lib64/mysql'
        '/usr/lib64/qt-3.3/lib'
Asking for detailed CUDA configuration...

Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 10]: 11.3


Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]: 8.2


Please specify the locally installed NCCL version you want to use. [Leave empty to use http://github.com/nvidia/nccl]: 2.11


Please specify the comma-separated list of base paths to look for CUDA libraries and headers. [Leave empty to use the default]: /home/sam/Documents/env-tf-1.15.5-nv21.06-centos


Found CUDA 11.3 in:
    /home/sam/Documents/env-tf-1.15.5-nv21.06-centos/lib
    /home/sam/Documents/env-tf-1.15.5-nv21.06-centos/include
Found cuDNN 8 in:
    /home/sam/Documents/env-tf-1.15.5-nv21.06-centos/lib
    /home/sam/Documents/env-tf-1.15.5-nv21.06-centos/include
Found NCCL 2 in:
    /home/sam/Documents/env-tf-1.15.5-nv21.06-centos/lib
    /home/sam/Documents/env-tf-1.15.5-nv21.06-centos/include


Please specify a list of comma-separated CUDA compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size, and that TensorFlow only supports compute capabilities >= 3.5 [Default is: 3.5,7.0]: 3.5,3.7,5.0,5.2,6.0,6.1,7.0,7.5,8.0,8.6


Do you want to use clang as CUDA compiler? [y/N]: 
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /opt/rh/devtoolset-7/root/usr/bin/gcc]: 


Do you wish to build TensorFlow with MPI support? [y/N]: 
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]: 


Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: 
Not configuring the WORKSPACE for Android builds.

CC=/opt/rh/devtoolset-7/root/usr/bin/gcc CXX=/opt/rh/devtoolset-7/root/usr/bin/g++ bazel build --config=v1 //tensorflow/tools/pip_package:build_pip_package

more .tf_configure.bazelrc

build --action_env PYTHON_BIN_PATH="/home/sam/Documents/env-tf-1.15.5-nv21.06-centos/bin/python"
build --action_env PYTHON_LIB_PATH="/home/sam/Documents/env-tf-1.15.5-nv21.06-centos/lib/python3.8/site-packages"
build --python_path="/home/sam/Documents/env-tf-1.15.5-nv21.06-centos/bin/python"
build:xla --define with_xla_support=true
build --config=xla
build --action_env TF_USE_CCACHE="0"
build --action_env TF_CUDA_VERSION="11.3"
build --action_env TF_CUDNN_VERSION="8.2"
build --action_env TF_NCCL_VERSION="2.11"
build --action_env TF_CUDA_PATHS="/home/sam/Documents/env-tf-1.15.5-nv21.06-centos"
build --action_env CUDA_TOOLKIT_PATH="/home/sam/Documents/env-tf-1.15.5-nv21.06-centos"
build --action_env TF_CUDA_COMPUTE_CAPABILITIES="3.5,3.7,5.0,5.2,6.0,6.1,7.0,7.5,8.0,8.6"
build --action_env LD_LIBRARY_PATH="/home/sam/Documents/env-tf-1.15.5-nv21.06-centos/lib"
build --action_env GCC_HOST_COMPILER_PATH="/opt/rh/devtoolset-7/root/usr/bin/gcc"
build --config=cuda
build --copt=-march=native
build --copt=-Wno-sign-compare
build:opt --define with_default_optimizations=true
build:v2 --define=tf_api_version=2
test --flaky_test_attempts=3
test --test_size_filters=small,medium
test --test_tag_filters=-benchmark-test,-no_oss,-oss_serial
test --build_tag_filters=-benchmark-test,-no_oss
test --test_tag_filters=-gpu
test --build_tag_filters=-gpu
build --action_env TF_CONFIGURE_IOS="0"

Any other info / logs
Would prefer to build with 1.15+nv21.06 as it is the oldest version with NVIDIA library support via conda.

My clients are not interested in installed more current kernel drivers.

My clients are not interested in using Docker containers.

Error log:

ERROR: Analysis of target '//tensorflow/tools/pip_package:build_pip_package' failed; build aborted: no such package '@cudnn_frontend_archive//': Traceback (most recent call last):
	File "/home/sam/.cache/bazel/_bazel_sam/a31973e2a597fcbee537e0e8a93418a8/external/bazel_tools/tools/build_defs/repo/git.bzl", line 157
		_clone_or_update(ctx)
	File "/home/sam/.cache/bazel/_bazel_sam/a31973e2a597fcbee537e0e8a93418a8/external/bazel_tools/tools/build_defs/repo/git.bzl", line 74, in _clone_or_update
		fail(("error cloning %s:\n%s" % (ctx....)))
error cloning cudnn_frontend_archive:
+ cd /home/sam/.cache/bazel/_bazel_sam/a31973e2a597fcbee537e0e8a93418a8/external
+ rm -rf /home/sam/.cache/bazel/_bazel_sam/a31973e2a597fcbee537e0e8a93418a8/external/cudnn_frontend_archive /home/sam/.cache/bazel/_bazel_sam/a31973e2a597fcbee537e0e8a93418a8/external/cudnn_frontend_archive
+ git clone https://oauth2:[email protected]/cudnn/cudnn_frontend.git /home/sam/.cache/bazel/_bazel_sam/a31973e2a597fcbee537e0e8a93418a8/external/cudnn_frontend_archive
Cloning into '/home/sam/.cache/bazel/_bazel_sam/a31973e2a597fcbee537e0e8a93418a8/external/cudnn_frontend_archive'...
fatal: unable to access 'https://gitlab-master.nvidia.com/cudnn/cudnn_frontend.git/': Could not resolve host: gitlab-master.nvidia.com; Unknown error
+ git clone https://oauth2:[email protected]/cudnn/cudnn_frontend.git /home/sam/.cache/bazel/_bazel_sam/a31973e2a597fcbee537e0e8a93418a8/external/cudnn_frontend_archive
Cloning into '/home/sam/.cache/bazel/_bazel_sam/a31973e2a597fcbee537e0e8a93418a8/external/cudnn_frontend_archive'...
fatal: unable to access 'https://gitlab-master.nvidia.com/cudnn/cudnn_frontend.git/': Could not resolve host: gitlab-master.nvidia.com; Unknown error
@samhodge
Copy link

samhodge commented Jul 2, 2022

Work around is to clone https://github.com/NVIDIA/cudnn-frontend.git into a directory one level above the tensorflow repository and to use the r1.15.5+nv22.05 branch source code

@kognat-docs
Copy link
Author

This is functional on RHEL 7 with glibc 2.17

@nluehr
Copy link
Contributor

nluehr commented Nov 17, 2022

The "workaround" of cloning cudnn-frontend from NVIDIA/cudnn-frontend noted above is the correct procedure as documented in the build instructions here.

The reference to the private repository (https://gitlab-master.nvidia.com/cudnn/cudnn_frontend.git) was replaced starting in 21.08 with the following.

    native.new_local_repository(
        name = "cudnn_frontend_archive",
        build_file = clean_dep("//third_party:cudnn_frontend.BUILD"),
        path = "../cudnn-frontend",
    )

@kognat-docs
Copy link
Author

image

Sorry looks like I needed to get my glasses out and read the fine print

Thanks for the update

Sam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants