Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tweak libpaths in TensorFlow easyblock by adding directory containing libnccl.so.2 #3497

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

trz42
Copy link
Contributor

@trz42 trz42 commented Oct 29, 2024

When building TensorFlow-2.15.1-foss-2023a-CUDA-12.1.1.eb for EESSI we hit an error that libnccl.so.2 wasn't found in 533 test cases. In the backtrace we saw the following (# number of occurrences):

    from tensorflow.python.platform import _pywrap_cpu_feature_guard # 515
    from tensorflow.python import _pywrap_py_exception_registry # 16
    from tensorflow.python.framework import _errors_test_helper # 2

The corresponding shared libraries for the imports are

  • _pywrap_cpu_feature_guard.so,
  • _pywrap_py_exception_registry.so, and
  • _errors_test_helper.so.

While _pywrap_py_exception_registry.so depends on libnccl.so.2 directly, the other two depend on _pywrap_tensorflow_internal.so which depends on libnccl.so.2 directly. Regardless of these dependency graph differences, the underlying issue causing the import error seems to be the same.

The problem seems to be caused by a combination of factors (below the list we illustrate the issue):

  • The RPATH section of _pywrap_tensorflow_internal.so or _pywrap_py_exception_registry.so contains a relative path to a location that contains libnccl.so.2 (via something like $ORIGIN/../...)
  • However, it seems that this relative path is applied to the wrong base directory. In that base directory, the shared library _pywrap_tensorflow_internal.so or _pywrap_py_exception_registry.so is actually a symbolic link to another directory.
  • Following the relative path (e.g., some $ORIGIN/../... in the RPATH header) from that other directory, the shared library libnccl.so.2 is found.
  • The easyblock implemented by tensorflow.py does not add the absolute directory that contains libnccl.so.2 and which is provided by EESSI (e.g., /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software/NCCL/2.18.3-GCCcore-12.3.0-CUDA-12.1.1/lib) to system_libs_info[2] and thus this absolute location doesn't get added to LIBRARY_PATH which is then used by the RPATH wrapper linker script. The absolute location to the NCCL installation is derived in the tensorflow.py easyblock (see variable nccl_root) and stored in some NCCL_INSTALL_PATH variable in many/all command invocations. The directory containing libnccl.so.2 could easily be obtained as os.path.join(nccl_root, 'lib').
  • The installation procedure for TensorFlow knows the location of libnccl.so.2 and seems to copy the file to some "internal" location and then uses that location to add some relative paths to the RPATH section in libraries that depend on it. However, since the absolute location is fixed for EESSI (for specific architecture), this absolute location could be used.

This PR does the latter, it adds the known absolute location to libnccl.so.2 to the RPATH section of shared libraries (via the RPATH linker wrappers).

Maybe there are other/better ways to do this - using the known absolute path to libnccl.so.2 and make the RPATH linker wrapper aware of it.

Illustrating the issue (via the import error for _pywrap_py_exception_registry):

  • Error
FAIL: //tensorflow/python/kernel_tests/sparse_ops:sparse_slice_op_test (see /dev/shm/tmp/easybuild/build/TensorFlow/2.15.1/foss-2023a-CUDA-12.1.1/TensorFlow/bazel-root/118f384e922c3d547bdefb95531c44fe/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/python/kernel_tests/sparse_ops/sparse_slice_op_test/test.log)
INFO: From Testing //tensorflow/python/kernel_tests/sparse_ops:sparse_slice_op_test:
==================== Test output for //tensorflow/python/kernel_tests/sparse_ops:sparse_slice_op_test:
Traceback (most recent call last):
  File "/dev/shm/tmp/easybuild/build/TensorFlow/2.15.1/foss-2023a-CUDA-12.1.1/TensorFlow/bazel-root/118f384e922c3d547bdefb95531c44fe/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/sparse_ops/sparse_slice_op_test.runfiles/org_tensorflow/tensorflow/python/kernel_tests/sparse_ops/sparse_slice_op_test.py", line 19, in <module>
    from tensorflow.python.framework import errors
  File "/dev/shm/tmp/easybuild/build/TensorFlow/2.15.1/foss-2023a-CUDA-12.1.1/TensorFlow/bazel-root/118f384e922c3d547bdefb95531c44fe/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/sparse_ops/sparse_slice_op_test.runfiles/org_tensorflow/tensorflow/python/framework/errors.py", line 21, in <module>
    from tensorflow.python.framework import errors_impl as _impl
  File "/dev/shm/tmp/easybuild/build/TensorFlow/2.15.1/foss-2023a-CUDA-12.1.1/TensorFlow/bazel-root/118f384e922c3d547bdefb95531c44fe/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/sparse_ops/sparse_slice_op_test.runfiles/org_tensorflow/tensorflow/python/framework/errors_impl.py", line 21, in <module>
    from tensorflow.python import _pywrap_py_exception_registry
ImportError: libnccl.so.2: cannot open shared object file: No such file or directory
  • Checking ldd on the imported shared library
cd /dev/shm/tmp/easybuild/build/TensorFlow/2.15.1/foss-2023a-CUDA-12.1.1/TensorFlow/bazel-root/118f384e922c3d547bdefb95531c44fe/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/sparse_ops/sparse_slice_op_test.runfiles/org_tensorflow/tensorflow/python; ldd _pywrap_py_exception_registry.so | grep libnccl.so.2
        libnccl.so.2 => not found
        libnccl.so.2 => not found
        libnccl.so.2 => not found
  • Determining target of symlink
cd /dev/shm/tmp/easybuild/build/TensorFlow/2.15.1/foss-2023a-CUDA-12.1.1/TensorFlow/bazel-root/118f384e922c3d547bdefb95531c44fe/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/sparse_ops/sparse_slice_op_test.runfiles/org_tensorflow/tensorflow/python; realpath _pywrap_py_exception_registry.so
/dev/shm/tmp/easybuild/build/TensorFlow/2.15.1/foss-2023a-CUDA-12.1.1/TensorFlow/bazel-root/118f384e922c3d547bdefb95531c44fe/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/_pywrap_py_exception_registry.so
  • Checking ldd from the directory for the target of the symlink
cd /dev/shm/tmp/easybuild/build/TensorFlow/2.15.1/foss-2023a-CUDA-12.1.1/TensorFlow/bazel-root/118f384e922c3d547bdefb95531c44fe/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python; ldd _pywrap_py_exception_registry.so | grep libnccl.so.2
        libnccl.so.2 => /dev/shm/tmp/easybuild/build/TensorFlow/2.15.1/foss-2023a-CUDA-12.1.1/TensorFlow/bazel-root/118f384e922c3d547bdefb95531c44fe/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/./../../_solib_local/_U@local_Uconfig_Unccl_S_S_Cnccl___Uexternal_Slocal_Uconfig_Unccl/libnccl.so.2 (0x00007eff6832c000)

@ocaisa
Copy link
Member

ocaisa commented Oct 29, 2024

I think what you need to do here is add NCCL to

and this should trigger the same inclusion...but I'm not sure of the syntax. Maybe @Flamefire can make a suggestion

@trz42
Copy link
Contributor Author

trz42 commented Oct 29, 2024

I think what you need to do here is add NCCL to

and this should trigger the same inclusion...but I'm not sure of the syntax. Maybe @Flamefire can make a suggestion

That would be nice, but I'm unsure. All those packages are non-GPU (CUDA) packages. If I add NCCL with CUDA to the list, would that not imply that also the non-CUDA version of TensorFlow depends on this?

I can give it a try.

@trz42
Copy link
Contributor Author

trz42 commented Oct 29, 2024

After commit 0f331a8 the libnccl.so.2 issue is gone.

@trz42
Copy link
Contributor Author

trz42 commented Oct 30, 2024

I think what you need to do here is add NCCL to

and this should trigger the same inclusion...but I'm not sure of the syntax. Maybe @Flamefire can make a suggestion

That would be nice, but I'm unsure. All those packages are non-GPU (CUDA) packages. If I add NCCL with CUDA to the list, would that not imply that also the non-CUDA version of TensorFlow depends on this?

I can give it a try.

Adding the line

        ('NCCL', '2.18.3:'): 'nccl',

to

results in the same error (libnccl.so.2 cannot be found).

The setting for LIBRARY_PATH that is used to run commands by Bazel does not contain any path to NCCL, it is just

    LIBRARY_PATH=/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/cURL/8.0.1-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/double-conversion/3.3.0-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/flatbuffers/23.5.26-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/giflib/5.2.1-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/hwloc/2.9.1-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/ICU/73.2-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/JsonCpp/1.9.5-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/libjpeg-turbo/2.1.5.1-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/libpng/1.6.39-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/nsync/1.26.0-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/pybind11/2.11.1-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/snappy/1.1.10-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/SQLite/3.42.0-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/OpenSSL/1.1/lib \

while running with the easyblock in this PR it is

    LIBRARY_PATH=/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/cURL/8.0.1-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/double-conversion/3.3.0-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/flatbuffers/23.5.26-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/giflib/5.2.1-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/hwloc/2.9.1-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/ICU/73.2-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/JsonCpp/1.9.5-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/libjpeg-turbo/2.1.5.1-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/libpng/1.6.39-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/nsync/1.26.0-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/pybind11/2.11.1-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/snappy/1.1.10-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/SQLite/3.42.0-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/OpenSSL/1.1/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software/NCCL/2.18.3-GCCcore-12.3.0-CUDA-12.1.1/lib \

Copy link
Member

@ocaisa ocaisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very nervous about making changes to the TF easyblock as there is a lot going on there, an opinion from @Flamefire would be welcome

@@ -562,6 +562,15 @@ def configure_step(self):
tensorrt_root = get_software_root('TensorRT')
nccl_root = get_software_root('NCCL')

# add path to libnccl.so.2 directory provided by NCCL when both sysroot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an if block for NCCL further down, best to keep this stuff together there (and the libpath part doesn't get used until a later step anyway)

Comment on lines 568 to 572
system_libs_info_as_list = list(self.system_libs_info)
new_libpaths = system_libs_info_as_list[2]
new_libpaths.append(os.path.join(nccl_root, 'lib'))
system_libs_info_as_list[2] = new_libpaths
self.system_libs_info = tuple(system_libs_info_as_list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
system_libs_info_as_list = list(self.system_libs_info)
new_libpaths = system_libs_info_as_list[2]
new_libpaths.append(os.path.join(nccl_root, 'lib'))
system_libs_info_as_list[2] = new_libpaths
self.system_libs_info = tuple(system_libs_info_as_list)
system_libs_info_as_list = list(self.system_libs_info)
system_libs_info_as_list[2].append(get_software_libdir('NCCL'))
self.system_libs_info = tuple(system_libs_info_as_list)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will test if that works incl moving the code to the if block for nccl_root.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This almost worked. get_software_libdir('NCCL') returns the directory name, not the full path. Slightly modified this, see f6a9afd The latter version worked.


# add path to libnccl.so.2 directory provided by NCCL when both sysroot
# and RPATH are used (such as in EESSI)
if build_option('sysroot') and self.toolchain.use_rpath:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is build_option('sysroot') a necessary condition here? The core issue is use of rpath and lack of LD_LIBRARY_PATH

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s not a necessary condition. It rather limits when this code would be executed.

An alternative approach could be to use some environment variable which could contain paths to be added to LIBRARY_PATH. In this easyblock we could check if it is set and append the paths to the third tuple element. In EESSI, we could then set this in a hook.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a general issue though, if you have that configuration (rpath and no LD_LIBRARY_PATH) then you will run into this problem. The solution is specific to NCCL, and that is fine. I wouldn't introduce something arbitrary that would affect reproducibility.

You could use something similar to

filtered_env_vars = build_option('filter_env_vars') or []
if 'LD_LIBRARY_PATH' in filtered_env_vars and 'LIBRARY_PATH' not in filtered_env_vars:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, if this is not specific to using an alternate sysroot, then remove that part of the condition, there's no need to artificially restrict this to the EESSI context...

Copy link
Contributor Author

@trz42 trz42 Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, it seems not specific to using an alternate sysroot.

Changed the conditional expression (see edc9bfe) and tested this. After the change, libnccl.so.2 is found.

Note, there is another issue building TensorFlow with EESSI, but I may know how to fix that. It likely requires changing the shebang so scripts use env from the compat layer. Changing the scripts will need some addition work to circumvent some sanity check run by Bazel (see https://stackoverflow.com/questions/47775668/bazel-how-to-skip-corrupt-installation-on-centos6).

The latter fix can be done in the easyconfig or in a hook.

@boegel boegel changed the title tweak libpaths by adding directory containing libnccl.so.2 tweak libpaths in TensorFlow easyblock by adding directory containing libnccl.so.2 Nov 6, 2024
@boegel boegel added the bug fix label Nov 6, 2024
@boegel boegel modified the milestones: 5.0, release after 4.9.4 Nov 6, 2024
Comment on lines 718 to 720
system_libs_info_as_list = list(self.system_libs_info)
system_libs_info_as_list[2].append(os.path.join(nccl_root, get_software_libdir('NCCL')))
self.system_libs_info = tuple(system_libs_info_as_list)
Copy link
Contributor

@Flamefire Flamefire Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That part looks very fishy and I had to read the full code to verify this is correct. We should make get_systems_libs and hence self.system_libs_info a named tuple instead to make it easier to understand.

However from a semantic POV this is the wrong place to add NCCL: "System libs" in the context of tensorflow are dependencies that can be vendored in a way TF understands. I.e. https://github.com/tensorflow/tensorflow/blob/master/third_party/systemlibs/syslibs_configure.bzl#L11

I'd rather put this into the build_step where action_env['LIBRARY_PATH'] is set. The easiest way would be to (conditionally) append to libpaths right after cpaths, libpaths = self.system_libs_info[1:]

This is also easier to understand due to the comment there: "Make TF find our modules. LD_LIBRARY_PATH gets automatically added by configure.py"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds right to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made an attempt in f697d97

Not tested yet. Will let you know if it works or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in f697d97 worked.

easybuild/easyblocks/t/tensorflow.py Outdated Show resolved Hide resolved
@boegel
Copy link
Member

boegel commented Nov 19, 2024

@boegelbot please test @ jsc-zen3
EB_BRANCH=5.0.x
EB_ARGS="TensorFlow-2.15.1-foss-2023a.eb --filter-env-vars=LD_LIBRARY_PATH --installpath /tmp/$USER"

@boegelbot
Copy link

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ "5.0.x" != 'develop' ]]; then EB_BRANCH="5.0.x" ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/"5.0.x" source init_env_easybuild_develop.sh; fi; EB_PR=3497 EB_ARGS="TensorFlow-2.15.1-foss-2023a.eb --filter-env-vars=LD_LIBRARY_PATH --installpath /tmp/$USER" EB_REPO=easybuild-easyblocks EB_BRANCH="5.0.x" /opt/software/slurm/bin/sbatch --job-name test_PR_3497 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 5306

Test results coming soon (I hope)...

- notification for comment with ID 2485680132 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix EESSI Related to EESSI project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants