-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tweak libpaths in TensorFlow easyblock by adding directory containing libnccl.so.2 #3497
base: develop
Are you sure you want to change the base?
Conversation
I think what you need to do here is add NCCL to
and this should trigger the same inclusion...but I'm not sure of the syntax. Maybe @Flamefire can make a suggestion |
That would be nice, but I'm unsure. All those packages are non-GPU (CUDA) packages. If I add NCCL with CUDA to the list, would that not imply that also the non-CUDA version of TensorFlow depends on this? I can give it a try. |
After commit 0f331a8 the |
Adding the line
to
libnccl.so.2 cannot be found).
The setting for
while running with the easyblock in this PR it is
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very nervous about making changes to the TF easyblock as there is a lot going on there, an opinion from @Flamefire would be welcome
easybuild/easyblocks/t/tensorflow.py
Outdated
@@ -562,6 +562,15 @@ def configure_step(self): | |||
tensorrt_root = get_software_root('TensorRT') | |||
nccl_root = get_software_root('NCCL') | |||
|
|||
# add path to libnccl.so.2 directory provided by NCCL when both sysroot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an if
block for NCCL further down, best to keep this stuff together there (and the libpath
part doesn't get used until a later step anyway)
easybuild/easyblocks/t/tensorflow.py
Outdated
system_libs_info_as_list = list(self.system_libs_info) | ||
new_libpaths = system_libs_info_as_list[2] | ||
new_libpaths.append(os.path.join(nccl_root, 'lib')) | ||
system_libs_info_as_list[2] = new_libpaths | ||
self.system_libs_info = tuple(system_libs_info_as_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
system_libs_info_as_list = list(self.system_libs_info) | |
new_libpaths = system_libs_info_as_list[2] | |
new_libpaths.append(os.path.join(nccl_root, 'lib')) | |
system_libs_info_as_list[2] = new_libpaths | |
self.system_libs_info = tuple(system_libs_info_as_list) | |
system_libs_info_as_list = list(self.system_libs_info) | |
system_libs_info_as_list[2].append(get_software_libdir('NCCL')) | |
self.system_libs_info = tuple(system_libs_info_as_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will test if that works incl moving the code to the if
block for nccl_root
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This almost worked. get_software_libdir('NCCL')
returns the directory name, not the full path. Slightly modified this, see f6a9afd The latter version worked.
easybuild/easyblocks/t/tensorflow.py
Outdated
|
||
# add path to libnccl.so.2 directory provided by NCCL when both sysroot | ||
# and RPATH are used (such as in EESSI) | ||
if build_option('sysroot') and self.toolchain.use_rpath: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is build_option('sysroot')
a necessary condition here? The core issue is use of rpath and lack of LD_LIBRARY_PATH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s not a necessary condition. It rather limits when this code would be executed.
An alternative approach could be to use some environment variable which could contain paths to be added to LIBRARY_PATH
. In this easyblock we could check if it is set and append the paths to the third tuple element. In EESSI, we could then set this in a hook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a general issue though, if you have that configuration (rpath and no LD_LIBRARY_PATH
) then you will run into this problem. The solution is specific to NCCL, and that is fine. I wouldn't introduce something arbitrary that would affect reproducibility.
You could use something similar to
easybuild-easyblocks/easybuild/easyblocks/p/python.py
Lines 199 to 200 in 2ab3cbc
filtered_env_vars = build_option('filter_env_vars') or [] | |
if 'LD_LIBRARY_PATH' in filtered_env_vars and 'LIBRARY_PATH' not in filtered_env_vars: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, if this is not specific to using an alternate sysroot, then remove that part of the condition, there's no need to artificially restrict this to the EESSI context...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, it seems not specific to using an alternate sysroot.
Changed the conditional expression (see edc9bfe) and tested this. After the change, libnccl.so.2
is found.
Note, there is another issue building TensorFlow with EESSI, but I may know how to fix that. It likely requires changing the shebang so scripts use env
from the compat layer. Changing the scripts will need some addition work to circumvent some sanity check run by Bazel (see https://stackoverflow.com/questions/47775668/bazel-how-to-skip-corrupt-installation-on-centos6).
The latter fix can be done in the easyconfig or in a hook.
easybuild/easyblocks/t/tensorflow.py
Outdated
system_libs_info_as_list = list(self.system_libs_info) | ||
system_libs_info_as_list[2].append(os.path.join(nccl_root, get_software_libdir('NCCL'))) | ||
self.system_libs_info = tuple(system_libs_info_as_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That part looks very fishy and I had to read the full code to verify this is correct. We should make get_systems_libs
and hence self.system_libs_info
a named tuple instead to make it easier to understand.
However from a semantic POV this is the wrong place to add NCCL: "System libs" in the context of tensorflow are dependencies that can be vendored in a way TF understands. I.e. https://github.com/tensorflow/tensorflow/blob/master/third_party/systemlibs/syslibs_configure.bzl#L11
I'd rather put this into the build_step
where action_env['LIBRARY_PATH']
is set. The easiest way would be to (conditionally) append to libpaths
right after cpaths, libpaths = self.system_libs_info[1:]
This is also easier to understand due to the comment there: "Make TF find our modules. LD_LIBRARY_PATH gets automatically added by configure.py"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds right to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made an attempt in f697d97
Not tested yet. Will let you know if it works or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in f697d97 worked.
Co-authored-by: Alexander Grund <[email protected]>
@boegelbot please test @ jsc-zen3 |
@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de PR test command '
Test results coming soon (I hope)... - notification for comment with ID 2485680132 processed Message to humans: this is just bookkeeping information for me, |
When building
TensorFlow-2.15.1-foss-2023a-CUDA-12.1.1.eb
for EESSI we hit an error thatlibnccl.so.2
wasn't found in 533 test cases. In the backtrace we saw the following (# number of occurrences):The corresponding shared libraries for the imports are
_pywrap_cpu_feature_guard.so
,_pywrap_py_exception_registry.so
, and_errors_test_helper.so
.While
_pywrap_py_exception_registry.so
depends onlibnccl.so.2
directly, the other two depend on_pywrap_tensorflow_internal.so
which depends onlibnccl.so.2
directly. Regardless of these dependency graph differences, the underlying issue causing the import error seems to be the same.The problem seems to be caused by a combination of factors (below the list we illustrate the issue):
_pywrap_tensorflow_internal.so
or_pywrap_py_exception_registry.so
contains a relative path to a location that containslibnccl.so.2
(via something like$ORIGIN/../...
)_pywrap_tensorflow_internal.so
or_pywrap_py_exception_registry.so
is actually a symbolic link to another directory.$ORIGIN/../...
in the RPATH header) from that other directory, the shared librarylibnccl.so.2
is found.tensorflow.py
does not add the absolute directory that containslibnccl.so.2
and which is provided by EESSI (e.g.,/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software/NCCL/2.18.3-GCCcore-12.3.0-CUDA-12.1.1/lib
) tosystem_libs_info[2]
and thus this absolute location doesn't get added toLIBRARY_PATH
which is then used by the RPATH wrapper linker script. The absolute location to the NCCL installation is derived in thetensorflow.py
easyblock (see variablenccl_root
) and stored in someNCCL_INSTALL_PATH
variable in many/all command invocations. The directory containinglibnccl.so.2
could easily be obtained asos.path.join(nccl_root, 'lib')
.libnccl.so.2
and seems to copy the file to some "internal" location and then uses that location to add some relative paths to the RPATH section in libraries that depend on it. However, since the absolute location is fixed for EESSI (for specific architecture), this absolute location could be used.This PR does the latter, it adds the known absolute location to
libnccl.so.2
to the RPATH section of shared libraries (via the RPATH linker wrappers).Maybe there are other/better ways to do this - using the known absolute path to
libnccl.so.2
and make the RPATH linker wrapper aware of it.Illustrating the issue (via the import error for
_pywrap_py_exception_registry
):ldd
on the imported shared libraryldd
from the directory for the target of the symlink