Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Could not load /opt/rocm-6.1.3/lib/rocblas/library/TensileLibrary.dat #856

Open
unclemusclez opened this issue Jun 21, 2024 · 11 comments
Labels
bug Something isn't working Under Investigation

Comments

@unclemusclez
Copy link

unclemusclez commented Jun 21, 2024

Problem Description

rocblaslt error: Could not load /opt/rocm-6.1.3/lib/hipblaslt/library/TensileLibrary.dat
Segmentation fault

i performed cp /opt/rocm-6.1.3/lib/hipblaslt/library/TensileLibrary_gfx1100.dat /opt/rocm-6.1.3/lib/hipblaslt/library/TensileLibrary.dat otherwise i dont know how to get this file.

I ran the ./install.sh -idc --architecture 'gfx1100' --merge-files --static from the hipBLASLt repository
Driver installation via amdgpu-install -y --usecase=wsl,rocm --no-dkms

Operating System

WSL2 Ubuntu 22.04 Windows 11

CPU

7800x3d

GPU

AMD Radeon RX 7900 XT

Other

No response

ROCm Version

ROCm 6.1.3

ROCm Component

hipBLASLt

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@KKyang KKyang added the bug Something isn't working label Jun 24, 2024
@minzhezhou
Copy link

minzhezhou commented Jul 7, 2024

Did you load it from pytorch? You may also need to replace the torch libhipblaslt.so

There was a similar issue raised by @lhl: #831

@unclemusclez
Copy link
Author

831 it seems to be not installed with 6.1.3 software there was no official release in the repository

@sleppyrobot
Copy link

I get this error on linux as well.
ubuntu 24.04
7900xtx
rocm 6.2

Pointing HIPBLASLT_TENSILE_LIBPATH = hipBLASLt/build/release/Tensile/library
causes below error.

rocblaslt error: Cannot read /home/adminl/hipBLASLt/build/release/Tensile/library/TensileLibrary.dat: No such file or directory

rocblaslt error: Could not load /home/adminl/hipBLASLt/build/release/Tensile/library/TensileLibrary.dat
Segmentation fault (core dumped)

@ppanchad-amd
Copy link

Hi @unclemusclez. Internal ticket has been created to investigate your issue. Thanks!

@unclemusclez
Copy link
Author

Hi @unclemusclez. Internal ticket has been created to investigate your issue. Thanks!

@ppanchad-amd this may be fixable with ln -s of the correlating lazy load *.dat file to .../TensileLibrary.dat

Thank you for looking into this.
I am curious about any progress. At the moment i am exploring multiple GFX platforms, including gfx906 and gfx1100.

@schung-amd
Copy link

Hi @unclemusclez, when is this issue occurring? I can see the place in the source code where this error is emitted, and it looks like it should be picking up TensileLibrary_gfx1100.dat; not sure yet why it isn't so I'll try to reproduce this.

@unclemusclez
Copy link
Author

@schung-amd it's been some time since tried to compile ROCm for my Windows WSL machine. I think this might be an issue with bitsandbytes but i don't remember at this point. this issue is from 4 months ago. Perhaps you can not replicate this because it relates to the kernel, of which does not exists on WSL linux.

From my experience, if its not working, I just don't worry about it until there is a new WSL-Windows driver update for ROCm.

ROCm 6.1.2 is nice, but really we need 6.2 on Windows. That will bring everything up to date with the modern capabilities of PyTorch and CUDA Cooperative Groups are supported. The current windows drivers for GPU are not even working correctly. We have to downgrade or shared memory is used by default. It's very difficult to troubleshoot the versions/source/environment of things when I'm actively trying to do work.

I'll follow up with this at some point when i come across it again.

@schung-amd
Copy link

This should be addressed in ROCm 6.2 with lazy loading (28eb825), so hopefully once WSL for 6.2 is released this is fixed.

ROCm 6.1.2 is nice, but really we need 6.2 on Windows. That will bring everything up to date with the modern capabilities of PyTorch and CUDA Cooperative Groups are supported.

Unfortunately we have no plans at this time to add cooperative groups support on Windows.

@sleppyrobot Are you still encountering this error on Ubuntu? If so, can you provide some steps to reproduce it?

@sleppyrobot
Copy link

Hey no the issue went away when I changed pytorch version.

As far as the steps to reproduce, I was using
ComfyUI and a SDXL model
with rocm6.2 pytorch 2.5, any from August to early September would trigger the error,
also need to link or point to the hipblast library.
@schung-amd

@unclemusclez
Copy link
Author

Unfortunately we have no plans at this time to add cooperative groups support on Windows.

@schung-amd this is a necessity for a lot of video and 3d AI python applications due to their dependency on https://github.com/graphdeco-inria/diff-gaussian-rasterization

is there anyway to have this reprioritized or looked at? For Unreal/Blender pipelines this would be incredible. It is a major reason why I am considering switching to an entirely Linux platform at the moment. I just don't have the resources or time to switch everything.

Of course, there was ZLUDA.

@schung-amd
Copy link

I've seen other requests for cooperative groups support on Windows and am reaching out internally to push for support if feasible. That being said, I am unaware of the reason we are not supporting it at this time (i.e. there may be technical barriers) and wouldn't expect support in the near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Under Investigation
Projects
None yet
Development

No branches or pull requests

6 participants