Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems encountered during building from scratch #10

Open
913887524gsd opened this issue Nov 13, 2024 · 2 comments
Open

Problems encountered during building from scratch #10

913887524gsd opened this issue Nov 13, 2024 · 2 comments

Comments

@913887524gsd
Copy link

Nice project!

This issue(post?) records the obstacles and solutions I encountered during the construction process. Hope the maintainer can modify the script after seeing this to make the build process smoother.

Environment

docker: 27.1.0
image: nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
setup command:

sudo docker run -dit --gpus all                                         \
            -v.:/root                                                   \
            --privileged --network=host --ipc=host                      \
            --name phos nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04

Waiting for user input

I used commands in readme to build:

./build.sh -3 -i

It got stuck during the installation of software-properties-common because the process requires user input to confirm time zone information, but there is no way to provide input.

Solution: Manually install software-properties-common or set TZ and DEBIAN_FRONTEND environment vars.

Missing ~/.cargo/env

After completing the first stage of the installation, the script prompted me to source ~/.bashrc. However, after sourcing it, I found that ~/.cargo/env was missing.

Solution: Install the rust toolchain:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Missing header files

When building the Autogen and Remoting components, the process failed, and the log indicated that some header files were missing (see build_log/build_PhOS-Autogen.log and build_log/build_PhOS-Remoting.log for details):

../../pos/cuda_impl/utils/fatbin.h:26:10: fatal error: libelf.h: No such file or directory
   26 | #include <libelf.h>
      |          ^~~~~~~~~~
cpu-utils.c:9:10: fatal error: openssl/md5.h: No such file or directory
    9 | #include <openssl/md5.h>
      |          ^~~~~~~~~~~~~~~
cpu-client-driver.c:7:10: fatal error: vdpau/vdpau.h: No such file or directory
    7 | #include <vdpau/vdpau.h>
      |          ^~~~~~~~~~~~~~~

Solution: Install header files:

apt-get install -y libelf-dev libgl1-mesa-dev libssl-dev libvdpau-dev

Missing dynamic library

After completing the installation, I tried to launched hijack library using LD_PRELOAD, but it failed due to a missing libtirpc.so.3. I could only find /usr/lib/x86_64-linux-gnu/libtirpc.so.

Solution: Run the ldconfig command to generate libtirpc.so.3.

Hijacking failed

I tested the hijack with a hello world CUDA program, but no runtime APIs were hijacked. Running the ldd command to check library dependencies showed that no runtime library was included. It seemed that nvcc forces runtime library to be statically linked in user program binary.

Solution: Add the --cudart=shared argument to force dynamic linking of the CUDA runtime in the user program.

@wxdwfc
Copy link

wxdwfc commented Nov 14, 2024

Thank you so much for your troubleshooting! We will check that and revise the doc accordingly :)

@913887524gsd
Copy link
Author

When running the llama-2 example code, I found some arguments are incorrect. Here is the patch:

diff --git a/examples/llama2-13b-chat-hf/download.py b/examples/llama2-13b-chat-hf/download.py
index 65e6020..3c6072c 100644
--- a/examples/llama2-13b-chat-hf/download.py
+++ b/examples/llama2-13b-chat-hf/download.py
@@ -32,7 +32,7 @@ model.save_pretrained(model_path)

 # download tokenizer parameter
 if not os.path.exists(tokenizer_path):
-    os.makedirs(model_path)
+    os.makedirs(tokenizer_path)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 tokenizer.save_pretrained(tokenizer_path)

diff --git a/examples/llama2-13b-chat-hf/inference.py b/examples/llama2-13b-chat-hf/inference.py
index 1fdac76..d7e7770 100755
--- a/examples/llama2-13b-chat-hf/inference.py
+++ b/examples/llama2-13b-chat-hf/inference.py
@@ -18,8 +18,8 @@ import transformers
 import time
 from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

-model = AutoModelForCausalLM.from_pretrained('/nvme/huggingface/hub/models--meta-llama--Llama-2-13b-chat-hf/snapshots/a2cb7a712bb6e5e736ca7f8cd98167f81a0b5bd8/').to('cuda:0')
-tokenizer = AutoTokenizer.from_pretrained('/nvme/huggingface/hub/models--meta-llama--Llama-2-13b-chat-hf/snapshots/a2cb7a712bb6e5e736ca7f8cd98167f81a0b5bd8/')
+model = AutoModelForCausalLM.from_pretrained('./model').to('cuda:0')
+tokenizer = AutoTokenizer.from_pretrained('./tokenizer')

 print(f"process id: {os.getpid()}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants