Problems encountered during building from scratch #10

913887524gsd · 2024-11-13T12:21:24Z

Nice project!

This issue(post?) records the obstacles and solutions I encountered during the construction process. Hope the maintainer can modify the script after seeing this to make the build process smoother.

Environment

docker: 27.1.0
image: nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
setup command:

sudo docker run -dit --gpus all                                         \
            -v.:/root                                                   \
            --privileged --network=host --ipc=host                      \
            --name phos nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04

Waiting for user input

I used commands in readme to build:

./build.sh -3 -i

It got stuck during the installation of software-properties-common because the process requires user input to confirm time zone information, but there is no way to provide input.

Solution: Manually install software-properties-common or set TZ and DEBIAN_FRONTEND environment vars.

Missing ~/.cargo/env

After completing the first stage of the installation, the script prompted me to source ~/.bashrc. However, after sourcing it, I found that ~/.cargo/env was missing.

Solution: Install the rust toolchain:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Missing header files

When building the Autogen and Remoting components, the process failed, and the log indicated that some header files were missing (see build_log/build_PhOS-Autogen.log and build_log/build_PhOS-Remoting.log for details):

../../pos/cuda_impl/utils/fatbin.h:26:10: fatal error: libelf.h: No such file or directory
   26 | #include <libelf.h>
      |          ^~~~~~~~~~
cpu-utils.c:9:10: fatal error: openssl/md5.h: No such file or directory
    9 | #include <openssl/md5.h>
      |          ^~~~~~~~~~~~~~~
cpu-client-driver.c:7:10: fatal error: vdpau/vdpau.h: No such file or directory
    7 | #include <vdpau/vdpau.h>
      |          ^~~~~~~~~~~~~~~

Solution: Install header files:

apt-get install -y libelf-dev libgl1-mesa-dev libssl-dev libvdpau-dev

Missing dynamic library

After completing the installation, I tried to launched hijack library using LD_PRELOAD, but it failed due to a missing libtirpc.so.3. I could only find /usr/lib/x86_64-linux-gnu/libtirpc.so.

Solution: Run the ldconfig command to generate libtirpc.so.3.

Hijacking failed

I tested the hijack with a hello world CUDA program, but no runtime APIs were hijacked. Running the ldd command to check library dependencies showed that no runtime library was included. It seemed that nvcc forces runtime library to be statically linked in user program binary.

Solution: Add the --cudart=shared argument to force dynamic linking of the CUDA runtime in the user program.

The text was updated successfully, but these errors were encountered:

wxdwfc · 2024-11-14T00:31:10Z

Thank you so much for your troubleshooting! We will check that and revise the doc accordingly :)

913887524gsd · 2024-11-19T03:00:13Z

When running the llama-2 example code, I found some arguments are incorrect. Here is the patch:

diff --git a/examples/llama2-13b-chat-hf/download.py b/examples/llama2-13b-chat-hf/download.py
index 65e6020..3c6072c 100644
--- a/examples/llama2-13b-chat-hf/download.py
+++ b/examples/llama2-13b-chat-hf/download.py
@@ -32,7 +32,7 @@ model.save_pretrained(model_path)

 # download tokenizer parameter
 if not os.path.exists(tokenizer_path):
-    os.makedirs(model_path)
+    os.makedirs(tokenizer_path)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 tokenizer.save_pretrained(tokenizer_path)

diff --git a/examples/llama2-13b-chat-hf/inference.py b/examples/llama2-13b-chat-hf/inference.py
index 1fdac76..d7e7770 100755
--- a/examples/llama2-13b-chat-hf/inference.py
+++ b/examples/llama2-13b-chat-hf/inference.py
@@ -18,8 +18,8 @@ import transformers
 import time
 from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

-model = AutoModelForCausalLM.from_pretrained('/nvme/huggingface/hub/models--meta-llama--Llama-2-13b-chat-hf/snapshots/a2cb7a712bb6e5e736ca7f8cd98167f81a0b5bd8/').to('cuda:0')
-tokenizer = AutoTokenizer.from_pretrained('/nvme/huggingface/hub/models--meta-llama--Llama-2-13b-chat-hf/snapshots/a2cb7a712bb6e5e736ca7f8cd98167f81a0b5bd8/')
+model = AutoModelForCausalLM.from_pretrained('./model').to('cuda:0')
+tokenizer = AutoTokenizer.from_pretrained('./tokenizer')

 print(f"process id: {os.getpid()}")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems encountered during building from scratch #10

Problems encountered during building from scratch #10

913887524gsd commented Nov 13, 2024

wxdwfc commented Nov 14, 2024

913887524gsd commented Nov 19, 2024

Problems encountered during building from scratch #10

Problems encountered during building from scratch #10

Comments

913887524gsd commented Nov 13, 2024

Environment

Waiting for user input

Missing ~/.cargo/env

Missing header files

Missing dynamic library

Hijacking failed

wxdwfc commented Nov 14, 2024

913887524gsd commented Nov 19, 2024