Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cant run ollama in docker container with iGPU in linux #12363

Open
user7z opened this issue Nov 7, 2024 · 8 comments
Open

cant run ollama in docker container with iGPU in linux #12363

user7z opened this issue Nov 7, 2024 · 8 comments
Assignees

Comments

@user7z
Copy link

user7z commented Nov 7, 2024

here is the the container parameters :

export DOCKER_IMAGE=intelanalytics/ipex-llm-inference-cpp-xpu:latest
export CONTAINER_NAME=ipex-llm-inference-cpp-xpu-container
podman run -itd
--net=host
--device=/dev/dri
-v /home/user/.ollama:/root/.ollama
-e no_proxy=localhost,127.0.0.1
--memory="32G"
--name=$CONTAINER_NAME
-e DEVICE=iGPU
--shm-size="16g"
$DOCKER_IMAGE
cd scripts
bash start-ollama.sh

source ipex-llm-init --gpu --device $DEVICE
found oneapi in /opt/intel/oneapi/setvars.sh

:: initializing oneAPI environment ...
bash: BASH_VERSION = 5.1.16(1)-release
args: Using "$@" for setvars.sh arguments: --force
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: vtune -- latest
:: oneAPI environment initialized ::

/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
/usr/local/lib/python3.11/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
root@lp:/llm/scripts# bash start-ollama.sh
root@lp:/llm/scripts# 2024/11/08 00:35:57 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-11-08T00:35:57.378+08:00 level=INFO source=images.go:753 msg="total blobs: 6"
time=2024-11-08T00:35:57.378+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
time=2024-11-08T00:35:57.379+08:00 level=INFO source=routes.go:1172 msg="Listening on 127.0.0.1:11434 (version 0.3.6-ipexllm-20241106)"
time=2024-11-08T00:35:57.380+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama272927415/runners
time=2024-11-08T00:35:57.504+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cpu cpu_avx]"
time=2024-11-08T00:36:09.351+08:00 level=INFO source=gpu.go:168 msg="looking for compatible GPUs"
time=2024-11-08T00:36:09.351+08:00 level=WARN source=gpu.go:560 msg="unable to locate gpu dependency libraries"
time=2024-11-08T00:36:09.351+08:00 level=WARN source=gpu.go:560 msg="unable to locate gpu dependency libraries"
time=2024-11-08T00:36:09.357+08:00 level=WARN source=gpu.go:560 msg="unable to locate gpu dependency libraries"
time=2024-11-08T00:36:09.360+08:00 level=INFO source=gpu.go:280 msg="no compatible GPUs were discovered"
time=2024-11-08T00:36:09.378+08:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=999 layers.model=31 layers.offload=0 layers.split="" memory.available="[26.2 GiB]" memory.required.full="434.7 MiB" memory.required.partial="0 B" memory.required.kv="180.0 MiB" memory.required.allocations="[434.7 MiB]" memory.weights.total="233.7 MiB" memory.weights.repeating="205.0 MiB" memory.weights.nonrepeating="28.7 MiB" memory.graph.full="164.5 MiB" memory.graph.partial="168.4 MiB"
time=2024-11-08T00:36:09.379+08:00 level=INFO source=server.go:395 msg="starting llama server" cmd="/tmp/ollama272927415/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-55aa88ddac43adce6af0e9be8d6cdff2337a3835cd9b50bbcd7a894eb66dfc75 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --no-mmap --parallel 4 --port 36063"
time=2024-11-08T00:36:09.380+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2024-11-08T00:36:09.380+08:00 level=INFO source=server.go:595 msg="waiting for llama runner to start responding"
time=2024-11-08T00:36:09.380+08:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server error"
llama_model_loader: loaded meta data with 33 key-value pairs and 272 tensors from /root/.ollama/models/blobs/sha256-55aa88ddac43adce6af0e9be8d6cdff2337a3835cd9b50bbcd7a894eb66dfc75 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Smollm2 135M 8k Lc100K Mix1 Ep2
llama_model_loader: - kv 3: general.organization str = HuggingFaceTB
llama_model_loader: - kv 4: general.finetune str = 8k-lc100k-mix1-ep2
llama_model_loader: - kv 5: general.basename str = smollm2
llama_model_loader: - kv 6: general.size_label str = 135M
llama_model_loader: - kv 7: general.license str = apache-2.0
llama_model_loader: - kv 8: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 9: llama.block_count u32 = 30
llama_model_loader: - kv 10: llama.context_length u32 = 8192
llama_model_loader: - kv 11: llama.embedding_length u32 = 576
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 1536
llama_model_loader: - kv 13: llama.attention.head_count u32 = 9
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 3
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 100000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: general.file_type u32 = 10
llama_model_loader: - kv 18: llama.vocab_size u32 = 49152
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 20: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 23: tokenizer.ggml.pre str = smollm
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,49152] = ["<|endoftext|>", "<|im_start|>", "<|...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,49152] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,48900] = ["Ġ t", "Ġ a", "i n", "h e", "Ġ Ġ...
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 29: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 31: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - kv 32: general.quantization_version u32 = 2
llama_model_loader: - type f32: 61 tensors
llama_model_loader: - type q8_0: 1 tensors
llama_model_loader: - type q3_K: 30 tensors
llama_model_loader: - type iq4_nl: 180 tensors
llm_load_vocab: special tokens cache size = 17
llm_load_vocab: token to piece cache size = 0.3170 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 49152
llm_load_print_meta: n_merges = 48900
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 576
llm_load_print_meta: n_layer = 30
llm_load_print_meta: n_head = 9
llm_load_print_meta: n_head_kv = 3
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 3
llm_load_print_meta: n_embd_k_gqa = 192
llm_load_print_meta: n_embd_v_gqa = 192
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 1536
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 100000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q2_K - Medium
llm_load_print_meta: model params = 134.52 M
llm_load_print_meta: model size = 82.41 MiB (5.14 BPW)
llm_load_print_meta: general.name = Smollm2 135M 8k Lc100K Mix1 Ep2
llm_load_print_meta: BOS token = 1 '<|im_start|>'
llm_load_print_meta: EOS token = 2 '<|im_end|>'
llm_load_print_meta: UNK token = 0 '<|endoftext|>'
llm_load_print_meta: PAD token = 2 '<|im_end|>'
llm_load_print_meta: LF token = 143 'Ä'
llm_load_print_meta: EOT token = 0 '<|endoftext|>'
llm_load_print_meta: EOG token = 0 '<|endoftext|>'
llm_load_print_meta: EOG token = 2 '<|im_end|>'
llm_load_print_meta: max token length = 162
time=2024-11-08T00:36:09.632+08:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server loading model"
ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
llm_load_tensors: ggml ctx size = 0.25 MiB
llm_load_tensors: offloading 30 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 31/31 layers to GPU
llm_load_tensors: SYCL0 buffer size = 82.46 MiB
llm_load_tensors: SYCL_Host buffer size = 28.69 MiB
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 100000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |

ID Device Type Name Version units group group size Driver version
0 [level_zero:gpu:0] Intel Graphics [0x46a8] 1.3 80 512 32 26651M 1.3.26241
llama_kv_cache_init: SYCL0 KV buffer size = 180.00 MiB
llama_new_context_with_model: KV self size = 180.00 MiB, K (f16): 90.00 MiB, V (f16): 90.00 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 0.76 MiB
llama_new_context_with_model: SYCL0 compute buffer size = 97.12 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 17.13 MiB
llama_new_context_with_model: graph nodes = 846
llama_new_context_with_model: graph splits = 2
time=2024-11-08T00:36:15.414+08:00 level=INFO source=server.go:634 msg="llama runner started in 6.03 seconds"
ollama_llama_server: /home/runner/_work/llm.cpp/llm.cpp/llm.cpp/bigdl-core-xe/llama_backend/sdp_xmx_kernel.cpp:439: auto ggml_sycl_op_sdp_xmx_casual(fp16 *, fp16 *, fp16 *, fp16 *, fp16 *, float *, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, float *, float, sycl::queue &)::(anonymous class)::operator()() const: Assertion `false' failed.
@sgwhat
Copy link
Contributor

sgwhat commented Nov 8, 2024

Hi @user7z , could you provide your device configuration information?

@user7z
Copy link
Author

user7z commented Nov 8, 2024

@sgwhat its i5 1235-U alderlake that has iris Xe graphics card , i make it work for llama3.2 , didnt work with for example (smollm2), for llama it has a bad accuracy regression , try to chat with it , or say hello hi , and you'll see , but when it used with it within oldy open-webui , it fails directlly

@user7z
Copy link
Author

user7z commented Nov 10, 2024

@sgwhat gemm2 is the only one that works , and do poorely , phi3.5 at least lunchs , qwe2.5 misral models , llama3.2 do not work , one of the mistral models respond to my first message , after that i get assertion 'false' failed error , i only experience this with this docker image, also the official open-webui container works great , so i think there is no need to bload the gigantic docker image with it , its great if you provide one that just have a working ollama , without all the bloat , it might cause the poor performance with gemma2

@sgwhat
Copy link
Contributor

sgwhat commented Nov 11, 2024

which oneapi version have you installed in your container?

@user7z
Copy link
Author

user7z commented Nov 11, 2024

@sgwhat its a container it comes with oneapi , the version is the one you support under linux

@hzjane
Copy link
Contributor

hzjane commented Nov 11, 2024

I can‘t reproduce the Assertion false failed error, maybe you could provide more infomation about how to reproduce it.
And I meet the Incorrect output issue even though outside the docker image,we will fix it later.

@user7z
Copy link
Author

user7z commented Nov 11, 2024

@hzjane to reproduce :
image : docker.io/intelanalytics/ipex-llm-inference-cpp-xpu:latest
Run the container
Go inside it
cd scripts
bash start-ollama.sh
Open another terminal and do the same but instead kf runing ollama run
bash start-openwebui.sh
Go to the opebwebui in your browser and try those models :
smollm2 didnt work at all
Llama 3.2 work for a few chats ( one or two)
Mistral same thing
Qwen2.5
Those models that i tested ,also i tested
Gemma2 it did work.
You well notice a regression in the accuracy , & a perofrmance hit compared to the local setup , this was tested in an updated linux system with iris xe integrated gpu found in intel cpus , my one is i5 1235-U

@hzjane
Copy link
Contributor

hzjane commented Nov 13, 2024

Smollm2 or gemma2 didn't work issue is fix by pr-12386. The output accuracy issue is still being fixed by @sgwhat .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants