-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cant run ollama in docker container with iGPU in linux #12363
Comments
Hi @user7z , could you provide your device configuration information? |
@sgwhat its i5 1235-U alderlake that has iris Xe graphics card , i make it work for llama3.2 , didnt work with for example (smollm2), for llama it has a bad accuracy regression , try to chat with it , or say hello hi , and you'll see , but when it used with it within oldy open-webui , it fails directlly |
@sgwhat gemm2 is the only one that works , and do poorely , phi3.5 at least lunchs , qwe2.5 misral models , llama3.2 do not work , one of the mistral models respond to my first message , after that i get assertion 'false' failed error , i only experience this with this docker image, also the official open-webui container works great , so i think there is no need to bload the gigantic docker image with it , its great if you provide one that just have a working ollama , without all the bloat , it might cause the poor performance with gemma2 |
which oneapi version have you installed in your container? |
@sgwhat its a container it comes with oneapi , the version is the one you support under linux |
I can‘t reproduce the |
@hzjane to reproduce : |
here is the the container parameters :
export DOCKER_IMAGE=intelanalytics/ipex-llm-inference-cpp-xpu:latest
export CONTAINER_NAME=ipex-llm-inference-cpp-xpu-container
podman run -itd
--net=host
--device=/dev/dri
-v /home/user/.ollama:/root/.ollama
-e no_proxy=localhost,127.0.0.1
--memory="32G"
--name=$CONTAINER_NAME
-e DEVICE=iGPU
--shm-size="16g"
$DOCKER_IMAGE
cd scripts
bash start-ollama.sh
source ipex-llm-init --gpu --device $DEVICE
found oneapi in /opt/intel/oneapi/setvars.sh
:: initializing oneAPI environment ...
bash: BASH_VERSION = 5.1.16(1)-release
args: Using "$@" for setvars.sh arguments: --force
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: vtune -- latest
:: oneAPI environment initialized ::
/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
/usr/local/lib/python3.11/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
root@lp:/llm/scripts# bash start-ollama.sh
root@lp:/llm/scripts# 2024/11/08 00:35:57 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-11-08T00:35:57.378+08:00 level=INFO source=images.go:753 msg="total blobs: 6"
time=2024-11-08T00:35:57.378+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
time=2024-11-08T00:35:57.379+08:00 level=INFO source=routes.go:1172 msg="Listening on 127.0.0.1:11434 (version 0.3.6-ipexllm-20241106)"
time=2024-11-08T00:35:57.380+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama272927415/runners
time=2024-11-08T00:35:57.504+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cpu cpu_avx]"
time=2024-11-08T00:36:09.351+08:00 level=INFO source=gpu.go:168 msg="looking for compatible GPUs"
time=2024-11-08T00:36:09.351+08:00 level=WARN source=gpu.go:560 msg="unable to locate gpu dependency libraries"
time=2024-11-08T00:36:09.351+08:00 level=WARN source=gpu.go:560 msg="unable to locate gpu dependency libraries"
time=2024-11-08T00:36:09.357+08:00 level=WARN source=gpu.go:560 msg="unable to locate gpu dependency libraries"
time=2024-11-08T00:36:09.360+08:00 level=INFO source=gpu.go:280 msg="no compatible GPUs were discovered"
time=2024-11-08T00:36:09.378+08:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=999 layers.model=31 layers.offload=0 layers.split="" memory.available="[26.2 GiB]" memory.required.full="434.7 MiB" memory.required.partial="0 B" memory.required.kv="180.0 MiB" memory.required.allocations="[434.7 MiB]" memory.weights.total="233.7 MiB" memory.weights.repeating="205.0 MiB" memory.weights.nonrepeating="28.7 MiB" memory.graph.full="164.5 MiB" memory.graph.partial="168.4 MiB"
time=2024-11-08T00:36:09.379+08:00 level=INFO source=server.go:395 msg="starting llama server" cmd="/tmp/ollama272927415/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-55aa88ddac43adce6af0e9be8d6cdff2337a3835cd9b50bbcd7a894eb66dfc75 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --no-mmap --parallel 4 --port 36063"
time=2024-11-08T00:36:09.380+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2024-11-08T00:36:09.380+08:00 level=INFO source=server.go:595 msg="waiting for llama runner to start responding"
time=2024-11-08T00:36:09.380+08:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server error"
llama_model_loader: loaded meta data with 33 key-value pairs and 272 tensors from /root/.ollama/models/blobs/sha256-55aa88ddac43adce6af0e9be8d6cdff2337a3835cd9b50bbcd7a894eb66dfc75 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Smollm2 135M 8k Lc100K Mix1 Ep2
llama_model_loader: - kv 3: general.organization str = HuggingFaceTB
llama_model_loader: - kv 4: general.finetune str = 8k-lc100k-mix1-ep2
llama_model_loader: - kv 5: general.basename str = smollm2
llama_model_loader: - kv 6: general.size_label str = 135M
llama_model_loader: - kv 7: general.license str = apache-2.0
llama_model_loader: - kv 8: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 9: llama.block_count u32 = 30
llama_model_loader: - kv 10: llama.context_length u32 = 8192
llama_model_loader: - kv 11: llama.embedding_length u32 = 576
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 1536
llama_model_loader: - kv 13: llama.attention.head_count u32 = 9
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 3
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 100000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: general.file_type u32 = 10
llama_model_loader: - kv 18: llama.vocab_size u32 = 49152
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 20: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 23: tokenizer.ggml.pre str = smollm
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,49152] = ["<|endoftext|>", "<|im_start|>", "<|...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,49152] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,48900] = ["Ġ t", "Ġ a", "i n", "h e", "Ġ Ġ...
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 29: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 31: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - kv 32: general.quantization_version u32 = 2
llama_model_loader: - type f32: 61 tensors
llama_model_loader: - type q8_0: 1 tensors
llama_model_loader: - type q3_K: 30 tensors
llama_model_loader: - type iq4_nl: 180 tensors
llm_load_vocab: special tokens cache size = 17
llm_load_vocab: token to piece cache size = 0.3170 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 49152
llm_load_print_meta: n_merges = 48900
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 576
llm_load_print_meta: n_layer = 30
llm_load_print_meta: n_head = 9
llm_load_print_meta: n_head_kv = 3
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 3
llm_load_print_meta: n_embd_k_gqa = 192
llm_load_print_meta: n_embd_v_gqa = 192
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 1536
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 100000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q2_K - Medium
llm_load_print_meta: model params = 134.52 M
llm_load_print_meta: model size = 82.41 MiB (5.14 BPW)
llm_load_print_meta: general.name = Smollm2 135M 8k Lc100K Mix1 Ep2
llm_load_print_meta: BOS token = 1 '<|im_start|>'
llm_load_print_meta: EOS token = 2 '<|im_end|>'
llm_load_print_meta: UNK token = 0 '<|endoftext|>'
llm_load_print_meta: PAD token = 2 '<|im_end|>'
llm_load_print_meta: LF token = 143 'Ä'
llm_load_print_meta: EOT token = 0 '<|endoftext|>'
llm_load_print_meta: EOG token = 0 '<|endoftext|>'
llm_load_print_meta: EOG token = 2 '<|im_end|>'
llm_load_print_meta: max token length = 162
time=2024-11-08T00:36:09.632+08:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server loading model"
ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
llm_load_tensors: ggml ctx size = 0.25 MiB
llm_load_tensors: offloading 30 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 31/31 layers to GPU
llm_load_tensors: SYCL0 buffer size = 82.46 MiB
llm_load_tensors: SYCL_Host buffer size = 28.69 MiB
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 100000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
The text was updated successfully, but these errors were encountered: