Disable XMX #12426

NikosDi · 2024-11-21T10:41:06Z

Hello.
I have an Intel ARC A380 and I'm using Ollama with IPEX-LLM using this script with Ubuntu:

#!/bin/bash

# Activate conda environment
source /home/nikos/miniforge3/etc/profile.d/conda.sh  # Update this with the correct Conda path
conda activate llm-cpp

# Ensure init-ollama is in the PATH (adjust as needed)
export PATH="/home/nikos/llm_env/bin:$PATH"

# Initialize Ollama
init-ollama

# Set environment variables
export OLLAMA_NUM_GPU=999
export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1
source /opt/intel/oneapi/setvars.sh
export SYCL_CACHE_PERSISTENT=1
export OLLAMA_NUM_PARALLEL=1

# Start Ollama
./ollama serve

It works fine.

For testing reasons I want to disable the use of XMX engine (DPAS)

I added these two environment variables at the end of script:

export BIGDL_LLM_XMX_DISABLED=1
export SYCL_USE_XMX=0

# Set environment variables
export OLLAMA_NUM_GPU=999
export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1
source /opt/intel/oneapi/setvars.sh
export SYCL_CACHE_PERSISTENT=1
export OLLAMA_NUM_PARALLEL=1
export BIGDL_LLM_XMX_DISABLED=1
export SYCL_USE_XMX=0

Unfortunately, when I run Ollama with IPEX-LLM the runtime environment gives me this (at the server)

ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  3850.02 MiB
llm_load_tensors:  SYCL_Host buffer size =    72.00 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A380 Graphics|    1.3|    128|    1024|   32|  6064M|            1.3.29735|
llama_kv_cache_init:      SYCL0 KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.14 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =    84.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 902
llama_new_context_with_model: graph splits = 2
[1732185101] warming up the model with an empty run

It clearly says:
ggml_sycl_init: SYCL_USE_XMX: yes

Is it possible to disable the XMX engine ?

Thank you.

The text was updated successfully, but these errors were encountered:

sgwhat · 2024-11-25T02:19:34Z

Hi @NikosDi , SYCL_USE_XMX cannot be directly disabled. May I know the reason that you need to disable xmx?

NikosDi · 2024-11-25T04:01:55Z

Hello.
For me, as I said above, it is for testing reasons.

I want to know the impact of XMX in overall performance.

But someone who has a system with an iGPU with no XMX and no discrete card, how could he use IPEX-LLM ?

Could XMX be disabled in other way ?

sgwhat · 2024-11-26T01:46:40Z

Hi @NikosDi, XMX cannot be disabled through other methods. If your device does not have XMX, optimizations related to XMX will be disabled.

NikosDi · 2024-11-26T04:38:25Z

Hello.
The parameter BIGDL_LLM_XMX_DISABLED was active in previous versions and you had to enable it for iGPUs in order to run BIGL-LLM.

I think it would be useful considering a relevant parameter like that using IPEX-LLM - besides the automatic behavior you describe.

sgwhat · 2024-11-27T02:16:59Z

BIGDL_LLM_XMX_DISABLED is only applicable for ipex-llm transformers optimization.

NikosDi · 2024-11-27T08:10:49Z

Well, by using BIGDL_LLM_XMX_DISABLED =1 as described above in the script, I didn't see any speed difference using my ARC A380.

Is this an expected behavior ?

Is it possible to give the option to disable XMX completely using a discrete GPU like you do for iGPUs ?

qiuxin2012 assigned sgwhat Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable XMX #12426

Disable XMX #12426

NikosDi commented Nov 21, 2024 •

edited

Loading

sgwhat commented Nov 25, 2024

NikosDi commented Nov 25, 2024

sgwhat commented Nov 26, 2024

NikosDi commented Nov 26, 2024

sgwhat commented Nov 27, 2024

NikosDi commented Nov 27, 2024

Disable XMX #12426

Disable XMX #12426

Comments

NikosDi commented Nov 21, 2024 • edited Loading

sgwhat commented Nov 25, 2024

NikosDi commented Nov 25, 2024

sgwhat commented Nov 26, 2024

NikosDi commented Nov 26, 2024

sgwhat commented Nov 27, 2024

NikosDi commented Nov 27, 2024

NikosDi commented Nov 21, 2024 •

edited

Loading