Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable XMX #12426

Open
NikosDi opened this issue Nov 21, 2024 · 6 comments
Open

Disable XMX #12426

NikosDi opened this issue Nov 21, 2024 · 6 comments
Assignees

Comments

@NikosDi
Copy link

NikosDi commented Nov 21, 2024

Hello.
I have an Intel ARC A380 and I'm using Ollama with IPEX-LLM using this script with Ubuntu:

#!/bin/bash

# Activate conda environment
source /home/nikos/miniforge3/etc/profile.d/conda.sh  # Update this with the correct Conda path
conda activate llm-cpp

# Ensure init-ollama is in the PATH (adjust as needed)
export PATH="/home/nikos/llm_env/bin:$PATH"

# Initialize Ollama
init-ollama

# Set environment variables
export OLLAMA_NUM_GPU=999
export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1
source /opt/intel/oneapi/setvars.sh
export SYCL_CACHE_PERSISTENT=1
export OLLAMA_NUM_PARALLEL=1

# Start Ollama
./ollama serve

It works fine.

For testing reasons I want to disable the use of XMX engine (DPAS)

I added these two environment variables at the end of script:

export BIGDL_LLM_XMX_DISABLED=1
export SYCL_USE_XMX=0
# Set environment variables
export OLLAMA_NUM_GPU=999
export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1
source /opt/intel/oneapi/setvars.sh
export SYCL_CACHE_PERSISTENT=1
export OLLAMA_NUM_PARALLEL=1
export BIGDL_LLM_XMX_DISABLED=1
export SYCL_USE_XMX=0

Unfortunately, when I run Ollama with IPEX-LLM the runtime environment gives me this (at the server)

ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  3850.02 MiB
llm_load_tensors:  SYCL_Host buffer size =    72.00 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A380 Graphics|    1.3|    128|    1024|   32|  6064M|            1.3.29735|
llama_kv_cache_init:      SYCL0 KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.14 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =    84.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 902
llama_new_context_with_model: graph splits = 2
[1732185101] warming up the model with an empty run

It clearly says:
ggml_sycl_init: SYCL_USE_XMX: yes

Is it possible to disable the XMX engine ?

Thank you.

@sgwhat
Copy link
Contributor

sgwhat commented Nov 25, 2024

Hi @NikosDi , SYCL_USE_XMX cannot be directly disabled. May I know the reason that you need to disable xmx?

@NikosDi
Copy link
Author

NikosDi commented Nov 25, 2024

Hello.
For me, as I said above, it is for testing reasons.

I want to know the impact of XMX in overall performance.

But someone who has a system with an iGPU with no XMX and no discrete card, how could he use IPEX-LLM ?

Could XMX be disabled in other way ?

@sgwhat
Copy link
Contributor

sgwhat commented Nov 26, 2024

Hi @NikosDi, XMX cannot be disabled through other methods. If your device does not have XMX, optimizations related to XMX will be disabled.

@NikosDi
Copy link
Author

NikosDi commented Nov 26, 2024

Hello.
The parameter BIGDL_LLM_XMX_DISABLED was active in previous versions and you had to enable it for iGPUs in order to run BIGL-LLM.

I think it would be useful considering a relevant parameter like that using IPEX-LLM - besides the automatic behavior you describe.

@sgwhat
Copy link
Contributor

sgwhat commented Nov 27, 2024

BIGDL_LLM_XMX_DISABLED is only applicable for ipex-llm transformers optimization.

@NikosDi
Copy link
Author

NikosDi commented Nov 27, 2024

Well, by using BIGDL_LLM_XMX_DISABLED =1 as described above in the script, I didn't see any speed difference using my ARC A380.

Is this an expected behavior ?

Is it possible to give the option to disable XMX completely using a discrete GPU like you do for iGPUs ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants