The pure C++ text-to-image pipeline, driven by the OpenVINO native C++ API for Stable Diffusion v1.5 with LMS Discrete Scheduler, supports both static and dynamic model inference. It includes advanced features like LoRA integration with safetensors and OpenVINO Tokenizers. Loading openvino_tokenizers
to ov::Core
enables tokenization. The sample uses diffusers for image generation and imwrite for saving .bmp
images. This demo has been tested on Windows and Unix platforms. There is also a Jupyter notebook which provides an example of image generation in Python.
Note
This tutorial assumes that the current working directory is <openvino.genai repo>/image_generation/stable_diffusion_1_5/cpp/
and all paths are relative to this folder.
Prerequisites:
- Conda (installation guide)
C++ Packages:
- CMake: Cross-platform build tool
- OpenVINO: Model inference.
master
and possibly the latestreleases/*
branch correspond to not yet released OpenVINO versions. https://storage.openvinotoolkit.org/repositories/openvino/packages/nightly/ can be used for these branches early testing.
Prepare a python environment and install dependencies:
conda create -n openvino_sd_cpp python==3.10
conda activate openvino_sd_cpp
conda install -c conda-forge openvino=2024.2.0 c-compiler cxx-compiler git make cmake
# Ensure that Conda standard libraries are used
conda env config vars set LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
-
Install dependencies to import models from HuggingFace:
git submodule update --init # Reactivate Conda environment after installing dependencies and setting env vars conda activate openvino_sd_cpp python -m pip install -r ../../requirements.txt python -m pip install ../../../thirdparty/openvino_tokenizers/[transformers]
-
Download the model from Huggingface and convert it to OpenVINO IR via optimum-intel CLI.
Example models to download:
Example command for downloading dreamlike-art/dreamlike-anime-1.0 model and exporting it with FP16 precision:
optimum-cli export openvino --model dreamlike-art/dreamlike-anime-1.0 --task stable-diffusion --weight-format fp16 models/dreamlike_anime_1_0_ov/FP16
You can also choose other precision and export FP32 or INT8 model.
Please, refer to the official website for 🤗 Optimum and optimum-intel to read more details.
If https://huggingface.co/ is down, the script won't be able to download the model.
Note
Now the pipeline support batch size = 1 only, i.e. static model (1, 3, 512, 512)
Low-Rank Adaptation (LoRA) is a technique introduced to deal with the problem of fine-tuning Diffusers and Large Language Models (LLMs). In the case of Stable Diffusion fine-tuning, LoRA can be applied to the cross-attention layers for the image representations with the latent described.
LoRA weights can be enabled for Unet model of Stable Diffusion pipeline to generate images with different styles.
In this sample LoRA weights are used in safetensors format. Safetensors is a serialization format developed by Hugging Face that is specifically designed for efficiently storing and loading large tensors. It provides a lightweight and efficient way to serialize tensors, making it easier to store and load machine learning models.
The LoRA safetensors model is loaded via safetensors.h. The layer name and weight are modified with Eigen
library and inserted into the SD models with ov::pass::MatcherPass
in the file common/diffusers/src/lora.cpp.
There are various LoRA models on https://civitai.com/tag/lora and on HuggingFace, you can consider to choose your own LoRA model in safetensor format. For example, you can use LoRA soulcard model.
Download and put LoRA safetensors model into the models directory. When running the built sample provide the path to the LoRA model with -l, --loraPath arg
argument.
conda activate openvino_sd_cpp
cmake -DCMAKE_BUILD_TYPE=Release -S . -B build
cmake --build build --parallel
./build/stable_diffusion [-p <posPrompt>] [-n <negPrompt>] [-s <seed>] [--height <output image>] [--width <output image>] [-d <device>] [-r <readNPLatent>] [-l <lora.safetensors>] [-a <alpha>] [-h <help>] [-m <modelPath>] [-t <modelType>] [--guidanceScale <guidanceScale>] [--dynamic]
Usage:
stable_diffusion [OPTION...]
-p, --posPrompt arg
Initial positive prompt for SD (default: "cyberpunk cityscape like Tokyo New York with tall buildings at dusk golden hour cinematic lighting")-n, --negPrompt arg
The prompt to guide the image generation away from. Ignored when not using guidance (--guidanceScale
is less than1
) (default: "")-d, --device arg
AUTO, CPU, or GPU. Doesn't apply to Tokenizer model, OpenVINO Tokenizers can be inferred on a CPU device only (default: CPU)--step arg
Number of diffusion step ( default: 20)-s, --seed arg
Number of random seed to generate latent (default: 42)--guidanceScale arg
A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality (default: 7.5)--num arg
Number of image output(default: 1)--height arg
Height of output image (default: 512)--width arg
Width of output image (default: 512)-c, --useCache
Use model caching-r, --readNPLatent
Read numpy generated latents from file-m, --modelPath arg
Specify path of SD model IR (default: ../models/dreamlike_anime_1_0_ov)-t, --type arg
Specify the type of SD model IRs (FP32, FP16 or INT8) (default: FP16)--dynamic
Specify the model input shape to use dynamic shape-l, --loraPath arg
Specify path of lora file. (*.safetensors). (default: )-a, --alpha arg
alpha for lora (default: 0.75)-h, --help
Print usage
Note
The tokenizer model will always be loaded to CPU: OpenVINO Tokenizers can be inferred on a CPU device only.
Positive prompt: cyberpunk cityscape like Tokyo New York with tall buildings at dusk golden hour cinematic lighting
Negative prompt: (empty, check the Notes for details)
To read the numpy latent instead of C++ std lib for the alignment with Python pipeline, use -r, --readNPLatent
argument.
-
Generate image without lora
./build/stable_diffusion -r
-
Generate image with soulcard lora
./build/stable_diffusion -r -l path/to/soulcard.safetensors
-
Generate different size image with dynamic model (C++ lib generated latent):
./build/stable_diffusion -m ./models/dreamlike_anime_1_0_ov -t FP16 --dynamic --height 448 --width 704
For the generation quality, be careful with the negative prompt and random latent generation. C++ random generation with MT19937 results differ from numpy.random.randn()
. Hence, please use -r, --readNPLatent
for the alignment with Python (this latent file is for output image 512X512 only).
Guidance scale controls how similar the generated image will be to the prompt. A higher guidance scale means the model will try to generate an image that follows the prompt more strictly. A lower guidance scale means the model will have more creativity.
guidance_scale
is a way to increase the adherence to the conditional signal that guides the generation (text, in this case) as well as overall sample quality. It is also known as classifier-free guidance.
To improve image generation quality, model supports negative prompting. Technically, positive prompt steers the diffusion toward the images associated with it, while negative prompt steers the diffusion away from it. In other words, negative prompt declares undesired concepts for generation image, e.g. if we want to have colorful and bright image, gray scale image will be result which we want to avoid, in this case gray scale can be treated as negative prompt. The positive and negative prompt are in equal footing. You can always use one with or without the other. More explanation of how it works can be found in this article.
Note
Negative prompting is applicable only for high guidance scale (at least > 1).
Refer to the OpenVINO blog to get more information on enabling LoRA weights.