Retrieval Augmented Generation (RAG) generates up-to-date and domain-specific answers by connecting a Large Language Model (LLM) to your enterprise data.
- QA Chatbot -- No-GPU using NVIDIA AI Foundation
- QA Chatbot -- A100/H100/L40S
- QA Chatbot -- Multi-GPU
- QA Chatbot -- Quantized LLM model
- QA Chatbot -- Task Decomposition
- QA Chatbot -- NemoTron Model
This example deploys a developer RAG pipeline for chat QA and serves inferencing via the NVIDIA AI Foundation endpoint.
Developers get free credits for 10K requests to any of the available models.
Model | Embedding | Framework | Description | Multi-GPU | TRT-LLM | NVIDIA AI Foundation | Triton | Vector Database |
---|---|---|---|---|---|---|---|---|
mixtral_8x7b | nvolveqa_40k | Langchain | QA chatbot | NO | NO | YES | NO | FAISS |
This example uses NVIDIA AI Foundation inference endpoint.
- Follow steps 1 - 5 in the "Prepare the environment" section of example 02.
Follow these instructions to sign up for an NVIDIA AI Foundation developer account and deploy this example.
This example deploys a developer RAG pipeline for chat QA and serves inferencing via the NeMo Framework inference container.
⚠️ NOTE: This example requires an A100, H100, or L40S GPU. Refer to the support matrix to understand memory requirements for the model you are deploying.
Model | Embedding | Framework | Description | Multi-GPU | TRT-LLM | NVIDIA AI Foundation | Triton | Vector Database |
---|---|---|---|---|---|---|---|---|
llama-2 | e5-large-v2 | Llamaindex | QA chatbot | NO | YES | NO | YES | Milvus |
llama-2 | e5-large-v2 | Llamaindex | QA chatbot | NO | YES | NO | YES | pgvector |
-
Verify NVIDIA GPU driver version 535 or later is installed.
Note: This step is not required for Nvidia AI foundation workflow
535.129.03
$ nvidia-smi -q -d compute
==============NVSMI LOG==============
Timestamp : Sun Nov 26 21:17:25 2023
Driver Version : 535.129.03
CUDA Version : 12.2
Attached GPUs : 1
GPU 00000000:CA:00.0
Compute Mode : Default
Reference: NVIDIA Container Toolkit and NVIDIA Linux driver installation instructions
- Clone the Generative AI examples Git repository.
⚠️ NOTE: This example requires Git Large File Support (LFS)
sudo apt -y install git-lfs
git clone [email protected]:NVIDIA/GenerativeAIExamples.git
cd GenerativeAIExamples/
git lfs pull
-
Verify the NVIDIA container toolkit is installed and configured as the default container runtime.
Note: This step is not required for Nvidia AI foundation workflow
$ cat /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
$ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-d8ce95c1-12f7-3174-6395-e573163a2ace)
- Create an NGC Account and API Key.
Please refer to instructions to create account and generate NGC API key.
Login to nvcr.io
using the following command:
docker login nvcr.io
-
[Optional] Enable Riva ASR and TTS.
a. To launch a Riva server locally, please refer to the instructions in the Riva Quick Start Guide.
-
In the provided
config.sh
script, setservice_enabled_asr=true
andservice_enabled_tts=true
, and select the desired ASR and TTS languages by adding the appropriate language codes toasr_language_code
andtts_language_code
. -
Once the server is running, assign its IP address (or hostname) and port (50051 by default) to
RIVA_API_URI
indeploy/compose/compose.env
.
b. Alternatively, you can use a hosted Riva API endpoint. You might need to obtain an API key and/or Function ID for access.
- In
deploy/compose/compose.env
, make the following assignments as necessary:
export RIVA_API_URI="<Riva API address/hostname>:<Port>" export RIVA_API_KEY="<Riva API key>" export RIVA_FUNCTION_ID="<Riva Function ID>"
-
Reference:
You can download the model either from huggingface or meta.
The steps mentioned here explains how to download from meta. If you are interested in downloading the model checkpoints from huggingface, follow the steps here instead.
- Clone the Llama Github.
git clone https://github.com/facebookresearch/llama.git
cd llama/
-
Fill out Meta's Llama request access form.
-
Download the model weights.
- Select the Llama 2 and Llama Chat text boxes.
- After verifying your email, Meta will email you a download link.
- Download the llama-2-13b-chat model when prompted.
$ ./download.sh
Enter the URL from email: < https://download.llamameta.net/… etc>
Enter the list of models to download without spaces (7B,13B,70B,7B-chat,13B-chat,70B-chat), or press Enter for all: 13B-chat
- Copy the tokenizer to the model directory.
$ mv tokenizer* llama-2-13b-chat/
$ ls ~/git/llama/llama-2-13b-chat/
checklist.chk consolidated.00.pth consolidated.01.pth params.json tokenizer.model tokenizer_checklist.chk
- Set the absolute path to the model location in compose.env.
$ cd ~/git/GenerativeAIExamples
$ grep MODEL deploy/compose/compose.env | grep -v \#
export MODEL_DIRECTORY="/home/nvidia/git/llama/llama-2-13b-chat/"
export MODEL_ARCHITECTURE="llama"
export MODEL_NAME="Llama-2-13b-chat"
- Deploy the developer RAG example via Docker compose using milvus vector store, steps to deploy RAG example with pgvector vector store is here.
⚠️ NOTE: It may take up to 5 minutes for the Triton server to start. The-d
flag starts the services in the background.
$ source deploy/compose/compose.env; docker compose -f deploy/compose/docker-compose.yaml build
$ docker compose -f deploy/compose/docker-compose.yaml up -d
$ docker ps --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"
CONTAINER ID NAMES STATUS
256da0ecdb7b llm-playground Up 48 minutes
2974aa4fb2ce chain-server Up 48 minutes
4a8c4aebe4ad notebook-server Up 48 minutes
5be2b57bb5c1 milvus-standalone Up 48 minutes (healthy)
ecf674c8139c llm-inference-server Up 48 minutes (healthy)
a6609c22c171 milvus-minio Up 48 minutes (healthy)
b23c0858c4d4 milvus-etcd Up 48 minutes (healthy)
Reference:
-
Connect to the sample web application at
http://host-ip:8090
. -
Check [X] Enable TTS output to allow the web app to read the answers to your queries aloud.
-
Select the desired ASR language (
English (en-US)
for this test), TTS language (English (en-US)
for this test) and TTS voice from the dropdown menus below the checkboxes to utilize the web app's voice-to-voice interaction capabilities. -
In the Converse tab, type "How many cores does the Grace superchip contain?" in the chat box and press Submit. Alternatively, click on the microphone button to the right of the text box and ask your query verbally.
- If you encounter an error message reading "Media devices could not be accessed" when you first attempt to transcribe a voice query,
carry out the following steps:
-
Open
chrome://flags
in another browser tab. -
Search for "insecure origins treated as secure".
-
Copy
http://host-ip:8090
into the associated text box. -
Select "Enabled" in the adjacent dropdown menu.
-
Click on the "Relaunch" button at the bottom right of the page.
-
Grant
http://host-ip:8090
access to your microphone.
- Upload the sample data set to the Knowledge Base tab.
⚠️ NOTE:dataset.zip
is located in thenotebooks
directory. Unzip the archive and upload the PDFs.
There is a timeout of
10 mins
set for the ingestion process. Uploading large files may see ingestion failure depending on network bandwidth.
-
Return to Converse tab and check [X] Use knowledge base.
-
Retype (or re-transcribe) the question: "How many cores does the Grace superchip contain?"
⚠️ NOTE: Default prompts are optimized for llama chat model if you're using completion model then prompts need to be finetuned accordingly.
Execute the Jupyter notebooks to explore optional features.
Note: Jupyter notebook is supported for default flow i.e. trt-llm with milvus.
-
In a web browser, open Jupyter at
http://host-ip:8888
. -
Execute the notebooks in order:
- Enable streaming responses from the LLM
- Document QA with LangChain
- Document QA with LlamaIndex
- Advanced Document QA with LlamaIndex
- Document QA via REST FastAPI Server
To uninstall, stop and remove the running containers.
cd deploy/compose
source compose.env
docker compose down
docker compose ps -q
Deploying with pgvector vector store
- Deploy the developer RAG example via Docker compose.
⚠️ NOTE: It may take up to 5 minutes for the Triton server to start. The-d
flag starts the services in the background.
$ source deploy/compose/compose.env; docker compose -f deploy/compose/docker-compose-pgvector.yaml build
$ docker compose -f deploy/compose/docker-compose-pgvector.yaml up -d
$ docker ps --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"
CONTAINER ID NAMES STATUS
0f6f091d892e llm-playground Up 22 hours
8d0ab09fcb98 chain-server Up 22 hours
85bd98ba3b24 notebook-server Up 22 hours
22f0d405b38b llm-inference-server Up 22 hours (healthy)
cbd3cf65ce7e pgvector Up 22 hours
After deployment is successful, you can follow steps from Test to verify workflow.
This example deploys a developer RAG pipeline for chat QA and serves inference via the NeMo Framework inference container across multiple GPUs.
Model | Embedding | Framework | Description | Multi-GPU | TRT-LLM | NVIDIA AI Foundation | Triton | Vector Database |
---|---|---|---|---|---|---|---|---|
llama-2 | e5-large-v2 | Llamaindex | QA chatbot | YES | YES | NO | YES | Milvus |
- Follow the steps in the "Prepare the environment" section of example 02.
-
Follow steps 1 - 4 in the "Deploy" section of example 02 to stage the model weights.
-
Find the GPU device ID. You can check this using
nvidia-smi
command. -
Assign LLM inference to specific GPUs by specifying the GPU ID(s) in the docker compose file.
deploy:
resources:
reservations:
devices:
- driver: nvidia
# count: ${INFERENCE_GPU_COUNT:-all} # Comment this out
device_ids: ["0"]
capabilities: [gpu]
- Follow steps in the "Deploy the model" section of example 02 to deploy via Docker compose.
-
Follow steps 1 - 5 in the "Test" section of example 02.
-
Verify the correct GPU is serving the model using
nvidia-smi
.
- To unintstall, follow the "Uninstall" steps in example 02".
This example deploys a developer RAG pipeline for chat QA and serves inference via the NeMo Framework inference container across multiple GPUs using a quantized version of Llama-7b-chat model.
Model | Embedding | Framework | Description | Multi-GPU | TRT-LLM | NVIDIA AI Foundation | Triton | Vector Database |
---|---|---|---|---|---|---|---|---|
llama-2-7b-chat | e5-large-v2 | Llamaindex | QA chatbot | YES | YES | NO | YES | Milvus |
- Follow the steps in the "Prepare the environment" section of example 02.
- Download Llama2-7b chat Chat Model Weights from huggingface as meta checkpoint does not have the required files to quantize it.
⚠️ NOTE: For this initial version only 7B chat model is supported on A100/H100/L40 GPUs.
-
For quantization of the Llama2 model using AWQ, first clone the TensorRT-LLM repository separately and checkout release/v0.5.0.
- Also copy the Llama2 model directory downloaded earlier to the TensorRT-LLM repo
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cp -r <path-to-Llama2-model-directory> TensorRT-LLM/
cd TensorRT-LLM/
git checkout release/0.5.0
-
Now setup the TensorRT-LLM repo seprately using steps here
-
Once the model is downloaded and TensorRT-LLM repo is setup, we can quantize the model using the TensorRT-LLM container.
-
Follow the steps from here to quantize using AWQ, run these commands inside the container.
-
While running the quantization script, make sure to point
--model_dir
to your downloaded Llama2 model directory -
Once the quantization is completed, copy the generated PyTorch (.pt) file inside the model directory
cp <quantized-checkpoint>.pt <model-dir>
- Now, we will come back our repository, follow the steps below to deploy this quantized model using the inference server.
-
Update compose.env with
MODEL_DIRECTORY
pointing to Llama2 model directory containing the quantized checkpoint. -
Make sure the qantized PyTorch model (.pt) file generated using above steps is present inside the MODEL_DIRECTORY.
-
Uncomment the QUANTIZATION variable which specifies quantization as "int4_awq" inside the compose.env.
export QUANTIZATION="int4_awq"
- Deploy the developer RAG example via Docker compose.
⚠️ NOTE: It may take up to 5 minutes for the Triton server to start. The-d
flag starts the services in the background.
$ source deploy/compose/compose.env; docker compose -f deploy/compose/docker-compose.yaml build
$ docker compose -f deploy/compose/docker-compose.yaml up -d
$ docker ps --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"
CONTAINER ID NAMES STATUS
256da0ecdb7b llm-playground Up 48 minutes
2974aa4fb2ce chain-server Up 48 minutes
4a8c4aebe4ad notebook-server Up 48 minutes
5be2b57bb5c1 milvus-standalone Up 48 minutes (healthy)
ecf674c8139c llm-inference-server Up 48 minutes (healthy)
a6609c22c171 milvus-minio Up 48 minutes (healthy)
b23c0858c4d4 milvus-etcd Up 48 minutes (healthy)
- Follow steps 1 - 5 in the "Test" section of example 02.
- To uninstall, follow the "Uninstall" steps in example 02".
This example deploys a recursive Task Decomposition example for chat QA. It uses the llama2-70b chat model (via the NVIDIA AI Foundation endpoint) for inference.
It showcases how to perform RAG when the agent needs to access information from several different files/chunks or perform some computation on the answers. It uses a custom langchain agent that recursively breaks down the user's questions into subquestions that it attempts to answer. It has access to 2 tools - search (which performs standard RAG on a subquestion) and math (which poses a math question to the LLM). The agent continues to break down the question into sub-questions until it has the answers it needs to formulate the final answer.
Model | Embedding | Framework | Description | Multi-GPU | TRT-LLM | NVIDIA AI Foundation | Triton | Vector Database |
---|---|---|---|---|---|---|---|---|
llama2_70b | nvolveqa_40k | Langchain | QA chatbot | NO | NO | YES | NO | FAISS |
- Follow the steps in the "Prepare the environment" section of example 02.
-
Follow the "Deploy" section of example 01 to setup your API key
-
Change the RAG example in
deploy/compose/compose.env
.export RAG_EXAMPLE="query_decomposition_rag"
-
Change the LLM in
deploy/compose/docker-compose-nv-ai-foundation.yaml
tollama2_70b
.query: container_name: chain-server ... environment: APP_LLM_MODELNAME: llama2_70b ...
-
Deploy the Query Decomposition RAG example via Docker compose.
$ source deploy/compose/compose.env; docker compose -f deploy/compose/docker-compose-nv-ai-foundation.yaml build
$ docker compose -f deploy/compose/docker-compose-nv-ai-foundation.yaml up -d
$ docker ps --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"
CONTAINER ID NAMES STATUS
256da0ecdb7b llm-playground Up 48 minutes
2974aa4fb2ce chain-server Up 48 minutes
-
Connect to the sample web application at
http://host-ip:8090
. -
Upload 2 text documents in the Knowledge Base tab. The documents can contain different information - for example, one document can contain a company's revenue analysis for Q3 2023 and the other can contain a similar analysis for Q4 2023.
-
Return to the Converse tab and check [X] Use knowledge base.
-
Enter the question: "Which is greater - NVIDIA's datacenter revenue for Q4 2023 or the sum of its datacenter and gaming revenues for Q3 2023?" and hit submit to get the answer.
- To uninstall, follow the "Uninstall" steps in example 02".
This example deploys a developer RAG pipeline for chat QA and serves inference via the NeMo Framework inference container using NeMoTron model and showcases inference using sample notebook.
- Follow the steps in the "Prepare the environment" section of example 02.
⚠️ NOTE: This example requires at least 100GB of GPU memory or two A100 GPUs for locally deploying the nemotron model.
- Download NeMoTron chat checkpoint from HuggingFace
git-lfs clone https://huggingface.co/nvidia/nemotron-3-8b-chat-4k-sft
- Make sure the absolute model path of nemotron-3-8b-chat-4k-sft model is updated in
/GenerativeAIExamples/deploy/compose/compose.env
. Set the below values incompose.env
file.
export MODEL_DIRECTORY="/home/nvidia/nemotron-3-8b-chat-4k-sft" # Example path
export MODEL_ARCHITECTURE="gptnext"
export MODEL_NAME="nemotron-3-8b-chat-4k-sft"
- Build and deploy the nemotron workflow
source deploy/compose/compose.env
docker compose -f deploy/compose/docker-compose-nemotron.yaml build
docker compose -f deploy/compose/docker-compose-nemotron.yaml up -d
- Check the deployment status by printing logs of
llm-inference-server
container
Successful TRT-LLM conversion and Triton Inference Server deployment logs will display the following message
I0107 03:03:38.638311 260 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0107 03:03:38.679626 260 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
- Run
02_langchain_simple.ipynb
for Document Question-Answering with LangChain based using NeMoTron model.
[Optional] Run 00-llm-non-streaming-nemotron.ipynb
to send request to LLM.
⚠️ NOTE:
- Nemotron models do not support streaming in this release.
To deep dive into different components and workflow used by the examples, please refer to the Developer Guide.