Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) generates up-to-date and domain-specific answers by connecting a Large Language Model (LLM) to your enterprise data.

Developer RAG Examples

QA Chatbot -- No-GPU using NVIDIA AI Foundation
QA Chatbot -- A100/H100/L40S
QA Chatbot -- Multi-GPU
QA Chatbot -- Quantized LLM model
QA Chatbot -- Task Decomposition
QA Chatbot -- NemoTron Model

1: QA Chatbot -- NVIDIA AI Foundation inference endpoint

This example deploys a developer RAG pipeline for chat QA and serves inferencing via the NVIDIA AI Foundation endpoint.

Developers get free credits for 10K requests to any of the available models.

Model	Embedding	Framework	Description	Multi-GPU	TRT-LLM	NVIDIA AI Foundation	Triton	Vector Database
mixtral_8x7b	nvolveqa_40k	Langchain	QA chatbot	NO	NO	YES	NO	FAISS

1.1 Prepare the environment

This example uses NVIDIA AI Foundation inference endpoint.

Follow steps 1 - 5 in the "Prepare the environment" section of example 02.

1.2 Deploy

Follow these instructions to sign up for an NVIDIA AI Foundation developer account and deploy this example.

2: QA Chatbot -- A100/H100/L40S GPU

This example deploys a developer RAG pipeline for chat QA and serves inferencing via the NeMo Framework inference container.

⚠️ NOTE: This example requires an A100, H100, or L40S GPU. Refer to the support matrix to understand memory requirements for the model you are deploying.

Model	Embedding	Framework	Description	Multi-GPU	TRT-LLM	NVIDIA AI Foundation	Triton	Vector Database
llama-2	e5-large-v2	Llamaindex	QA chatbot	NO	YES	NO	YES	Milvus
llama-2	e5-large-v2	Llamaindex	QA chatbot	NO	YES	NO	YES	pgvector

2.1 Prepare the environment

Install Docker Engine and Docker Compose.
Verify NVIDIA GPU driver version 535 or later is installed.

Note: This step is not required for Nvidia AI foundation workflow

535.129.03

$ nvidia-smi -q -d compute

==============NVSMI LOG==============

Timestamp                                 : Sun Nov 26 21:17:25 2023
Driver Version                            : 535.129.03
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:CA:00.0
    Compute Mode                          : Default

Reference: NVIDIA Container Toolkit and NVIDIA Linux driver installation instructions

Clone the Generative AI examples Git repository.

⚠️ NOTE: This example requires Git Large File Support (LFS)

sudo apt -y install git-lfs
git clone git@github.com:NVIDIA/GenerativeAIExamples.git
cd GenerativeAIExamples/
git lfs pull

Verify the NVIDIA container toolkit is installed and configured as the default container runtime.

Note: This step is not required for Nvidia AI foundation workflow

$ cat /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

$ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-d8ce95c1-12f7-3174-6395-e573163a2ace)

Create an NGC Account and API Key.

Please refer to instructions to create account and generate NGC API key.

Login to nvcr.io using the following command:

docker login nvcr.io

[Optional] Enable Riva ASR and TTS.

a. To launch a Riva server locally, please refer to the instructions in the Riva Quick Start Guide.
- In the provided config.sh script, set service_enabled_asr=true and service_enabled_tts=true, and select the desired ASR and TTS languages by adding the appropriate language codes to asr_language_code and tts_language_code.
- Once the server is running, assign its IP address (or hostname) and port (50051 by default) to RIVA_API_URI in deploy/compose/compose.env.
b. Alternatively, you can use a hosted Riva API endpoint. You might need to obtain an API key and/or Function ID for access.
- In deploy/compose/compose.env, make the following assignments as necessary:
```
export RIVA_API_URI="<Riva API address/hostname>:<Port>"
export RIVA_API_KEY="<Riva API key>"
export RIVA_FUNCTION_ID="<Riva Function ID>"
```

Reference:

Docker installation instructions (Ubuntu)
NVIDIA Container Toolkit Installation instructions

2.2 Deploy

Downloading the model

You can download the model either from huggingface or meta.

The steps mentioned here explains how to download from meta. If you are interested in downloading the model checkpoints from huggingface, follow the steps here instead.

Clone the Llama Github.

git clone https://github.com/facebookresearch/llama.git
cd llama/

Fill out Meta's Llama request access form.
Download the model weights.

Select the Llama 2 and Llama Chat text boxes.
After verifying your email, Meta will email you a download link.
Download the llama-2-13b-chat model when prompted.

$ ./download.sh
Enter the URL from email: < https://download.llamameta.net/… etc>

Enter the list of models to download without spaces (7B,13B,70B,7B-chat,13B-chat,70B-chat), or press Enter for all: 13B-chat

Copy the tokenizer to the model directory.

$ mv tokenizer* llama-2-13b-chat/

$ ls ~/git/llama/llama-2-13b-chat/
checklist.chk  consolidated.00.pth  consolidated.01.pth  params.json  tokenizer.model  tokenizer_checklist.chk

Deploying the model

Set the absolute path to the model location in compose.env.

$ cd ~/git/GenerativeAIExamples

$ grep MODEL deploy/compose/compose.env | grep -v \#
export MODEL_DIRECTORY="/home/nvidia/git/llama/llama-2-13b-chat/"
export MODEL_ARCHITECTURE="llama"
export MODEL_NAME="Llama-2-13b-chat"

Deploy the developer RAG example via Docker compose using milvus vector store, steps to deploy RAG example with pgvector vector store is here.

⚠️ NOTE: It may take up to 5 minutes for the Triton server to start. The -d flag starts the services in the background.

$ source deploy/compose/compose.env;  docker compose -f deploy/compose/docker-compose.yaml build

$ docker compose -f deploy/compose/docker-compose.yaml up -d

$ docker ps --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"
CONTAINER ID   NAMES                  STATUS
256da0ecdb7b   llm-playground         Up 48 minutes
2974aa4fb2ce   chain-server           Up 48 minutes
4a8c4aebe4ad   notebook-server        Up 48 minutes
5be2b57bb5c1   milvus-standalone      Up 48 minutes (healthy)
ecf674c8139c   llm-inference-server   Up 48 minutes (healthy)
a6609c22c171   milvus-minio           Up 48 minutes (healthy)
b23c0858c4d4   milvus-etcd            Up 48 minutes (healthy)

Reference:

Meta Llama README
Meta Llama request access form

2.3 Test

Connect to the sample web application at http://host-ip:8090.
Check [X] Enable TTS output to allow the web app to read the answers to your queries aloud.
Select the desired ASR language (English (en-US) for this test), TTS language (English (en-US) for this test) and TTS voice from the dropdown menus below the checkboxes to utilize the web app's voice-to-voice interaction capabilities.
In the Converse tab, type "How many cores does the Grace superchip contain?" in the chat box and press Submit. Alternatively, click on the microphone button to the right of the text box and ask your query verbally.

If you encounter an error message reading "Media devices could not be accessed" when you first attempt to transcribe a voice query,

carry out the following steps:

Open chrome://flags in another browser tab.
Search for "insecure origins treated as secure".
Copy http://host-ip:8090 into the associated text box.
Select "Enabled" in the adjacent dropdown menu.
Click on the "Relaunch" button at the bottom right of the page.
Grant http://host-ip:8090 access to your microphone.

Upload the sample data set to the Knowledge Base tab.

⚠️ NOTE: dataset.zip is located in the notebooks directory. Unzip the archive and upload the PDFs.

There is a timeout of 10 mins set for the ingestion process. Uploading large files may see ingestion failure depending on network bandwidth.

Return to Converse tab and check [X] Use knowledge base.
Retype (or re-transcribe) the question: "How many cores does the Grace superchip contain?"

⚠️ NOTE: Default prompts are optimized for llama chat model if you're using completion model then prompts need to be finetuned accordingly.

Learn More

Execute the Jupyter notebooks to explore optional features.

Note: Jupyter notebook is supported for default flow i.e. trt-llm with milvus.

In a web browser, open Jupyter at http://host-ip:8888.
Execute the notebooks in order:

Enable streaming responses from the LLM
Document QA with LangChain
Document QA with LlamaIndex
Advanced Document QA with LlamaIndex
Document QA via REST FastAPI Server

2.4 Uninstall

To uninstall, stop and remove the running containers.

cd deploy/compose
source compose.env
docker compose down
docker compose ps -q

Deploying with pgvector vector store

Deploy the developer RAG example via Docker compose.

⚠️ NOTE: It may take up to 5 minutes for the Triton server to start. The -d flag starts the services in the background.

$ source deploy/compose/compose.env;  docker compose -f deploy/compose/docker-compose-pgvector.yaml build

$ docker compose -f deploy/compose/docker-compose-pgvector.yaml up -d

$ docker ps --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"
CONTAINER ID   NAMES                  STATUS
0f6f091d892e   llm-playground         Up 22 hours
8d0ab09fcb98   chain-server           Up 22 hours
85bd98ba3b24   notebook-server        Up 22 hours
22f0d405b38b   llm-inference-server   Up 22 hours (healthy)
cbd3cf65ce7e   pgvector               Up 22 hours

After deployment is successful, you can follow steps from Test to verify workflow.

3: QA Chatbot Multi-GPU -- A100/H100/L40S

This example deploys a developer RAG pipeline for chat QA and serves inference via the NeMo Framework inference container across multiple GPUs.

Model	Embedding	Framework	Description	Multi-GPU	TRT-LLM	NVIDIA AI Foundation	Triton	Vector Database
llama-2	e5-large-v2	Llamaindex	QA chatbot	YES	YES	NO	YES	Milvus

3.1 Prepare the environment

Follow the steps in the "Prepare the environment" section of example 02.

3.2 Deploy

Follow steps 1 - 4 in the "Deploy" section of example 02 to stage the model weights.
Find the GPU device ID. You can check this using nvidia-smi command.
Assign LLM inference to specific GPUs by specifying the GPU ID(s) in the docker compose file.

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              # count: ${INFERENCE_GPU_COUNT:-all} # Comment this out
              device_ids: ["0"]
              capabilities: [gpu]

Follow steps in the "Deploy the model" section of example 02 to deploy via Docker compose.

3.3 Test

Follow steps 1 - 5 in the "Test" section of example 02.
Verify the correct GPU is serving the model using nvidia-smi.

3.4 Uninstall

To unintstall, follow the "Uninstall" steps in example 02".

4: QA Chatbot with Quantized LLM model -- A100/H100/L40S

This example deploys a developer RAG pipeline for chat QA and serves inference via the NeMo Framework inference container across multiple GPUs using a quantized version of Llama-7b-chat model.

Model	Embedding	Framework	Description	Multi-GPU	TRT-LLM	NVIDIA AI Foundation	Triton	Vector Database
llama-2-7b-chat	e5-large-v2	Llamaindex	QA chatbot	YES	YES	NO	YES	Milvus

4.1 Prepare the environment

Follow the steps in the "Prepare the environment" section of example 02.

4.2 Deploy

Download Llama2-7b chat Chat Model Weights from huggingface as meta checkpoint does not have the required files to quantize it.

⚠️ NOTE: For this initial version only 7B chat model is supported on A100/H100/L40 GPUs.

For quantization of the Llama2 model using AWQ, first clone the TensorRT-LLM repository separately and checkout release/v0.5.0.
- Also copy the Llama2 model directory downloaded earlier to the TensorRT-LLM repo

  git clone https://github.com/NVIDIA/TensorRT-LLM.git
  cp -r <path-to-Llama2-model-directory> TensorRT-LLM/
  cd TensorRT-LLM/
  git checkout release/0.5.0

Now setup the TensorRT-LLM repo seprately using steps here
Once the model is downloaded and TensorRT-LLM repo is setup, we can quantize the model using the TensorRT-LLM container.

Follow the steps from here to quantize using AWQ, run these commands inside the container.
While running the quantization script, make sure to point --model_dir to your downloaded Llama2 model directory
Once the quantization is completed, copy the generated PyTorch (.pt) file inside the model directory

 cp <quantized-checkpoint>.pt <model-dir>

Now, we will come back our repository, follow the steps below to deploy this quantized model using the inference server.

Update compose.env with MODEL_DIRECTORY pointing to Llama2 model directory containing the quantized checkpoint.
Make sure the qantized PyTorch model (.pt) file generated using above steps is present inside the MODEL_DIRECTORY.
Uncomment the QUANTIZATION variable which specifies quantization as "int4_awq" inside the compose.env.

  export QUANTIZATION="int4_awq"

Deploy the developer RAG example via Docker compose.

⚠️ NOTE: It may take up to 5 minutes for the Triton server to start. The -d flag starts the services in the background.

$ source deploy/compose/compose.env;  docker compose -f deploy/compose/docker-compose.yaml build

$ docker compose -f deploy/compose/docker-compose.yaml up -d

$ docker ps --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"
CONTAINER ID   NAMES                  STATUS
256da0ecdb7b   llm-playground         Up 48 minutes
2974aa4fb2ce   chain-server           Up 48 minutes
4a8c4aebe4ad   notebook-server        Up 48 minutes
5be2b57bb5c1   milvus-standalone      Up 48 minutes (healthy)
ecf674c8139c   llm-inference-server   Up 48 minutes (healthy)
a6609c22c171   milvus-minio           Up 48 minutes (healthy)
b23c0858c4d4   milvus-etcd            Up 48 minutes (healthy)

4.3 Test

Follow steps 1 - 5 in the "Test" section of example 02.

4.4 Uninstall

To uninstall, follow the "Uninstall" steps in example 02".

5: QA Chatbot with Task Decomposition example -- A100/H100/L40S

This example deploys a recursive Task Decomposition example for chat QA. It uses the llama2-70b chat model (via the NVIDIA AI Foundation endpoint) for inference.

It showcases how to perform RAG when the agent needs to access information from several different files/chunks or perform some computation on the answers. It uses a custom langchain agent that recursively breaks down the user's questions into subquestions that it attempts to answer. It has access to 2 tools - search (which performs standard RAG on a subquestion) and math (which poses a math question to the LLM). The agent continues to break down the question into sub-questions until it has the answers it needs to formulate the final answer.

Model	Embedding	Framework	Description	Multi-GPU	TRT-LLM	NVIDIA AI Foundation	Triton	Vector Database
llama2_70b	nvolveqa_40k	Langchain	QA chatbot	NO	NO	YES	NO	FAISS

5.1 Prepare the environment

Follow the steps in the "Prepare the environment" section of example 02.

5.2 Deploy

Follow the "Deploy" section of example 01 to setup your API key
Change the RAG example in deploy/compose/compose.env.
```
export RAG_EXAMPLE="query_decomposition_rag"
```

Change the LLM in deploy/compose/docker-compose-nv-ai-foundation.yaml to llama2_70b.

query:
  container_name: chain-server
  ...
  environment:
    APP_LLM_MODELNAME: llama2_70b
    ...

Deploy the Query Decomposition RAG example via Docker compose.

$ source deploy/compose/compose.env;  docker compose -f deploy/compose/docker-compose-nv-ai-foundation.yaml build

$ docker compose -f deploy/compose/docker-compose-nv-ai-foundation.yaml up -d

$ docker ps --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"
CONTAINER ID   NAMES                  STATUS
256da0ecdb7b   llm-playground         Up 48 minutes
2974aa4fb2ce   chain-server           Up 48 minutes

5.3 Test

Connect to the sample web application at http://host-ip:8090.
Upload 2 text documents in the Knowledge Base tab. The documents can contain different information - for example, one document can contain a company's revenue analysis for Q3 2023 and the other can contain a similar analysis for Q4 2023.
Return to the Converse tab and check [X] Use knowledge base.
Enter the question: "Which is greater - NVIDIA's datacenter revenue for Q4 2023 or the sum of its datacenter and gaming revenues for Q3 2023?" and hit submit to get the answer.

5.4 Uninstall

To uninstall, follow the "Uninstall" steps in example 02".

6: QA Chatbot -- NemoTron Model

This example deploys a developer RAG pipeline for chat QA and serves inference via the NeMo Framework inference container using NeMoTron model and showcases inference using sample notebook.

6.1 Prepare the environment

Follow the steps in the "Prepare the environment" section of example 02.

⚠️ NOTE: This example requires at least 100GB of GPU memory or two A100 GPUs for locally deploying the nemotron model.

6.2 Deploy

Download NeMoTron chat checkpoint from HuggingFace

git-lfs clone https://huggingface.co/nvidia/nemotron-3-8b-chat-4k-sft

Make sure the absolute model path of nemotron-3-8b-chat-4k-sft model is updated in /GenerativeAIExamples/deploy/compose/compose.env. Set the below values in compose.env file.

export MODEL_DIRECTORY="/home/nvidia/nemotron-3-8b-chat-4k-sft" # Example path
export MODEL_ARCHITECTURE="gptnext"
export MODEL_NAME="nemotron-3-8b-chat-4k-sft"

Build and deploy the nemotron workflow

source deploy/compose/compose.env
docker compose -f deploy/compose/docker-compose-nemotron.yaml build
docker compose -f deploy/compose/docker-compose-nemotron.yaml up -d

Check the deployment status by printing logs of llm-inference-server container

Successful TRT-LLM conversion and Triton Inference Server deployment logs will display the following message

I0107 03:03:38.638311 260 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0107 03:03:38.679626 260 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002

6.3 Test

Run 02_langchain_simple.ipynb for Document Question-Answering with LangChain based using NeMoTron model.

[Optional] Run 00-llm-non-streaming-nemotron.ipynb to send request to LLM.

⚠️ NOTE:

Nemotron models do not support streaming in this release.

Learn More

To deep dive into different components and workflow used by the examples, please refer to the Developer Guide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Retrieval Augmented Generation

Developer RAG Examples

1: QA Chatbot -- NVIDIA AI Foundation inference endpoint

1.1 Prepare the environment

1.2 Deploy

2: QA Chatbot -- A100/H100/L40S GPU

2.1 Prepare the environment

2.2 Deploy

Downloading the model

Deploying the model

2.3 Test

Learn More

2.4 Uninstall

Deploying with pgvector vector store

3: QA Chatbot Multi-GPU -- A100/H100/L40S

3.1 Prepare the environment

3.2 Deploy

3.3 Test

3.4 Uninstall

4: QA Chatbot with Quantized LLM model -- A100/H100/L40S

4.1 Prepare the environment

4.2 Deploy

4.3 Test

4.4 Uninstall

5: QA Chatbot with Task Decomposition example -- A100/H100/L40S

5.1 Prepare the environment

5.2 Deploy

5.3 Test

5.4 Uninstall

6: QA Chatbot -- NemoTron Model

6.1 Prepare the environment

6.2 Deploy

6.3 Test

Learn More

Files

README.md

Latest commit

History

README.md

File metadata and controls

Retrieval Augmented Generation

Developer RAG Examples

1: QA Chatbot -- NVIDIA AI Foundation inference endpoint

1.1 Prepare the environment

1.2 Deploy

2: QA Chatbot -- A100/H100/L40S GPU

2.1 Prepare the environment

2.2 Deploy

Downloading the model

Deploying the model

2.3 Test

Learn More

2.4 Uninstall

Deploying with pgvector vector store

3: QA Chatbot Multi-GPU -- A100/H100/L40S

3.1 Prepare the environment

3.2 Deploy

3.3 Test

3.4 Uninstall

4: QA Chatbot with Quantized LLM model -- A100/H100/L40S

4.1 Prepare the environment

4.2 Deploy

4.3 Test

4.4 Uninstall

5: QA Chatbot with Task Decomposition example -- A100/H100/L40S

5.1 Prepare the environment

5.2 Deploy

5.3 Test

5.4 Uninstall

6: QA Chatbot -- NemoTron Model

6.1 Prepare the environment

6.2 Deploy

6.3 Test

Learn More