Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Userguide changes #7502

Open
wants to merge 91 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
9e6e71a
Restructure Triton User Guide
Aug 6, 2024
add69fb
Add multi level support
statiraju Aug 14, 2024
f2cdf90
Add server documentation
statiraju Aug 15, 2024
4073c1b
Create quick_start.rst
harryskim Aug 20, 2024
d4b7a11
Use sphinx default parser for rendering for new userguide requirements
statiraju Aug 23, 2024
4a8d5d3
Merge branch 'statiraju-userguide' of https://github.com/triton-infer…
statiraju Aug 23, 2024
bba711a
Fix footer to user nvidia_sphinx theme
statiraju Aug 23, 2024
8e4eec3
Create quick_start.rst
harryskim Aug 23, 2024
7b8b46a
Create llm.rst
harryskim Aug 23, 2024
b243350
Create multimodal.rst
harryskim Aug 23, 2024
c49c247
Update llm.rst
harryskim Aug 23, 2024
a10d5e1
Create embedding.rst
harryskim Aug 23, 2024
697e797
Create stable_diffusion.rst
harryskim Aug 23, 2024
cc18c0e
Create vision.rst
harryskim Aug 23, 2024
f05d990
Delete docs/doc1 directory
harryskim Aug 23, 2024
321f0a8
Create architecture.md
harryskim Aug 23, 2024
e37b3b7
Create placeholder.rst
harryskim Aug 23, 2024
0dde7b2
Create perf_analyzer.rst
harryskim Aug 23, 2024
7b581ce
Update perf_analyzer.rst
harryskim Aug 23, 2024
3d491a2
Update perf_analyzer.rst
harryskim Aug 23, 2024
331eda7
Update contents.rst
harryskim Aug 23, 2024
c59a5a7
Create genai_perf.rst
harryskim Aug 24, 2024
a960069
Update contents.rst
harryskim Aug 24, 2024
0578946
Update genai_perf.rst
harryskim Aug 24, 2024
0ecd872
Create model_analyzer.rst
harryskim Aug 24, 2024
9c32d2d
Update contents.rst
harryskim Aug 24, 2024
42a4ef9
Create trt_llm.rst
harryskim Aug 25, 2024
63ab3e1
Update perf_analyzer/model_analyzer with correct rel path
statiraju Aug 25, 2024
c5b71cf
Improve rendering with myst-parser using pandoc
statiraju Aug 26, 2024
b0a2056
Fix version_match and switcher.json
statiraju Aug 27, 2024
3a5d045
Add server documentation
statiraju Aug 27, 2024
766b201
Use default nvidia-sphinx-theme css
statiraju Aug 27, 2024
9169362
Add server side docs
statiraju Aug 28, 2024
0a0d2a3
Add state management to server
statiraju Aug 28, 2024
c42802b
Update quick_start.rst
harryskim Aug 30, 2024
5d72e61
Create quick_deployment_by_backend.rst
harryskim Aug 30, 2024
bab5ba8
Delete docs/getting_started/llm.rst
harryskim Aug 30, 2024
1c47a01
Create llm.md
harryskim Aug 30, 2024
2c9ec6e
Update contents.rst
harryskim Aug 30, 2024
3c8a767
Delete docs/getting_started/embedding.rst
harryskim Aug 30, 2024
42970d8
Delete docs/getting_started/multimodal.rst
harryskim Aug 30, 2024
ba56590
Delete docs/getting_started/stable_diffusion.rst
harryskim Aug 30, 2024
dacac5c
Delete docs/getting_started/vision.rst
harryskim Aug 30, 2024
4a2e969
Update contents.rst for backends
harryskim Aug 30, 2024
ae90b0e
Update title of llm.md
harryskim Aug 31, 2024
e5a9971
Update quick_deployment_by_backend.rst
harryskim Aug 31, 2024
9ebd84a
Measure edits to ensure llm guide flow is sound
harryskim Sep 1, 2024
1097977
Update llm.md
harryskim Sep 1, 2024
9bb8005
Create release_note.md
harryskim Sep 2, 2024
382e287
Create compatibility.md
harryskim Sep 2, 2024
6c398c8
Update contents.rst
harryskim Sep 2, 2024
2d7a7ee
Update release_note.md with 24.08 release
harryskim Sep 2, 2024
f884e7d
fix type for bash command
harryskim Sep 2, 2024
bd39051
Update release_note.md
harryskim Sep 2, 2024
de44f82
Update index.md to include Triton architecture section
harryskim Sep 2, 2024
fb97b27
Add client documentation
statiraju Sep 3, 2024
6b0833b
Merge branch 'main' into statiraju-userguide
statiraju Sep 3, 2024
ddd9552
Fix links in Getting started
statiraju Sep 3, 2024
04758fa
Add model execution to server features
statiraju Sep 3, 2024
d23e070
Update contents.rst
harryskim Sep 3, 2024
b32eba3
Update compatibility.md
harryskim Sep 3, 2024
b58685a
Update compatibility.md
harryskim Sep 3, 2024
6869ef8
Update compatibility.md
harryskim Sep 3, 2024
476b64b
Delete docs/vision.rst
harryskim Sep 4, 2024
9b2f8f7
Delete docs/vlm.rst
harryskim Sep 4, 2024
cec023d
Delete docs/quickstart.rst
harryskim Sep 4, 2024
f6c5be4
Create scaling_guide.rst
harryskim Sep 4, 2024
71dd2ed
Create multi_node.md
harryskim Sep 4, 2024
cae91c4
Create multi_instance.md
harryskim Sep 4, 2024
b961c72
Create mig.md
harryskim Sep 4, 2024
65c8c0c
Update contents.rst
harryskim Sep 4, 2024
a24674d
Delete docs/llminference.rst
harryskim Sep 4, 2024
ec24171
Delete docs/deployment.rst
harryskim Sep 4, 2024
6619355
Delete docs/stablediffusion.rst
harryskim Sep 4, 2024
a9c7e0b
Update scaling_guide.rst
harryskim Sep 4, 2024
1ee50c4
Delete docs/k8.rst
harryskim Sep 4, 2024
bb42c9d
Create index.md
harryskim Sep 4, 2024
798e4f1
Create release_note.md
harryskim Sep 4, 2024
08fb7fd
Create compatibility.md
harryskim Sep 4, 2024
585d32b
Update contents.rst
harryskim Sep 4, 2024
4423e11
Delete docs/release_note.md
harryskim Sep 4, 2024
305946b
Delete docs/compatibility.md
harryskim Sep 4, 2024
8da9e61
Create vllm.rst
harryskim Sep 4, 2024
7c29940
Update contents.rst
harryskim Sep 4, 2024
774b21b
Update compatibility.md
nvda-mesharma Sep 4, 2024
8691de4
Update contents.rst
harryskim Sep 5, 2024
5ba17dc
Update compatibility.md
nvda-mesharma Oct 18, 2024
4298679
Update scaling_guide.rst
harryskim Oct 29, 2024
a9c7a69
Update llm.md
harryskim Oct 29, 2024
be1c27c
Update quick_deployment_by_backend.rst
harryskim Oct 29, 2024
d10b406
Update compatibility.md
nvda-mesharma Nov 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions docs/Dockerfile.docs
Original file line number Diff line number Diff line change
Expand Up @@ -68,11 +68,17 @@ RUN pip3 install \
sphinx-book-theme \
sphinx-copybutton \
sphinx-design \
sphinx-mdinclude \
sphinx-prompt \
sphinx-sitemap \
sphinx-tabs \
sphinxcontrib-bibtex

# install nvidia-sphinx-theme
RUN pip3 install \
--index-url https://urm.nvidia.com/artifactory/api/pypi/ct-omniverse-pypi/simple/ \
nvidia-sphinx-theme

# Set visitor script to be included on every HTML page
ENV VISITS_COUNTING_SCRIPT="//assets.adobedtm.com/b92787824f2e0e9b68dc2e993f9bd995339fe417/satelliteLib-7ba51e58dc61bcb0e9311aadd02a0108ab24cc6c.js"

6 changes: 3 additions & 3 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,9 +124,9 @@ Triton supports batching individual inference requests to improve compute resour
- [Queuing Policies](user_guide/model_configuration.md#queue-policy)
- [Ragged Batching](user_guide/ragged_batching.md)
- [Sequence Batcher](user_guide/model_configuration.md#sequence-batcher)
- [Stateful Models](user_guide/architecture.md#stateful-models)
- [Control Inputs](user_guide/architecture.md#control-inputs)
- [Implicit State - Stateful Inference Using a Stateless Model](user_guide/architecture.md#implicit-state-management)
- [Stateful Models](user_guide/model_execution.md#stateful-models)
- [Control Inputs](user_guide/model_execution.md#control-inputs)
- [Implicit State - Stateful Inference Using a Stateless Model](implicit_state_management.md#implicit-state-management)
- [Sequence Scheduling Strategies](user_guide/architecture.md#scheduling-strategies)
- [Direct](user_guide/architecture.md#direct)
- [Oldest](user_guide/architecture.md#oldest)
Expand Down
2 changes: 1 addition & 1 deletion docs/_reference/tritonclient_api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
Python tritonclient Package API
===============================

tritonclient python package is hosted at the `pyPI.org <https://pypi.org/project/tritonclient/>`_. This package documentation for tritonclient is genenerated by sphinx autosummary extension.
tritonclient python package is hosted at the `pyPI.org <https://pypi.org/project/tritonclient/>`_.

.. autosummary::
:toctree: tritonclient
Expand Down
31 changes: 31 additions & 0 deletions docs/_static/switcher.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
[
{
"version": "dev",
"url": "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/"
},
{
"name": "2.49 (stable)",
"version": "2.49",
"url": "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/"
},
{
"name": "2.48",
"version": "2.48",
"url": "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/"
},
{
"name": "2.47",
"version": "2.47",
"url": "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/"
},
{
"name": "2.46",
"version": "2.46",
"url": "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/"
},
{
"name": "2.45",
"version": "2.45",
"url": "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/"
}
]
11 changes: 11 additions & 0 deletions docs/backend/vllm.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
########
vLLM
########

.. toctree::
:hidden:
:caption: vLLM
:maxdepth: 2

vllm_backend/README
Multi-LoRA <vllm_backend/docs/llama_multi_lora_tutorial>
15 changes: 15 additions & 0 deletions docs/client_doc/api_reference.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@

####
API Reference
####

.. toctree::
:maxdepth: 1
:hidden:

.. Placeholder Openai Documentation

OpenAI API (BETA) <openai_README.md>
kserve


40 changes: 40 additions & 0 deletions docs/client_doc/in_process.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
####
In-Process Triton Server API
####


The Triton Inference Server provides a backwards-compatible C API/ python-bindings/java-bindings that
allows Triton to be linked directly into a C/C++/java/python application. This API
is called the "Triton Server API" or just "Server API" for short. The
API is implemented in the Triton shared library which is built from
source contained in the `core
repository <https://github.com/triton-inference-server/core>`__. On Linux
this library is libtritonserver.so and on Windows it is
tritonserver.dll. In the Triton Docker image the shared library is
found in /opt/tritonserver/lib. The header file that defines and
documents the Server API is
`tritonserver.h <https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h>`__.
`Java bindings for In-Process Triton Server API <../customization_guide/inprocess_java_api.html#java-bindings-for-in-process-triton-server-api>`__
are built on top of `tritonserver.h` and can be used for Java applications that
need to use Tritonserver in-process.

All capabilities of Triton server are encapsulated in the shared
library and are exposed via the Server API. The `tritonserver`
executable implements HTTP/REST and GRPC endpoints and uses the Server
API to communicate with core Triton logic. The primary source files
for the endpoints are `grpc_server.cc <https://github.com/triton-inference-server/server/blob/main/src/grpc/grpc_server.cc>`__ and
`http_server.cc <https://github.com/triton-inference-server/server/blob/main/src/http_server.cc>`__. In these source files you can
see the Server API being used.

You can use the Server API in your own application as well. A simple
example using the Server API can be found in
`simple.cc <https://github.com/triton-inference-server/server/blob/main/src/simple.cc>`__.

.. toctree::
:maxdepth: 1
:hidden:

C/C++ <../customization_guide/inprocess_c_api.md>
python
Java <../customization_guide/inprocess_java_api.md>

15 changes: 15 additions & 0 deletions docs/client_doc/kserve.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
####
KServe API
####


Triton uses the
`KServe community standard inference protocols <https://github.com/kserve/kserve/tree/master/docs/predict-api/v2>`__
to define HTTP/REST and GRPC APIs plus several extensions.

.. toctree::
:maxdepth: 1
:hidden:

HTTP/REST and GRPC Protocol <../customization_guide/inference_protocols.md>
kserve_extension
24 changes: 24 additions & 0 deletions docs/client_doc/kserve_extension.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
####
Extensions
####

To fully enable all capabilities
Triton also implements `HTTP/REST and GRPC
extensions <https://github.com/triton-inference-server/server/tree/main/docs/protocol>`__
to the KServe inference protocol.

.. toctree::
:maxdepth: 1
:hidden:

Binary tensor data extension <../protocol/extension_binary_data.md>
Classification extension <../protocol/extension_classification.md>
Schedule policy extension <../protocol/extension_schedule_policy.md>
Sequence extension <../protocol/extension_sequence.md>
Shared-memory extension <../protocol/extension_shared_memory.md>
Model configuration extension <../protocol/extension_model_configuration.md>
Model repository extension <../protocol/extension_model_repository.md>
Statistics extension <../protocol/extension_statistics.md>
Trace extension <../protocol/extension_trace.md>
Logging extension <../protocol/extension_logging.md>
Parameters extension <../protocol/extension_parameters.md>
148 changes: 148 additions & 0 deletions docs/client_doc/openai_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# OpenAI-Compatible Frontend for Triton Inference Server

## Pre-requisites

1. Docker + NVIDIA Container Runtime
2. A correctly configured `HF_TOKEN` for access to HuggingFace models.
- The current examples and testing primarily use the
[`meta-llama/Meta-Llama-3.1-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)
model, but you can manually bring your own models and adjust accordingly.

## VLLM

1. Build and launch the container:
- Mounts the `~/.huggingface/cache` for re-use of downloaded models across runs, containers, etc.
- Sets the [`HF_TOKEN`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hftoken) environment variable to
access gated models, make sure this is set in your local environment if needed.

```bash
docker build -t tritonserver-openai-vllm -f docker/Dockerfile.vllm .

docker run -it --net=host --gpus all --rm \
-v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
-e HF_TOKEN \
tritonserver-openai-vllm
```

2. Launch the OpenAI-compatible Triton Inference Server:
```bash
cd openai/

# NOTE: Adjust the --tokenizer based on the model being used
python3 openai_frontend/main.py --model-repository tests/vllm_models/ --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct
```

3. Send a `/v1/chat/completions` request:
- Note the use of `jq` is optional, but provides a nicely formatted output for JSON responses.
```bash
MODEL="llama-3.1-8b-instruct"
curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "'${MODEL}'",
"messages": [{"role": "user", "content": "Say this is a test!"}]
}' | jq
```

4. Send a `/v1/completions` request:
- Note the use of `jq` is optional, but provides a nicely formatted output for JSON responses.
```bash
MODEL="llama-3.1-8b-instruct"
curl -s http://localhost:8000/v1/completions -H 'Content-Type: application/json' -d '{
"model": "'${MODEL}'",
"prompt": "Machine learning is"
}' | jq
```

5. Benchmark with `genai-perf`:
```bash
MODEL="llama-3.1-8b-instruct"
TOKENIZER="meta-llama/Meta-Llama-3.1-8B-Instruct"
genai-perf \
--model ${MODEL} \
--tokenizer ${TOKENIZER} \
--service-kind openai \
--endpoint-type chat \
--synthetic-input-tokens-mean 256 \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean 256 \
--output-tokens-stddev 0 \
--streaming
```

6. Use the OpenAI python client directly:
```python
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY",
)

model = "llama-3.1-8b-instruct"
completion = client.chat.completions.create(
model=model,
messages=[
{
"role": "system",
"content": "You are a helpful assistant.",
},
{"role": "user", "content": "What are LLMs?"},
],
max_tokens=256,
)

print(completion.choices[0].message.content)
```

7. Run tests (NOTE: The server should not be running, the tests will handle starting/stopping the server as necessary):
```bash
pytest -v tests/
```

8. For a list of examples, see the `examples/` folder.

## TensorRT-LLM

**NOTE**: The workflow for preparing TRT-LLM engines, model repository, etc. in order to
load and test is not fleshed out in the README here yet. You can try using the Triton CLI
or follow existing TRT-LLM backend examples to prepare a model repository, and point
at the model repository accordingly when following the examples.

0. Prepare your model repository for a TensorRT-LLM model, build the engine, etc.

1. Build and launch the container:
- Mounts the openai source files to `/workspace` for simplicity, later on these will be shipped in the container.
- Mounts the `~/.huggingface/cache` for re-use of downloaded models across runs, containers, etc.
- Sets the [`HF_TOKEN`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hftoken) environment variable to
access gated models, make sure this is set in your local environment if needed.

```bash
docker build -t tritonserver-openai-tensorrtllm -f docker/Dockerfile.tensorrtllm ./docker

docker run -it --net=host --gpus all --rm \
-v ${PWD}:/workspace \
-v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
-e HF_TOKEN \
-w /workspace \
tritonserver-openai-tensorrtllm
```

2. Launch the OpenAI server:
```bash
cd openai/

# NOTE: Adjust the --tokenizer based on the model being used
python3 openai_frontend/main.py --model-repository tests/tensorrtllm_models/ --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct
```

3. Send a `/v1/chat/completions` request:
- Note the use of `jq` is optional, but provides a nicely formatted output for JSON responses.
```bash
MODEL="tensorrt_llm_bls"
curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "'${MODEL}'",
"messages": [{"role": "user", "content": "Say this is a test!"}]
}' | jq
```

The other examples should be the same as vLLM, except that you should set `MODEL="tensorrt_llm_bls"`,
everywhere applicable as seen in the example request above.
1 change: 1 addition & 0 deletions docs/client_doc/placeholder.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

12 changes: 12 additions & 0 deletions docs/client_doc/python.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
####
Python
####

.. include:: python_readme.rst

.. toctree::
:maxdepth: 1
:hidden:

Kafka I/O <../tutorials/Triton_Inference_Server_Python_API/examples/kafka-io/README.md>
Rayserve <../tutorials/Triton_Inference_Server_Python_API/examples/rayserve/README.md>
Loading