*[Writing your own engine in Python](#writing-your-own-engine-in-python)
*[Writing your own engine in Python](#writing-your-own-engine-in-python)
*[Batch mode](#batch-mode)
*[Batch mode](#batch-mode)
...
@@ -437,10 +437,13 @@ Startup can be slow so you may want to `export DYN_LOG=debug` to see progress.
...
@@ -437,10 +437,13 @@ Startup can be slow so you may want to `export DYN_LOG=debug` to see progress.
Shutdown: `ray stop`
Shutdown: `ray stop`
#### TensorRT-LLM engine
#### trtllm
To run a TRT-LLM model with dynamo-run we have included a python based [async engine] (https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/engines/agg_engine.py).
Using [TensorRT-LLM's LLM API](https://nvidia.github.io/TensorRT-LLM/llm-api/), a high-level Python API.
To configure the TensorRT-LLM async engine please see [llm_api_config.yaml](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/configs/llm_api_config.yaml). The file defines the options that need to be passed to the LLM engine. Follow the steps below to serve trtllm on dynamo run.
You can use `--extra-engine-args` to pass extra arguments to LLM API engine.
The trtllm engine requires requires [etcd](https://etcd.io/) and [nats](https://nats.io/) with jetstream (`nats-server -js`) to be running.
##### Step 1: Build the environment
##### Step 1: Build the environment
...
@@ -454,7 +457,7 @@ See instructions [here](https://github.com/ai-dynamo/dynamo/blob/main/examples/t
...
@@ -454,7 +457,7 @@ See instructions [here](https://github.com/ai-dynamo/dynamo/blob/main/examples/t
Execute the following to load the TensorRT-LLM model specified in the configuration.
Execute the following to load the TensorRT-LLM model specified in the configuration.
"--kv-block-size",type=int,default=32,help="Size of a KV cache block."
"--kv-block-size",type=int,default=32,help="Size of a KV cache block."
)
)
parser.add_argument(
"--context-length",
type=int,
default=None,
help="This argument is not used by TRTLLM. Please provide max_input_len, max_seq_len and max_output_len in yaml file and point --extra-engine-args to the yaml file.",
)
parser.add_argument(
parser.add_argument(
"--extra-engine-args",
"--extra-engine-args",
type=str,
type=str,
...
@@ -241,6 +253,12 @@ def cmd_line_args():
...
@@ -241,6 +253,12 @@ def cmd_line_args():
)
)
args=parser.parse_args()
args=parser.parse_args()
ifargs.context_lengthisnotNone:
warnings.warn(
"--context-length is accepted for compatibility but will be ignored for TensorRT-LLM. Please provide max_input_len, max_seq_len and max_output_len in yaml file and point --extra-engine-args to the yaml file.",