Unverified Commit 8964d0d5 authored by Neal Vaidya's avatar Neal Vaidya Committed by GitHub
Browse files

docs: add GPT-OSS deployment guide (#2297)


Signed-off-by: default avatarjthomson04 <jwillthomson19@gmail.com>
Co-authored-by: default avatarjthomson04 <jwillthomson19@gmail.com>
Co-authored-by: default avatarJohn Thomson (DLAlgo) <jothomson@nvidia.com>
Co-authored-by: default avatarNeelay Shah <neelays@nvidia.com>
parent 7c8f8fdc
...@@ -27,6 +27,10 @@ limitations under the License. ...@@ -27,6 +27,10 @@ limitations under the License.
High-throughput, low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments. High-throughput, low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
## Latest News
* [08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./components/backends/trtllm/gpt-oss.md)
## The Era of Multi-GPU, Multi-Node ## The Era of Multi-GPU, Multi-Node
<p align="center"> <p align="center">
......
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
enable_attention_dp: true
disable_overlap_scheduler: false
moe_config:
backend: CUTLASS
cuda_graph_config:
max_batch_size: 128
enable_padding: true
cache_transceiver_config:
backend: ucx
max_tokens_in_buffer: 65536
print_iter_log: false
stream_interval: 10
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
enable_attention_dp: false
disable_overlap_scheduler: true
moe_config:
backend: CUTLASS
enable_chunked_prefill: true
cuda_graph_config:
max_batch_size: 32
enable_padding: true
cache_transceiver_config:
backend: ucx
max_tokens_in_buffer: 65536
print_iter_log: false
stream_interval: 10
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Running gpt-oss-120b Disaggregated with TensorRT-LLM
Dynamo supports disaggregated serving of gpt-oss-120b with TensorRT-LLM. This guide demonstrates how to deploy gpt-oss-120b using disaggregated prefill/decode serving on a single B200 node with 8 GPUs, running 1 prefill worker on 4 GPUs and 1 decode worker on 4 GPUs.
## Overview
This deployment uses disaggregated serving in TensorRT-LLM where:
- **Prefill Worker**: Processes input prompts efficiently using 4 GPUs with tensor parallelism
- **Decode Worker**: Generates output tokens using 4 GPUs, optimized for token generation throughput
- **Frontend**: Provides OpenAI-compatible API endpoint with round-robin routing
The disaggregated approach optimizes for both low-latency (maximizing tokens per second per user) and high-throughput (maximizing total tokens per GPU per second) use cases by separating the compute-intensive prefill phase from the memory-bound decode phase.
## Prerequisites
- 1x NVIDIA B200 node with 8 GPUs (this guide focuses on single-node B200 deployment)
- CUDA Toolkit 12.8 or later
- Docker with [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) installed
- Fast SSD storage for model weights (~240GB required)
- HuggingFace account and [access token](https://huggingface.co/settings/tokens)
- [HuggingFace CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli)
Ensure that the `etcd` and `nats` services are running with the following command:
```bash
docker compose -f deploy/docker-compose.yml up
```
## Instructions
### 1. Build the Container
**For ARM64 (GB200):**
```bash
# Navigate to the Dynamo repository root
cd $DYNAMO_ROOT
export DYNAMO_CONTAINER_IMAGE=dynamo-gpt-oss-arm64
# Build the container with a specific TensorRT-LLM commit
docker build --platform linux/arm64 -f container/Dockerfile.tensorrt_llm_prebuilt . \
--build-arg BASE_IMAGE=nvcr.io/nvidia/tensorrt-llm/release \
--build-arg BASE_IMAGE_TAG=gpt-oss-dev \
--build-arg ARCH=arm64 \
--build-arg ARCH_ALT=aarch64 \
-t $DYNAMO_CONTAINER_IMAGE
```
**For x86_64:**
```bash
# Navigate to the Dynamo repository root
cd $DYNAMO_ROOT
export DYNAMO_CONTAINER_IMAGE=dynamo-gpt-oss-amd64
docker build -f container/Dockerfile.tensorrt_llm_prebuilt . \
--build-arg BASE_IMAGE=nvcr.io/nvidia/tensorrt-llm/release \
--build-arg BASE_IMAGE_TAG=gpt-oss-dev \
-t $DYNAMO_CONTAINER_IMAGE
```
### 2. Download the Model
```bash
export MODEL_PATH=<LOCAL_MODEL_DIRECTORY>
huggingface-cli download openai/gpt-oss-120b --include "original/*" --local-dir $MODEL_PATH
```
### 3. Run the Container
Launch the Dynamo TensorRT-LLM container with the necessary configurations:
```bash
docker run \
--gpus all \
-it \
--rm \
--network host \
--volume $MODEL_PATH:/model \
--volume $PWD:/workspace/dynamo \
--shm-size=10G \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--ulimit nofile=65536:65536 \
--cap-add CAP_SYS_PTRACE \
--ipc host \
-e HF_TOKEN=$HF_TOKEN \
-e TRTLLM_ENABLE_PDL=1 \
-e TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True \
$DYNAMO_CONTAINER_IMAGE
```
This command:
- Automatically removes the container when stopped (`--rm`)
- Allows container to interact with host's IPC resources for optimal performance (`--ipc=host`)
- Runs the container in interactive mode (`-it`)
- Sets up shared memory and stack limits for optimal performance
- Mounts your model directory into the container at `/model`
- Mounts the current Dynamo workspace into the container at `/workspace/dynamo`
- Enables [PDL](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programmatic-dependent-launch-and-synchronization) and disables parallel weight loading
- Sets HuggingFace token as environment variable in the container
### 4. Understanding the Configuration
The deployment uses configuration files and command-line arguments to control behavior:
#### Configuration Files
**Prefill Configuration (`engine_configs/gpt_oss/prefill.yaml`)**:
- `enable_attention_dp: false` - Attention data parallelism disabled for prefill
- `enable_chunked_prefill: true` - Enables efficient chunked prefill processing
- `moe_config.backend: CUTLASS` - Uses optimized CUTLASS kernels for MoE layers
- `cache_transceiver_config.backend: ucx` - Uses UCX for efficient KV cache transfer
- `cuda_graph_config.max_batch_size: 32` - Maximum batch size for CUDA graphs
**Decode Configuration (`engine_configs/gpt_oss/decode.yaml`)**:
- `enable_attention_dp: true` - Attention data parallelism enabled for decode
- `disable_overlap_scheduler: false` - Enables overlapping for decode efficiency
- `moe_config.backend: CUTLASS` - Uses optimized CUTLASS kernels for MoE layers
- `cache_transceiver_config.backend: ucx` - Uses UCX for efficient KV cache transfer
- `cuda_graph_config.max_batch_size: 128` - Maximum batch size for CUDA graphs
#### Command-Line Arguments
Both workers receive these key arguments:
- `--tensor-parallel-size 4` - Uses 4 GPUs for tensor parallelism
- `--expert-parallel-size 4` - Expert parallelism across 4 GPUs
- `--free-gpu-memory-fraction 0.9` - Allocates 90% of GPU memory
Prefill-specific arguments:
- `--max-num-tokens 20000` - Maximum tokens for prefill processing
- `--max-batch-size 32` - Maximum batch size for prefill
Decode-specific arguments:
- `--max-num-tokens 16384` - Maximum tokens for decode processing
- `--max-batch-size 128` - Maximum batch size for decode
### 5. Launch the Deployment
You can use the provided launch script or run the components manually:
#### Option A: Using the Launch Script
```bash
cd /workspace/dynamo/components/backends/trtllm
./launch/gpt_oss_disagg.sh
```
#### Option B: Manual Launch
1. **Clear namespace and start frontend**:
```bash
cd /workspace/dynamo/components/backends/trtllm
# Clear any existing deployments
python3 utils/clear_namespace.py --namespace dynamo
# Start frontend with round-robin routing
python3 -m dynamo.frontend --router-mode round-robin --http-port 8000 &
```
2. **Launch prefill worker**:
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m dynamo.trtllm \
--model-path /model \
--served-model-name gpt-oss-120b \
--extra-engine-args engine_configs/gpt_oss/prefill.yaml \
--disaggregation-mode prefill \
--disaggregation-strategy prefill_first \
--max-num-tokens 20000 \
--max-batch-size 32 \
--free-gpu-memory-fraction 0.9 \
--tensor-parallel-size 4 \
--expert-parallel-size 4 &
```
3. **Launch decode worker**:
```bash
CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m dynamo.trtllm \
--model-path /model \
--served-model-name gpt-oss-120b \
--extra-engine-args engine_configs/gpt_oss/decode.yaml \
--disaggregation-mode decode \
--disaggregation-strategy prefill_first \
--max-num-tokens 16384 \
--max-batch-size 128 \
--free-gpu-memory-fraction 0.9 \
--tensor-parallel-size 4 \
--expert-parallel-size 4
```
### 6. Test the Deployment
Send a test request to verify the deployment:
```bash
curl -X POST http://localhost:8000/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-oss-120b",
"input": "Explain the concept of disaggregated serving in LLM inference in 3 sentences.",
"max_output_tokens": 200,
"stream": false
}'
```
The server exposes a standard OpenAI-compatible API endpoint that accepts JSON requests. You can adjust parameters like `max_tokens`, `temperature`, and others according to your needs.
## Benchmarking
### Performance Testing with GenAI-Perf
The Dynamo container includes [GenAI-Perf](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html), NVIDIA's tool for benchmarking generative AI models. This tool helps measure throughput, latency, and other performance metrics for your deployment.
**Run the following benchmark from inside the container** (after completing the deployment steps above):
```bash
# Create a directory for benchmark results
mkdir -p /tmp/benchmark-results
# Run the benchmark - this command tests the deployment with high-concurrency synthetic workload
genai-perf profile \
--model gpt-oss-120b \
--tokenizer /model \
--endpoint-type chat \
--endpoint /v1/chat/completions \
--streaming \
--url localhost:8000 \
--synthetic-input-tokens-mean 32000 \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean 256 \
--output-tokens-stddev 0 \
--extra-inputs max_tokens:256 \
--extra-inputs min_tokens:256 \
--extra-inputs ignore_eos:true \
--extra-inputs "{\"nvext\":{\"ignore_eos\":true}}" \
--concurrency 256 \
--request-count 6144 \
--warmup-request-count 1000 \
--num-dataset-entries 8000 \
--random-seed 100 \
--artifact-dir /tmp/benchmark-results \
-- \
-v \
--max-threads 500 \
-H 'Authorization: Bearer NOT USED' \
-H 'Accept: text/event-stream'
```
### What This Benchmark Does
This command:
- **Tests chat completions** with streaming responses against the disaggregated deployment
- **Simulates high load** with 256 concurrent requests and 6144 total requests
- **Uses long context inputs** (32K tokens) to test prefill performance
- **Generates consistent outputs** (256 tokens) to measure decode throughput
- **Includes warmup period** (1000 requests) to stabilize performance metrics
- **Saves detailed results** to `/tmp/benchmark-results` for analysis
Key parameters you can adjust:
- `--concurrency`: Number of simultaneous requests (impacts GPU utilization)
- `--synthetic-input-tokens-mean`: Average input length (tests prefill capacity)
- `--output-tokens-mean`: Average output length (tests decode throughput)
- `--request-count`: Total number of requests for the benchmark
### Installing GenAI-Perf Outside the Container
If you prefer to run benchmarks from outside the container:
```bash
# Install GenAI-Perf
pip install genai-perf
# Then run the same benchmark command, adjusting the tokenizer path if needed
```
## Architecture Overview
The disaggregated architecture separates prefill and decode phases:
```mermaid
flowchart TD
Client["Users/Clients<br/>(HTTP)"] --> Frontend["Frontend<br/>Round-Robin Router"]
Frontend --> Prefill["Prefill Worker<br/>(GPUs 0-3)"]
Frontend --> Decode["Decode Worker<br/>(GPUs 4-7)"]
Prefill -.->|KV Cache Transfer<br/>via UCX| Decode
```
## Key Features
1. **Disaggregated Serving**: Separates compute-intensive prefill from memory-bound decode operations
2. **Optimized Resource Usage**: Different parallelism strategies for prefill vs decode
3. **Scalable Architecture**: Easy to adjust worker counts based on workload
4. **TensorRT-LLM Optimizations**: Leverages TensorRT-LLM's efficient kernels and memory management
## Troubleshooting
### Common Issues
1. **CUDA Out-of-Memory Errors**
- Reduce `--max-num-tokens` in the launch commands (currently 20000 for prefill, 16384 for decode)
- Lower `--free-gpu-memory-fraction` from 0.9 to 0.8 or 0.7
- Ensure model checkpoints are compatible with the expected format
2. **Workers Not Connecting**
- Ensure etcd and NATS services are running: `docker ps | grep -E "(etcd|nats)"`
- Check network connectivity between containers
- Verify CUDA_VISIBLE_DEVICES settings match your GPU configuration
- Check that no other processes are using the assigned GPUs
3. **Performance Issues**
- Monitor GPU utilization with `nvidia-smi` while the deployment is running
- Check worker logs for bottlenecks or errors
- Ensure that batch sizes in manual commands match those in configuration files
- Adjust chunked prefill settings based on your workload
- For connection issues, ensure port 8000 is not being used by another application
4. **Container Startup Issues**
- Verify that the NVIDIA Container Toolkit is properly installed
- Check Docker daemon is running with GPU support
- Ensure sufficient disk space for model weights and container images
## Next Steps
- **Production Deployment**: For multi-node deployments, see the [Multi-node Guide](../../examples/basics/multinode/README.md)
- **Advanced Configuration**: Explore TensorRT-LLM engine building options for further optimization
- **Monitoring**: Set up Prometheus and Grafana for production monitoring
- **Performance Benchmarking**: Use GenAI-Perf to measure and optimize your deployment performance
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Environment variables with defaults
export MODEL_PATH=${MODEL_PATH:-"/model"}
export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"gpt-oss-120b"}
export DISAGGREGATION_STRATEGY=${DISAGGREGATION_STRATEGY:-"prefill_first"}
export PREFILL_ENGINE_ARGS=${PREFILL_ENGINE_ARGS:-"engine_configs/gpt_oss/prefill.yaml"}
export DECODE_ENGINE_ARGS=${DECODE_ENGINE_ARGS:-"engine_configs/gpt_oss/decode.yaml"}
set -e
trap 'echo Cleaning up...; kill 0' EXIT
# run clear_namespace
python3 utils/clear_namespace.py --namespace dynamo
# run frontend
python3 -m dynamo.frontend --router-mode round-robin --http-port 8000 &
# With tensor_parallel_size=4, each worker needs 4 GPUs
# run prefill worker
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m dynamo.trtllm \
--model-path "$MODEL_PATH" \
--served-model-name "$SERVED_MODEL_NAME" \
--extra-engine-args "$PREFILL_ENGINE_ARGS" \
--disaggregation-mode prefill \
--disaggregation-strategy "$DISAGGREGATION_STRATEGY" \
--max-num-tokens 20000 \
--max-batch-size 32 \
--free-gpu-memory-fraction 0.9 \
--tensor-parallel-size 4 \
--expert-parallel-size 4 &
# run decode worker
CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m dynamo.trtllm \
--model-path "$MODEL_PATH" \
--served-model-name "$SERVED_MODEL_NAME" \
--extra-engine-args "$DECODE_ENGINE_ARGS" \
--disaggregation-mode decode \
--disaggregation-strategy "$DISAGGREGATION_STRATEGY" \
--max-num-tokens 16384 \
--max-batch-size 128 \
--free-gpu-memory-fraction 0.9 \
--tensor-parallel-size 4 \
--expert-parallel-size 4
ARG BASE_IMAGE
ARG BASE_IMAGE_TAG
ARG ARCH=amd64
ARG ARCH_ALT=x86_64
FROM ${BASE_IMAGE}:${BASE_IMAGE_TAG}
ARG ARCH
ARG ARCH_ALT
WORKDIR /workspace
COPY . /workspace
# etcd
ENV ETCD_VERSION="v3.5.21"
RUN wget https://github.com/etcd-io/etcd/releases/download/$ETCD_VERSION/etcd-$ETCD_VERSION-linux-${ARCH}.tar.gz -O /tmp/etcd.tar.gz && \
mkdir -p /usr/local/bin/etcd && \
tar -xvf /tmp/etcd.tar.gz -C /usr/local/bin/etcd --strip-components=1 && \
rm /tmp/etcd.tar.gz
ENV PATH=/usr/local/bin/etcd/:$PATH
# nats
RUN wget --tries=3 --waitretry=5 https://github.com/nats-io/nats-server/releases/download/v2.10.28/nats-server-v2.10.28-${ARCH}.deb && \
dpkg -i nats-server-v2.10.28-${ARCH}.deb && rm nats-server-v2.10.28-${ARCH}.deb
RUN pip install -r ./container/deps/requirements.txt
# Rust build/dev dependencies
RUN apt-get update && \
apt-get install --no-install-recommends -y \
gdb \
protobuf-compiler \
cmake \
libssl-dev \
pkg-config \
libclang-dev
ARG RUSTARCH=${ARCH_ALT}-unknown-linux-gnu
ENV RUSTUP_HOME=/usr/local/rustup \
CARGO_HOME=/usr/local/cargo \
PATH=/usr/local/cargo/bin:$PATH \
RUST_VERSION=1.87.0
# Install Rust using RUSTARCH derived from ARCH_ALT
RUN wget --tries=3 --waitretry=5 "https://static.rust-lang.org/rustup/archive/1.28.1/${RUSTARCH}/rustup-init" && \
# TODO: Add SHA check back based on RUSTARCH
chmod +x rustup-init && \
./rustup-init -y --no-modify-path --profile default --default-toolchain $RUST_VERSION --default-host ${RUSTARCH} && \
rm rustup-init && \
chmod -R a+w $RUSTUP_HOME $CARGO_HOME
RUN cargo build \
--release \
--locked \
--workspace
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
# Build dynamo wheels
RUN uv build --wheel --out-dir /workspace/dist && \
cd /workspace/lib/bindings/python && \
uv build --wheel --out-dir /workspace/dist --python 3.12
RUN mkdir -p /opt/dynamo/bindings/wheels && \
mkdir /opt/dynamo/bindings/lib && \
cp dist/ai_dynamo*cp312*.whl /opt/dynamo/bindings/wheels/
RUN pip install /workspace/dist/ai_dynamo_runtime*cp312*.whl && pip install /workspace/dist/ai_dynamo*any.whl
# Copy files for legal compliance
COPY ATTRIBUTION* LICENSE /workspace/
ENTRYPOINT ["/opt/nvidia/nvidia_entrypoint.sh"]
...@@ -91,6 +91,7 @@ TENSORRTLLM_PIP_WHEEL_DIR="/tmp/trtllm_wheel/" ...@@ -91,6 +91,7 @@ TENSORRTLLM_PIP_WHEEL_DIR="/tmp/trtllm_wheel/"
DEFAULT_EXPERIMENTAL_TRTLLM_COMMIT="69e9f6d48944b2ae0124ff57aa59340aa4dfae15" DEFAULT_EXPERIMENTAL_TRTLLM_COMMIT="69e9f6d48944b2ae0124ff57aa59340aa4dfae15"
TRTLLM_COMMIT="" TRTLLM_COMMIT=""
TRTLLM_USE_NIXL_KVCACHE_EXPERIMENTAL="0" TRTLLM_USE_NIXL_KVCACHE_EXPERIMENTAL="0"
TRTLLM_GIT_URL=""
# TensorRT-LLM PyPI index URL # TensorRT-LLM PyPI index URL
TENSORRTLLM_INDEX_URL="https://pypi.python.org/simple" TENSORRTLLM_INDEX_URL="https://pypi.python.org/simple"
...@@ -187,6 +188,14 @@ get_options() { ...@@ -187,6 +188,14 @@ get_options() {
missing_requirement "$1" missing_requirement "$1"
fi fi
;; ;;
--tensorrtllm-git-url)
if [ "$2" ]; then
TRTLLM_GIT_URL=$2
shift
else
missing_requirement "$1"
fi
;;
--base-image) --base-image)
if [ "$2" ]; then if [ "$2" ]; then
BASE_IMAGE=$2 BASE_IMAGE=$2
...@@ -360,6 +369,7 @@ show_help() { ...@@ -360,6 +369,7 @@ show_help() {
echo " [--use-default-experimental-tensorrtllm-commit] Use the default experimental commit (${DEFAULT_EXPERIMENTAL_TRTLLM_COMMIT}) to build TensorRT-LLM. This is a flag (no argument). Do not combine with --tensorrtllm-commit or --tensorrtllm-pip-wheel." echo " [--use-default-experimental-tensorrtllm-commit] Use the default experimental commit (${DEFAULT_EXPERIMENTAL_TRTLLM_COMMIT}) to build TensorRT-LLM. This is a flag (no argument). Do not combine with --tensorrtllm-commit or --tensorrtllm-pip-wheel."
echo " [--tensorrtllm-pip-wheel tensorrtllm pip wheel on artifactory]" echo " [--tensorrtllm-pip-wheel tensorrtllm pip wheel on artifactory]"
echo " [--tensorrtllm-index-url tensorrtllm PyPI index URL if providing the wheel from artifactory]" echo " [--tensorrtllm-index-url tensorrtllm PyPI index URL if providing the wheel from artifactory]"
echo " [--tensorrtllm-git-url tensorrtllm git repository URL for cloning]"
echo " [--build-arg additional build args to pass to docker build]" echo " [--build-arg additional build args to pass to docker build]"
echo " [--cache-from cache location to start from]" echo " [--cache-from cache location to start from]"
echo " [--cache-to location where to cache the build output]" echo " [--cache-to location where to cache the build output]"
...@@ -489,7 +499,11 @@ if [[ $FRAMEWORK == "TENSORRTLLM" ]]; then ...@@ -489,7 +499,11 @@ if [[ $FRAMEWORK == "TENSORRTLLM" ]]; then
echo "Checking for TensorRT-LLM wheel in ${TENSORRTLLM_PIP_WHEEL_DIR}" echo "Checking for TensorRT-LLM wheel in ${TENSORRTLLM_PIP_WHEEL_DIR}"
if ! check_wheel_file "${TENSORRTLLM_PIP_WHEEL_DIR}" "${ARCH}_${TRTLLM_COMMIT}"; then if ! check_wheel_file "${TENSORRTLLM_PIP_WHEEL_DIR}" "${ARCH}_${TRTLLM_COMMIT}"; then
echo "WARN: Valid trtllm wheel file not found in ${TENSORRTLLM_PIP_WHEEL_DIR}, attempting to build from source" echo "WARN: Valid trtllm wheel file not found in ${TENSORRTLLM_PIP_WHEEL_DIR}, attempting to build from source"
if ! env -i ${SOURCE_DIR}/build_trtllm_wheel.sh -o ${TENSORRTLLM_PIP_WHEEL_DIR} -c ${TRTLLM_COMMIT} -a ${ARCH} -n ${NIXL_REF}; then GIT_URL_ARG=""
if [ -n "${TRTLLM_GIT_URL}" ]; then
GIT_URL_ARG="-u ${TRTLLM_GIT_URL}"
fi
if ! env -i ${SOURCE_DIR}/build_trtllm_wheel.sh -o ${TENSORRTLLM_PIP_WHEEL_DIR} -c ${TRTLLM_COMMIT} -a ${ARCH} -n ${NIXL_REF} ${GIT_URL_ARG}; then
error "ERROR: Failed to build TensorRT-LLM wheel" error "ERROR: Failed to build TensorRT-LLM wheel"
fi fi
fi fi
......
...@@ -18,17 +18,19 @@ ...@@ -18,17 +18,19 @@
# This script builds the TRT-LLM base image for Dynamo with TensorRT-LLM. # This script builds the TRT-LLM base image for Dynamo with TensorRT-LLM.
while getopts "c:o:a:n:" opt; do while getopts "c:o:a:n:u:" opt; do
case ${opt} in case ${opt} in
c) TRTLLM_COMMIT=$OPTARG ;; c) TRTLLM_COMMIT=$OPTARG ;;
o) OUTPUT_DIR=$OPTARG ;; o) OUTPUT_DIR=$OPTARG ;;
a) ARCH=$OPTARG ;; a) ARCH=$OPTARG ;;
n) NIXL_COMMIT=$OPTARG ;; n) NIXL_COMMIT=$OPTARG ;;
*) echo "Usage: $(basename $0) [-c commit] [-o output_dir] [-a arch] [-n nixl_commit]" u) TRTLLM_GIT_URL=$OPTARG ;;
*) echo "Usage: $(basename $0) [-c commit] [-o output_dir] [-a arch] [-n nixl_commit] [-u git_url]"
echo " -c: TensorRT-LLM commit to build" echo " -c: TensorRT-LLM commit to build"
echo " -o: Output directory for wheel files" echo " -o: Output directory for wheel files"
echo " -a: Architecture (amd64 or arm64)" echo " -a: Architecture (amd64 or arm64)"
echo " -n: NIXL commit" echo " -n: NIXL commit"
echo " -u: TensorRT-LLM git URL"
exit 1 ;; exit 1 ;;
esac esac
done done
...@@ -38,13 +40,18 @@ if [ -z "$OUTPUT_DIR" ]; then ...@@ -38,13 +40,18 @@ if [ -z "$OUTPUT_DIR" ]; then
OUTPUT_DIR="/tmp/trtllm_wheel" OUTPUT_DIR="/tmp/trtllm_wheel"
fi fi
# Set default TensorRT-LLM git URL if not specified
if [ -z "$TRTLLM_GIT_URL" ]; then
TRTLLM_GIT_URL="https://github.com/NVIDIA/TensorRT-LLM.git"
fi
# Store directory where script is being launched from # Store directory where script is being launched from
MAIN_DIR=$(dirname "$(readlink -f "$0")") MAIN_DIR=$(dirname "$(readlink -f "$0")")
(cd /tmp && \ (cd /tmp && \
# Clone the TensorRT-LLM repository. # Clone the TensorRT-LLM repository.
if [ ! -d "TensorRT-LLM" ]; then if [ ! -d "TensorRT-LLM" ]; then
git clone https://github.com/NVIDIA/TensorRT-LLM.git git clone "${TRTLLM_GIT_URL}"
fi fi
cd TensorRT-LLM cd TensorRT-LLM
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment