Unverified Commit 1995ef9a authored by Yan Ru Pei's avatar Yan Ru Pei Committed by GitHub
Browse files

docs: change docs to default port 8000 (#2876)


Signed-off-by: default avatarPeaBrane <yanrpei@gmail.com>
parent 4df2e2d6
...@@ -120,7 +120,7 @@ Dynamo provides a simple way to spin up a local set of inference components incl ...@@ -120,7 +120,7 @@ Dynamo provides a simple way to spin up a local set of inference components incl
``` ```
# Start an OpenAI compatible HTTP server, a pre-processor (prompt templating and tokenization) and a router. # Start an OpenAI compatible HTTP server, a pre-processor (prompt templating and tokenization) and a router.
# Pass the TLS certificate and key paths to use HTTPS instead of HTTP. # Pass the TLS certificate and key paths to use HTTPS instead of HTTP.
python -m dynamo.frontend --http-port 8080 [--tls-cert-path cert.pem] [--tls-key-path key.pem] python -m dynamo.frontend --http-port 8000 [--tls-cert-path cert.pem] [--tls-key-path key.pem]
# Start the SGLang engine, connecting to NATS and etcd to receive requests. You can run several of these, # Start the SGLang engine, connecting to NATS and etcd to receive requests. You can run several of these,
# both for the same model and for multiple models. The frontend node will discover them. # both for the same model and for multiple models. The frontend node will discover them.
...@@ -130,7 +130,7 @@ python -m dynamo.sglang.worker --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B ...@@ -130,7 +130,7 @@ python -m dynamo.sglang.worker --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B
#### Send a Request #### Send a Request
```bash ```bash
curl localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B", "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
"messages": [ "messages": [
{ {
......
...@@ -37,7 +37,7 @@ python -m dynamo.mocker \ ...@@ -37,7 +37,7 @@ python -m dynamo.mocker \
--enable-prefix-caching --enable-prefix-caching
# Start frontend server # Start frontend server
python -m dynamo.frontend --http-port 8080 python -m dynamo.frontend --http-port 8000
``` ```
### Legacy JSON file support: ### Legacy JSON file support:
......
...@@ -26,7 +26,7 @@ node 1 ...@@ -26,7 +26,7 @@ node 1
On node 0 (where the frontend was started) send a test request to verify your deployment: On node 0 (where the frontend was started) send a test request to verify your deployment:
```bash ```bash
curl localhost:8080/v1/chat/completions \ curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": "deepseek-ai/DeepSeek-R1", "model": "deepseek-ai/DeepSeek-R1",
......
...@@ -197,7 +197,7 @@ See the [vLLM CLI documentation](https://docs.vllm.ai/en/v0.9.2/configuration/se ...@@ -197,7 +197,7 @@ See the [vLLM CLI documentation](https://docs.vllm.ai/en/v0.9.2/configuration/se
Send a test request to verify your deployment: Send a test request to verify your deployment:
```bash ```bash
curl localhost:8080/v1/chat/completions \ curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": "Qwen/Qwen3-0.6B", "model": "Qwen/Qwen3-0.6B",
......
# Dynamo frontend node. # Dynamo frontend node.
Usage: `python -m dynamo.frontend [--http-port 8080]`. Usage: `python -m dynamo.frontend [--http-port 8000]`.
This runs an OpenAI compliant HTTP server, a pre-processor, and a router in a single process. Engines / workers are auto-discovered when they call `register_llm`. This runs an OpenAI compliant HTTP server, a pre-processor, and a router in a single process. Engines / workers are auto-discovered when they call `register_llm`.
......
...@@ -48,7 +48,7 @@ tools. ...@@ -48,7 +48,7 @@ tools.
Try the following to begin interacting with a model: Try the following to begin interacting with a model:
> dynamo --help > dynamo --help
> python -m dynamo.frontend [--http-port 8080] > python -m dynamo.frontend [--http-port 8000]
> python -m dynamo.vllm Qwen/Qwen2.5-3B-Instruct > python -m dynamo.vllm Qwen/Qwen2.5-3B-Instruct
To run more complete deployment examples, instances of etcd and nats need to be To run more complete deployment examples, instances of etcd and nats need to be
......
...@@ -23,7 +23,7 @@ graph TD ...@@ -23,7 +23,7 @@ graph TD
PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380] PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380]
PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401] PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401]
PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP
PROMETHEUS -->|:8080/metrics| DYNAMOFE[Dynamo HTTP FE :8080] PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000]
PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081] PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081]
DYNAMOFE --> DYNAMOBACKEND DYNAMOFE --> DYNAMOBACKEND
GRAFANA -->|:9090/query API| PROMETHEUS GRAFANA -->|:9090/query API| PROMETHEUS
......
...@@ -24,7 +24,7 @@ Get started with Dynamo locally in just a few commands: ...@@ -24,7 +24,7 @@ Get started with Dynamo locally in just a few commands:
.. code-block:: bash .. code-block:: bash
# Start the OpenAI compatible frontend (default port is 8080) # Start the OpenAI compatible frontend (default port is 8000)
python -m dynamo.frontend python -m dynamo.frontend
# In another terminal, start an SGLang worker # In another terminal, start an SGLang worker
...@@ -34,7 +34,7 @@ Get started with Dynamo locally in just a few commands: ...@@ -34,7 +34,7 @@ Get started with Dynamo locally in just a few commands:
.. code-block:: bash .. code-block:: bash
curl localhost:8080/v1/chat/completions \ curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen3-0.6B", -d '{"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}], "messages": [{"role": "user", "content": "Hello!"}],
......
...@@ -23,7 +23,7 @@ This diagram shows the NVIDIA Dynamo disaggregated inference system as implement ...@@ -23,7 +23,7 @@ This diagram shows the NVIDIA Dynamo disaggregated inference system as implement
The primary user journey through the system: The primary user journey through the system:
1. **Discovery (S1)**: Client discovers the service endpoint 1. **Discovery (S1)**: Client discovers the service endpoint
2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8080) 2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
3. **Validate (S3)**: Frontend forwards request to Processor for validation and routing 3. **Validate (S3)**: Frontend forwards request to Processor for validation and routing
4. **Route (S3)**: Processor routes the validated request to appropriate Decode Worker 4. **Route (S3)**: Processor routes the validated request to appropriate Decode Worker
...@@ -84,7 +84,7 @@ graph TD ...@@ -84,7 +84,7 @@ graph TD
%% Top Layer - Client & Frontend %% Top Layer - Client & Frontend
Client["<b>HTTP Client</b>"] Client["<b>HTTP Client</b>"]
S1[["<b>1 DISCOVERY</b>"]] S1[["<b>1 DISCOVERY</b>"]]
Frontend["<b>Frontend</b><br/><i>OpenAI Compatible Server<br/>Port 8080</i>"] Frontend["<b>Frontend</b><br/><i>OpenAI Compatible Server<br/>Port 8000</i>"]
S2[["<b>2 REQUEST</b>"]] S2[["<b>2 REQUEST</b>"]]
%% Processing Layer %% Processing Layer
......
...@@ -14,12 +14,12 @@ The Dynamo KV Router intelligently routes requests by evaluating their computati ...@@ -14,12 +14,12 @@ The Dynamo KV Router intelligently routes requests by evaluating their computati
To launch the Dynamo frontend with the KV Router: To launch the Dynamo frontend with the KV Router:
```bash ```bash
python -m dynamo.frontend --router-mode kv --http-port 8080 python -m dynamo.frontend --router-mode kv --http-port 8000
``` ```
This command: This command:
- Launches the Dynamo frontend service with KV routing enabled - Launches the Dynamo frontend service with KV routing enabled
- Exposes the service on port 8080 (configurable) - Exposes the service on port 8000 (configurable)
- Automatically handles all backend workers registered to the Dynamo endpoint - Automatically handles all backend workers registered to the Dynamo endpoint
Backend workers register themselves using the `register_llm` API, after which the KV Router automatically: Backend workers register themselves using the `register_llm` API, after which the KV Router automatically:
......
...@@ -88,7 +88,7 @@ Here's a template structure based on the examples: ...@@ -88,7 +88,7 @@ Here's a template structure based on the examples:
Consult the corresponding sh file. Each of the python commands to launch a component will go into your yaml spec under the Consult the corresponding sh file. Each of the python commands to launch a component will go into your yaml spec under the
`extraPodSpec: -> mainContainer: -> args:` `extraPodSpec: -> mainContainer: -> args:`
The front end is launched with "python3 -m dynamo.frontend [--http-port 8080] [--router-mode kv]" The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]"
Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command. Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command.
If you are a Dynamo contributor the [dynamo run guide](../dynamo_run.md) for details on how to run this command. If you are a Dynamo contributor the [dynamo run guide](../dynamo_run.md) for details on how to run this command.
......
...@@ -79,7 +79,7 @@ graph TD ...@@ -79,7 +79,7 @@ graph TD
PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380] PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380]
PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401] PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401]
PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP
PROMETHEUS -->|:8080/metrics| DYNAMOFE[Dynamo HTTP FE :8080] PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000]
PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081] PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081]
DYNAMOFE --> DYNAMOBACKEND DYNAMOFE --> DYNAMOBACKEND
GRAFANA -->|:9090/query API| PROMETHEUS GRAFANA -->|:9090/query API| PROMETHEUS
......
...@@ -46,7 +46,7 @@ genai-perf profile \ ...@@ -46,7 +46,7 @@ genai-perf profile \
--tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
-m deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--endpoint-type chat \ --endpoint-type chat \
--url http://localhost:8080 \ --url http://localhost:8000 \
--streaming \ --streaming \
--input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
``` ```
...@@ -76,7 +76,7 @@ In this example, we use a fixed 2p2d engine as baseline. Planner provides a `--n ...@@ -76,7 +76,7 @@ In this example, we use a fixed 2p2d engine as baseline. Planner provides a `--n
# TODO # TODO
# in terminal 2 # in terminal 2
genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B --service-kind openai --endpoint-type chat --url http://localhost:8080 --streaming --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B --service-kind openai --endpoint-type chat --url http://localhost:8000 --streaming --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
``` ```
## Results ## Results
......
...@@ -85,7 +85,7 @@ If you are using a **GPU**, the following GPU models and architectures are suppo ...@@ -85,7 +85,7 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
> [!Caution] > [!Caution]
> ¹ There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8080 for frontend). > ¹ There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend).
## Build Support ## Build Support
......
...@@ -73,7 +73,7 @@ bash launch/agg.sh --model Qwen/Qwen2.5-VL-7B-Instruct ...@@ -73,7 +73,7 @@ bash launch/agg.sh --model Qwen/Qwen2.5-VL-7B-Instruct
In another terminal: In another terminal:
```bash ```bash
curl http://localhost:8080/v1/chat/completions \ curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": "llava-hf/llava-1.5-7b-hf", "model": "llava-hf/llava-1.5-7b-hf",
...@@ -146,7 +146,7 @@ bash launch/disagg.sh --model llava-hf/llava-1.5-7b-hf ...@@ -146,7 +146,7 @@ bash launch/disagg.sh --model llava-hf/llava-1.5-7b-hf
In another terminal: In another terminal:
```bash ```bash
curl http://localhost:8080/v1/chat/completions \ curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": "llava-hf/llava-1.5-7b-hf", "model": "llava-hf/llava-1.5-7b-hf",
...@@ -223,7 +223,7 @@ bash launch/agg_llama.sh ...@@ -223,7 +223,7 @@ bash launch/agg_llama.sh
In another terminal: In another terminal:
```bash ```bash
curl http://localhost:8080/v1/chat/completions \ curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8", "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
...@@ -295,7 +295,7 @@ bash launch/disagg_llama.sh ...@@ -295,7 +295,7 @@ bash launch/disagg_llama.sh
In another terminal: In another terminal:
```bash ```bash
curl http://localhost:8080/v1/chat/completions \ curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8", "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
...@@ -366,7 +366,7 @@ bash launch/video_agg.sh ...@@ -366,7 +366,7 @@ bash launch/video_agg.sh
In another terminal: In another terminal:
```bash ```bash
curl http://localhost:8080/v1/chat/completions \ curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": "llava-hf/LLaVA-NeXT-Video-7B-hf", "model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
...@@ -455,7 +455,7 @@ bash launch/video_disagg.sh ...@@ -455,7 +455,7 @@ bash launch/video_disagg.sh
In another terminal: In another terminal:
```bash ```bash
curl http://localhost:8080/v1/chat/completions \ curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": "llava-hf/LLaVA-NeXT-Video-7B-hf", "model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
......
...@@ -185,7 +185,7 @@ DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 cargo run --bin system_server ...@@ -185,7 +185,7 @@ DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 cargo run --bin system_server
The server will start an system status server on the specified port (8081 in this example) that exposes the Prometheus metrics endpoint at `/metrics`. The server will start an system status server on the specified port (8081 in this example) that exposes the Prometheus metrics endpoint at `/metrics`.
To Run an actual LLM frontend + server (aggregated example), launch both of them. By default, the frontend listens to port 8080. To Run an actual LLM frontend + server (aggregated example), launch both of them. By default, the frontend listens to port 8000.
``` ```
python -m dynamo.frontend & python -m dynamo.frontend &
...@@ -202,5 +202,5 @@ Once running, you can query the metrics: ...@@ -202,5 +202,5 @@ Once running, you can query the metrics:
curl http://localhost:8081/metrics | grep -E "dynamo_component" curl http://localhost:8081/metrics | grep -E "dynamo_component"
# Get all frontend metrics # Get all frontend metrics
curl http://localhost:8080/metrics | grep -E "dynamo_frontend" curl http://localhost:8000/metrics | grep -E "dynamo_frontend"
``` ```
...@@ -62,13 +62,13 @@ python3 summarize_scores_dynamo.py ...@@ -62,13 +62,13 @@ python3 summarize_scores_dynamo.py
### Baseline Architecture (deploy-baseline-dynamo.sh) ### Baseline Architecture (deploy-baseline-dynamo.sh)
``` ```
HTTP Request → Dynamo Ingress(8080) → Dynamo Worker → Direct Inference HTTP Request → Dynamo Ingress(8000) → Dynamo Worker → Direct Inference
Environment: ENABLE_LMCACHE=0 Environment: ENABLE_LMCACHE=0
``` ```
### LMCache Architecture (deploy-lmcache_enabled-dynamo.sh) ### LMCache Architecture (deploy-lmcache_enabled-dynamo.sh)
``` ```
HTTP Request → Dynamo Ingress(8080) → Dynamo Worker → LMCache-enabled Inference HTTP Request → Dynamo Ingress(8000) → Dynamo Worker → LMCache-enabled Inference
Environment: ENABLE_LMCACHE=1 Environment: ENABLE_LMCACHE=1
LMCACHE_CHUNK_SIZE=256 LMCACHE_CHUNK_SIZE=256
LMCACHE_LOCAL_CPU=True LMCACHE_LOCAL_CPU=True
...@@ -80,7 +80,7 @@ Environment: ENABLE_LMCACHE=1 ...@@ -80,7 +80,7 @@ Environment: ENABLE_LMCACHE=1
Test scripts use Dynamo's Chat Completions API: Test scripts use Dynamo's Chat Completions API:
```bash ```bash
curl -X POST http://localhost:8080/v1/chat/completions \ curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": Qwen/Qwen3-0.6B, "model": Qwen/Qwen3-0.6B,
......
...@@ -18,7 +18,7 @@ ...@@ -18,7 +18,7 @@
# Reference: https://github.com/LMCache/LMCache/blob/dev/.buildkite/correctness/1-mmlu.py # Reference: https://github.com/LMCache/LMCache/blob/dev/.buildkite/correctness/1-mmlu.py
# ASSUMPTIONS: # ASSUMPTIONS:
# 1. dynamo is running (default: localhost:8080) without LMCache # 1. dynamo is running (default: localhost:8000) without LMCache
# 2. the mmlu dataset is in a "data" directory # 2. the mmlu dataset is in a "data" directory
# 3. all invocations of this script should be run in the same directory # 3. all invocations of this script should be run in the same directory
# (for later consolidation) # (for later consolidation)
......
...@@ -17,7 +17,7 @@ ...@@ -17,7 +17,7 @@
# Reference: https://github.com/LMCache/LMCache/blob/dev/.buildkite/correctness/2-mmlu.py # Reference: https://github.com/LMCache/LMCache/blob/dev/.buildkite/correctness/2-mmlu.py
# ASSUMPTIONS: # ASSUMPTIONS:
# 1. dynamo is running (default: localhost:8080) with LMCache enabled # 1. dynamo is running (default: localhost:8000) with LMCache enabled
# 2. the mmlu dataset is in a "data" directory # 2. the mmlu dataset is in a "data" directory
# 3. all invocations of this script should be run in the same directory # 3. all invocations of this script should be run in the same directory
# (for later consolidation) # (for later consolidation)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment