Unverified Commit 1995ef9a authored by Yan Ru Pei's avatar Yan Ru Pei Committed by GitHub
Browse files

docs: change docs to default port 8000 (#2876)


Signed-off-by: default avatarPeaBrane <yanrpei@gmail.com>
parent 4df2e2d6
......@@ -120,7 +120,7 @@ Dynamo provides a simple way to spin up a local set of inference components incl
```
# Start an OpenAI compatible HTTP server, a pre-processor (prompt templating and tokenization) and a router.
# Pass the TLS certificate and key paths to use HTTPS instead of HTTP.
python -m dynamo.frontend --http-port 8080 [--tls-cert-path cert.pem] [--tls-key-path key.pem]
python -m dynamo.frontend --http-port 8000 [--tls-cert-path cert.pem] [--tls-key-path key.pem]
# Start the SGLang engine, connecting to NATS and etcd to receive requests. You can run several of these,
# both for the same model and for multiple models. The frontend node will discover them.
......@@ -130,7 +130,7 @@ python -m dynamo.sglang.worker --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B
#### Send a Request
```bash
curl localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
"messages": [
{
......
......@@ -37,7 +37,7 @@ python -m dynamo.mocker \
--enable-prefix-caching
# Start frontend server
python -m dynamo.frontend --http-port 8080
python -m dynamo.frontend --http-port 8000
```
### Legacy JSON file support:
......
......@@ -26,7 +26,7 @@ node 1
On node 0 (where the frontend was started) send a test request to verify your deployment:
```bash
curl localhost:8080/v1/chat/completions \
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1",
......
......@@ -197,7 +197,7 @@ See the [vLLM CLI documentation](https://docs.vllm.ai/en/v0.9.2/configuration/se
Send a test request to verify your deployment:
```bash
curl localhost:8080/v1/chat/completions \
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
......
# Dynamo frontend node.
Usage: `python -m dynamo.frontend [--http-port 8080]`.
Usage: `python -m dynamo.frontend [--http-port 8000]`.
This runs an OpenAI compliant HTTP server, a pre-processor, and a router in a single process. Engines / workers are auto-discovered when they call `register_llm`.
......
......@@ -48,7 +48,7 @@ tools.
Try the following to begin interacting with a model:
> dynamo --help
> python -m dynamo.frontend [--http-port 8080]
> python -m dynamo.frontend [--http-port 8000]
> python -m dynamo.vllm Qwen/Qwen2.5-3B-Instruct
To run more complete deployment examples, instances of etcd and nats need to be
......
......@@ -23,7 +23,7 @@ graph TD
PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380]
PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401]
PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP
PROMETHEUS -->|:8080/metrics| DYNAMOFE[Dynamo HTTP FE :8080]
PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000]
PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081]
DYNAMOFE --> DYNAMOBACKEND
GRAFANA -->|:9090/query API| PROMETHEUS
......
......@@ -24,7 +24,7 @@ Get started with Dynamo locally in just a few commands:
.. code-block:: bash
# Start the OpenAI compatible frontend (default port is 8080)
# Start the OpenAI compatible frontend (default port is 8000)
python -m dynamo.frontend
# In another terminal, start an SGLang worker
......@@ -34,7 +34,7 @@ Get started with Dynamo locally in just a few commands:
.. code-block:: bash
curl localhost:8080/v1/chat/completions \
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}],
......
......@@ -23,7 +23,7 @@ This diagram shows the NVIDIA Dynamo disaggregated inference system as implement
The primary user journey through the system:
1. **Discovery (S1)**: Client discovers the service endpoint
2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8080)
2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
3. **Validate (S3)**: Frontend forwards request to Processor for validation and routing
4. **Route (S3)**: Processor routes the validated request to appropriate Decode Worker
......@@ -84,7 +84,7 @@ graph TD
%% Top Layer - Client & Frontend
Client["<b>HTTP Client</b>"]
S1[["<b>1 DISCOVERY</b>"]]
Frontend["<b>Frontend</b><br/><i>OpenAI Compatible Server<br/>Port 8080</i>"]
Frontend["<b>Frontend</b><br/><i>OpenAI Compatible Server<br/>Port 8000</i>"]
S2[["<b>2 REQUEST</b>"]]
%% Processing Layer
......
......@@ -14,12 +14,12 @@ The Dynamo KV Router intelligently routes requests by evaluating their computati
To launch the Dynamo frontend with the KV Router:
```bash
python -m dynamo.frontend --router-mode kv --http-port 8080
python -m dynamo.frontend --router-mode kv --http-port 8000
```
This command:
- Launches the Dynamo frontend service with KV routing enabled
- Exposes the service on port 8080 (configurable)
- Exposes the service on port 8000 (configurable)
- Automatically handles all backend workers registered to the Dynamo endpoint
Backend workers register themselves using the `register_llm` API, after which the KV Router automatically:
......
......@@ -88,7 +88,7 @@ Here's a template structure based on the examples:
Consult the corresponding sh file. Each of the python commands to launch a component will go into your yaml spec under the
`extraPodSpec: -> mainContainer: -> args:`
The front end is launched with "python3 -m dynamo.frontend [--http-port 8080] [--router-mode kv]"
The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]"
Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command.
If you are a Dynamo contributor the [dynamo run guide](../dynamo_run.md) for details on how to run this command.
......
......@@ -79,7 +79,7 @@ graph TD
PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380]
PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401]
PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP
PROMETHEUS -->|:8080/metrics| DYNAMOFE[Dynamo HTTP FE :8080]
PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000]
PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081]
DYNAMOFE --> DYNAMOBACKEND
GRAFANA -->|:9090/query API| PROMETHEUS
......
......@@ -46,7 +46,7 @@ genai-perf profile \
--tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
-m deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--endpoint-type chat \
--url http://localhost:8080 \
--url http://localhost:8000 \
--streaming \
--input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
```
......@@ -76,7 +76,7 @@ In this example, we use a fixed 2p2d engine as baseline. Planner provides a `--n
# TODO
# in terminal 2
genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B --service-kind openai --endpoint-type chat --url http://localhost:8080 --streaming --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B --service-kind openai --endpoint-type chat --url http://localhost:8000 --streaming --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
```
## Results
......
......@@ -85,7 +85,7 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
> [!Caution]
> ¹ There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8080 for frontend).
> ¹ There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend).
## Build Support
......
......@@ -73,7 +73,7 @@ bash launch/agg.sh --model Qwen/Qwen2.5-VL-7B-Instruct
In another terminal:
```bash
curl http://localhost:8080/v1/chat/completions \
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
......@@ -146,7 +146,7 @@ bash launch/disagg.sh --model llava-hf/llava-1.5-7b-hf
In another terminal:
```bash
curl http://localhost:8080/v1/chat/completions \
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
......@@ -223,7 +223,7 @@ bash launch/agg_llama.sh
In another terminal:
```bash
curl http://localhost:8080/v1/chat/completions \
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
......@@ -295,7 +295,7 @@ bash launch/disagg_llama.sh
In another terminal:
```bash
curl http://localhost:8080/v1/chat/completions \
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
......@@ -366,7 +366,7 @@ bash launch/video_agg.sh
In another terminal:
```bash
curl http://localhost:8080/v1/chat/completions \
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
......@@ -455,7 +455,7 @@ bash launch/video_disagg.sh
In another terminal:
```bash
curl http://localhost:8080/v1/chat/completions \
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
......
......@@ -185,7 +185,7 @@ DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 cargo run --bin system_server
The server will start an system status server on the specified port (8081 in this example) that exposes the Prometheus metrics endpoint at `/metrics`.
To Run an actual LLM frontend + server (aggregated example), launch both of them. By default, the frontend listens to port 8080.
To Run an actual LLM frontend + server (aggregated example), launch both of them. By default, the frontend listens to port 8000.
```
python -m dynamo.frontend &
......@@ -202,5 +202,5 @@ Once running, you can query the metrics:
curl http://localhost:8081/metrics | grep -E "dynamo_component"
# Get all frontend metrics
curl http://localhost:8080/metrics | grep -E "dynamo_frontend"
curl http://localhost:8000/metrics | grep -E "dynamo_frontend"
```
......@@ -62,13 +62,13 @@ python3 summarize_scores_dynamo.py
### Baseline Architecture (deploy-baseline-dynamo.sh)
```
HTTP Request → Dynamo Ingress(8080) → Dynamo Worker → Direct Inference
HTTP Request → Dynamo Ingress(8000) → Dynamo Worker → Direct Inference
Environment: ENABLE_LMCACHE=0
```
### LMCache Architecture (deploy-lmcache_enabled-dynamo.sh)
```
HTTP Request → Dynamo Ingress(8080) → Dynamo Worker → LMCache-enabled Inference
HTTP Request → Dynamo Ingress(8000) → Dynamo Worker → LMCache-enabled Inference
Environment: ENABLE_LMCACHE=1
LMCACHE_CHUNK_SIZE=256
LMCACHE_LOCAL_CPU=True
......@@ -80,7 +80,7 @@ Environment: ENABLE_LMCACHE=1
Test scripts use Dynamo's Chat Completions API:
```bash
curl -X POST http://localhost:8080/v1/chat/completions \
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": Qwen/Qwen3-0.6B,
......
......@@ -18,7 +18,7 @@
# Reference: https://github.com/LMCache/LMCache/blob/dev/.buildkite/correctness/1-mmlu.py
# ASSUMPTIONS:
# 1. dynamo is running (default: localhost:8080) without LMCache
# 1. dynamo is running (default: localhost:8000) without LMCache
# 2. the mmlu dataset is in a "data" directory
# 3. all invocations of this script should be run in the same directory
# (for later consolidation)
......
......@@ -17,7 +17,7 @@
# Reference: https://github.com/LMCache/LMCache/blob/dev/.buildkite/correctness/2-mmlu.py
# ASSUMPTIONS:
# 1. dynamo is running (default: localhost:8080) with LMCache enabled
# 1. dynamo is running (default: localhost:8000) with LMCache enabled
# 2. the mmlu dataset is in a "data" directory
# 3. all invocations of this script should be run in the same directory
# (for later consolidation)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment