docs: change docs to default port 8000 (#2876)

Signed-off-by: PeaBrane <yanrpei@gmail.com>

docs: change docs to default port 8000 (#2876)
Signed-off-by: PeaBrane <yanrpei@gmail.com>
1995ef9a · Yan Ru Pei · GitHub · 4df2e2d6 · 1995ef9a · 1995ef9a
Unverified Commit 1995ef9a authored Sep 04, 2025 by Yan Ru Pei Committed by GitHub Sep 05, 2025
19 changed files
--- a/README.md
+++ b/README.md
@@ -120,7 +120,7 @@ Dynamo provides a simple way to spin up a local set of inference components incl
 ```
 # Start an OpenAI compatible HTTP server, a pre-processor (prompt templating and tokenization) and a router.
 # Pass the TLS certificate and key paths to use HTTPS instead of HTTP.
-python -m dynamo.frontend --http-port 8080 [--tls-cert-path cert.pem] [--tls-key-path key.pem]
+python -m dynamo.frontend --http-port 8000 [--tls-cert-path cert.pem] [--tls-key-path key.pem]
 # Start the SGLang engine, connecting to NATS and etcd to receive requests. You can run several of these,
 # both for the same model and for multiple models. The frontend node will discover them.
@@ -130,7 +130,7 @@ python -m dynamo.sglang.worker --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B
 #### Send a Request
 ```bash
-curl localhost:8080/v1/chat/completions   -H "Content-Type: application/json"   -d '{
+curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    "messages": [
    {

--- a/components/backends/mocker/README.md
+++ b/components/backends/mocker/README.md
@@ -37,7 +37,7 @@ python -m dynamo.mocker \
  --enable-prefix-caching
 # Start frontend server
-python -m dynamo.frontend --http-port 8080
+python -m dynamo.frontend --http-port 8000
 ```
 ### Legacy JSON file support:

--- a/components/backends/vllm/deepseek-r1.md
+++ b/components/backends/vllm/deepseek-r1.md
@@ -26,7 +26,7 @@ node 1
 On node 0 (where the frontend was started) send a test request to verify your deployment:
 ```bash
-curl localhost:8080/v1/chat/completions \
+curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1",

--- a/components/backends/vllm/deploy/README.md
+++ b/components/backends/vllm/deploy/README.md
@@ -197,7 +197,7 @@ See the [vLLM CLI documentation](https://docs.vllm.ai/en/v0.9.2/configuration/se
 Send a test request to verify your deployment:
 ```bash
-curl localhost:8080/v1/chat/completions \
+curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",

--- a/components/frontend/README.md
+++ b/components/frontend/README.md
 # Dynamo frontend node.
-Usage: `python -m dynamo.frontend [--http-port 8080]`.
+Usage: `python -m dynamo.frontend [--http-port 8000]`.
 This runs an OpenAI compliant HTTP server, a pre-processor, and a router in a single process. Engines / workers are auto-discovered when they call `register_llm`.

--- a/container/launch_message.txt
+++ b/container/launch_message.txt
@@ -48,7 +48,7 @@ tools.
 Try the following to begin interacting with a model:
 > dynamo --help
-> python -m dynamo.frontend [--http-port 8080]
+> python -m dynamo.frontend [--http-port 8000]
 > python -m dynamo.vllm Qwen/Qwen2.5-3B-Instruct
 To run more complete deployment examples, instances of etcd and nats need to be

--- a/deploy/metrics/README.md
+++ b/deploy/metrics/README.md
@@ -23,7 +23,7 @@ graph TD
        PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380]
        PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401]
        PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP
-        PROMETHEUS -->|:8080/metrics| DYNAMOFE[Dynamo HTTP FE :8080]
+        PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000]
        PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081]
        DYNAMOFE --> DYNAMOBACKEND
        GRAFANA -->|:9090/query API| PROMETHEUS

--- a/docs/_includes/quick_start_local.rst
+++ b/docs/_includes/quick_start_local.rst
@@ -24,7 +24,7 @@ Get started with Dynamo locally in just a few commands:
 .. code-block:: bash
-   # Start the OpenAI compatible frontend (default port is 8080)
+   # Start the OpenAI compatible frontend (default port is 8000)
   python -m dynamo.frontend
   # In another terminal, start an SGLang worker
@@ -34,7 +34,7 @@ Get started with Dynamo locally in just a few commands:
 .. code-block:: bash
-   curl localhost:8080/v1/chat/completions \
+   curl localhost:8000/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{"model": "Qwen/Qwen3-0.6B",
          "messages": [{"role": "user", "content": "Hello!"}],

--- a/docs/architecture/dynamo_flow.md
+++ b/docs/architecture/dynamo_flow.md
@@ -23,7 +23,7 @@ This diagram shows the NVIDIA Dynamo disaggregated inference system as implement
 The primary user journey through the system:
 1. **Discovery (S1)**: Client discovers the service endpoint
-2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8080)
+2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
 3. **Validate (S3)**: Frontend forwards request to Processor for validation and routing
 4. **Route (S3)**: Processor routes the validated request to appropriate Decode Worker
@@ -84,7 +84,7 @@ graph TD
    %% Top Layer - Client & Frontend
    Client["<b>HTTP Client</b>"]
    S1[["<b>1 DISCOVERY</b>"]]
-    Frontend["<b>Frontend</b><br/><i>OpenAI Compatible Server<br/>Port 8080</i>"]
+    Frontend["<b>Frontend</b><br/><i>OpenAI Compatible Server<br/>Port 8000</i>"]
    S2[["<b>2 REQUEST</b>"]]
    %% Processing Layer

--- a/docs/components/router/README.md
+++ b/docs/components/router/README.md
@@ -14,12 +14,12 @@ The Dynamo KV Router intelligently routes requests by evaluating their computati
 To launch the Dynamo frontend with the KV Router:
 ```bash
-python -m dynamo.frontend --router-mode kv --http-port 8080
+python -m dynamo.frontend --router-mode kv --http-port 8000
 ```
 This command:
 - Launches the Dynamo frontend service with KV routing enabled
- Exposes the service on port 8080 (configurable)
+- Exposes the service on port 8000 (configurable)
 - Automatically handles all backend workers registered to the Dynamo endpoint
 Backend workers register themselves using the `register_llm` API, after which the KV Router automatically:

--- a/docs/guides/dynamo_deploy/create_deployment.md
+++ b/docs/guides/dynamo_deploy/create_deployment.md
@@ -88,7 +88,7 @@ Here's a template structure based on the examples:
 Consult the corresponding sh file. Each of the python commands to launch a component will go into your yaml spec under the
 `extraPodSpec: -> mainContainer: -> args:`
-The front end is launched with "python3 -m dynamo.frontend [--http-port 8080] [--router-mode kv]"
+The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]"
 Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command.
 If you are a Dynamo contributor the [dynamo run guide](../dynamo_run.md) for details on how to run this command.

--- a/docs/guides/metrics.md
+++ b/docs/guides/metrics.md
@@ -79,7 +79,7 @@ graph TD
        PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380]
        PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401]
        PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP
-        PROMETHEUS -->|:8080/metrics| DYNAMOFE[Dynamo HTTP FE :8080]
+        PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000]
        PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081]
        DYNAMOFE --> DYNAMOBACKEND
        GRAFANA -->|:9090/query API| PROMETHEUS

--- a/docs/guides/planner_benchmark/README.md
+++ b/docs/guides/planner_benchmark/README.md
@@ -46,7 +46,7 @@ genai-perf profile \
    --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --endpoint-type chat \
-    --url http://localhost:8080 \
+    --url http://localhost:8000 \
    --streaming \
    --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
 ```
@@ -76,7 +76,7 @@ In this example, we use a fixed 2p2d engine as baseline. Planner provides a `--n
 # TODO
 # in terminal 2
-genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B --service-kind openai --endpoint-type chat --url http://localhost:8080 --streaming --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
+genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B --service-kind openai --endpoint-type chat --url http://localhost:8000 --streaming --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
 ```
 ## Results

--- a/docs/support_matrix.md
+++ b/docs/support_matrix.md
@@ -85,7 +85,7 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
 > [!Caution]
-> ¹ There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8080 for frontend).
+> ¹ There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend).
 ## Build Support

--- a/examples/multimodal/README.md
+++ b/examples/multimodal/README.md
@@ -73,7 +73,7 @@ bash launch/agg.sh --model Qwen/Qwen2.5-VL-7B-Instruct
 In another terminal:
 ```bash
-curl http://localhost:8080/v1/chat/completions \
+curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "llava-hf/llava-1.5-7b-hf",
@@ -146,7 +146,7 @@ bash launch/disagg.sh --model llava-hf/llava-1.5-7b-hf
 In another terminal:
 ```bash
-curl http://localhost:8080/v1/chat/completions \
+curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "llava-hf/llava-1.5-7b-hf",
@@ -223,7 +223,7 @@ bash launch/agg_llama.sh
 In another terminal:
 ```bash
-curl http://localhost:8080/v1/chat/completions \
+curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
@@ -295,7 +295,7 @@ bash launch/disagg_llama.sh
 In another terminal:
 ```bash
-curl http://localhost:8080/v1/chat/completions \
+curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
@@ -366,7 +366,7 @@ bash launch/video_agg.sh
 In another terminal:
 ```bash
-curl http://localhost:8080/v1/chat/completions \
+curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
@@ -455,7 +455,7 @@ bash launch/video_disagg.sh
 In another terminal:
 ```bash
-curl http://localhost:8080/v1/chat/completions \
+curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "llava-hf/LLaVA-NeXT-Video-7B-hf",

--- a/lib/runtime/examples/system_metrics/README.md
+++ b/lib/runtime/examples/system_metrics/README.md
@@ -185,7 +185,7 @@ DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 cargo run --bin system_server
 The server will start an system status server on the specified port (8081 in this example) that exposes the Prometheus metrics endpoint at `/metrics`.
-To Run an actual LLM frontend + server (aggregated example), launch both of them. By default, the frontend listens to port 8080.
+To Run an actual LLM frontend + server (aggregated example), launch both of them. By default, the frontend listens to port 8000.
 ```
 python -m dynamo.frontend &
@@ -202,5 +202,5 @@ Once running, you can query the metrics:
 curl http://localhost:8081/metrics | grep -E "dynamo_component"
 # Get all frontend metrics
-curl http://localhost:8080/metrics | grep -E "dynamo_frontend"
+curl http://localhost:8000/metrics | grep -E "dynamo_frontend"
 ```
--- a/tests/lmcache/README.md
+++ b/tests/lmcache/README.md
@@ -62,13 +62,13 @@ python3 summarize_scores_dynamo.py
 ### Baseline Architecture (deploy-baseline-dynamo.sh)
 ```
-HTTP Request → Dynamo Ingress(8080) → Dynamo Worker → Direct Inference
+HTTP Request → Dynamo Ingress(8000) → Dynamo Worker → Direct Inference
 Environment: ENABLE_LMCACHE=0
 ```
 ### LMCache Architecture (deploy-lmcache_enabled-dynamo.sh)
 ```
-HTTP Request → Dynamo Ingress(8080) → Dynamo Worker → LMCache-enabled Inference
+HTTP Request → Dynamo Ingress(8000) → Dynamo Worker → LMCache-enabled Inference
 Environment: ENABLE_LMCACHE=1
            LMCACHE_CHUNK_SIZE=256
            LMCACHE_LOCAL_CPU=True
@@ -80,7 +80,7 @@ Environment: ENABLE_LMCACHE=1
 Test scripts use Dynamo's Chat Completions API:
 ```bash
-curl -X POST http://localhost:8080/v1/chat/completions \
+curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": Qwen/Qwen3-0.6B,

--- a/tests/lmcache/mmlu-baseline-dynamo.py
+++ b/tests/lmcache/mmlu-baseline-dynamo.py
@@ -18,7 +18,7 @@
 # Reference: https://github.com/LMCache/LMCache/blob/dev/.buildkite/correctness/1-mmlu.py
 # ASSUMPTIONS:
-# 1. dynamo is running (default: localhost:8080) without LMCache
+# 1. dynamo is running (default: localhost:8000) without LMCache
 # 2. the mmlu dataset is in a "data" directory
 # 3. all invocations of this script should be run in the same directory
 #    (for later consolidation)

--- a/tests/lmcache/mmlu-lmcache_enabled-dynamo.py
+++ b/tests/lmcache/mmlu-lmcache_enabled-dynamo.py
@@ -17,7 +17,7 @@
 # Reference: https://github.com/LMCache/LMCache/blob/dev/.buildkite/correctness/2-mmlu.py
 # ASSUMPTIONS:
-# 1. dynamo is running (default: localhost:8080) with LMCache enabled
+# 1. dynamo is running (default: localhost:8000) with LMCache enabled
 # 2. the mmlu dataset is in a "data" directory
 # 3. all invocations of this script should be run in the same directory
 #    (for later consolidation)