Refactor the docs (#9031)

2449a0af · Lianmin Zheng · GitHub · 0f229c07 · 2449a0af · 2449a0af
Unverified Commit 2449a0af authored Aug 10, 2025 by Lianmin Zheng Committed by GitHub Aug 10, 2025
19 changed files
--- a/docs/references/deploy_on_k8s.md
+++ b/docs/references/deploy_on_k8s.md
--- a/docs/references/disaggregation/lws-examples/d-svc.yaml
+++ b/docs/references/disaggregation/lws-examples/d-svc.yaml
--- a/docs/references/disaggregation/lws-examples/d.yaml
+++ b/docs/references/disaggregation/lws-examples/d.yaml
--- a/docs/references/disaggregation/lws-examples/lb.yaml
+++ b/docs/references/disaggregation/lws-examples/lb.yaml
--- a/docs/references/disaggregation/lws-examples/p-svc.yaml
+++ b/docs/references/disaggregation/lws-examples/p-svc.yaml
--- a/docs/references/disaggregation/lws-examples/p.yaml
+++ b/docs/references/disaggregation/lws-examples/p.yaml
--- a/docs/references/disaggregation/lws_pd_deploy.md
+++ b/docs/references/disaggregation/lws_pd_deploy.md
--- a/docs/references/multi_node.md
+++ b/docs/references/multi_node.md
--- a/docs/references/multi_node_deployment/multi_node_index.rst
+++ b/docs/references/multi_node_deployment/multi_node_index.rst
+Multi-Node Deployment
+=====================
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Multi-Node Deployment
+
+   multi_node.md
+   deploy_on_k8s.md
+   lws_pd/lws_pd_deploy.md
+
+- `Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs <https://lmsys.org/blog/2025-05-05-large-scale-ep/>`_
+- `Deploying Kimi K2 with PD Disaggregation and Large-Scale Expert Parallelism on 128 H200 GPUs <https://lmsys.org/blog/2025-07-20-k2-large-scale-ep/>`_
--- a/docs/references/performance_analysis_and_optimization.rst
+++ b/docs/references/performance_analysis_and_optimization.rst
-Performance Analysis & Optimization
-===================================
-.. toctree::
-   :maxdepth: 1
-
-   benchmark_and_profiling.md
-   accuracy_evaluation.md
--- a/docs/references/production_metrics.md
+++ b/docs/references/production_metrics.md
 # Production Metrics

-SGLang exposes the following metrics via Prometheus. The metrics are namespaced by `$name` (the model name).
+SGLang exposes the following metrics via Prometheus. You can enable it by adding `--enable-metrics` when you launch the server.

 An example of the monitoring dashboard is available in [examples/monitoring/grafana.json](https://github.com/sgl-project/sglang/blob/main/examples/monitoring/grafana/dashboards/json/sglang-dashboard.json).


--- a/docs/references/troubleshooting.md
+++ b/docs/references/troubleshooting.md
-# Troubleshooting
-
-This page lists common errors and tips for resolving them.
-
-## CUDA Out of Memory
-If you encounter out-of-memory (OOM) errors, you can adjust the following parameters:
-
- If OOM occurs during prefill, try reducing `--chunked-prefill-size` to `4096` or `2048`. This saves memory but slows down the prefill speed for long prompts.
- If OOM occurs during decoding, try lowering `--max-running-requests`.
- You can also reduce `--mem-fraction-static` to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput.
- Another common case for OOM is requesting input logprobs for a long prompt as it requires significant memory. To address this, set `logprob_start_len` in your sampling parameters to include only the necessary parts. If you do need input logprobs for a long prompt, try reducing `--mem-fraction-static`.
-
-## CUDA Error: Illegal Memory Access Encountered
-This error may result from kernel errors or out-of-memory issues:
- If it is a kernel error, resolving it may be challenging. Please file an issue on GitHub.
- If it is an out-of-memory issue, it may sometimes be reported as this error instead of "Out of Memory." Refer to the section above for guidance on avoiding OOM issues.
--- a/docs/references/modelscope.md
+++ b/docs/references/modelscope.md
--- a/docs/supported_models/transformers_fallback.md
+++ b/docs/supported_models/transformers_fallback.md
@@ -4,23 +4,23 @@

 ## Example launch Command

-By default, we will use sglang implementation if it is available. Otherwise, we will fall back to transformers one. However, you can switch the implementation by setting `impl` to `transformers`.
+By default, we will use sglang implementation if it is available. Otherwise, we will fall back to transformers one. However, you can switch the implementation by setting `--model-impl` to `transformers`.

 ```shell
 python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-1B-Instruct \
  --host 0.0.0.0 \
  --port 30000 \
-  --impl transformers
+  --model-impl transformers
 ```

-#### Supported features
+## Supported features

-##### Quantization
+### Quantization

 Transformers fall back has supported most of available quantization in SGLang (except GGUF). See [Quantization page](https://docs.sglang.ai/backend/quantization.html) for more information about supported quantization in SGLang.

-##### Remote code
+### Remote code

 This fallback also means that any model on the hub that can be used in `transformers` with `trust_remote_code=True` that correctly implements attention can be used in production!


--- a/python/sglang/__init__.py
+++ b/python/sglang/__init__.py
@@ -32,16 +32,20 @@ from sglang.lang.choices import (
    token_length_normalized,
    unconditional_likelihood_normalized,
 )
-from sglang.srt.entrypoints.engine import Engine
+
+# Lazy import some libraries
 from sglang.utils import LazyImport
 from sglang.version import __version__

-ServerArgs = LazyImport("sglang.srt.server_args", "ServerArgs")
 Anthropic = LazyImport("sglang.lang.backend.anthropic", "Anthropic")
 LiteLLM = LazyImport("sglang.lang.backend.litellm", "LiteLLM")
 OpenAI = LazyImport("sglang.lang.backend.openai", "OpenAI")
 VertexAI = LazyImport("sglang.lang.backend.vertexai", "VertexAI")

+# Runtime Engine APIs
+ServerArgs = LazyImport("sglang.srt.server_args", "ServerArgs")
+Engine = LazyImport("sglang.srt.entrypoints.engine", "Engine")
+
 __all__ = [
    "Engine",
    "Runtime",

--- a/python/sglang/srt/server_args.py
+++ b/python/sglang/srt/server_args.py
@@ -2175,10 +2175,6 @@ class ServerArgs:
        self.mem_fraction_static = (
            original_server_arg_mem_fraction * final_overall_factor
        )
-        logger.warning(
-            f"Multimodal model: Dynamically adjusted --mem-fraction-static "
-            f"from: {original_server_arg_mem_fraction:.3f} to: {self.mem_fraction_static:.3f}."
-        )


 def prepare_server_args(argv: List[str]) -> ServerArgs:

--- a/docs/backend/patch.py
+++ b/docs/backend/patch.py
+"""
+Do some monkey patch to make the documentation compilation faster and more reliable.
+
+- Avoid port conflicts
+- Reduce the server launch time
+"""
+
 import weakref

 import nest_asyncio

 nest_asyncio.apply()

+import sglang.srt.server_args as server_args_mod
 from sglang.utils import execute_shell_command, reserve_port

-DEFAULT_MAX_RUNNING_REQUESTS = 200
-DEFAULT_MAX_TOTAL_TOKENS = 20480
-
-import sglang.srt.server_args as server_args_mod
+DEFAULT_MAX_RUNNING_REQUESTS = 128
+DEFAULT_MAX_TOTAL_TOKENS = 20480  # To allow multiple servers on the same machine

 _original_post_init = server_args_mod.ServerArgs.__post_init__

@@ -20,7 +26,7 @@ def patched_post_init(self):
        self.max_running_requests = DEFAULT_MAX_RUNNING_REQUESTS
    if self.max_total_tokens is None:
        self.max_total_tokens = DEFAULT_MAX_TOTAL_TOKENS
-    self.disable_cuda_graph = True
+    self.cuda_graph_max_bs = 4


 server_args_mod.ServerArgs.__post_init__ = patched_post_init
@@ -41,7 +47,7 @@ def launch_server_cmd(command: str, host: str = "0.0.0.0", port: int = None):
    extra_flags = (
        f"--max-running-requests {DEFAULT_MAX_RUNNING_REQUESTS} "
        f"--max-total-tokens {DEFAULT_MAX_TOTAL_TOKENS} "
-        f"--disable-cuda-graph"
+        f"--cuda-graph-max-bs 4"
    )

    full_command = f"{command} --port {port} {extra_flags}"

--- a/python/sglang/utils.py
+++ b/python/sglang/utils.py
@@ -458,7 +458,7 @@ def wait_for_server(base_url: str, timeout: int = None) -> None:
                    NOTE: Typically, the server runs in a separate terminal.
                    In this notebook, we run the server and notebook code together, so their outputs are combined.
                    To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
-                    We are running those notebooks in a CI parallel environment, so the throughput is not representative of the actual performance.
+                    We are running those notebooks in a CI environment, so the throughput is not representative of the actual performance.
                    """
                )
                break

--- a/test/README.md
+++ b/test/README.md
@@ -19,22 +19,15 @@ python3 run_suite.py --suite per-commit
 ## Test Frontend Language
 ```bash
 cd sglang/test/lang
-export OPENAI_API_KEY=sk-*****

 # Run a single file
-python3 test_openai_backend.py
-
-# Run a single test
-python3 -m unittest test_openai_backend.TestOpenAIServer.test_few_shot_qa
-
-# Run a suite with multiple files
-python3 run_suite.py --suite per-commit
+python3 test_srt_backend.py
 ```

 ## Adding or Updating Tests in CI

 - Create new test files under `test/srt` or `test/lang` depending on the type of test.
- Ensure they are referenced in the respective `run_suite.py` (e.g., `test/srt/run_suite.py` or `test/lang/run_suite.py`) so they’re picked up in CI. For most small test cases, they can be added to the `per-commit` suite.
+- Ensure they are referenced in the respective `run_suite.py` (e.g., `test/srt/run_suite.py`) so they’re picked up in CI. For most small test cases, they can be added to the `per-commit` suite. Sort the test cases alphabetically.
 - The CI will run the `per-commit` and `nightly` automatically. If you need special setup or custom test groups, you may modify the workflows in [`.github/workflows/`](https://github.com/sgl-project/sglang/tree/main/.github/workflows).


@@ -45,3 +38,4 @@ python3 run_suite.py --suite per-commit
 - Give tests descriptive names reflecting their purpose.
 - Use robust assertions (e.g., assert, unittest methods) to validate outcomes.
 - Clean up resources to avoid side effects and preserve test independence.
+- Reduce the test time by using smaller models and reusing the server for multiple test cases.