Unverified Commit 2449a0af authored by Lianmin Zheng's avatar Lianmin Zheng Committed by GitHub
Browse files

Refactor the docs (#9031)

parent 0f229c07
Multi-Node Deployment
=====================
.. toctree::
:maxdepth: 1
:caption: Multi-Node Deployment
multi_node.md
deploy_on_k8s.md
lws_pd/lws_pd_deploy.md
- `Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs <https://lmsys.org/blog/2025-05-05-large-scale-ep/>`_
- `Deploying Kimi K2 with PD Disaggregation and Large-Scale Expert Parallelism on 128 H200 GPUs <https://lmsys.org/blog/2025-07-20-k2-large-scale-ep/>`_
Performance Analysis & Optimization
===================================
.. toctree::
:maxdepth: 1
benchmark_and_profiling.md
accuracy_evaluation.md
# Production Metrics
SGLang exposes the following metrics via Prometheus. The metrics are namespaced by `$name` (the model name).
SGLang exposes the following metrics via Prometheus. You can enable it by adding `--enable-metrics` when you launch the server.
An example of the monitoring dashboard is available in [examples/monitoring/grafana.json](https://github.com/sgl-project/sglang/blob/main/examples/monitoring/grafana/dashboards/json/sglang-dashboard.json).
......
# Troubleshooting
This page lists common errors and tips for resolving them.
## CUDA Out of Memory
If you encounter out-of-memory (OOM) errors, you can adjust the following parameters:
- If OOM occurs during prefill, try reducing `--chunked-prefill-size` to `4096` or `2048`. This saves memory but slows down the prefill speed for long prompts.
- If OOM occurs during decoding, try lowering `--max-running-requests`.
- You can also reduce `--mem-fraction-static` to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput.
- Another common case for OOM is requesting input logprobs for a long prompt as it requires significant memory. To address this, set `logprob_start_len` in your sampling parameters to include only the necessary parts. If you do need input logprobs for a long prompt, try reducing `--mem-fraction-static`.
## CUDA Error: Illegal Memory Access Encountered
This error may result from kernel errors or out-of-memory issues:
- If it is a kernel error, resolving it may be challenging. Please file an issue on GitHub.
- If it is an out-of-memory issue, it may sometimes be reported as this error instead of "Out of Memory." Refer to the section above for guidance on avoiding OOM issues.
......@@ -4,23 +4,23 @@
## Example launch Command
By default, we will use sglang implementation if it is available. Otherwise, we will fall back to transformers one. However, you can switch the implementation by setting `impl` to `transformers`.
By default, we will use sglang implementation if it is available. Otherwise, we will fall back to transformers one. However, you can switch the implementation by setting `--model-impl` to `transformers`.
```shell
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-1B-Instruct \
--host 0.0.0.0 \
--port 30000 \
--impl transformers
--model-impl transformers
```
#### Supported features
## Supported features
##### Quantization
### Quantization
Transformers fall back has supported most of available quantization in SGLang (except GGUF). See [Quantization page](https://docs.sglang.ai/backend/quantization.html) for more information about supported quantization in SGLang.
##### Remote code
### Remote code
This fallback also means that any model on the hub that can be used in `transformers` with `trust_remote_code=True` that correctly implements attention can be used in production!
......
......@@ -32,16 +32,20 @@ from sglang.lang.choices import (
token_length_normalized,
unconditional_likelihood_normalized,
)
from sglang.srt.entrypoints.engine import Engine
# Lazy import some libraries
from sglang.utils import LazyImport
from sglang.version import __version__
ServerArgs = LazyImport("sglang.srt.server_args", "ServerArgs")
Anthropic = LazyImport("sglang.lang.backend.anthropic", "Anthropic")
LiteLLM = LazyImport("sglang.lang.backend.litellm", "LiteLLM")
OpenAI = LazyImport("sglang.lang.backend.openai", "OpenAI")
VertexAI = LazyImport("sglang.lang.backend.vertexai", "VertexAI")
# Runtime Engine APIs
ServerArgs = LazyImport("sglang.srt.server_args", "ServerArgs")
Engine = LazyImport("sglang.srt.entrypoints.engine", "Engine")
__all__ = [
"Engine",
"Runtime",
......
......@@ -2175,10 +2175,6 @@ class ServerArgs:
self.mem_fraction_static = (
original_server_arg_mem_fraction * final_overall_factor
)
logger.warning(
f"Multimodal model: Dynamically adjusted --mem-fraction-static "
f"from: {original_server_arg_mem_fraction:.3f} to: {self.mem_fraction_static:.3f}."
)
def prepare_server_args(argv: List[str]) -> ServerArgs:
......
"""
Do some monkey patch to make the documentation compilation faster and more reliable.
- Avoid port conflicts
- Reduce the server launch time
"""
import weakref
import nest_asyncio
nest_asyncio.apply()
import sglang.srt.server_args as server_args_mod
from sglang.utils import execute_shell_command, reserve_port
DEFAULT_MAX_RUNNING_REQUESTS = 200
DEFAULT_MAX_TOTAL_TOKENS = 20480
import sglang.srt.server_args as server_args_mod
DEFAULT_MAX_RUNNING_REQUESTS = 128
DEFAULT_MAX_TOTAL_TOKENS = 20480 # To allow multiple servers on the same machine
_original_post_init = server_args_mod.ServerArgs.__post_init__
......@@ -20,7 +26,7 @@ def patched_post_init(self):
self.max_running_requests = DEFAULT_MAX_RUNNING_REQUESTS
if self.max_total_tokens is None:
self.max_total_tokens = DEFAULT_MAX_TOTAL_TOKENS
self.disable_cuda_graph = True
self.cuda_graph_max_bs = 4
server_args_mod.ServerArgs.__post_init__ = patched_post_init
......@@ -41,7 +47,7 @@ def launch_server_cmd(command: str, host: str = "0.0.0.0", port: int = None):
extra_flags = (
f"--max-running-requests {DEFAULT_MAX_RUNNING_REQUESTS} "
f"--max-total-tokens {DEFAULT_MAX_TOTAL_TOKENS} "
f"--disable-cuda-graph"
f"--cuda-graph-max-bs 4"
)
full_command = f"{command} --port {port} {extra_flags}"
......
......@@ -458,7 +458,7 @@ def wait_for_server(base_url: str, timeout: int = None) -> None:
NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
We are running those notebooks in a CI parallel environment, so the throughput is not representative of the actual performance.
We are running those notebooks in a CI environment, so the throughput is not representative of the actual performance.
"""
)
break
......
......@@ -19,22 +19,15 @@ python3 run_suite.py --suite per-commit
## Test Frontend Language
```bash
cd sglang/test/lang
export OPENAI_API_KEY=sk-*****
# Run a single file
python3 test_openai_backend.py
# Run a single test
python3 -m unittest test_openai_backend.TestOpenAIServer.test_few_shot_qa
# Run a suite with multiple files
python3 run_suite.py --suite per-commit
python3 test_srt_backend.py
```
## Adding or Updating Tests in CI
- Create new test files under `test/srt` or `test/lang` depending on the type of test.
- Ensure they are referenced in the respective `run_suite.py` (e.g., `test/srt/run_suite.py` or `test/lang/run_suite.py`) so they’re picked up in CI. For most small test cases, they can be added to the `per-commit` suite.
- Ensure they are referenced in the respective `run_suite.py` (e.g., `test/srt/run_suite.py`) so they’re picked up in CI. For most small test cases, they can be added to the `per-commit` suite. Sort the test cases alphabetically.
- The CI will run the `per-commit` and `nightly` automatically. If you need special setup or custom test groups, you may modify the workflows in [`.github/workflows/`](https://github.com/sgl-project/sglang/tree/main/.github/workflows).
......@@ -45,3 +38,4 @@ python3 run_suite.py --suite per-commit
- Give tests descriptive names reflecting their purpose.
- Use robust assertions (e.g., assert, unittest methods) to validate outcomes.
- Clean up resources to avoid side effects and preserve test independence.
- Reduce the test time by using smaller models and reusing the server for multiple test cases.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment