Unverified Commit 6aa94b96 authored by Lianmin Zheng's avatar Lianmin Zheng Committed by GitHub
Browse files

Update ci workflows (#1804)

parent c2650748
...@@ -10,24 +10,28 @@ The core features include: ...@@ -10,24 +10,28 @@ The core features include:
- **Extensive Model Support**: Supports a wide range of generative models (Llama 3, Gemma 2, Mistral, QWen, DeepSeek, LLaVA, etc.) and embedding models (e5-mistral), with easy extensibility for integrating new models. - **Extensive Model Support**: Supports a wide range of generative models (Llama 3, Gemma 2, Mistral, QWen, DeepSeek, LLaVA, etc.) and embedding models (e5-mistral), with easy extensibility for integrating new models.
- **Active Community**: SGLang is open-source and backed by an active community with industry adoption. - **Active Community**: SGLang is open-source and backed by an active community with industry adoption.
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
:caption: Getting Started :caption: Getting Started
install.md install.md
send_request.ipynb
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
:caption: Backend Tutorial :caption: Backend Tutorial
backend.md backend.md
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
:caption: Frontend Tutorial :caption: Frontend Tutorial
frontend.md frontend.md
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
:caption: References :caption: References
...@@ -39,4 +43,3 @@ The core features include: ...@@ -39,4 +43,3 @@ The core features include:
choices_methods.md choices_methods.md
benchmark_and_profiling.md benchmark_and_profiling.md
troubleshooting.md troubleshooting.md
embedding_model.ipynb
\ No newline at end of file
...@@ -48,9 +48,9 @@ docker run --gpus all \ ...@@ -48,9 +48,9 @@ docker run --gpus all \
<summary>More</summary> <summary>More</summary>
> This method is recommended if you plan to serve it as a service. > This method is recommended if you plan to serve it as a service.
> A better approach is to use the [k8s-sglang-service.yaml](./docker/k8s-sglang-service.yaml). > A better approach is to use the [k8s-sglang-service.yaml](https://github.com/sgl-project/sglang/blob/main/docker/k8s-sglang-service.yaml).
1. Copy the [compose.yml](./docker/compose.yaml) to your local machine 1. Copy the [compose.yml](https://github.com/sgl-project/sglang/blob/main/docker/compose.yaml) to your local machine
2. Execute the command `docker compose up -d` in your terminal. 2. Execute the command `docker compose up -d` in your terminal.
</details> </details>
......
ipykernel
ipywidgets
jupyter_client
markdown>=3.4.0 markdown>=3.4.0
matplotlib
myst-parser myst-parser
nbconvert
nbsphinx
pandoc
pillow
pydantic
sphinx sphinx
sphinx-book-theme sphinx-book-theme
sphinx-copybutton sphinx-copybutton
sphinx-tabs sphinx-tabs
sphinxcontrib-mermaid sphinxcontrib-mermaid
pillow urllib3<2.0.0
pydantic \ No newline at end of file
urllib3<2.0.0
nbsphinx
pandoc
\ No newline at end of file
...@@ -194,7 +194,7 @@ Since we compute penalty algorithms through CUDA, the logic stores relevant para ...@@ -194,7 +194,7 @@ Since we compute penalty algorithms through CUDA, the logic stores relevant para
You can run your own benchmark with desired parameters on your own hardware to make sure it's not OOMing before using. You can run your own benchmark with desired parameters on your own hardware to make sure it's not OOMing before using.
Tuning `--mem-fraction-static` and/or `--max-running-requests` will help. See [here](hyperparameter_tuning.md#minor-tune---max-prefill-tokens---mem-fraction-static---max-running-requests) for more information. Tuning `--mem-fraction-static` and/or `--max-running-requests` will help.
### Benchmarks ### Benchmarks
......
python3 -m http.server --d _build/html
...@@ -5,9 +5,9 @@ This page lists some common errors and tips for fixing them. ...@@ -5,9 +5,9 @@ This page lists some common errors and tips for fixing them.
## CUDA error: an illegal memory access was encountered ## CUDA error: an illegal memory access was encountered
This error may be due to kernel errors or out-of-memory issues. This error may be due to kernel errors or out-of-memory issues.
- If it is a kernel error, it is not easy to fix. - If it is a kernel error, it is not easy to fix.
- If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." In this case, try setting a smaller value for `--mem-fraction-static`. The default value of `--mem-fraction-static` is around 0.8 - 0.9. https://github.com/sgl-project/sglang/blob/1edd4e07d6ad52f4f63e7f6beaa5987c1e1cf621/python/sglang/srt/server_args.py#L92-L102 - If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." In this case, try setting a smaller value for `--mem-fraction-static`. The default value of `--mem-fraction-static` is around 0.8 - 0.9.
## The server hangs ## The server hangs
If the server hangs, try disabling some optimizations when launching the server. If the server hangs, try disabling some optimizations when launching the server.
- Add `--disable-cuda-graph`. - Add `--disable-cuda-graph`.
- Add `--disable-flashinfer-sampling`. - Add `--sampling-backend pytorch`.
pip install --upgrade pip
pip install -e "python[all]"
pip install transformers==4.45.2
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ --force-reinstall
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment