Unverified Commit 6aa94b96 authored by Lianmin Zheng's avatar Lianmin Zheng Committed by GitHub
Browse files

Update ci workflows (#1804)

parent c2650748
......@@ -10,24 +10,28 @@ The core features include:
- **Extensive Model Support**: Supports a wide range of generative models (Llama 3, Gemma 2, Mistral, QWen, DeepSeek, LLaVA, etc.) and embedding models (e5-mistral), with easy extensibility for integrating new models.
- **Active Community**: SGLang is open-source and backed by an active community with industry adoption.
.. toctree::
:maxdepth: 1
:caption: Getting Started
install.md
send_request.ipynb
.. toctree::
:maxdepth: 1
:caption: Backend Tutorial
backend.md
.. toctree::
:maxdepth: 1
:caption: Frontend Tutorial
frontend.md
.. toctree::
:maxdepth: 1
:caption: References
......@@ -39,4 +43,3 @@ The core features include:
choices_methods.md
benchmark_and_profiling.md
troubleshooting.md
embedding_model.ipynb
\ No newline at end of file
......@@ -48,9 +48,9 @@ docker run --gpus all \
<summary>More</summary>
> This method is recommended if you plan to serve it as a service.
> A better approach is to use the [k8s-sglang-service.yaml](./docker/k8s-sglang-service.yaml).
> A better approach is to use the [k8s-sglang-service.yaml](https://github.com/sgl-project/sglang/blob/main/docker/k8s-sglang-service.yaml).
1. Copy the [compose.yml](./docker/compose.yaml) to your local machine
1. Copy the [compose.yml](https://github.com/sgl-project/sglang/blob/main/docker/compose.yaml) to your local machine
2. Execute the command `docker compose up -d` in your terminal.
</details>
......
ipykernel
ipywidgets
jupyter_client
markdown>=3.4.0
matplotlib
myst-parser
nbconvert
nbsphinx
pandoc
pillow
pydantic
sphinx
sphinx-book-theme
sphinx-copybutton
sphinx-tabs
sphinxcontrib-mermaid
pillow
pydantic
urllib3<2.0.0
nbsphinx
pandoc
\ No newline at end of file
urllib3<2.0.0
\ No newline at end of file
......@@ -194,7 +194,7 @@ Since we compute penalty algorithms through CUDA, the logic stores relevant para
You can run your own benchmark with desired parameters on your own hardware to make sure it's not OOMing before using.
Tuning `--mem-fraction-static` and/or `--max-running-requests` will help. See [here](hyperparameter_tuning.md#minor-tune---max-prefill-tokens---mem-fraction-static---max-running-requests) for more information.
Tuning `--mem-fraction-static` and/or `--max-running-requests` will help.
### Benchmarks
......
python3 -m http.server --d _build/html
......@@ -5,9 +5,9 @@ This page lists some common errors and tips for fixing them.
## CUDA error: an illegal memory access was encountered
This error may be due to kernel errors or out-of-memory issues.
- If it is a kernel error, it is not easy to fix.
- If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." In this case, try setting a smaller value for `--mem-fraction-static`. The default value of `--mem-fraction-static` is around 0.8 - 0.9. https://github.com/sgl-project/sglang/blob/1edd4e07d6ad52f4f63e7f6beaa5987c1e1cf621/python/sglang/srt/server_args.py#L92-L102
- If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." In this case, try setting a smaller value for `--mem-fraction-static`. The default value of `--mem-fraction-static` is around 0.8 - 0.9.
## The server hangs
If the server hangs, try disabling some optimizations when launching the server.
- Add `--disable-cuda-graph`.
- Add `--disable-flashinfer-sampling`.
- Add `--sampling-backend pytorch`.
pip install --upgrade pip
pip install -e "python[all]"
pip install transformers==4.45.2
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ --force-reinstall
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment