Update ci workflows (#1804)

6aa94b96 · Lianmin Zheng · GitHub · c2650748 · 6aa94b96 · 6aa94b96
Unverified Commit 6aa94b96 authored Oct 26, 2024 by Lianmin Zheng Committed by GitHub Oct 26, 2024
13 changed files
--- a/docs/en/hyperparameter_tuning.md
+++ b/docs/en/hyperparameter_tuning.md
--- a/docs/en/index.rst
+++ b/docs/en/index.rst
@@ -10,24 +10,28 @@ The core features include:
 - **Extensive Model Support**: Supports a wide range of generative models (Llama 3, Gemma 2, Mistral, QWen, DeepSeek, LLaVA, etc.) and embedding models (e5-mistral), with easy extensibility for integrating new models.
 - **Active Community**: SGLang is open-source and backed by an active community with industry adoption.

+
 .. toctree::
   :maxdepth: 1
   :caption: Getting Started

   install.md
-   send_request.ipynb
+

 .. toctree::
   :maxdepth: 1
   :caption: Backend Tutorial
+
   backend.md


 .. toctree::
   :maxdepth: 1
   :caption: Frontend Tutorial
+
   frontend.md

+
 .. toctree::
   :maxdepth: 1
   :caption: References
@@ -39,4 +43,3 @@ The core features include:
   choices_methods.md
   benchmark_and_profiling.md
   troubleshooting.md
-   embedding_model.ipynb
\ No newline at end of file
--- a/docs/en/install.md
+++ b/docs/en/install.md
@@ -48,9 +48,9 @@ docker run --gpus all \
 <summary>More</summary>

 > This method is recommended if you plan to serve it as a service.
-> A better approach is to use the [k8s-sglang-service.yaml](./docker/k8s-sglang-service.yaml).
+> A better approach is to use the [k8s-sglang-service.yaml](https://github.com/sgl-project/sglang/blob/main/docker/k8s-sglang-service.yaml).

-1. Copy the [compose.yml](./docker/compose.yaml) to your local machine
+1. Copy the [compose.yml](https://github.com/sgl-project/sglang/blob/main/docker/compose.yaml) to your local machine
 2. Execute the command `docker compose up -d` in your terminal.
 </details>


--- a/docs/en/model_support.md
+++ b/docs/en/model_support.md
--- a/docs/en/release_process.md
+++ b/docs/en/release_process.md
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
+ipykernel
+ipywidgets
+jupyter_client
 markdown>=3.4.0
+matplotlib
 myst-parser
+nbconvert
+nbsphinx
+pandoc
+pillow
+pydantic
 sphinx
 sphinx-book-theme
 sphinx-copybutton
 sphinx-tabs
 sphinxcontrib-mermaid
-pillow
-pydantic
-urllib3<2.0.0
-nbsphinx
-pandoc
\ No newline at end of file
+urllib3<2.0.0
\ No newline at end of file
--- a/docs/en/sampling_params.md
+++ b/docs/en/sampling_params.md
@@ -194,7 +194,7 @@ Since we compute penalty algorithms through CUDA, the logic stores relevant para

 You can run your own benchmark with desired parameters on your own hardware to make sure it's not OOMing before using.

-Tuning `--mem-fraction-static` and/or `--max-running-requests` will help. See [here](hyperparameter_tuning.md#minor-tune---max-prefill-tokens---mem-fraction-static---max-running-requests) for more information.
+Tuning `--mem-fraction-static` and/or `--max-running-requests` will help.

 ### Benchmarks


--- a/docs/en/send_request.ipynb
+++ b/docs/en/send_request.ipynb
--- a/docs/serve.sh
+++ b/docs/serve.sh
+python3 -m http.server --d _build/html
--- a/docs/en/setup_github_runner.md
+++ b/docs/en/setup_github_runner.md
--- a/docs/en/troubleshooting.md
+++ b/docs/en/troubleshooting.md
@@ -5,9 +5,9 @@ This page lists some common errors and tips for fixing them.
 ## CUDA error: an illegal memory access was encountered
 This error may be due to kernel errors or out-of-memory issues.
 - If it is a kernel error, it is not easy to fix.
- If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." In this case, try setting a smaller value for `--mem-fraction-static`. The default value of `--mem-fraction-static` is around 0.8 - 0.9. https://github.com/sgl-project/sglang/blob/1edd4e07d6ad52f4f63e7f6beaa5987c1e1cf621/python/sglang/srt/server_args.py#L92-L102
+- If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." In this case, try setting a smaller value for `--mem-fraction-static`. The default value of `--mem-fraction-static` is around 0.8 - 0.9.

 ## The server hangs
 If the server hangs, try disabling some optimizations when launching the server.
 - Add `--disable-cuda-graph`.
- Add `--disable-flashinfer-sampling`.
+- Add `--sampling-backend pytorch`.
--- a/scripts/ci_install_dependency.sh
+++ b/scripts/ci_install_dependency.sh
+pip install --upgrade pip
+pip install -e "python[all]"
+pip install transformers==4.45.2
+pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ --force-reinstall
--- a/test/killall_sglang.sh
+++ b/test/killall_sglang.sh