Commit e71d4ab3 authored by Lianmin Zheng's avatar Lianmin Zheng
Browse files

Update docs (#12)

parent fbf42263
...@@ -153,10 +153,10 @@ def image_qa(s, image_file, question): ...@@ -153,10 +153,10 @@ def image_qa(s, image_file, question):
### Constrained Decoding ### Constrained Decoding
```python ```python
@function @sgl.function
def regular_expression_gen(s): def regular_expression_gen(s):
s += "Q: What is the IP address of the Google DNS servers?\n" s += "Q: What is the IP address of the Google DNS servers?\n"
s += "A: " + gen( s += "A: " + sgl.gen(
"answer", "answer",
temperature=0, temperature=0,
regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)", regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
...@@ -197,7 +197,7 @@ for out in state.text_iter(): ...@@ -197,7 +197,7 @@ for out in state.text_iter():
## Backend: SGLang Runtime (SRT) ## Backend: SGLang Runtime (SRT)
The SGLang Runtime (SRT) is designed to work best with the SGLang frontend. The SGLang Runtime (SRT) is designed to work best with the SGLang frontend.
However, it can also be used as a standalone API server. However, it can also be used as a standalone API server.
In this case, the RadixAttention can still greatly accelerate many use cases. In this case, the [RadixAttention](https://arxiv.org/abs/2312.07104) can still greatly accelerate many use cases.
### Usage ### Usage
Launch a server Launch a server
...@@ -237,7 +237,7 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port ...@@ -237,7 +237,7 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
- Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8 - Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8
![mixtral_8x7b](assets/mixtral_8x7b.jpg) ![mixtral_8x7b](assets/mixtral_8x7b.jpg)
Learn more [here](). Learn more [here](docs/benchmark_results.md).
## Roadmap ## Roadmap
- [ ] Function call - [ ] Function call
......
## Benchmark Results
We tested our system on the following common LLM workloads and reported the achieved throughput:
- **[MMLU](https://arxiv.org/abs/2009.03300)**: A 5-shot, multi-choice, multi-task benchmark.
- **[HellaSwag](https://arxiv.org/abs/1905.07830)**: A 20-shot, multi-choice sentence completion benchmark.
- **[ReAct Agent](https://arxiv.org/abs/2210.03629)**: An agent task using prompt traces collected from the original ReAct paper.
- **[Tree-of-Thought](https://arxiv.org/pdf/2305.10601.pdf)**: A custom tree search-based prompt for solving GSM-8K problems.
- **JSON Decode**: Extracting information from a Wikipedia page and outputting it in JSON format.
- **Chat (short)**: A synthetic chat benchmark where each conversation includes 4 turns with short LLM outputs.
- **Chat (long)**: A synthetic chat benchmark where each conversation includes 4 turns with long LLM outputs.
- **[DSPy RAG](https://github.com/stanfordnlp/dspy)**: A retrieval-augmented generation pipeline in the DSPy tutorial.
- **[LLaVA Bench](https://github.com/haotian-liu/LLaVA)**: Running LLaVA v1.5, a vision language model on the LLaVA-in-the-wild benchmark.
We tested both Llama-7B on one NVIDIA A10G GPU (24GB) and Mixtral-8x7B on 8 NVIDIA A10G GPUs with tensor parallelism, using FP16 precision. We used vllm v0.2.5, guidance v0.1.8, and Hugging Face TGI v1.3.0 as baseline systems.
- Llama-7B on NVIDIA A10G, FP16, Tensor Parallelism=1
![llama_7b](../assets/llama_7b.jpg)
- Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8
![mixtral_8x7b](../assets/mixtral_8x7b.jpg)
The benchmark code is available [here](https://github.com/sgl-project/sglang/tree/main/benchmark).
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment