Update readme (#769)

Co-authored-by: Mingyi <wisclmy0611@gmail.com>

Update readme (#769)
Co-authored-by: Mingyi <wisclmy0611@gmail.com>
9f95dcc6 · Ying Sheng · GitHub · 0736b270 · 9f95dcc6
Unverified Commit 9f95dcc6 authored Jul 27, 2024 by Ying Sheng Committed by GitHub Jul 27, 2024
Hide whitespace changes
Inline Side-by-side

Showing with 6 additions and 5 deletions

README.md README.md +6 -5

No files found.
--- a/README.md
+++ b/README.md
@@ -159,16 +159,17 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
 ### Run Llama 3.1 405B
 ```bash
-# 2 nodes run 405B fp16
+## Run 405B (fp8) on a single node
+python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
+## Run 405B (fp16) on two nodes
 # replace the `172.16.4.52:20000` with your own first node ip address and port, disable CUDA Graph temporarily
 # on the first node
 GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0 --disable-cuda-graph --mem-frac 0.75
 # on the second
 GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1 --disable-cuda-graph --mem-frac 0.75
-# single node run 405B fp8
-python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
 ```
 ### Supported Models
@@ -198,7 +199,7 @@ Instructions for supporting a new model are [here](https://github.com/sgl-projec
 ### Benchmark Performance
- Benchmark a single static batch. Run the following command without launching a server. The arguments are the same as those for `launch_server.py`.
+- Benchmark a single static batch by running the following command without launching a server. The arguments are the same as those for `launch_server.py`. This is not a dynamic batching server, so it may run out of memory for a batch size that can run successfully with a real server. This is because a real server will truncate the prefill into several batches/chunks, while this unit test does not do this.
  ```
  python -m sglang.bench_latency --model-path meta-llama/Meta-Llama-3-8B-Instruct --batch 32 --input-len 256 --output-len 32
  ```