Update readme (#434)

455c9ccc · Lianmin Zheng · GitHub · 39191c85 · 455c9ccc · 455c9ccc
Unverified Commit 455c9ccc authored May 13, 2024 by Lianmin Zheng Committed by GitHub May 13, 2024
Show whitespace changes
Inline Side-by-side

Showing with 7 additions and 4 deletions

README.md README.md +5 -3

benchmark/latency_throughput/test_latency.py benchmark/latency_throughput/test_latency.py +2 -1

No files found.
--- a/README.md
+++ b/README.md
@@ -326,15 +326,17 @@ response = client.chat.completions.create(
 print(response)
 ```
-In above example, the server uses the chat template specified in the model tokenizer.
-You can override the chat template if needed when launching the server:
+By default, the server uses the chat template specified in the model tokenizer from Hugging Face. It should just work for most official models such as Llama-2/Llama-3.
+If needed, you can also override the chat template when launching the server:
 ```
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2
 ```
 If the chat template you are looking for is missing, you are welcome to contribute it.
-Meanwhile, you can also temporary register your chat template as follows:
+Meanwhile, you can also temporarily register your chat template as follows:
 ```json
 {

--- a/benchmark/latency_throughput/test_latency.py
+++ b/benchmark/latency_throughput/test_latency.py
@@ -30,7 +30,8 @@ if __name__ == "__main__":
        response = requests.post(
            url + "/generate",
            json={
-                "input_ids": [[1,2,3], [1,2,3]],
+                "text": f"{a}, ",
+                #"input_ids": [[2] * 256] * 196,
                "sampling_params": {
                    "temperature": 0,
                    "max_new_tokens": max_new_tokens,