Update balance-serve.md

6cbe044a · wang jiahao · GitHub · 8a1313ca · 6cbe044a
Unverified Commit 6cbe044a authored Apr 05, 2025 by wang jiahao Committed by GitHub Apr 05, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 4 additions and 1 deletion

doc/en/balance-serve.md doc/en/balance-serve.md +4 -1

No files found.
--- a/doc/en/balance-serve.md
+++ b/doc/en/balance-serve.md
@@ -128,14 +128,17 @@ It features the following arguments:

 - `--max_new_tokens`: Maximum number of tokens generated per request.
 - `--cache_lens`: Total length of kvcache allocated by the scheduler. All requests share a kvcache space.
+- `--max_batch_size`: Maximum number of requests (prefill + decode) processed in a single run by the engine. (Supported only by `balance_serve`)
 - `--chunk_size`: Maximum number of tokens processed in a single run by the engine.
  corresponding to 32768 tokens, and the space occupied will be released after the requests are completed.
- `--max_batch_size`: Maximum number of requests (prefill + decode) processed in a single run by the engine. (Supported only by `balance_serve`)
 - `--backend_type`: `balance_serve` is a multi-concurrency backend engine introduced in version v0.2.4. The original single-concurrency engine is `ktransformers`.
 - `--model_path`: Path to safetensor config path (only config required, not model safetensors).  
  Please note that, since `ver 0.2.4`, the last segment of `${model_path}` directory name **MUST** be one of the model names defined in `ktransformers/configs/model_configs.json`.
 - `--force_think`: Force responding the reasoning tag of `DeepSeek R1`.

+The relationship between `max_batch_size`, `cache_lens`, and `max_new_tokens` should satisfy:
+`cache_lens > max_batch_size * max_new_tokens`, otherwise the concurrency will decrease.
+
 ### 2. access server

 ```