docs: update sglang usage instructions and improve parameter descriptions in README files

1d57371f · myhloli · 016076e0 · 1d57371f · 1d57371f
Commit 1d57371f authored Jun 23, 2025 by myhloli
Hide whitespace changes
Inline Side-by-side

Showing with 18 additions and 14 deletions

README.md README.md +9 -7

README_zh-CN.md README_zh-CN.md +9 -7

No files found.
--- a/README.md
+++ b/README.md
@@ -651,13 +651,13 @@ mineru -p <input_path> -o <output_path>

 #### 2.3 Using sglang to Accelerate VLM Model Inference

-##### Start sglang-engine Mode
+##### Through the sglang-engine Mode

 ```bash
 mineru -p <input_path> -o <output_path> -b vlm-sglang-engine
 ```

-##### Start sglang-server/client Mode
+##### Through the sglang-server/client Mode

 1. Start Server:

@@ -665,11 +665,13 @@ mineru -p <input_path> -o <output_path> -b vlm-sglang-engine
 mineru-sglang-server --port 30000
 ```

-> [!TIP]
-> sglang acceleration requires a GPU with Ampere architecture or newer, and at least 24GB VRAM. If you have two 12GB or 16GB GPUs, you can use Tensor Parallelism (TP) mode:  
-> `mineru-sglang-server --port 30000 --tp 2`  
-> 
-> If you still encounter out-of-memory errors with two GPUs, or if you need to improve throughput or inference speed using multi-GPU parallelism, please refer to the [sglang official documentation](https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands).
+> sglang-server has some commonly used parameters for configuration:
+> - If you have two GPUs with `12GB` or `16GB` VRAM, you can use the Tensor Parallel (TP) mode: `--tp 2`
+> - If you have two GPUs with `11GB` VRAM, in addition to Tensor Parallel mode, you need to reduce the KV cache size: `--tp 2 --mem-fraction-static 0.7`
+> - If you have more than two GPUs with `24GB` VRAM or above, you can use sglang's multi-GPU parallel mode to increase throughput: `--dp 2`
+> - You can also enable `torch.compile` to accelerate inference speed by approximately 15%: `--enable-torch-compile`
+> - If you want to learn more about the usage of `sglang` parameters, please refer to the [official sglang documentation](https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands)
+

 2. Use Client in another terminal:


--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -492,7 +492,7 @@ cd MinerU
 uv pip install -e .[core] -i https://mirrors.aliyun.com/pypi/simple
 ```

-> [!TIP]
+> [!NOTE]
 > Linux和macOS系统安装后自动支持cuda/mps加速，Windows用户如需使用cuda加速，
 > 请前往 [Pytorch官网](https://pytorch.org/get-started/locally/) 选择合适的cuda版本安装pytorch。

@@ -640,13 +640,13 @@ mineru -p <input_path> -o <output_path>

 #### 2.3 使用 sglang 加速 VLM 模型推理

-##### 启动 sglang-engine 模式
+##### 通过 sglang-engine 模式

 ```bash
 mineru -p <input_path> -o <output_path> -b vlm-sglang-engine
 ```

-##### 启动 sglang-server/client 模式
+##### 通过 sglang-server/client 模式

 1. 启动 Server：

@@ -655,10 +655,12 @@ mineru-sglang-server --port 30000
 ```

 > [!TIP]
-> sglang加速需设备有Ampere及以后架构，24G显存及以上显卡，如您有两张12G或16G显卡，可以通过张量并行（TP）模式使用:
-> `mineru-sglang-server --port 30000 --tp 2`
-> 
-> 如使用两张卡仍出现显存不足错误或需要使用多卡并行增加吞吐量或推理速度，请参考 [sglang官方文档](https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands)
+> sglang-server 有一些常用参数可以配置：
+> - 如您有两张显存为`12G`或`16G`的显卡，可以通过张量并行（TP）模式使用：`--tp 2`
+> - 如您有两张`11G显存`显卡，除了张量并行外，还需要调低KV缓存大小，可以使用：`--tp 2 --mem-fraction-static 0.7`
+> - 如果您有超过多张`24G显存`以上显卡，可以使用sglang的多卡并行模式来增加吞吐量：`--dp 2`
+> - 同时您可以启用`torch.compile`来将推理速度加速约15%：`--enable-torch-compile`
+> - 如果您想了解更多有关`sglang`的参数使用方法，请参考 [sglang官方文档](https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands)

 2. 在另一个终端中使用 Client 调用：