deepseek_ocr_server_8707_20260204_145012.log 29.9 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
INFO 02-04 14:50:16 [__init__.py:240] Automatically detected platform rocm.
/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py:472: DeprecationWarning: 
        on_event is deprecated, use lifespan event handlers instead.

        Read more about it in the
        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
        
  @app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-04 14:50:21 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-04 14:50:21 [config.py:721] This model supports multiple tasks: {'classify', 'embed', 'score', 'reward', 'generate'}. Defaulting to 'generate'.
INFO 02-04 14:50:21 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False, 
INFO 02-04 14:50:22 [rocm.py:226] None is not supported in AMD GPUs.
INFO 02-04 14:50:22 [rocm.py:227] Using ROCmFlashAttention backend.
WARNING 02-04 14:50:22 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
INFO 02-04 14:50:22 [worker_base.py:653] ########## 14555 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
INFO 02-04 14:50:22 [worker_base.py:654] ########## 14555 process(rank0) is running on memnode(s): {0, 1, 2, 3}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0204 14:50:22.719424 14555 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0204 14:50:22.719481 14555 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 14:50:22.719913 14555 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55e48a312250, SPLIT_COLOR: 3389850942126204093, PG Name: 1
I0204 14:50:22.719926 14555 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 14:50:22.738953 14555 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55e48a312250, SPLIT_COLOR: 3389850942126204093, PG Name: 3
I0204 14:50:22.738993 14555 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 14:50:22.740214 14555 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55e48a312250, SPLIT_COLOR: 3389850942126204093, PG Name: 5
I0204 14:50:22.740231 14555 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 14:50:22.741233 14555 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55e48a312250, SPLIT_COLOR: 3389850942126204093, PG Name: 7
I0204 14:50:22.741250 14555 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
INFO 02-04 14:50:22 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-04 14:50:22 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-04 14:50:23 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]

Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.14it/s]

Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.12it/s]

INFO 02-04 14:50:26 [loader.py:460] Loading weights took 1.81 seconds
INFO 02-04 14:50:26 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.499614 seconds
Some kwargs in processor config are unused and will not have any effect: ignore_id, image_token, add_special_token, sft_format, image_mean, image_std, mask_prompt, downsample_ratio, candidate_resolutions, patch_size, pad_token, normalize. 
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
  x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 14:50:38 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 14:50:39 [worker.py:287] Memory profiling takes 12.33 seconds
INFO 02-04 14:50:39 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-04 14:50:39 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
INFO 02-04 14:50:39 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 1092
INFO 02-04 14:50:39 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
INFO 02-04 14:50:41 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.

Capturing CUDA graph shapes:   0%|          | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes:  17%|█▋        | 1/6 [00:00<00:02,  1.93it/s]
Capturing CUDA graph shapes:  33%|███▎      | 2/6 [00:00<00:01,  2.03it/s]
Capturing CUDA graph shapes:  50%|█████     | 3/6 [00:01<00:01,  2.06it/s]
Capturing CUDA graph shapes:  67%|██████▋   | 4/6 [00:01<00:00,  2.01it/s]
Capturing CUDA graph shapes:  83%|████████▎ | 5/6 [00:02<00:00,  1.99it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00,  2.01it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00,  2.01it/s]
INFO 02-04 14:50:44 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
INFO 02-04 14:50:44 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 17.52 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
   - CPU 线程池: 2 线程
   - GPU 线程池: 1 线程

[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs

INFO:     Started server process [14555]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: ignore_id, image_token, add_special_token, sft_format, image_mean, image_std, mask_prompt, downsample_ratio, candidate_resolutions, patch_size, pad_token, normalize. 
   [1/3] Tokenize 19 页...
   [1/3] Tokenize 完成
   [2/3] GPU 批量推理 19 页...

Processed prompts:   0%|          | 0/19 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:   5%|▌         | 1/19 [00:12<03:45, 12.53s/it, est. speed input: 90.28 toks/s, output: 1.44 toks/s]
Processed prompts:  11%|█         | 2/19 [00:15<01:58,  6.97s/it, est. speed input: 144.95 toks/s, output: 8.71 toks/s]
Processed prompts:  16%|█▌        | 3/19 [00:16<01:06,  4.14s/it, est. speed input: 207.25 toks/s, output: 17.04 toks/s]
Processed prompts:  21%|██        | 4/19 [00:16<00:38,  2.59s/it, est. speed input: 272.56 toks/s, output: 25.91 toks/s]
Processed prompts:  26%|██▋       | 5/19 [00:17<00:25,  1.85s/it, est. speed input: 330.05 toks/s, output: 35.02 toks/s]
Processed prompts:  37%|███▋      | 7/19 [00:17<00:11,  1.01it/s, est. speed input: 454.16 toks/s, output: 54.67 toks/s]
Processed prompts:  53%|█████▎    | 10/19 [00:17<00:04,  1.95it/s, est. speed input: 640.14 toks/s, output: 85.41 toks/s]
Processed prompts:  58%|█████▊    | 11/19 [00:17<00:03,  2.21it/s, est. speed input: 696.02 toks/s, output: 95.61 toks/s]
Processed prompts:  63%|██████▎   | 12/19 [00:18<00:03,  2.15it/s, est. speed input: 738.44 toks/s, output: 105.17 toks/s]
Processed prompts:  68%|██████▊   | 13/19 [00:18<00:02,  2.35it/s, est. speed input: 786.93 toks/s, output: 116.25 toks/s]
Processed prompts:  74%|███████▎  | 14/19 [00:18<00:01,  2.80it/s, est. speed input: 840.49 toks/s, output: 128.40 toks/s]
Processed prompts:  79%|███████▉  | 15/19 [00:18<00:01,  3.30it/s, est. speed input: 893.08 toks/s, output: 140.82 toks/s]
Processed prompts:  89%|████████▉ | 17/19 [00:20<00:00,  2.33it/s, est. speed input: 952.65 toks/s, output: 162.22 toks/s]
Processed prompts:  95%|█████████▍| 18/19 [00:20<00:00,  2.17it/s, est. speed input: 981.18 toks/s, output: 176.78 toks/s]
Processed prompts: 100%|██████████| 19/19 [00:21<00:00,  1.76it/s, est. speed input: 993.47 toks/s, output: 192.92 toks/s]
Processed prompts: 100%|██████████| 19/19 [00:21<00:00,  1.14s/it, est. speed input: 993.47 toks/s, output: 192.92 toks/s]
   [2/3] GPU 推理完成
   OCR 耗时: 26.65s
   [3/3] 后处理...
   [3/3] 后处理完成 (0.00s)
============================================================
[SUCCESS] 全部完成
   总耗时: 29.10s
   平均: 1.53s/页
============================================================

INFO:     127.0.0.1:55910 - "POST /ocr HTTP/1.1" 200 OK
   [1/3] Tokenize 19 页...
   [1/3] Tokenize 完成
   [2/3] GPU 批量推理 19 页...

Processed prompts:   0%|          | 0/19 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:   5%|▌         | 1/19 [00:12<03:37, 12.10s/it, est. speed input: 93.46 toks/s, output: 1.49 toks/s]
Processed prompts:  11%|█         | 2/19 [00:15<01:55,  6.77s/it, est. speed input: 149.33 toks/s, output: 8.98 toks/s]
Processed prompts:  16%|█▌        | 3/19 [00:15<01:04,  4.04s/it, est. speed input: 213.03 toks/s, output: 17.52 toks/s]
Processed prompts:  21%|██        | 4/19 [00:16<00:38,  2.53s/it, est. speed input: 280.00 toks/s, output: 26.61 toks/s]
Processed prompts:  26%|██▋       | 5/19 [00:16<00:25,  1.82s/it, est. speed input: 338.64 toks/s, output: 35.93 toks/s]
Processed prompts:  37%|███▋      | 7/19 [00:17<00:12,  1.04s/it, est. speed input: 457.99 toks/s, output: 55.13 toks/s]
Processed prompts:  53%|█████▎    | 10/19 [00:17<00:04,  1.85it/s, est. speed input: 645.21 toks/s, output: 86.08 toks/s]
Processed prompts:  58%|█████▊    | 11/19 [00:17<00:03,  2.10it/s, est. speed input: 701.51 toks/s, output: 96.37 toks/s]
Processed prompts:  63%|██████▎   | 12/19 [00:18<00:03,  2.07it/s, est. speed input: 743.95 toks/s, output: 105.96 toks/s]
Processed prompts:  68%|██████▊   | 13/19 [00:18<00:02,  2.27it/s, est. speed input: 792.69 toks/s, output: 117.10 toks/s]
Processed prompts:  74%|███████▎  | 14/19 [00:18<00:01,  2.72it/s, est. speed input: 846.58 toks/s, output: 129.33 toks/s]
Processed prompts:  79%|███████▉  | 15/19 [00:18<00:01,  3.21it/s, est. speed input: 899.49 toks/s, output: 141.83 toks/s]
Processed prompts:  89%|████████▉ | 17/19 [00:20<00:00,  2.31it/s, est. speed input: 959.04 toks/s, output: 163.31 toks/s]
Processed prompts:  95%|█████████▍| 18/19 [00:20<00:00,  2.15it/s, est. speed input: 987.37 toks/s, output: 177.90 toks/s]
Processed prompts: 100%|██████████| 19/19 [00:21<00:00,  1.75it/s, est. speed input: 999.56 toks/s, output: 194.11 toks/s]
Processed prompts: 100%|██████████| 19/19 [00:21<00:00,  1.13s/it, est. speed input: 999.56 toks/s, output: 194.11 toks/s]
   [2/3] GPU 推理完成
   OCR 耗时: 24.63s
   [3/3] 后处理...
   [3/3] 后处理完成 (0.00s)
============================================================
[SUCCESS] 全部完成
   总耗时: 27.07s
   平均: 1.42s/页
============================================================

INFO:     127.0.0.1:43926 - "POST /ocr HTTP/1.1" 200 OK
   [1/3] Tokenize 22 页...
   [1/3] Tokenize 完成
   [2/3] GPU 批量推理 22 页...

Processed prompts:   0%|          | 0/22 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:   5%|▍         | 1/22 [00:15<05:17, 15.10s/it, est. speed input: 74.91 toks/s, output: 6.56 toks/s]
Processed prompts:   9%|▉         | 2/22 [00:15<02:11,  6.59s/it, est. speed input: 143.81 toks/s, output: 12.65 toks/s]
Processed prompts:  14%|█▎        | 3/22 [00:18<01:33,  4.90s/it, est. speed input: 182.30 toks/s, output: 19.61 toks/s]
Processed prompts:  18%|█▊        | 4/22 [00:20<01:07,  3.73s/it, est. speed input: 220.17 toks/s, output: 28.67 toks/s]
Processed prompts:  23%|██▎       | 5/22 [00:21<00:48,  2.85s/it, est. speed input: 258.91 toks/s, output: 39.01 toks/s]
Processed prompts:  27%|██▋       | 6/22 [00:22<00:35,  2.25s/it, est. speed input: 296.16 toks/s, output: 50.10 toks/s]
Processed prompts:  32%|███▏      | 7/22 [00:29<00:54,  3.65s/it, est. speed input: 268.68 toks/s, output: 56.47 toks/s]
Processed prompts:  36%|███▋      | 8/22 [00:31<00:45,  3.25s/it, est. speed input: 284.10 toks/s, output: 70.96 toks/s]
Processed prompts:  41%|████      | 9/22 [00:32<00:32,  2.53s/it, est. speed input: 310.48 toks/s, output: 88.09 toks/s]
Processed prompts:  45%|████▌     | 10/22 [00:35<00:30,  2.56s/it, est. speed input: 319.23 toks/s, output: 101.84 toks/s]
Processed prompts:  50%|█████     | 11/22 [00:37<00:27,  2.47s/it, est. speed input: 330.13 toks/s, output: 117.02 toks/s]
Processed prompts:  55%|█████▍    | 12/22 [00:37<00:17,  1.77s/it, est. speed input: 358.37 toks/s, output: 137.81 toks/s]
Processed prompts:  59%|█████▉    | 13/22 [00:38<00:12,  1.40s/it, est. speed input: 382.90 toks/s, output: 157.53 toks/s]
Processed prompts:  64%|██████▎   | 14/22 [00:39<00:09,  1.16s/it, est. speed input: 405.78 toks/s, output: 176.96 toks/s]
Processed prompts:  68%|██████▊   | 15/22 [00:42<00:13,  1.90s/it, est. speed input: 398.03 toks/s, output: 185.77 toks/s]
Processed prompts:  73%|███████▎  | 16/22 [00:44<00:10,  1.78s/it, est. speed input: 410.00 toks/s, output: 201.69 toks/s]
Processed prompts:  77%|███████▋  | 17/22 [00:44<00:07,  1.47s/it, est. speed input: 428.40 toks/s, output: 223.28 toks/s]
Processed prompts:  82%|████████▏ | 18/22 [00:45<00:04,  1.18s/it, est. speed input: 448.48 toks/s, output: 246.00 toks/s]
Processed prompts:  86%|████████▋ | 19/22 [00:45<00:02,  1.06it/s, est. speed input: 469.47 toks/s, output: 269.53 toks/s]
Processed prompts:  91%|█████████ | 20/22 [00:48<00:02,  1.48s/it, est. speed input: 466.38 toks/s, output: 282.20 toks/s]
Processed prompts:  95%|█████████▌| 21/22 [00:48<00:01,  1.07s/it, est. speed input: 488.51 toks/s, output: 307.47 toks/s]
Processed prompts: 100%|██████████| 22/22 [00:53<00:00,  2.18s/it, est. speed input: 466.14 toks/s, output: 316.51 toks/s]
Processed prompts: 100%|██████████| 22/22 [00:53<00:00,  2.43s/it, est. speed input: 466.14 toks/s, output: 316.51 toks/s]
   [2/3] GPU 推理完成
   OCR 耗时: 58.12s
   [3/3] 后处理...
   [3/3] 后处理完成 (0.00s)
============================================================
[SUCCESS] 全部完成
   总耗时: 69.56s
   平均: 3.16s/页
============================================================

INFO:     127.0.0.1:55008 - "POST /ocr HTTP/1.1" 200 OK
   [1/3] Tokenize 22 页...
   [1/3] Tokenize 完成
   [2/3] GPU 批量推理 22 页...

Processed prompts:   0%|          | 0/22 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:   5%|▍         | 1/22 [00:15<05:21, 15.31s/it, est. speed input: 73.86 toks/s, output: 6.47 toks/s]
Processed prompts:   9%|▉         | 2/22 [00:15<02:13,  6.67s/it, est. speed input: 141.92 toks/s, output: 12.49 toks/s]
Processed prompts:  14%|█▎        | 3/22 [00:18<01:33,  4.94s/it, est. speed input: 180.25 toks/s, output: 19.39 toks/s]
Processed prompts:  18%|█▊        | 4/22 [00:20<01:07,  3.75s/it, est. speed input: 217.99 toks/s, output: 28.38 toks/s]
Processed prompts:  23%|██▎       | 5/22 [00:22<00:48,  2.87s/it, est. speed input: 256.49 toks/s, output: 38.64 toks/s]
Processed prompts:  27%|██▋       | 6/22 [00:23<00:36,  2.26s/it, est. speed input: 293.53 toks/s, output: 49.66 toks/s]
Processed prompts:  32%|███▏      | 7/22 [00:29<00:55,  3.68s/it, est. speed input: 266.42 toks/s, output: 56.00 toks/s]
Processed prompts:  36%|███▋      | 8/22 [00:32<00:45,  3.27s/it, est. speed input: 281.74 toks/s, output: 70.37 toks/s]
Processed prompts:  41%|████      | 9/22 [00:33<00:33,  2.54s/it, est. speed input: 307.92 toks/s, output: 87.36 toks/s]
Processed prompts:  45%|████▌     | 10/22 [00:35<00:30,  2.58s/it, est. speed input: 316.71 toks/s, output: 101.03 toks/s]
Processed prompts:  50%|█████     | 11/22 [00:38<00:27,  2.49s/it, est. speed input: 327.38 toks/s, output: 116.05 toks/s]
Processed prompts:  55%|█████▍    | 12/22 [00:38<00:17,  1.79s/it, est. speed input: 355.40 toks/s, output: 136.67 toks/s]
Processed prompts:  59%|█████▉    | 13/22 [00:38<00:12,  1.41s/it, est. speed input: 379.68 toks/s, output: 156.21 toks/s]
Processed prompts:  64%|██████▎   | 14/22 [00:39<00:09,  1.17s/it, est. speed input: 402.40 toks/s, output: 175.48 toks/s]
Processed prompts:  68%|██████▊   | 15/22 [00:43<00:13,  1.92s/it, est. speed input: 394.48 toks/s, output: 184.12 toks/s]
Processed prompts:  73%|███████▎  | 16/22 [00:44<00:10,  1.81s/it, est. speed input: 406.20 toks/s, output: 199.82 toks/s]
Processed prompts:  77%|███████▋  | 17/22 [00:45<00:07,  1.49s/it, est. speed input: 424.39 toks/s, output: 221.19 toks/s]
Processed prompts:  82%|████████▏ | 18/22 [00:45<00:04,  1.20s/it, est. speed input: 444.25 toks/s, output: 243.69 toks/s]
Processed prompts:  86%|████████▋ | 19/22 [00:46<00:02,  1.05it/s, est. speed input: 465.03 toks/s, output: 266.98 toks/s]
Processed prompts:  91%|█████████ | 20/22 [00:48<00:02,  1.50s/it, est. speed input: 461.96 toks/s, output: 279.52 toks/s]
Processed prompts:  95%|█████████▌| 21/22 [00:49<00:01,  1.08s/it, est. speed input: 483.88 toks/s, output: 304.56 toks/s]
Processed prompts: 100%|██████████| 22/22 [00:53<00:00,  2.22s/it, est. speed input: 461.15 toks/s, output: 313.13 toks/s]
Processed prompts: 100%|██████████| 22/22 [00:53<00:00,  2.45s/it, est. speed input: 461.15 toks/s, output: 313.13 toks/s]
   [2/3] GPU 推理完成
   OCR 耗时: 58.66s
   [3/3] 后处理...
   [3/3] 后处理完成 (0.00s)
============================================================
[SUCCESS] 全部完成
   总耗时: 70.07s
   平均: 3.19s/页
============================================================

INFO:     127.0.0.1:46898 - "POST /ocr HTTP/1.1" 200 OK
   [1/3] Tokenize 22 页...
   [1/3] Tokenize 完成
   [2/3] GPU 批量推理 22 页...

Processed prompts:   0%|          | 0/22 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:   5%|▍         | 1/22 [00:15<05:25, 15.49s/it, est. speed input: 73.04 toks/s, output: 6.39 toks/s]
Processed prompts:   9%|▉         | 2/22 [00:16<02:14,  6.75s/it, est. speed input: 140.38 toks/s, output: 12.35 toks/s]
Processed prompts:  14%|█▎        | 3/22 [00:19<01:34,  4.99s/it, est. speed input: 178.38 toks/s, output: 19.19 toks/s]
Processed prompts:  18%|█▊        | 4/22 [00:20<01:08,  3.79s/it, est. speed input: 215.82 toks/s, output: 28.10 toks/s]
Processed prompts:  23%|██▎       | 5/22 [00:22<00:49,  2.89s/it, est. speed input: 254.00 toks/s, output: 38.27 toks/s]
Processed prompts:  27%|██▋       | 6/22 [00:23<00:36,  2.27s/it, est. speed input: 290.78 toks/s, output: 49.19 toks/s]
Processed prompts:  32%|███▏      | 7/22 [00:29<00:55,  3.70s/it, est. speed input: 264.06 toks/s, output: 55.50 toks/s]
Processed prompts:  36%|███▋      | 8/22 [00:32<00:46,  3.29s/it, est. speed input: 279.27 toks/s, output: 69.76 toks/s]
Processed prompts:  41%|████      | 9/22 [00:33<00:33,  2.56s/it, est. speed input: 305.18 toks/s, output: 86.59 toks/s]
Processed prompts:  45%|████▌     | 10/22 [00:36<00:31,  2.60s/it, est. speed input: 313.88 toks/s, output: 100.13 toks/s]
Processed prompts:  50%|█████     | 11/22 [00:38<00:27,  2.50s/it, est. speed input: 324.68 toks/s, output: 115.09 toks/s]
Processed prompts:  55%|█████▍    | 12/22 [00:38<00:17,  1.80s/it, est. speed input: 352.47 toks/s, output: 135.54 toks/s]
Processed prompts:  59%|█████▉    | 13/22 [00:39<00:12,  1.42s/it, est. speed input: 376.59 toks/s, output: 154.93 toks/s]
Processed prompts:  64%|██████▎   | 14/22 [00:39<00:09,  1.18s/it, est. speed input: 399.19 toks/s, output: 174.08 toks/s]
Processed prompts:  68%|██████▊   | 15/22 [00:43<00:13,  1.92s/it, est. speed input: 391.79 toks/s, output: 182.86 toks/s]
Processed prompts:  73%|███████▎  | 16/22 [00:44<00:10,  1.80s/it, est. speed input: 403.70 toks/s, output: 198.59 toks/s]
Processed prompts:  77%|███████▋  | 17/22 [00:45<00:07,  1.48s/it, est. speed input: 421.91 toks/s, output: 219.90 toks/s]
Processed prompts:  82%|████████▏ | 18/22 [00:46<00:04,  1.19s/it, est. speed input: 441.72 toks/s, output: 242.30 toks/s]
Processed prompts:  86%|████████▋ | 19/22 [00:46<00:02,  1.05it/s, est. speed input: 462.44 toks/s, output: 265.49 toks/s]
Processed prompts:  91%|█████████ | 20/22 [00:49<00:02,  1.48s/it, est. speed input: 459.78 toks/s, output: 278.20 toks/s]
Processed prompts:  95%|█████████▌| 21/22 [00:49<00:01,  1.07s/it, est. speed input: 481.62 toks/s, output: 303.14 toks/s]
Processed prompts: 100%|██████████| 22/22 [00:54<00:00,  2.17s/it, est. speed input: 460.32 toks/s, output: 312.56 toks/s]
Processed prompts: 100%|██████████| 22/22 [00:54<00:00,  2.46s/it, est. speed input: 460.32 toks/s, output: 312.56 toks/s]
   [2/3] GPU 推理完成
   OCR 耗时: 58.74s
   [3/3] 后处理...
   [3/3] 后处理完成 (0.00s)
============================================================
[SUCCESS] 全部完成
   总耗时: 70.17s
   平均: 3.19s/页
============================================================

INFO:     127.0.0.1:45882 - "POST /ocr HTTP/1.1" 200 OK
   [1/3] Tokenize 22 页...
   [1/3] Tokenize 完成
   [2/3] GPU 批量推理 22 页...

Processed prompts:   0%|          | 0/22 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:   5%|▍         | 1/22 [00:15<05:17, 15.11s/it, est. speed input: 74.86 toks/s, output: 6.55 toks/s]
Processed prompts:   9%|▉         | 2/22 [00:15<02:11,  6.59s/it, est. speed input: 143.79 toks/s, output: 12.65 toks/s]
Processed prompts:  14%|█▎        | 3/22 [00:18<01:33,  4.91s/it, est. speed input: 182.07 toks/s, output: 19.59 toks/s]
Processed prompts:  18%|█▊        | 4/22 [00:20<01:07,  3.74s/it, est. speed input: 219.78 toks/s, output: 28.61 toks/s]
Processed prompts:  23%|██▎       | 5/22 [00:21<00:48,  2.86s/it, est. speed input: 258.36 toks/s, output: 38.92 toks/s]
Processed prompts:  27%|██▋       | 6/22 [00:22<00:36,  2.26s/it, est. speed input: 295.43 toks/s, output: 49.98 toks/s]
Processed prompts:  32%|███▏      | 7/22 [00:29<00:55,  3.68s/it, est. speed input: 267.59 toks/s, output: 56.24 toks/s]
Processed prompts:  36%|███▋      | 8/22 [00:31<00:45,  3.27s/it, est. speed input: 282.89 toks/s, output: 70.66 toks/s]
Processed prompts:  41%|████      | 9/22 [00:32<00:33,  2.55s/it, est. speed input: 309.03 toks/s, output: 87.68 toks/s]
Processed prompts:  45%|████▌     | 10/22 [00:35<00:31,  2.59s/it, est. speed input: 317.58 toks/s, output: 101.31 toks/s]
Processed prompts:  50%|█████     | 11/22 [00:37<00:27,  2.49s/it, est. speed input: 328.31 toks/s, output: 116.38 toks/s]
Processed prompts:  55%|█████▍    | 12/22 [00:38<00:17,  1.79s/it, est. speed input: 356.40 toks/s, output: 137.05 toks/s]
Processed prompts:  59%|█████▉    | 13/22 [00:38<00:12,  1.41s/it, est. speed input: 380.78 toks/s, output: 156.66 toks/s]
Processed prompts:  64%|██████▎   | 14/22 [00:39<00:09,  1.17s/it, est. speed input: 403.61 toks/s, output: 176.01 toks/s]
Processed prompts:  68%|██████▊   | 15/22 [00:42<00:13,  1.92s/it, est. speed input: 395.55 toks/s, output: 184.61 toks/s]
Processed prompts:  73%|███████▎  | 16/22 [00:44<00:10,  1.80s/it, est. speed input: 407.36 toks/s, output: 200.39 toks/s]
Processed prompts:  77%|███████▋  | 17/22 [00:45<00:07,  1.49s/it, est. speed input: 425.64 toks/s, output: 221.84 toks/s]
Processed prompts:  82%|████████▏ | 18/22 [00:45<00:04,  1.20s/it, est. speed input: 445.59 toks/s, output: 244.42 toks/s]
Processed prompts:  86%|████████▋ | 19/22 [00:46<00:02,  1.05it/s, est. speed input: 466.45 toks/s, output: 267.79 toks/s]
Processed prompts:  91%|█████████ | 20/22 [00:48<00:02,  1.48s/it, est. speed input: 463.57 toks/s, output: 280.50 toks/s]
Processed prompts:  95%|█████████▌| 21/22 [00:48<00:01,  1.07s/it, est. speed input: 485.58 toks/s, output: 305.63 toks/s]
Processed prompts: 100%|██████████| 22/22 [00:53<00:00,  2.17s/it, est. speed input: 463.82 toks/s, output: 314.93 toks/s]
Processed prompts: 100%|██████████| 22/22 [00:53<00:00,  2.44s/it, est. speed input: 463.82 toks/s, output: 314.93 toks/s]
   [2/3] GPU 推理完成
   OCR 耗时: 58.24s
   [3/3] 后处理...
   [3/3] 后处理完成 (0.00s)
============================================================
[SUCCESS] 全部完成
   总耗时: 69.68s
   平均: 3.17s/页
============================================================

INFO:     127.0.0.1:44226 - "POST /ocr HTTP/1.1" 200 OK