deepseek_ocr_server_8707_20260204_131516.log

INFO 02-04 13:15:20 [__init__.py:240] Automatically detected platform rocm.
/home/lst/DeepSeek-OCR-vllm/deepseek_ocr_server.py:472: DeprecationWarning: 
        on_event is deprecated, use lifespan event handlers instead.

        Read more about it in the
        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
        
  @app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr
INFO 02-04 13:15:25 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCRForCausalLM']}
INFO 02-04 13:15:25 [config.py:721] This model supports multiple tasks: {'reward', 'classify', 'generate', 'score', 'embed'}. Defaulting to 'generate'.
INFO 02-04 13:15:25 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr', speculative_config=None, tokenizer='/home/lst/deepseek_ocr', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":128}, use_cached_outputs=False, 
INFO 02-04 13:15:26 [rocm.py:226] None is not supported in AMD GPUs.
INFO 02-04 13:15:26 [rocm.py:227] Using ROCmFlashAttention backend.
WARNING 02-04 13:15:26 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
INFO 02-04 13:15:26 [worker_base.py:653] ########## 4593 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
INFO 02-04 13:15:26 [worker_base.py:654] ########## 4593 process(rank0) is running on memnode(s): {0, 1, 2, 3}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0204 13:15:26.794960  4593 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0204 13:15:26.795035  4593 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 13:15:26.795487  4593 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55747cc42d80, SPLIT_COLOR: 3389850942126204093, PG Name: 1
I0204 13:15:26.795502  4593 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 13:15:26.814953  4593 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55747cc42d80, SPLIT_COLOR: 3389850942126204093, PG Name: 3
I0204 13:15:26.814989  4593 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 13:15:26.816212  4593 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55747cc42d80, SPLIT_COLOR: 3389850942126204093, PG Name: 5
I0204 13:15:26.816231  4593 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 13:15:26.817245  4593 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55747cc42d80, SPLIT_COLOR: 3389850942126204093, PG Name: 7
I0204 13:15:26.817265  4593 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
INFO 02-04 13:15:26 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-04 13:15:26 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr...
INFO 02-04 13:15:27 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128] is overridden by config [128, 1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120]

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]

Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.84it/s]

Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.83it/s]

INFO 02-04 13:15:30 [loader.py:460] Loading weights took 2.13 seconds
INFO 02-04 13:15:30 [model_runner.py:1165] Model loading took 6.2319 GiB and 3.562644 seconds
Some kwargs in processor config are unused and will not have any effect: ignore_id, pad_token, image_token, add_special_token, candidate_resolutions, mask_prompt, image_std, sft_format, patch_size, normalize, image_mean, downsample_ratio. 
/home/lst/DeepSeek-OCR-vllm/deepencoder/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
  x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 13:15:44 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 13:15:45 [worker.py:287] Memory profiling takes 14.64 seconds
INFO 02-04 13:15:45 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.75) = 47.99GiB
INFO 02-04 13:15:45 [worker.py:287] model weights take 6.23GiB; non_torch_memory takes 1.55GiB; PyTorch activation peak memory takes 1.60GiB; the rest of the memory reserved for KV Cache is 38.60GiB.
INFO 02-04 13:15:45 [executor_base.py:112] # rocm blocks: 10541, # CPU blocks: 1092
INFO 02-04 13:15:45 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 82.35x
INFO 02-04 13:15:47 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.

Capturing CUDA graph shapes:   0%|          | 0/19 [00:00<?, ?it/s]
Capturing CUDA graph shapes:   5%|▌         | 1/19 [00:00<00:08,  2.02it/s]
Capturing CUDA graph shapes:  11%|█         | 2/19 [00:00<00:08,  2.09it/s]
Capturing CUDA graph shapes:  16%|█▌        | 3/19 [00:01<00:07,  2.11it/s]
Capturing CUDA graph shapes:  21%|██        | 4/19 [00:01<00:07,  2.06it/s]
Capturing CUDA graph shapes:  26%|██▋       | 5/19 [00:02<00:06,  2.06it/s]
Capturing CUDA graph shapes:  32%|███▏      | 6/19 [00:02<00:06,  2.09it/s]
Capturing CUDA graph shapes:  37%|███▋      | 7/19 [00:03<00:05,  2.07it/s]
Capturing CUDA graph shapes:  42%|████▏     | 8/19 [00:03<00:05,  2.05it/s]
Capturing CUDA graph shapes:  47%|████▋     | 9/19 [00:04<00:04,  2.04it/s]
Capturing CUDA graph shapes:  53%|█████▎    | 10/19 [00:04<00:04,  2.07it/s]
Capturing CUDA graph shapes:  58%|█████▊    | 11/19 [00:05<00:03,  2.08it/s]
Capturing CUDA graph shapes:  63%|██████▎   | 12/19 [00:05<00:03,  2.05it/s]
Capturing CUDA graph shapes:  68%|██████▊   | 13/19 [00:06<00:02,  2.06it/s]
Capturing CUDA graph shapes:  74%|███████▎  | 14/19 [00:06<00:02,  2.08it/s]
Capturing CUDA graph shapes:  79%|███████▉  | 15/19 [00:07<00:01,  2.10it/s]
Capturing CUDA graph shapes:  84%|████████▍ | 16/19 [00:07<00:01,  2.11it/s]
Capturing CUDA graph shapes:  89%|████████▉ | 17/19 [00:08<00:00,  2.08it/s]
Capturing CUDA graph shapes:  95%|█████████▍| 18/19 [00:08<00:00,  2.08it/s]
Capturing CUDA graph shapes: 100%|██████████| 19/19 [00:09<00:00,  2.10it/s]
Capturing CUDA graph shapes: 100%|██████████| 19/19 [00:09<00:00,  2.08it/s]
INFO 02-04 13:15:56 [model_runner.py:1752] Graph capturing finished in 9 secs, took 0.16 GiB
INFO 02-04 13:15:56 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 26.33 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
   - CPU 线程池: 2 线程
   - GPU 线程池: 1 线程

[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs

INFO:     Started server process [4593]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: ignore_id, pad_token, image_token, add_special_token, candidate_resolutions, mask_prompt, image_std, sft_format, patch_size, normalize, image_mean, downsample_ratio. 
   [1/3] Tokenize 19 页...
   [1/3] Tokenize 完成
   [2/3] GPU 批量推理 19 页...

Processed prompts:   0%|          | 0/19 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:   5%|▌         | 1/19 [00:08<02:28,  8.24s/it, est. speed input: 110.87 toks/s, output: 2.19 toks/s]
Processed prompts:  11%|█         | 2/19 [00:10<01:21,  4.78s/it, est. speed input: 172.44 toks/s, output: 10.86 toks/s]
Processed prompts:  16%|█▌        | 3/19 [00:11<00:46,  2.89s/it, est. speed input: 243.95 toks/s, output: 20.75 toks/s]
Processed prompts:  21%|██        | 4/19 [00:12<00:31,  2.12s/it, est. speed input: 300.33 toks/s, output: 31.58 toks/s]
Processed prompts:  26%|██▋       | 5/19 [00:12<00:20,  1.44s/it, est. speed input: 367.77 toks/s, output: 43.83 toks/s]
Processed prompts:  32%|███▏      | 6/19 [00:12<00:13,  1.05s/it, est. speed input: 430.89 toks/s, output: 56.24 toks/s]
Processed prompts:  37%|███▋      | 7/19 [00:12<00:09,  1.30it/s, est. speed input: 495.35 toks/s, output: 69.21 toks/s]
Processed prompts:  47%|████▋     | 9/19 [00:13<00:04,  2.36it/s, est. speed input: 631.80 toks/s, output: 96.57 toks/s]
Processed prompts:  53%|█████▎    | 10/19 [00:13<00:03,  2.66it/s, est. speed input: 689.56 toks/s, output: 109.36 toks/s]
Processed prompts:  58%|█████▊    | 11/19 [00:14<00:03,  2.06it/s, est. speed input: 716.25 toks/s, output: 119.46 toks/s]
Processed prompts:  63%|██████▎   | 12/19 [00:14<00:03,  2.26it/s, est. speed input: 763.32 toks/s, output: 133.63 toks/s]
Processed prompts:  68%|██████▊   | 13/19 [00:14<00:02,  2.84it/s, est. speed input: 819.68 toks/s, output: 149.65 toks/s]
Processed prompts:  74%|███████▎  | 14/19 [00:14<00:01,  3.15it/s, est. speed input: 868.97 toks/s, output: 165.06 toks/s]
Processed prompts:  84%|████████▍ | 16/19 [00:15<00:01,  2.25it/s, est. speed input: 918.38 toks/s, output: 190.68 toks/s]
Processed prompts:  89%|████████▉ | 17/19 [00:16<00:00,  2.15it/s, est. speed input: 944.32 toks/s, output: 207.47 toks/s]
Processed prompts:  95%|█████████▍| 18/19 [00:33<00:04,  4.81s/it, est. speed input: 487.69 toks/s, output: 161.97 toks/s]
Processed prompts: 100%|██████████| 19/19 [00:33<00:00,  1.77s/it, est. speed input: 514.78 toks/s, output: 222.74 toks/s]
   [2/3] GPU 推理完成
   OCR 耗时: 38.37s
   [3/3] 后处理...
   [3/3] 后处理完成 (0.00s)
============================================================
[SUCCESS] 全部完成
   总耗时: 40.82s
   平均: 2.15s/页
============================================================

INFO:     127.0.0.1:55486 - "POST /ocr HTTP/1.1" 200 OK
   [1/3] Tokenize 19 页...
   [1/3] Tokenize 完成
   [2/3] GPU 批量推理 19 页...

Processed prompts:   0%|          | 0/19 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:   5%|▌         | 1/19 [00:05<01:40,  5.58s/it, est. speed input: 163.69 toks/s, output: 3.23 toks/s]
Processed prompts:  11%|█         | 2/19 [00:07<01:02,  3.68s/it, est. speed input: 230.47 toks/s, output: 14.51 toks/s]
Processed prompts:  16%|█▌        | 3/19 [00:08<00:36,  2.29s/it, est. speed input: 319.52 toks/s, output: 27.18 toks/s]
Processed prompts:  21%|██        | 4/19 [00:09<00:26,  1.76s/it, est. speed input: 384.30 toks/s, output: 40.41 toks/s]
Processed prompts:  26%|██▋       | 5/19 [00:09<00:17,  1.21s/it, est. speed input: 467.80 toks/s, output: 55.75 toks/s]
Processed prompts:  32%|███▏      | 6/19 [00:10<00:11,  1.11it/s, est. speed input: 544.48 toks/s, output: 71.07 toks/s]
Processed prompts:  37%|███▋      | 7/19 [00:10<00:08,  1.49it/s, est. speed input: 623.53 toks/s, output: 87.12 toks/s]
Processed prompts:  47%|████▋     | 9/19 [00:10<00:03,  2.69it/s, est. speed input: 793.73 toks/s, output: 121.32 toks/s]
Processed prompts:  53%|█████▎    | 10/19 [00:10<00:03,  2.97it/s, est. speed input: 862.33 toks/s, output: 136.76 toks/s]
Processed prompts:  58%|█████▊    | 11/19 [00:11<00:03,  2.19it/s, est. speed input: 883.22 toks/s, output: 147.31 toks/s]
Processed prompts:  63%|██████▎   | 12/19 [00:11<00:02,  2.37it/s, est. speed input: 935.98 toks/s, output: 163.86 toks/s]
Processed prompts:  68%|██████▊   | 13/19 [00:11<00:02,  2.95it/s, est. speed input: 1003.15 toks/s, output: 183.15 toks/s]
Processed prompts:  74%|███████▎  | 14/19 [00:12<00:01,  3.26it/s, est. speed input: 1059.75 toks/s, output: 201.30 toks/s]
Processed prompts:  84%|████████▍ | 16/19 [00:13<00:01,  2.29it/s, est. speed input: 1102.22 toks/s, output: 228.85 toks/s]
Processed prompts:  89%|████████▉ | 17/19 [00:13<00:00,  2.18it/s, est. speed input: 1126.22 toks/s, output: 247.43 toks/s]
Processed prompts:  95%|█████████▍| 18/19 [00:31<00:04,  4.81s/it, est. speed input: 528.62 toks/s, output: 175.56 toks/s] 
Processed prompts: 100%|██████████| 19/19 [00:31<00:00,  1.64s/it, est. speed input: 557.98 toks/s, output: 241.44 toks/s]
   [2/3] GPU 推理完成
   OCR 耗时: 33.96s
   [3/3] 后处理...
   [3/3] 后处理完成 (0.00s)
============================================================
[SUCCESS] 全部完成
   总耗时: 36.41s
   平均: 1.92s/页
============================================================

INFO:     127.0.0.1:35584 - "POST /ocr HTTP/1.1" 200 OK