added DeepSeek OCR API by liushengtong

6ad287f7 · liuxu3 · 80c11a03 · 6ad287f7 · 6ad287f7 · 6ad287f7
Commit 6ad287f7 authored Feb 28, 2026 by liuxu3
20 changed files
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260204_183242.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260204_183242.log
+INFO 02-04 18:32:47 [__init__.py:240] Automatically detected platform rocm.
+/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py:476: DeprecationWarning: 
+        on_event is deprecated, use lifespan event handlers instead.
+        Read more about it in the
+        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
+  @app.on_event("shutdown")
+[INFO] 加载模型: /home/lst/deepseek_ocr2
+INFO 02-04 18:32:52 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
+INFO 02-04 18:32:52 [config.py:721] This model supports multiple tasks: {'generate', 'classify', 'embed', 'reward', 'score'}. Defaulting to 'generate'.
+INFO 02-04 18:32:52 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False, 
+INFO 02-04 18:32:53 [rocm.py:226] None is not supported in AMD GPUs.
+INFO 02-04 18:32:53 [rocm.py:227] Using ROCmFlashAttention backend.
+WARNING 02-04 18:32:53 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
+INFO 02-04 18:32:53 [worker_base.py:653] ########## 49838 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
+INFO 02-04 18:32:53 [worker_base.py:654] ########## 49838 process(rank0) is running on memnode(s): {0, 1, 2, 3}
+WARNING: Logging before InitGoogleLogging() is written to STDERR
+I0204 18:32:53.341442 49838 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
+I0204 18:32:53.341506 49838 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 18:32:53.341955 49838 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5635f2a9db40, SPLIT_COLOR: 3389850942126204093, PG Name: 1
+I0204 18:32:53.341969 49838 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 18:32:53.361923 49838 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5635f2a9db40, SPLIT_COLOR: 3389850942126204093, PG Name: 3
+I0204 18:32:53.361964 49838 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 18:32:53.363456 49838 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5635f2a9db40, SPLIT_COLOR: 3389850942126204093, PG Name: 5
+I0204 18:32:53.363479 49838 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 18:32:53.364511 49838 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5635f2a9db40, SPLIT_COLOR: 3389850942126204093, PG Name: 7
+I0204 18:32:53.364529 49838 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+INFO 02-04 18:32:53 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
+INFO 02-04 18:32:53 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+INFO 02-04 18:32:54 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
+
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.91it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.89it/s]
+INFO 02-04 18:32:56 [loader.py:460] Loading weights took 1.84 seconds
+INFO 02-04 18:32:57 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.543877 seconds
+Some kwargs in processor config are unused and will not have any effect: pad_token, add_special_token, sft_format, candidate_resolutions, patch_size, normalize, downsample_ratio, image_std, image_token, ignore_id, mask_prompt, image_mean. 
+/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
+  x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
+[rank0]: Traceback (most recent call last):
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 513, in <module>
+[rank0]:     main()
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 504, in main
+[rank0]:     initialize_model(args.model_path, args.gpu_id)
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 272, in initialize_model
+[rank0]:     llm = LLM(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1182, in inner
+[rank0]:     return fn(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 255, in __init__
+[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 520, in from_engine_args
+[rank0]:     return engine_cls.from_vllm_config(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 496, in from_vllm_config
+[rank0]:     return cls(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 286, in __init__
+[rank0]:     self._initialize_kv_caches()
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 432, in _initialize_kv_caches
+[rank0]:     self.model_executor.determine_num_available_blocks())
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
+[rank0]:     results = self.collective_rpc("determine_num_available_blocks")
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
+[rank0]:     answer = run_method(self.driver_worker, method, args, kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2624, in run_method
+[rank0]:     return func(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
+[rank0]:     return func(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 249, in determine_num_available_blocks
+[rank0]:     self.model_runner.profile_run()
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
+[rank0]:     return func(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1262, in profile_run
+[rank0]:     self._dummy_run(max_num_batched_tokens, max_num_seqs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1388, in _dummy_run
+[rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
+[rank0]:     return func(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1948, in execute_model
+[rank0]:     hidden_or_intermediate_states = model_executable(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
+[rank0]:     return self._call_impl(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
+[rank0]:     return forward_call(*args, **kwargs)
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 541, in forward
+[rank0]:     vision_embeddings = self.get_multimodal_embeddings(**kwargs)
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 501, in get_multimodal_embeddings
+[rank0]:     vision_embeddings = self._process_image_input(image_input)
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 487, in _process_image_input
+[rank0]:     vision_features = self._pixel_values_to_embedding(
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 401, in _pixel_values_to_embedding
+[rank0]:     local_features_2 = self.qwen2_model(local_features_1)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
+[rank0]:     return self._call_impl(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
+[rank0]:     return forward_call(*args, **kwargs)
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/qwen2_d2e.py", line 264, in forward
+[rank0]:     batch_query_imgs = param_img.unsqueeze(0).expand(
+[rank0]: UnboundLocalError: local variable 'param_img' referenced before assignment
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260204_183336.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260204_183336.log
+INFO 02-04 18:33:41 [__init__.py:240] Automatically detected platform rocm.
+/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py:476: DeprecationWarning: 
+        on_event is deprecated, use lifespan event handlers instead.
+        Read more about it in the
+        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
+  @app.on_event("shutdown")
+[INFO] 加载模型: /home/lst/deepseek_ocr2
+INFO 02-04 18:33:45 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
+INFO 02-04 18:33:45 [config.py:721] This model supports multiple tasks: {'generate', 'score', 'reward', 'classify', 'embed'}. Defaulting to 'generate'.
+INFO 02-04 18:33:45 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False, 
+INFO 02-04 18:33:46 [rocm.py:226] None is not supported in AMD GPUs.
+INFO 02-04 18:33:46 [rocm.py:227] Using ROCmFlashAttention backend.
+WARNING 02-04 18:33:46 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
+INFO 02-04 18:33:46 [worker_base.py:653] ########## 50743 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
+INFO 02-04 18:33:46 [worker_base.py:654] ########## 50743 process(rank0) is running on memnode(s): {0, 1, 2, 3}
+WARNING: Logging before InitGoogleLogging() is written to STDERR
+I0204 18:33:46.885210 50743 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
+I0204 18:33:46.885288 50743 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 18:33:46.885742 50743 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55b223a8f570, SPLIT_COLOR: 3389850942126204093, PG Name: 1
+I0204 18:33:46.885754 50743 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 18:33:46.904980 50743 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55b223a8f570, SPLIT_COLOR: 3389850942126204093, PG Name: 3
+I0204 18:33:46.905019 50743 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 18:33:46.906283 50743 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55b223a8f570, SPLIT_COLOR: 3389850942126204093, PG Name: 5
+I0204 18:33:46.906301 50743 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 18:33:46.907312 50743 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55b223a8f570, SPLIT_COLOR: 3389850942126204093, PG Name: 7
+I0204 18:33:46.907331 50743 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+INFO 02-04 18:33:46 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
+INFO 02-04 18:33:46 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+INFO 02-04 18:33:47 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
+
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.18it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.16it/s]
+INFO 02-04 18:33:50 [loader.py:460] Loading weights took 1.73 seconds
+INFO 02-04 18:33:50 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.422877 seconds
+Some kwargs in processor config are unused and will not have any effect: mask_prompt, ignore_id, add_special_token, pad_token, image_token, normalize, downsample_ratio, candidate_resolutions, sft_format, patch_size, image_mean, image_std. 
+/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
+  x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
+WARNING 02-04 18:34:02 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
+INFO 02-04 18:34:02 [worker.py:287] Memory profiling takes 12.06 seconds
+INFO 02-04 18:34:02 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
+INFO 02-04 18:34:02 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
+INFO 02-04 18:34:03 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 0
+INFO 02-04 18:34:03 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
+INFO 02-04 18:34:03 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
+
Capturing CUDA graph shapes:   0%|          | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes:  17%|█▋        | 1/6 [00:00<00:02,  1.91it/s]
Capturing CUDA graph shapes:  33%|███▎      | 2/6 [00:01<00:01,  2.01it/s]
Capturing CUDA graph shapes:  50%|█████     | 3/6 [00:01<00:01,  2.05it/s]
Capturing CUDA graph shapes:  67%|██████▋   | 4/6 [00:01<00:00,  2.02it/s]
Capturing CUDA graph shapes:  83%|████████▎ | 5/6 [00:02<00:00,  1.98it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00,  2.01it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00,  2.01it/s]
+INFO 02-04 18:34:06 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
+INFO 02-04 18:34:06 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 15.67 seconds
+[SUCCESS] 模型加载完成
+[INFO] 线程池配置:
+   - CPU 线程池: 2 线程
+   - GPU 线程池: 1 线程
+[INFO] 服务启动: http://0.0.0.0:8707
+[INFO] 接口文档: http://0.0.0.0:8707/docs
+INFO:     Started server process [50743]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260204_183446.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260204_183446.log
+INFO 02-04 18:34:51 [__init__.py:240] Automatically detected platform rocm.
+/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py:476: DeprecationWarning: 
+        on_event is deprecated, use lifespan event handlers instead.
+        Read more about it in the
+        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
+  @app.on_event("shutdown")
+[INFO] 加载模型: /home/lst/deepseek_ocr2
+INFO 02-04 18:34:56 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
+INFO 02-04 18:34:56 [config.py:721] This model supports multiple tasks: {'generate', 'embed', 'classify', 'reward', 'score'}. Defaulting to 'generate'.
+INFO 02-04 18:34:56 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False, 
+INFO 02-04 18:34:56 [rocm.py:226] None is not supported in AMD GPUs.
+INFO 02-04 18:34:56 [rocm.py:227] Using ROCmFlashAttention backend.
+WARNING 02-04 18:34:56 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
+INFO 02-04 18:34:56 [worker_base.py:653] ########## 51630 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
+INFO 02-04 18:34:56 [worker_base.py:654] ########## 51630 process(rank0) is running on memnode(s): {0, 1, 2, 3}
+WARNING: Logging before InitGoogleLogging() is written to STDERR
+I0204 18:34:57.060443 51630 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
+I0204 18:34:57.060510 51630 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 18:34:57.060959 51630 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55b96dd7c0d0, SPLIT_COLOR: 3389850942126204093, PG Name: 1
+I0204 18:34:57.060974 51630 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 18:34:57.080931 51630 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55b96dd7c0d0, SPLIT_COLOR: 3389850942126204093, PG Name: 3
+I0204 18:34:57.080973 51630 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 18:34:57.082218 51630 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55b96dd7c0d0, SPLIT_COLOR: 3389850942126204093, PG Name: 5
+I0204 18:34:57.082239 51630 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 18:34:57.083209 51630 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55b96dd7c0d0, SPLIT_COLOR: 3389850942126204093, PG Name: 7
+I0204 18:34:57.083231 51630 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+INFO 02-04 18:34:57 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
+INFO 02-04 18:34:57 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+INFO 02-04 18:34:58 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
+
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.96it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.95it/s]
+INFO 02-04 18:35:00 [loader.py:460] Loading weights took 1.82 seconds
+INFO 02-04 18:35:00 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.480003 seconds
+Some kwargs in processor config are unused and will not have any effect: image_std, pad_token, normalize, ignore_id, mask_prompt, patch_size, downsample_ratio, add_special_token, sft_format, image_token, image_mean, candidate_resolutions. 
+/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
+  x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
+[rank0]: Traceback (most recent call last):
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 513, in <module>
+[rank0]:     main()
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 504, in main
+[rank0]:     initialize_model(args.model_path, args.gpu_id)
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 272, in initialize_model
+[rank0]:     llm = LLM(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1182, in inner
+[rank0]:     return fn(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 255, in __init__
+[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 520, in from_engine_args
+[rank0]:     return engine_cls.from_vllm_config(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 496, in from_vllm_config
+[rank0]:     return cls(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 286, in __init__
+[rank0]:     self._initialize_kv_caches()
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 432, in _initialize_kv_caches
+[rank0]:     self.model_executor.determine_num_available_blocks())
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
+[rank0]:     results = self.collective_rpc("determine_num_available_blocks")
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
+[rank0]:     answer = run_method(self.driver_worker, method, args, kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2624, in run_method
+[rank0]:     return func(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
+[rank0]:     return func(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 249, in determine_num_available_blocks
+[rank0]:     self.model_runner.profile_run()
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
+[rank0]:     return func(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1262, in profile_run
+[rank0]:     self._dummy_run(max_num_batched_tokens, max_num_seqs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1388, in _dummy_run
+[rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
+[rank0]:     return func(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1948, in execute_model
+[rank0]:     hidden_or_intermediate_states = model_executable(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
+[rank0]:     return self._call_impl(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
+[rank0]:     return forward_call(*args, **kwargs)
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 541, in forward
+[rank0]:     vision_embeddings = self.get_multimodal_embeddings(**kwargs)
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 501, in get_multimodal_embeddings
+[rank0]:     vision_embeddings = self._process_image_input(image_input)
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 487, in _process_image_input
+[rank0]:     vision_features = self._pixel_values_to_embedding(
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 401, in _pixel_values_to_embedding
+[rank0]:     local_features_2 = self.qwen2_model(local_features_1)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
+[rank0]:     return self._call_impl(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
+[rank0]:     return forward_call(*args, **kwargs)
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/qwen2_d2e.py", line 264, in forward
+[rank0]:     batch_query_imgs = param_img.unsqueeze(0).expand(
+[rank0]: UnboundLocalError: local variable 'param_img' referenced before assignment
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260205_100318.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260205_100318.log
+INFO 02-05 10:03:22 [__init__.py:240] Automatically detected platform rocm.
+/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py:476: DeprecationWarning: 
+        on_event is deprecated, use lifespan event handlers instead.
+        Read more about it in the
+        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
+  @app.on_event("shutdown")
+[INFO] 加载模型: /home/lst/deepseek_ocr2
+INFO 02-05 10:03:27 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
+INFO 02-05 10:03:27 [config.py:721] This model supports multiple tasks: {'score', 'generate', 'reward', 'embed', 'classify'}. Defaulting to 'generate'.
+INFO 02-05 10:03:27 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False, 
+INFO 02-05 10:03:28 [rocm.py:226] None is not supported in AMD GPUs.
+INFO 02-05 10:03:28 [rocm.py:227] Using ROCmFlashAttention backend.
+WARNING 02-05 10:03:28 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
+INFO 02-05 10:03:28 [worker_base.py:653] ########## 58452 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
+INFO 02-05 10:03:28 [worker_base.py:654] ########## 58452 process(rank0) is running on memnode(s): {0, 1, 2, 3}
+WARNING: Logging before InitGoogleLogging() is written to STDERR
+I0205 10:03:28.481029 58452 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
+I0205 10:03:28.481109 58452 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0205 10:03:28.481564 58452 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55d4311a9f80, SPLIT_COLOR: 3389850942126204093, PG Name: 1
+I0205 10:03:28.481581 58452 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0205 10:03:28.500953 58452 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55d4311a9f80, SPLIT_COLOR: 3389850942126204093, PG Name: 3
+I0205 10:03:28.500998 58452 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0205 10:03:28.502223 58452 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55d4311a9f80, SPLIT_COLOR: 3389850942126204093, PG Name: 5
+I0205 10:03:28.502246 58452 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0205 10:03:28.503187 58452 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55d4311a9f80, SPLIT_COLOR: 3389850942126204093, PG Name: 7
+I0205 10:03:28.503209 58452 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+INFO 02-05 10:03:28 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
+INFO 02-05 10:03:28 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+INFO 02-05 10:03:29 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
+
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.15it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.13it/s]
+INFO 02-05 10:03:31 [loader.py:460] Loading weights took 1.79 seconds
+INFO 02-05 10:03:32 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.454019 seconds
+Some kwargs in processor config are unused and will not have any effect: image_token, sft_format, pad_token, image_std, add_special_token, ignore_id, patch_size, downsample_ratio, normalize, candidate_resolutions, image_mean, mask_prompt. 
+/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
+  x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
+WARNING 02-05 10:03:43 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
+INFO 02-05 10:03:44 [worker.py:287] Memory profiling takes 11.98 seconds
+INFO 02-05 10:03:44 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
+INFO 02-05 10:03:44 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
+INFO 02-05 10:03:44 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 0
+INFO 02-05 10:03:44 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
+INFO 02-05 10:03:44 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
+
Capturing CUDA graph shapes:   0%|          | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes:  17%|█▋        | 1/6 [00:00<00:02,  1.91it/s]
Capturing CUDA graph shapes:  33%|███▎      | 2/6 [00:01<00:01,  2.01it/s]
Capturing CUDA graph shapes:  50%|█████     | 3/6 [00:01<00:01,  2.04it/s]
Capturing CUDA graph shapes:  67%|██████▋   | 4/6 [00:02<00:01,  1.96it/s]
Capturing CUDA graph shapes:  83%|████████▎ | 5/6 [00:02<00:00,  1.97it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00,  2.01it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00,  2.00it/s]
+INFO 02-05 10:03:47 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
+INFO 02-05 10:03:47 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 15.60 seconds
+[SUCCESS] 模型加载完成
+[INFO] 线程池配置:
+   - CPU 线程池: 2 线程
+   - GPU 线程池: 1 线程
+[INFO] 服务启动: http://0.0.0.0:8707
+[INFO] 接口文档: http://0.0.0.0:8707/docs
+INFO:     Started server process [58452]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
+Some kwargs in processor config are unused and will not have any effect: image_token, sft_format, pad_token, image_std, add_special_token, ignore_id, patch_size, downsample_ratio, normalize, candidate_resolutions, image_mean, mask_prompt. 
+   [1/3] Tokenize 19 页...
+   [1/3] Tokenize 完成
+   [2/3] GPU 批量推理 19 页...
+
Processed prompts:   0%|          | 0/19 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:   5%|▌         | 1/19 [00:12<03:39, 12.18s/it, est. speed input: 92.86 toks/s, output: 1.48 toks/s]
Processed prompts:  11%|█         | 2/19 [00:15<01:55,  6.79s/it, est. speed input: 148.89 toks/s, output: 8.95 toks/s]
Processed prompts:  16%|█▌        | 3/19 [00:15<01:03,  3.96s/it, est. speed input: 214.81 toks/s, output: 17.35 toks/s]
Processed prompts:  21%|██        | 4/19 [00:16<00:38,  2.54s/it, est. speed input: 280.09 toks/s, output: 26.31 toks/s]
Processed prompts:  26%|██▋       | 5/19 [00:16<00:25,  1.83s/it, est. speed input: 338.06 toks/s, output: 35.69 toks/s]
Processed prompts:  37%|███▋      | 7/19 [00:16<00:11,  1.03it/s, est. speed input: 465.83 toks/s, output: 56.07 toks/s]
Processed prompts:  53%|█████▎    | 10/19 [00:17<00:04,  1.99it/s, est. speed input: 657.25 toks/s, output: 87.69 toks/s]
Processed prompts:  58%|█████▊    | 11/19 [00:17<00:03,  2.29it/s, est. speed input: 715.73 toks/s, output: 98.26 toks/s]
Processed prompts:  63%|██████▎   | 12/19 [00:17<00:03,  2.20it/s, est. speed input: 758.09 toks/s, output: 107.92 toks/s]
Processed prompts:  68%|██████▊   | 13/19 [00:18<00:02,  2.57it/s, est. speed input: 813.11 toks/s, output: 119.73 toks/s]
Processed prompts:  79%|███████▉  | 15/19 [00:18<00:01,  3.53it/s, est. speed input: 925.03 toks/s, output: 144.44 toks/s]
Processed prompts:  84%|████████▍ | 16/19 [00:18<00:00,  3.74it/s, est. speed input: 975.37 toks/s, output: 156.85 toks/s]
Processed prompts:  89%|████████▉ | 17/19 [00:19<00:00,  2.12it/s, est. speed input: 977.96 toks/s, output: 165.21 toks/s]
Processed prompts:  95%|█████████▍| 18/19 [00:20<00:00,  2.09it/s, est. speed input: 1009.77 toks/s, output: 180.35 toks/s]
Processed prompts: 100%|██████████| 19/19 [00:21<00:00,  1.19it/s, est. speed input: 977.88 toks/s, output: 193.90 toks/s] 
Processed prompts: 100%|██████████| 19/19 [00:21<00:00,  1.16s/it, est. speed input: 977.88 toks/s, output: 193.90 toks/s]
+   [2/3] GPU 推理完成
+   OCR 耗时: 26.94s
+   [3/3] 后处理...
+   [3/3] 后处理完成 (0.00s)
+============================================================
+[SUCCESS] 全部完成
+   总耗时: 29.37s
+   平均: 1.55s/页
+============================================================
+INFO:     127.0.0.1:36956 - "POST /ocr HTTP/1.1" 200 OK
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260227_112902.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260227_112902.log
+INFO 02-27 11:29:07 [__init__.py:240] Automatically detected platform rocm.
+/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py:476: DeprecationWarning: 
+        on_event is deprecated, use lifespan event handlers instead.
+        Read more about it in the
+        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
+  @app.on_event("shutdown")
+[INFO] 加载模型: /home/lst/deepseek_ocr2
+INFO 02-27 11:29:12 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
+INFO 02-27 11:29:12 [config.py:721] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
+INFO 02-27 11:29:12 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False, 
+INFO 02-27 11:29:13 [rocm.py:226] None is not supported in AMD GPUs.
+INFO 02-27 11:29:13 [rocm.py:227] Using ROCmFlashAttention backend.
+WARNING 02-27 11:29:13 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
+INFO 02-27 11:29:13 [worker_base.py:653] ########## 2019 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63}
+INFO 02-27 11:29:13 [worker_base.py:654] ########## 2019 process(rank0) is running on memnode(s): {0, 1, 2, 3, 4, 5, 6, 7}
+WARNING: Logging before InitGoogleLogging() is written to STDERR
+I0227 11:29:14.044931  2019 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
+I0227 11:29:14.044994  2019 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:29:14.045455  2019 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5639914424b0, SPLIT_COLOR: 3389850942126204093, PG Name: 1
+I0227 11:29:14.045471  2019 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:29:14.065977  2019 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5639914424b0, SPLIT_COLOR: 3389850942126204093, PG Name: 3
+I0227 11:29:14.066015  2019 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:29:14.067288  2019 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5639914424b0, SPLIT_COLOR: 3389850942126204093, PG Name: 5
+I0227 11:29:14.067306  2019 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:29:14.068274  2019 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5639914424b0, SPLIT_COLOR: 3389850942126204093, PG Name: 7
+I0227 11:29:14.068290  2019 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+INFO 02-27 11:29:14 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
+INFO 02-27 11:29:14 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+INFO 02-27 11:29:15 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
+[rank0]: Traceback (most recent call last):
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 513, in <module>
+[rank0]:     main()
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 504, in main
+[rank0]:     initialize_model(args.model_path, args.gpu_id)
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 272, in initialize_model
+[rank0]:     llm = LLM(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1182, in inner
+[rank0]:     return fn(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 255, in __init__
+[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 520, in from_engine_args
+[rank0]:     return engine_cls.from_vllm_config(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 496, in from_vllm_config
+[rank0]:     return cls(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 283, in __init__
+[rank0]:     self.model_executor = executor_class(vllm_config=vllm_config)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
+[rank0]:     self._init_executor()
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
+[rank0]:     self.collective_rpc("load_model")
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
+[rank0]:     answer = run_method(self.driver_worker, method, args, kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2624, in run_method
+[rank0]:     return func(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 203, in load_model
+[rank0]:     self.model_runner.load_model()
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1136, in load_model
+[rank0]:     self.model = get_model(vllm_config=self.vllm_config)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
+[rank0]:     return loader.load_model(vllm_config=vllm_config)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 454, in load_model
+[rank0]:     model = _initialize_model(vllm_config=vllm_config)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
+[rank0]:     return model_class(vllm_config=vllm_config, prefix=prefix)
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 325, in __init__
+[rank0]:     self.language_model = init_vllm_registered_model(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 286, in init_vllm_registered_model
+[rank0]:     return _initialize_model(vllm_config=vllm_config, prefix=prefix)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
+[rank0]:     return model_class(vllm_config=vllm_config, prefix=prefix)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 457, in __init__
+[rank0]:     self.model = DeepseekModel(vllm_config=vllm_config,
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 358, in __init__
+[rank0]:     self.start_layer, self.end_layer, self.layers = make_layers(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 609, in make_layers
+[rank0]:     [PPMissingLayer() for _ in range(start_layer)] + [
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 610, in <listcomp>
+[rank0]:     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 360, in <lambda>
+[rank0]:     lambda prefix: DeepseekDecoderLayer(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 300, in __init__
+[rank0]:     self.mlp = DeepseekMoE(config=config,
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 120, in __init__
+[rank0]:     self.pack_params()
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 162, in pack_params
+[rank0]:     self.w2 = self.w2.permute(0, 2, 1).contiguous()
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 79, in __torch_function__
+[rank0]:     return func(*args, **kwargs)
+[rank0]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 140.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 3.24 GiB is allocated by PyTorch, and 235.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260227_113134.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260227_113134.log
+INFO 02-27 11:31:39 [__init__.py:240] Automatically detected platform rocm.
+/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py:476: DeprecationWarning: 
+        on_event is deprecated, use lifespan event handlers instead.
+        Read more about it in the
+        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
+  @app.on_event("shutdown")
+[INFO] 加载模型: /home/lst/deepseek_ocr2
+INFO 02-27 11:31:45 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
+INFO 02-27 11:31:45 [config.py:721] This model supports multiple tasks: {'embed', 'score', 'classify', 'generate', 'reward'}. Defaulting to 'generate'.
+INFO 02-27 11:31:45 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False, 
+INFO 02-27 11:31:46 [rocm.py:226] None is not supported in AMD GPUs.
+INFO 02-27 11:31:46 [rocm.py:227] Using ROCmFlashAttention backend.
+WARNING 02-27 11:31:46 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
+INFO 02-27 11:31:46 [worker_base.py:653] ########## 2386 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63}
+INFO 02-27 11:31:46 [worker_base.py:654] ########## 2386 process(rank0) is running on memnode(s): {0, 1, 2, 3, 4, 5, 6, 7}
+WARNING: Logging before InitGoogleLogging() is written to STDERR
+I0227 11:31:46.413403  2386 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
+I0227 11:31:46.413467  2386 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:31:46.413935  2386 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5609f68a77b0, SPLIT_COLOR: 3389850942126204093, PG Name: 1
+I0227 11:31:46.413949  2386 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:31:46.434001  2386 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5609f68a77b0, SPLIT_COLOR: 3389850942126204093, PG Name: 3
+I0227 11:31:46.434037  2386 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:31:46.435284  2386 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5609f68a77b0, SPLIT_COLOR: 3389850942126204093, PG Name: 5
+I0227 11:31:46.435304  2386 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:31:46.436266  2386 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5609f68a77b0, SPLIT_COLOR: 3389850942126204093, PG Name: 7
+I0227 11:31:46.436283  2386 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+INFO 02-27 11:31:46 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
+INFO 02-27 11:31:46 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+INFO 02-27 11:31:47 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
+[rank0]: Traceback (most recent call last):
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 513, in <module>
+[rank0]:     main()
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 504, in main
+[rank0]:     initialize_model(args.model_path, args.gpu_id)
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 272, in initialize_model
+[rank0]:     llm = LLM(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1182, in inner
+[rank0]:     return fn(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 255, in __init__
+[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 520, in from_engine_args
+[rank0]:     return engine_cls.from_vllm_config(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 496, in from_vllm_config
+[rank0]:     return cls(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 283, in __init__
+[rank0]:     self.model_executor = executor_class(vllm_config=vllm_config)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
+[rank0]:     self._init_executor()
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
+[rank0]:     self.collective_rpc("load_model")
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
+[rank0]:     answer = run_method(self.driver_worker, method, args, kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2624, in run_method
+[rank0]:     return func(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 203, in load_model
+[rank0]:     self.model_runner.load_model()
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1136, in load_model
+[rank0]:     self.model = get_model(vllm_config=self.vllm_config)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
+[rank0]:     return loader.load_model(vllm_config=vllm_config)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 454, in load_model
+[rank0]:     model = _initialize_model(vllm_config=vllm_config)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
+[rank0]:     return model_class(vllm_config=vllm_config, prefix=prefix)
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 325, in __init__
+[rank0]:     self.language_model = init_vllm_registered_model(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 286, in init_vllm_registered_model
+[rank0]:     return _initialize_model(vllm_config=vllm_config, prefix=prefix)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
+[rank0]:     return model_class(vllm_config=vllm_config, prefix=prefix)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 457, in __init__
+[rank0]:     self.model = DeepseekModel(vllm_config=vllm_config,
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 358, in __init__
+[rank0]:     self.start_layer, self.end_layer, self.layers = make_layers(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 609, in make_layers
+[rank0]:     [PPMissingLayer() for _ in range(start_layer)] + [
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 610, in <listcomp>
+[rank0]:     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 360, in <lambda>
+[rank0]:     lambda prefix: DeepseekDecoderLayer(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 300, in __init__
+[rank0]:     self.mlp = DeepseekMoE(config=config,
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 120, in __init__
+[rank0]:     self.pack_params()
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 162, in pack_params
+[rank0]:     self.w2 = self.w2.permute(0, 2, 1).contiguous()
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 79, in __torch_function__
+[rank0]:     return func(*args, **kwargs)
+[rank0]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 140.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 3.24 GiB is allocated by PyTorch, and 235.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260227_113520.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260227_113520.log
+INFO 02-27 11:35:25 [__init__.py:240] Automatically detected platform rocm.
+/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py:476: DeprecationWarning: 
+        on_event is deprecated, use lifespan event handlers instead.
+        Read more about it in the
+        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
+  @app.on_event("shutdown")
+Traceback (most recent call last):
+  File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 513, in <module>
+    main()
+  File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 503, in main
+    os.environ["CUDA_VISIBLE_DEVICES"] = gpu_id
+NameError: name 'gpu_id' is not defined
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260227_113615.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260227_113615.log
+INFO 02-27 11:36:20 [__init__.py:240] Automatically detected platform rocm.
+/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py:476: DeprecationWarning: 
+        on_event is deprecated, use lifespan event handlers instead.
+        Read more about it in the
+        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
+  @app.on_event("shutdown")
+[INFO] 加载模型: /home/lst/deepseek_ocr2
+INFO 02-27 11:36:25 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
+INFO 02-27 11:36:25 [config.py:721] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'.
+INFO 02-27 11:36:25 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False, 
+INFO 02-27 11:36:26 [rocm.py:226] None is not supported in AMD GPUs.
+INFO 02-27 11:36:26 [rocm.py:227] Using ROCmFlashAttention backend.
+WARNING 02-27 11:36:26 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
+INFO 02-27 11:36:26 [worker_base.py:653] ########## 2840 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63}
+INFO 02-27 11:36:26 [worker_base.py:654] ########## 2840 process(rank0) is running on memnode(s): {0, 1, 2, 3, 4, 5, 6, 7}
+WARNING: Logging before InitGoogleLogging() is written to STDERR
+I0227 11:36:26.828375  2840 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
+I0227 11:36:26.828431  2840 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:36:26.828945  2840 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55d962bc1680, SPLIT_COLOR: 3389850942126204093, PG Name: 1
+I0227 11:36:26.828958  2840 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:36:26.848886  2840 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55d962bc1680, SPLIT_COLOR: 3389850942126204093, PG Name: 3
+I0227 11:36:26.848920  2840 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:36:26.850196  2840 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55d962bc1680, SPLIT_COLOR: 3389850942126204093, PG Name: 5
+I0227 11:36:26.850215  2840 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:36:26.851189  2840 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55d962bc1680, SPLIT_COLOR: 3389850942126204093, PG Name: 7
+I0227 11:36:26.851207  2840 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+INFO 02-27 11:36:26 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
+INFO 02-27 11:36:26 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+INFO 02-27 11:36:28 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
+[rank0]: Traceback (most recent call last):
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 513, in <module>
+[rank0]:     main()
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 504, in main
+[rank0]:     initialize_model(args.model_path, args.gpu_id)
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 272, in initialize_model
+[rank0]:     llm = LLM(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1182, in inner
+[rank0]:     return fn(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 255, in __init__
+[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 520, in from_engine_args
+[rank0]:     return engine_cls.from_vllm_config(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 496, in from_vllm_config
+[rank0]:     return cls(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 283, in __init__
+[rank0]:     self.model_executor = executor_class(vllm_config=vllm_config)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
+[rank0]:     self._init_executor()
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
+[rank0]:     self.collective_rpc("load_model")
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
+[rank0]:     answer = run_method(self.driver_worker, method, args, kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2624, in run_method
+[rank0]:     return func(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 203, in load_model
+[rank0]:     self.model_runner.load_model()
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1136, in load_model
+[rank0]:     self.model = get_model(vllm_config=self.vllm_config)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
+[rank0]:     return loader.load_model(vllm_config=vllm_config)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 454, in load_model
+[rank0]:     model = _initialize_model(vllm_config=vllm_config)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
+[rank0]:     return model_class(vllm_config=vllm_config, prefix=prefix)
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 325, in __init__
+[rank0]:     self.language_model = init_vllm_registered_model(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 286, in init_vllm_registered_model
+[rank0]:     return _initialize_model(vllm_config=vllm_config, prefix=prefix)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
+[rank0]:     return model_class(vllm_config=vllm_config, prefix=prefix)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 457, in __init__
+[rank0]:     self.model = DeepseekModel(vllm_config=vllm_config,
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 358, in __init__
+[rank0]:     self.start_layer, self.end_layer, self.layers = make_layers(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 609, in make_layers
+[rank0]:     [PPMissingLayer() for _ in range(start_layer)] + [
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 610, in <listcomp>
+[rank0]:     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 360, in <lambda>
+[rank0]:     lambda prefix: DeepseekDecoderLayer(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 300, in __init__
+[rank0]:     self.mlp = DeepseekMoE(config=config,
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 120, in __init__
+[rank0]:     self.pack_params()
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 162, in pack_params
+[rank0]:     self.w2 = self.w2.permute(0, 2, 1).contiguous()
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 79, in __torch_function__
+[rank0]:     return func(*args, **kwargs)
+[rank0]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 140.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 3.24 GiB is allocated by PyTorch, and 235.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260227_114133.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260227_114133.log
+INFO 02-27 11:41:38 [__init__.py:240] Automatically detected platform rocm.
+/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py:476: DeprecationWarning: 
+        on_event is deprecated, use lifespan event handlers instead.
+        Read more about it in the
+        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
+  @app.on_event("shutdown")
+[INFO] 加载模型: /home/lst/deepseek_ocr2
+INFO 02-27 11:41:43 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
+INFO 02-27 11:41:43 [config.py:721] This model supports multiple tasks: {'score', 'classify', 'generate', 'reward', 'embed'}. Defaulting to 'generate'.
+INFO 02-27 11:41:43 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False, 
+INFO 02-27 11:41:44 [rocm.py:226] None is not supported in AMD GPUs.
+INFO 02-27 11:41:44 [rocm.py:227] Using ROCmFlashAttention backend.
+WARNING 02-27 11:41:44 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
+INFO 02-27 11:41:44 [worker_base.py:653] ########## 3122 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63}
+INFO 02-27 11:41:44 [worker_base.py:654] ########## 3122 process(rank0) is running on memnode(s): {0, 1, 2, 3, 4, 5, 6, 7}
+WARNING: Logging before InitGoogleLogging() is written to STDERR
+I0227 11:41:44.968997  3122 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
+I0227 11:41:44.969059  3122 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:41:44.969534  3122 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55cf07c030c0, SPLIT_COLOR: 3389850942126204093, PG Name: 1
+I0227 11:41:44.969554  3122 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:41:44.989876  3122 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55cf07c030c0, SPLIT_COLOR: 3389850942126204093, PG Name: 3
+I0227 11:41:44.989917  3122 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:41:44.991149  3122 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55cf07c030c0, SPLIT_COLOR: 3389850942126204093, PG Name: 5
+I0227 11:41:44.991173  3122 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:41:44.992142  3122 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55cf07c030c0, SPLIT_COLOR: 3389850942126204093, PG Name: 7
+I0227 11:41:44.992164  3122 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+INFO 02-27 11:41:44 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
+INFO 02-27 11:41:44 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+INFO 02-27 11:41:46 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
+[rank0]: Traceback (most recent call last):
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 513, in <module>
+[rank0]:     main()
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 504, in main
+[rank0]:     initialize_model(args.model_path, args.gpu_id)
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 272, in initialize_model
+[rank0]:     llm = LLM(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1182, in inner
+[rank0]:     return fn(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 255, in __init__
+[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 520, in from_engine_args
+[rank0]:     return engine_cls.from_vllm_config(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 496, in from_vllm_config
+[rank0]:     return cls(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 283, in __init__
+[rank0]:     self.model_executor = executor_class(vllm_config=vllm_config)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
+[rank0]:     self._init_executor()
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
+[rank0]:     self.collective_rpc("load_model")
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
+[rank0]:     answer = run_method(self.driver_worker, method, args, kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2624, in run_method
+[rank0]:     return func(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 203, in load_model
+[rank0]:     self.model_runner.load_model()
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1136, in load_model
+[rank0]:     self.model = get_model(vllm_config=self.vllm_config)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
+[rank0]:     return loader.load_model(vllm_config=vllm_config)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 454, in load_model
+[rank0]:     model = _initialize_model(vllm_config=vllm_config)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
+[rank0]:     return model_class(vllm_config=vllm_config, prefix=prefix)
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 325, in __init__
+[rank0]:     self.language_model = init_vllm_registered_model(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 286, in init_vllm_registered_model
+[rank0]:     return _initialize_model(vllm_config=vllm_config, prefix=prefix)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
+[rank0]:     return model_class(vllm_config=vllm_config, prefix=prefix)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 457, in __init__
+[rank0]:     self.model = DeepseekModel(vllm_config=vllm_config,
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 358, in __init__
+[rank0]:     self.start_layer, self.end_layer, self.layers = make_layers(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 609, in make_layers
+[rank0]:     [PPMissingLayer() for _ in range(start_layer)] + [
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 610, in <listcomp>
+[rank0]:     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 360, in <lambda>
+[rank0]:     lambda prefix: DeepseekDecoderLayer(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 300, in __init__
+[rank0]:     self.mlp = DeepseekMoE(config=config,
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 120, in __init__
+[rank0]:     self.pack_params()
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 162, in pack_params
+[rank0]:     self.w2 = self.w2.permute(0, 2, 1).contiguous()
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 79, in __torch_function__
+[rank0]:     return func(*args, **kwargs)
+[rank0]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 140.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 3.24 GiB is allocated by PyTorch, and 235.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260227_114433.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260227_114433.log
+INFO 02-27 11:44:38 [__init__.py:240] Automatically detected platform rocm.
+/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py:476: DeprecationWarning: 
+        on_event is deprecated, use lifespan event handlers instead.
+        Read more about it in the
+        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
+  @app.on_event("shutdown")
+[INFO] 加载模型: /home/lst/deepseek_ocr2
+INFO 02-27 11:44:43 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
+INFO 02-27 11:44:43 [config.py:721] This model supports multiple tasks: {'reward', 'score', 'embed', 'generate', 'classify'}. Defaulting to 'generate'.
+INFO 02-27 11:44:43 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False, 
+INFO 02-27 11:44:44 [rocm.py:226] None is not supported in AMD GPUs.
+INFO 02-27 11:44:44 [rocm.py:227] Using ROCmFlashAttention backend.
+WARNING 02-27 11:44:44 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
+INFO 02-27 11:44:44 [worker_base.py:653] ########## 3409 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63}
+INFO 02-27 11:44:44 [worker_base.py:654] ########## 3409 process(rank0) is running on memnode(s): {0, 1, 2, 3, 4, 5, 6, 7}
+WARNING: Logging before InitGoogleLogging() is written to STDERR
+I0227 11:44:44.820218  3409 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
+I0227 11:44:44.820281  3409 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:44:44.820798  3409 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x557d5ac8e9c0, SPLIT_COLOR: 3389850942126204093, PG Name: 1
+I0227 11:44:44.820814  3409 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:44:44.840890  3409 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x557d5ac8e9c0, SPLIT_COLOR: 3389850942126204093, PG Name: 3
+I0227 11:44:44.840930  3409 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:44:44.842145  3409 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x557d5ac8e9c0, SPLIT_COLOR: 3389850942126204093, PG Name: 5
+I0227 11:44:44.842170  3409 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:44:44.843230  3409 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x557d5ac8e9c0, SPLIT_COLOR: 3389850942126204093, PG Name: 7
+I0227 11:44:44.843253  3409 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+INFO 02-27 11:44:44 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
+INFO 02-27 11:44:44 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+INFO 02-27 11:44:46 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
+
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:07<00:00,  7.90s/it]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:07<00:00,  7.90s/it]
+INFO 02-27 11:44:57 [loader.py:460] Loading weights took 10.70 seconds
+INFO 02-27 11:44:57 [model_runner.py:1165] Model loading took 6.3336 GiB and 12.624936 seconds
+Some kwargs in processor config are unused and will not have any effect: image_std, pad_token, image_token, patch_size, image_mean, ignore_id, add_special_token, downsample_ratio, mask_prompt, candidate_resolutions, normalize, sft_format. 
+/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
+  x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
+WARNING 02-27 11:45:11 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
+INFO 02-27 11:45:12 [worker.py:287] Memory profiling takes 14.13 seconds
+INFO 02-27 11:45:12 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
+INFO 02-27 11:45:12 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
+INFO 02-27 11:45:12 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 0
+INFO 02-27 11:45:12 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
+INFO 02-27 11:45:12 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
+
Capturing CUDA graph shapes:   0%|          | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes:  17%|█▋        | 1/6 [00:00<00:03,  1.44it/s]
Capturing CUDA graph shapes:  33%|███▎      | 2/6 [00:01<00:02,  1.51it/s]
Capturing CUDA graph shapes:  50%|█████     | 3/6 [00:01<00:01,  1.53it/s]
Capturing CUDA graph shapes:  67%|██████▋   | 4/6 [00:02<00:01,  1.52it/s]
Capturing CUDA graph shapes:  83%|████████▎ | 5/6 [00:03<00:00,  1.51it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00,  1.50it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00,  1.50it/s]
+INFO 02-27 11:45:16 [model_runner.py:1752] Graph capturing finished in 4 secs, took 0.12 GiB
+INFO 02-27 11:45:16 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 18.93 seconds
+[SUCCESS] 模型加载完成
+[INFO] 线程池配置:
+   - CPU 线程池: 2 线程
+   - GPU 线程池: 1 线程
+[INFO] 服务启动: http://0.0.0.0:8707
+[INFO] 接口文档: http://0.0.0.0:8707/docs
+INFO:     Started server process [3409]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
+Some kwargs in processor config are unused and will not have any effect: image_std, pad_token, image_token, patch_size, image_mean, ignore_id, add_special_token, downsample_ratio, mask_prompt, candidate_resolutions, normalize, sft_format. 
+   [1/3] Tokenize 19 页...
+   [1/3] Tokenize 完成
+   [2/3] GPU 批量推理 19 页...
+
Processed prompts:   0%|          | 0/19 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:   5%|▌         | 1/19 [00:12<03:45, 12.55s/it, est. speed input: 90.13 toks/s, output: 1.43 toks/s]
Processed prompts:  11%|█         | 2/19 [00:15<01:58,  6.99s/it, est. speed input: 144.51 toks/s, output: 8.69 toks/s]
Processed prompts:  16%|█▌        | 3/19 [00:16<01:05,  4.08s/it, est. speed input: 208.56 toks/s, output: 16.84 toks/s]
Processed prompts:  21%|██        | 4/19 [00:16<00:39,  2.61s/it, est. speed input: 271.98 toks/s, output: 25.55 toks/s]
Processed prompts:  26%|██▋       | 5/19 [00:17<00:26,  1.88s/it, est. speed input: 328.37 toks/s, output: 34.67 toks/s]
Processed prompts:  37%|███▋      | 7/19 [00:17<00:11,  1.00it/s, est. speed input: 452.57 toks/s, output: 54.48 toks/s]
Processed prompts:  53%|█████▎    | 10/19 [00:17<00:04,  1.94it/s, est. speed input: 638.47 toks/s, output: 85.19 toks/s]
Processed prompts:  58%|█████▊    | 11/19 [00:17<00:03,  2.23it/s, est. speed input: 695.37 toks/s, output: 95.47 toks/s]
Processed prompts:  63%|██████▎   | 12/19 [00:18<00:03,  2.14it/s, est. speed input: 736.62 toks/s, output: 104.86 toks/s]
Processed prompts:  68%|██████▊   | 13/19 [00:18<00:02,  2.51it/s, est. speed input: 790.12 toks/s, output: 116.34 toks/s]
Processed prompts:  79%|███████▉  | 15/19 [00:18<00:01,  3.44it/s, est. speed input: 898.89 toks/s, output: 140.36 toks/s]
Processed prompts:  84%|████████▍ | 16/19 [00:19<00:00,  3.65it/s, est. speed input: 947.88 toks/s, output: 152.43 toks/s]
Processed prompts:  89%|████████▉ | 17/19 [00:20<00:00,  2.06it/s, est. speed input: 950.50 toks/s, output: 160.57 toks/s]
Processed prompts:  95%|█████████▍| 18/19 [00:20<00:00,  2.03it/s, est. speed input: 981.45 toks/s, output: 175.29 toks/s]
Processed prompts: 100%|██████████| 19/19 [00:22<00:00,  1.16it/s, est. speed input: 951.02 toks/s, output: 188.57 toks/s]
Processed prompts: 100%|██████████| 19/19 [00:22<00:00,  1.19s/it, est. speed input: 951.02 toks/s, output: 188.57 toks/s]
+   [2/3] GPU 推理完成
+   OCR 耗时: 28.28s
+   [3/3] 后处理...
+   [3/3] 后处理完成 (0.00s)
+============================================================
+[SUCCESS] 全部完成
+   总耗时: 30.77s
+   平均: 1.62s/页
+============================================================
+INFO:     127.0.0.1:56926 - "POST /ocr HTTP/1.1" 200 OK
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr_server_8707_20260204_142548.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr_server_8707_20260204_142548.log
+INFO 02-04 14:25:53 [__init__.py:240] Automatically detected platform rocm.
+/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py:472: DeprecationWarning: 
+        on_event is deprecated, use lifespan event handlers instead.
+        Read more about it in the
+        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
+  @app.on_event("shutdown")
+[INFO] 加载模型: /home/lst/deepseek_ocr
+INFO 02-04 14:25:58 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
+INFO 02-04 14:25:58 [config.py:721] This model supports multiple tasks: {'score', 'generate', 'embed', 'classify', 'reward'}. Defaulting to 'generate'.
+INFO 02-04 14:25:58 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr', speculative_config=None, tokenizer='/home/lst/deepseek_ocr', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=3281, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False, 
+INFO 02-04 14:25:58 [rocm.py:226] None is not supported in AMD GPUs.
+INFO 02-04 14:25:58 [rocm.py:227] Using ROCmFlashAttention backend.
+WARNING 02-04 14:25:58 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
+INFO 02-04 14:25:58 [worker_base.py:653] ########## 12366 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
+INFO 02-04 14:25:58 [worker_base.py:654] ########## 12366 process(rank0) is running on memnode(s): {0, 1, 2, 3}
+WARNING: Logging before InitGoogleLogging() is written to STDERR
+I0204 14:25:59.137439 12366 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
+I0204 14:25:59.137527 12366 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 14:25:59.138036 12366 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5636c9281ad0, SPLIT_COLOR: 3389850942126204093, PG Name: 1
+I0204 14:25:59.138052 12366 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 14:25:59.157974 12366 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5636c9281ad0, SPLIT_COLOR: 3389850942126204093, PG Name: 3
+I0204 14:25:59.158027 12366 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 14:25:59.159323 12366 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5636c9281ad0, SPLIT_COLOR: 3389850942126204093, PG Name: 5
+I0204 14:25:59.159346 12366 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 14:25:59.160290 12366 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5636c9281ad0, SPLIT_COLOR: 3389850942126204093, PG Name: 7
+I0204 14:25:59.160312 12366 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+INFO 02-04 14:25:59 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
+INFO 02-04 14:25:59 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr...
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+INFO 02-04 14:26:00 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
+
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.84it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.81it/s]
+[rank0]: Traceback (most recent call last):
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py", line 509, in <module>
+[rank0]:     main()
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py", line 500, in main
+[rank0]:     initialize_model(args.model_path, args.gpu_id)
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py", line 270, in initialize_model
+[rank0]:     llm = LLM(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1182, in inner
+[rank0]:     return fn(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 255, in __init__
+[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 520, in from_engine_args
+[rank0]:     return engine_cls.from_vllm_config(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 496, in from_vllm_config
+[rank0]:     return cls(
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 283, in __init__
+[rank0]:     self.model_executor = executor_class(vllm_config=vllm_config)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
+[rank0]:     self._init_executor()
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
+[rank0]:     self.collective_rpc("load_model")
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
+[rank0]:     answer = run_method(self.driver_worker, method, args, kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2624, in run_method
+[rank0]:     return func(*args, **kwargs)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 203, in load_model
+[rank0]:     self.model_runner.load_model()
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1136, in load_model
+[rank0]:     self.model = get_model(vllm_config=self.vllm_config)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
+[rank0]:     return loader.load_model(vllm_config=vllm_config)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 457, in load_model
+[rank0]:     loaded_weights = model.load_weights(
+[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 576, in load_weights
+[rank0]:     autoloaded_weights = loader.load_weights(processed_weights, mapper=self.hf_to_vllm_mapper)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 261, in load_weights
+[rank0]:     autoloaded_weights = set(self._load_module("", self.module, weights))
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 222, in _load_module
+[rank0]:     yield from self._load_module(prefix,
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 195, in _load_module
+[rank0]:     loaded_params = module_load_weights(weights)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 494, in load_weights
+[rank0]:     return loader.load_weights(weights)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 261, in load_weights
+[rank0]:     autoloaded_weights = set(self._load_module("", self.module, weights))
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 222, in _load_module
+[rank0]:     yield from self._load_module(prefix,
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 195, in _load_module
+[rank0]:     loaded_params = module_load_weights(weights)
+[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 441, in load_weights
+[rank0]:     param = params_dict[name]
+[rank0]: KeyError: 'image_newline'
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr_server_8707_20260204_142755.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr_server_8707_20260204_142755.log
+INFO 02-04 14:28:00 [__init__.py:240] Automatically detected platform rocm.
+/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py:472: DeprecationWarning: 
+        on_event is deprecated, use lifespan event handlers instead.
+        Read more about it in the
+        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
+  @app.on_event("shutdown")
+[INFO] 加载模型: /home/lst/deepseek_ocr2
+INFO 02-04 14:28:05 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
+INFO 02-04 14:28:05 [config.py:721] This model supports multiple tasks: {'embed', 'generate', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
+INFO 02-04 14:28:05 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=3281, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False, 
+INFO 02-04 14:28:06 [rocm.py:226] None is not supported in AMD GPUs.
+INFO 02-04 14:28:06 [rocm.py:227] Using ROCmFlashAttention backend.
+WARNING 02-04 14:28:06 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
+INFO 02-04 14:28:06 [worker_base.py:653] ########## 12544 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
+INFO 02-04 14:28:06 [worker_base.py:654] ########## 12544 process(rank0) is running on memnode(s): {0, 1, 2, 3}
+WARNING: Logging before InitGoogleLogging() is written to STDERR
+I0204 14:28:06.276472 12544 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
+I0204 14:28:06.276561 12544 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 14:28:06.277021 12544 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x56262096b7b0, SPLIT_COLOR: 3389850942126204093, PG Name: 1
+I0204 14:28:06.277037 12544 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 14:28:06.296979 12544 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x56262096b7b0, SPLIT_COLOR: 3389850942126204093, PG Name: 3
+I0204 14:28:06.297024 12544 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 14:28:06.298228 12544 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x56262096b7b0, SPLIT_COLOR: 3389850942126204093, PG Name: 5
+I0204 14:28:06.298249 12544 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 14:28:06.299176 12544 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x56262096b7b0, SPLIT_COLOR: 3389850942126204093, PG Name: 7
+I0204 14:28:06.299197 12544 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+INFO 02-04 14:28:06 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
+INFO 02-04 14:28:06 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+INFO 02-04 14:28:07 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
+
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.13it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.11it/s]
+INFO 02-04 14:28:09 [loader.py:460] Loading weights took 1.92 seconds
+INFO 02-04 14:28:10 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.579491 seconds
+Some kwargs in processor config are unused and will not have any effect: ignore_id, patch_size, image_std, sft_format, normalize, image_token, add_special_token, mask_prompt, image_mean, candidate_resolutions, pad_token, downsample_ratio. 
+/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
+  x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
+WARNING 02-04 14:28:19 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
+INFO 02-04 14:28:20 [worker.py:287] Memory profiling takes 10.29 seconds
+INFO 02-04 14:28:20 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
+INFO 02-04 14:28:20 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 1.86GiB; the rest of the memory reserved for KV Cache is 47.81GiB.
+INFO 02-04 14:28:20 [executor_base.py:112] # rocm blocks: 13055, # CPU blocks: 1092
+INFO 02-04 14:28:20 [executor_base.py:117] Maximum concurrency for 3281 tokens per request: 254.65x
+INFO 02-04 14:28:22 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
+
Capturing CUDA graph shapes:   0%|          | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes:  17%|█▋        | 1/6 [00:00<00:02,  1.80it/s]
Capturing CUDA graph shapes:  33%|███▎      | 2/6 [00:01<00:02,  1.93it/s]
Capturing CUDA graph shapes:  50%|█████     | 3/6 [00:01<00:01,  2.00it/s]
Capturing CUDA graph shapes:  67%|██████▋   | 4/6 [00:02<00:01,  2.00it/s]
Capturing CUDA graph shapes:  83%|████████▎ | 5/6 [00:02<00:00,  2.01it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00,  2.04it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00,  2.00it/s]
+INFO 02-04 14:28:25 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
+INFO 02-04 14:28:25 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 15.87 seconds
+[SUCCESS] 模型加载完成
+[INFO] 线程池配置:
+   - CPU 线程池: 2 线程
+   - GPU 线程池: 1 线程
+[INFO] 服务启动: http://0.0.0.0:8707
+[INFO] 接口文档: http://0.0.0.0:8707/docs
+INFO:     Started server process [12544]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
+Some kwargs in processor config are unused and will not have any effect: ignore_id, patch_size, image_std, sft_format, normalize, image_token, add_special_token, mask_prompt, image_mean, candidate_resolutions, pad_token, downsample_ratio. 
+   [1/3] Tokenize 19 页...
+   [1/3] Tokenize 完成
+   [2/3] GPU 批量推理 19 页...
+
Processed prompts:   0%|          | 0/19 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:   5%|▌         | 1/19 [00:12<03:41, 12.32s/it, est. speed input: 91.79 toks/s, output: 1.46 toks/s]
Processed prompts:  11%|█         | 2/19 [00:15<01:57,  6.90s/it, est. speed input: 146.60 toks/s, output: 8.81 toks/s]
Processed prompts:  16%|█▌        | 3/19 [00:16<01:05,  4.10s/it, est. speed input: 209.42 toks/s, output: 17.22 toks/s]
Processed prompts:  21%|██        | 4/19 [00:16<00:38,  2.57s/it, est. speed input: 275.32 toks/s, output: 26.17 toks/s]
Processed prompts:  26%|██▋       | 5/19 [00:16<00:25,  1.84s/it, est. speed input: 333.36 toks/s, output: 35.37 toks/s]
Processed prompts:  37%|███▋      | 7/19 [00:17<00:11,  1.03it/s, est. speed input: 459.21 toks/s, output: 55.22 toks/s]
Processed prompts:  53%|█████▎    | 10/19 [00:17<00:04,  1.94it/s, est. speed input: 645.85 toks/s, output: 86.11 toks/s]
Processed prompts:  58%|█████▊    | 11/19 [00:17<00:03,  2.21it/s, est. speed input: 702.21 toks/s, output: 96.40 toks/s]
Processed prompts:  63%|██████▎   | 12/19 [00:18<00:03,  2.12it/s, est. speed input: 743.65 toks/s, output: 105.86 toks/s]
Processed prompts:  68%|██████▊   | 13/19 [00:18<00:02,  2.48it/s, est. speed input: 797.42 toks/s, output: 117.42 toks/s]
Processed prompts:  74%|███████▎  | 14/19 [00:18<00:01,  3.04it/s, est. speed input: 853.49 toks/s, output: 129.58 toks/s]
Processed prompts:  79%|███████▉  | 15/19 [00:18<00:01,  3.59it/s, est. speed input: 907.64 toks/s, output: 141.83 toks/s]
Processed prompts:  84%|████████▍ | 16/19 [00:18<00:00,  3.92it/s, est. speed input: 958.14 toks/s, output: 154.13 toks/s]
Processed prompts:  89%|████████▉ | 17/19 [00:20<00:01,  1.98it/s, est. speed input: 960.27 toks/s, output: 162.27 toks/s]
Processed prompts:  95%|█████████▍| 18/19 [00:20<00:00,  1.91it/s, est. speed input: 988.41 toks/s, output: 176.87 toks/s]
Processed prompts: 100%|██████████| 19/19 [00:21<00:00,  1.69it/s, est. speed input: 1006.67 toks/s, output: 193.52 toks/s]
Processed prompts: 100%|██████████| 19/19 [00:21<00:00,  1.12s/it, est. speed input: 1006.67 toks/s, output: 193.52 toks/s]
+   [2/3] GPU 推理完成
+   OCR 耗时: 26.37s
+   [3/3] 后处理...
+   [3/3] 后处理完成 (0.00s)
+============================================================
+[SUCCESS] 全部完成
+   总耗时: 28.83s
+   平均: 1.52s/页
+============================================================
+INFO:     127.0.0.1:51734 - "POST /ocr HTTP/1.1" 200 OK
+   [1/3] Tokenize 19 页...
+   [1/3] Tokenize 完成
+   [2/3] GPU 批量推理 19 页...
+
Processed prompts:   0%|          | 0/19 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:   5%|▌         | 1/19 [00:12<03:37, 12.07s/it, est. speed input: 93.70 toks/s, output: 1.49 toks/s]
Processed prompts:  11%|█         | 2/19 [00:15<01:55,  6.80s/it, est. speed input: 149.07 toks/s, output: 8.96 toks/s]
Processed prompts:  16%|█▌        | 3/19 [00:15<01:04,  4.06s/it, est. speed input: 212.46 toks/s, output: 17.47 toks/s]
Processed prompts:  21%|██        | 4/19 [00:16<00:38,  2.55s/it, est. speed input: 279.15 toks/s, output: 26.53 toks/s]
Processed prompts:  26%|██▋       | 5/19 [00:17<00:26,  1.92s/it, est. speed input: 332.36 toks/s, output: 35.26 toks/s]
Processed prompts:  37%|███▋      | 7/19 [00:17<00:12,  1.02s/it, est. speed input: 457.76 toks/s, output: 55.04 toks/s]
Processed prompts:  53%|█████▎    | 10/19 [00:17<00:04,  1.87it/s, est. speed input: 643.63 toks/s, output: 85.82 toks/s]
Processed prompts:  58%|█████▊    | 11/19 [00:17<00:03,  2.13it/s, est. speed input: 699.71 toks/s, output: 96.06 toks/s]
Processed prompts:  63%|██████▎   | 12/19 [00:18<00:03,  2.06it/s, est. speed input: 740.91 toks/s, output: 105.47 toks/s]
Processed prompts:  68%|██████▊   | 13/19 [00:18<00:02,  2.42it/s, est. speed input: 794.48 toks/s, output: 116.99 toks/s]
Processed prompts:  74%|███████▎  | 14/19 [00:18<00:01,  2.96it/s, est. speed input: 850.35 toks/s, output: 129.10 toks/s]
Processed prompts:  79%|███████▉  | 15/19 [00:18<00:01,  3.51it/s, est. speed input: 904.30 toks/s, output: 141.31 toks/s]
Processed prompts:  84%|████████▍ | 16/19 [00:18<00:00,  3.84it/s, est. speed input: 954.60 toks/s, output: 153.56 toks/s]
Processed prompts:  89%|████████▉ | 17/19 [00:20<00:01,  1.97it/s, est. speed input: 956.87 toks/s, output: 161.69 toks/s]
Processed prompts:  95%|█████████▍| 18/19 [00:20<00:00,  1.90it/s, est. speed input: 985.04 toks/s, output: 176.27 toks/s]
Processed prompts: 100%|██████████| 19/19 [00:21<00:00,  1.69it/s, est. speed input: 1003.54 toks/s, output: 192.92 toks/s]
Processed prompts: 100%|██████████| 19/19 [00:21<00:00,  1.13s/it, est. speed input: 1003.54 toks/s, output: 192.92 toks/s]
+   [2/3] GPU 推理完成
+   OCR 耗时: 24.64s
+   [3/3] 后处理...
+   [3/3] 后处理完成 (0.00s)
+============================================================
+[SUCCESS] 全部完成
+   总耗时: 27.09s
+   平均: 1.43s/页
+============================================================
+INFO:     127.0.0.1:34080 - "POST /ocr HTTP/1.1" 200 OK
+   [1/3] Tokenize 19 页...
+   [1/3] Tokenize 完成
+   [2/3] GPU 批量推理 19 页...
+
Processed prompts:   0%|          | 0/19 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:   5%|▌         | 1/19 [00:12<03:37, 12.07s/it, est. speed input: 93.72 toks/s, output: 1.49 toks/s]
Processed prompts:  11%|█         | 2/19 [00:15<01:55,  6.80s/it, est. speed input: 149.03 toks/s, output: 8.96 toks/s]
Processed prompts:  16%|█▌        | 3/19 [00:15<01:04,  4.05s/it, est. speed input: 212.52 toks/s, output: 17.48 toks/s]
Processed prompts:  21%|██        | 4/19 [00:16<00:38,  2.55s/it, est. speed input: 279.28 toks/s, output: 26.54 toks/s]
Processed prompts:  26%|██▋       | 5/19 [00:16<00:25,  1.82s/it, est. speed input: 337.87 toks/s, output: 35.85 toks/s]
Processed prompts:  37%|███▋      | 7/19 [00:17<00:11,  1.03it/s, est. speed input: 465.26 toks/s, output: 55.95 toks/s]
Processed prompts:  53%|█████▎    | 10/19 [00:17<00:04,  1.96it/s, est. speed input: 654.17 toks/s, output: 87.22 toks/s]
Processed prompts:  58%|█████▊    | 11/19 [00:17<00:03,  2.22it/s, est. speed input: 711.13 toks/s, output: 97.63 toks/s]
Processed prompts:  63%|██████▎   | 12/19 [00:18<00:03,  2.13it/s, est. speed input: 752.74 toks/s, output: 107.15 toks/s]
Processed prompts:  68%|██████▊   | 13/19 [00:18<00:02,  2.49it/s, est. speed input: 807.04 toks/s, output: 118.84 toks/s]
Processed prompts:  74%|███████▎  | 14/19 [00:18<00:01,  3.05it/s, est. speed input: 863.71 toks/s, output: 131.13 toks/s]
Processed prompts:  79%|███████▉  | 15/19 [00:18<00:01,  3.60it/s, est. speed input: 918.44 toks/s, output: 143.52 toks/s]
Processed prompts:  84%|████████▍ | 16/19 [00:18<00:00,  3.91it/s, est. speed input: 969.22 toks/s, output: 155.91 toks/s]
Processed prompts:  89%|████████▉ | 17/19 [00:19<00:01,  1.98it/s, est. speed input: 970.71 toks/s, output: 164.03 toks/s]
Processed prompts:  95%|█████████▍| 18/19 [00:20<00:00,  1.90it/s, est. speed input: 998.67 toks/s, output: 178.71 toks/s]
Processed prompts: 100%|██████████| 19/19 [00:21<00:00,  1.69it/s, est. speed input: 1016.79 toks/s, output: 195.47 toks/s]
Processed prompts: 100%|██████████| 19/19 [00:21<00:00,  1.11s/it, est. speed input: 1016.79 toks/s, output: 195.47 toks/s]
+   [2/3] GPU 推理完成
+   OCR 耗时: 24.22s
+   [3/3] 后处理...
+   [3/3] 后处理完成 (0.00s)
+============================================================
+[SUCCESS] 全部完成
+   总耗时: 26.65s
+   平均: 1.40s/页
+============================================================
+INFO:     127.0.0.1:47994 - "POST /ocr HTTP/1.1" 200 OK
+   [1/3] Tokenize 19 页...
+   [1/3] Tokenize 完成
+   [2/3] GPU 批量推理 19 页...
+
Processed prompts:   0%|          | 0/19 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:   5%|▌         | 1/19 [00:12<03:37, 12.07s/it, est. speed input: 93.73 toks/s, output: 1.49 toks/s]
Processed prompts:  11%|█         | 2/19 [00:15<01:55,  6.81s/it, est. speed input: 148.91 toks/s, output: 8.95 toks/s]
Processed prompts:  16%|█▌        | 3/19 [00:15<01:04,  4.06s/it, est. speed input: 212.24 toks/s, output: 17.45 toks/s]
Processed prompts:  21%|██        | 4/19 [00:16<00:38,  2.55s/it, est. speed input: 278.87 toks/s, output: 26.51 toks/s]
Processed prompts:  26%|██▋       | 5/19 [00:16<00:25,  1.83s/it, est. speed input: 337.25 toks/s, output: 35.78 toks/s]
Processed prompts:  37%|███▋      | 7/19 [00:17<00:11,  1.03it/s, est. speed input: 464.39 toks/s, output: 55.84 toks/s]
Processed prompts:  53%|█████▎    | 10/19 [00:17<00:04,  1.95it/s, est. speed input: 652.75 toks/s, output: 87.03 toks/s]
Processed prompts:  58%|█████▊    | 11/19 [00:17<00:03,  2.21it/s, est. speed input: 709.50 toks/s, output: 97.41 toks/s]
Processed prompts:  63%|██████▎   | 12/19 [00:18<00:03,  2.12it/s, est. speed input: 750.88 toks/s, output: 106.89 toks/s]
Processed prompts:  68%|██████▊   | 13/19 [00:18<00:02,  2.47it/s, est. speed input: 805.06 toks/s, output: 118.54 toks/s]
Processed prompts:  74%|███████▎  | 14/19 [00:18<00:01,  3.03it/s, est. speed input: 861.59 toks/s, output: 130.81 toks/s]
Processed prompts:  79%|███████▉  | 15/19 [00:18<00:01,  3.58it/s, est. speed input: 916.18 toks/s, output: 143.16 toks/s]
Processed prompts:  84%|████████▍ | 16/19 [00:18<00:00,  3.90it/s, est. speed input: 966.99 toks/s, output: 155.55 toks/s]
Processed prompts:  89%|████████▉ | 17/19 [00:19<00:01,  1.97it/s, est. speed input: 968.43 toks/s, output: 163.65 toks/s]
Processed prompts:  95%|█████████▍| 18/19 [00:20<00:00,  1.90it/s, est. speed input: 996.51 toks/s, output: 178.32 toks/s]
Processed prompts: 100%|██████████| 19/19 [00:21<00:00,  1.69it/s, est. speed input: 1014.71 toks/s, output: 195.06 toks/s]
Processed prompts: 100%|██████████| 19/19 [00:21<00:00,  1.11s/it, est. speed input: 1014.71 toks/s, output: 195.06 toks/s]
+   [2/3] GPU 推理完成
+   OCR 耗时: 24.31s
+   [3/3] 后处理...
+   [3/3] 后处理完成 (0.00s)
+============================================================
+[SUCCESS] 全部完成
+   总耗时: 26.75s
+   平均: 1.41s/页
+============================================================
+INFO:     127.0.0.1:49594 - "POST /ocr HTTP/1.1" 200 OK
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr_server_8707_20260204_143856.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr_server_8707_20260204_143856.log
+INFO 02-04 14:39:01 [__init__.py:240] Automatically detected platform rocm.
+/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py:472: DeprecationWarning: 
+        on_event is deprecated, use lifespan event handlers instead.
+        Read more about it in the
+        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
+  @app.on_event("shutdown")
+[INFO] 加载模型: /home/lst/deepseek_ocr2
+INFO 02-04 14:39:06 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
+INFO 02-04 14:39:06 [config.py:721] This model supports multiple tasks: {'score', 'reward', 'embed', 'classify', 'generate'}. Defaulting to 'generate'.
+INFO 02-04 14:39:06 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False, 
+INFO 02-04 14:39:06 [rocm.py:226] None is not supported in AMD GPUs.
+INFO 02-04 14:39:06 [rocm.py:227] Using ROCmFlashAttention backend.
+WARNING 02-04 14:39:06 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
+INFO 02-04 14:39:06 [worker_base.py:653] ########## 13561 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
+INFO 02-04 14:39:06 [worker_base.py:654] ########## 13561 process(rank0) is running on memnode(s): {0, 1, 2, 3}
+WARNING: Logging before InitGoogleLogging() is written to STDERR
+I0204 14:39:07.188728 13561 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
+I0204 14:39:07.188807 13561 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 14:39:07.189251 13561 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5640a3302a40, SPLIT_COLOR: 3389850942126204093, PG Name: 1
+I0204 14:39:07.189267 13561 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 14:39:07.208966 13561 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5640a3302a40, SPLIT_COLOR: 3389850942126204093, PG Name: 3
+I0204 14:39:07.209004 13561 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 14:39:07.210227 13561 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5640a3302a40, SPLIT_COLOR: 3389850942126204093, PG Name: 5
+I0204 14:39:07.210245 13561 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 14:39:07.211236 13561 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5640a3302a40, SPLIT_COLOR: 3389850942126204093, PG Name: 7
+I0204 14:39:07.211254 13561 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+INFO 02-04 14:39:07 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
+INFO 02-04 14:39:07 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+INFO 02-04 14:39:08 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
+
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.18it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.16it/s]
+INFO 02-04 14:39:10 [loader.py:460] Loading weights took 1.84 seconds
+INFO 02-04 14:39:10 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.529130 seconds
+Some kwargs in processor config are unused and will not have any effect: normalize, image_std, candidate_resolutions, sft_format, image_mean, add_special_token, patch_size, pad_token, mask_prompt, downsample_ratio, image_token, ignore_id. 
+/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
+  x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
+WARNING 02-04 14:39:22 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
+INFO 02-04 14:39:23 [worker.py:287] Memory profiling takes 11.97 seconds
+INFO 02-04 14:39:23 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
+INFO 02-04 14:39:23 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
+INFO 02-04 14:39:23 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 1092
+INFO 02-04 14:39:23 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
+INFO 02-04 14:39:25 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
+
Capturing CUDA graph shapes:   0%|          | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes:  17%|█▋        | 1/6 [00:00<00:02,  1.89it/s]
Capturing CUDA graph shapes:  33%|███▎      | 2/6 [00:01<00:02,  1.88it/s]
Capturing CUDA graph shapes:  50%|█████     | 3/6 [00:01<00:01,  1.96it/s]
Capturing CUDA graph shapes:  67%|██████▋   | 4/6 [00:02<00:01,  1.97it/s]
Capturing CUDA graph shapes:  83%|████████▎ | 5/6 [00:02<00:00,  1.99it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00,  2.02it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00,  1.99it/s]
+INFO 02-04 14:39:28 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
+INFO 02-04 14:39:28 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 17.42 seconds
+[SUCCESS] 模型加载完成
+[INFO] 线程池配置:
+   - CPU 线程池: 2 线程
+   - GPU 线程池: 1 线程
+[INFO] 服务启动: http://0.0.0.0:8707
+[INFO] 接口文档: http://0.0.0.0:8707/docs
+INFO:     Started server process [13561]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
+Some kwargs in processor config are unused and will not have any effect: normalize, image_std, candidate_resolutions, sft_format, image_mean, add_special_token, patch_size, pad_token, mask_prompt, downsample_ratio, image_token, ignore_id. 
+   [1/3] Tokenize 19 页...
+   [1/3] Tokenize 完成
+   [2/3] GPU 批量推理 19 页...
+
Processed prompts:   0%|          | 0/19 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:   5%|▌         | 1/19 [00:12<03:40, 12.25s/it, est. speed input: 92.36 toks/s, output: 1.47 toks/s]
Processed prompts:  11%|█         | 2/19 [00:15<01:56,  6.87s/it, est. speed input: 147.38 toks/s, output: 8.86 toks/s]
Processed prompts:  16%|█▌        | 3/19 [00:16<01:05,  4.08s/it, est. speed input: 210.52 toks/s, output: 17.31 toks/s]
Processed prompts:  21%|██        | 4/19 [00:16<00:38,  2.56s/it, est. speed input: 276.73 toks/s, output: 26.30 toks/s]
Processed prompts:  26%|██▋       | 5/19 [00:16<00:25,  1.83s/it, est. speed input: 334.79 toks/s, output: 35.52 toks/s]
Processed prompts:  37%|███▋      | 7/19 [00:17<00:11,  1.02it/s, est. speed input: 460.48 toks/s, output: 55.43 toks/s]
Processed prompts:  53%|█████▎    | 10/19 [00:17<00:04,  1.96it/s, est. speed input: 648.80 toks/s, output: 86.56 toks/s]
Processed prompts:  58%|█████▊    | 11/19 [00:17<00:03,  2.22it/s, est. speed input: 705.22 toks/s, output: 96.87 toks/s]
Processed prompts:  63%|██████▎   | 12/19 [00:18<00:03,  2.15it/s, est. speed input: 747.49 toks/s, output: 106.46 toks/s]
Processed prompts:  68%|██████▊   | 13/19 [00:18<00:02,  2.34it/s, est. speed input: 796.15 toks/s, output: 117.61 toks/s]
Processed prompts:  74%|███████▎  | 14/19 [00:18<00:01,  2.78it/s, est. speed input: 850.10 toks/s, output: 129.87 toks/s]
Processed prompts:  79%|███████▉  | 15/19 [00:18<00:01,  3.27it/s, est. speed input: 903.05 toks/s, output: 142.39 toks/s]
Processed prompts:  89%|████████▉ | 17/19 [00:19<00:00,  2.30it/s, est. speed input: 961.83 toks/s, output: 163.78 toks/s]
Processed prompts:  95%|█████████▍| 18/19 [00:20<00:00,  2.14it/s, est. speed input: 990.16 toks/s, output: 178.40 toks/s]
Processed prompts: 100%|██████████| 19/19 [00:21<00:00,  1.75it/s, est. speed input: 1002.04 toks/s, output: 194.59 toks/s]
Processed prompts: 100%|██████████| 19/19 [00:21<00:00,  1.13s/it, est. speed input: 1002.04 toks/s, output: 194.59 toks/s]
+   [2/3] GPU 推理完成
+   OCR 耗时: 26.52s
+   [3/3] 后处理...
+   [3/3] 后处理完成 (0.00s)
+============================================================
+[SUCCESS] 全部完成
+   总耗时: 28.98s
+   平均: 1.53s/页
+============================================================
+INFO:     127.0.0.1:53340 - "POST /ocr HTTP/1.1" 200 OK
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr_server_8707_20260204_145012.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr_server_8707_20260204_145012.log
+INFO 02-04 14:50:16 [__init__.py:240] Automatically detected platform rocm.
+/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py:472: DeprecationWarning: 
+        on_event is deprecated, use lifespan event handlers instead.
+        Read more about it in the
+        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
+  @app.on_event("shutdown")
+[INFO] 加载模型: /home/lst/deepseek_ocr2
+INFO 02-04 14:50:21 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
+INFO 02-04 14:50:21 [config.py:721] This model supports multiple tasks: {'classify', 'embed', 'score', 'reward', 'generate'}. Defaulting to 'generate'.
+INFO 02-04 14:50:21 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False, 
+INFO 02-04 14:50:22 [rocm.py:226] None is not supported in AMD GPUs.
+INFO 02-04 14:50:22 [rocm.py:227] Using ROCmFlashAttention backend.
+WARNING 02-04 14:50:22 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
+INFO 02-04 14:50:22 [worker_base.py:653] ########## 14555 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
+INFO 02-04 14:50:22 [worker_base.py:654] ########## 14555 process(rank0) is running on memnode(s): {0, 1, 2, 3}
+WARNING: Logging before InitGoogleLogging() is written to STDERR
+I0204 14:50:22.719424 14555 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
+I0204 14:50:22.719481 14555 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 14:50:22.719913 14555 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55e48a312250, SPLIT_COLOR: 3389850942126204093, PG Name: 1
+I0204 14:50:22.719926 14555 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 14:50:22.738953 14555 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55e48a312250, SPLIT_COLOR: 3389850942126204093, PG Name: 3
+I0204 14:50:22.738993 14555 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 14:50:22.740214 14555 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55e48a312250, SPLIT_COLOR: 3389850942126204093, PG Name: 5
+I0204 14:50:22.740231 14555 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 14:50:22.741233 14555 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55e48a312250, SPLIT_COLOR: 3389850942126204093, PG Name: 7
+I0204 14:50:22.741250 14555 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+INFO 02-04 14:50:22 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
+INFO 02-04 14:50:22 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+INFO 02-04 14:50:23 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
+
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.14it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.12it/s]
+INFO 02-04 14:50:26 [loader.py:460] Loading weights took 1.81 seconds
+INFO 02-04 14:50:26 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.499614 seconds
+Some kwargs in processor config are unused and will not have any effect: ignore_id, image_token, add_special_token, sft_format, image_mean, image_std, mask_prompt, downsample_ratio, candidate_resolutions, patch_size, pad_token, normalize. 
+/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
+  x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
+WARNING 02-04 14:50:38 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
+INFO 02-04 14:50:39 [worker.py:287] Memory profiling takes 12.33 seconds
+INFO 02-04 14:50:39 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
+INFO 02-04 14:50:39 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
+INFO 02-04 14:50:39 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 1092
+INFO 02-04 14:50:39 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
+INFO 02-04 14:50:41 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
+
Capturing CUDA graph shapes:   0%|          | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes:  17%|█▋        | 1/6 [00:00<00:02,  1.93it/s]
Capturing CUDA graph shapes:  33%|███▎      | 2/6 [00:00<00:01,  2.03it/s]
Capturing CUDA graph shapes:  50%|█████     | 3/6 [00:01<00:01,  2.06it/s]
Capturing CUDA graph shapes:  67%|██████▋   | 4/6 [00:01<00:00,  2.01it/s]
Capturing CUDA graph shapes:  83%|████████▎ | 5/6 [00:02<00:00,  1.99it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00,  2.01it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00,  2.01it/s]
+INFO 02-04 14:50:44 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
+INFO 02-04 14:50:44 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 17.52 seconds
+[SUCCESS] 模型加载完成
+[INFO] 线程池配置:
+   - CPU 线程池: 2 线程
+   - GPU 线程池: 1 线程
+[INFO] 服务启动: http://0.0.0.0:8707
+[INFO] 接口文档: http://0.0.0.0:8707/docs
+INFO:     Started server process [14555]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
+Some kwargs in processor config are unused and will not have any effect: ignore_id, image_token, add_special_token, sft_format, image_mean, image_std, mask_prompt, downsample_ratio, candidate_resolutions, patch_size, pad_token, normalize. 
+   [1/3] Tokenize 19 页...
+   [1/3] Tokenize 完成
+   [2/3] GPU 批量推理 19 页...
+
Processed prompts:   0%|          | 0/19 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:   5%|▌         | 1/19 [00:12<03:45, 12.53s/it, est. speed input: 90.28 toks/s, output: 1.44 toks/s]
Processed prompts:  11%|█         | 2/19 [00:15<01:58,  6.97s/it, est. speed input: 144.95 toks/s, output: 8.71 toks/s]
Processed prompts:  16%|█▌        | 3/19 [00:16<01:06,  4.14s/it, est. speed input: 207.25 toks/s, output: 17.04 toks/s]
Processed prompts:  21%|██        | 4/19 [00:16<00:38,  2.59s/it, est. speed input: 272.56 toks/s, output: 25.91 toks/s]
Processed prompts:  26%|██▋       | 5/19 [00:17<00:25,  1.85s/it, est. speed input: 330.05 toks/s, output: 35.02 toks/s]
Processed prompts:  37%|███▋      | 7/19 [00:17<00:11,  1.01it/s, est. speed input: 454.16 toks/s, output: 54.67 toks/s]
Processed prompts:  53%|█████▎    | 10/19 [00:17<00:04,  1.95it/s, est. speed input: 640.14 toks/s, output: 85.41 toks/s]
Processed prompts:  58%|█████▊    | 11/19 [00:17<00:03,  2.21it/s, est. speed input: 696.02 toks/s, output: 95.61 toks/s]
Processed prompts:  63%|██████▎   | 12/19 [00:18<00:03,  2.15it/s, est. speed input: 738.44 toks/s, output: 105.17 toks/s]
Processed prompts:  68%|██████▊   | 13/19 [00:18<00:02,  2.35it/s, est. speed input: 786.93 toks/s, output: 116.25 toks/s]
Processed prompts:  74%|███████▎  | 14/19 [00:18<00:01,  2.80it/s, est. speed input: 840.49 toks/s, output: 128.40 toks/s]
Processed prompts:  79%|███████▉  | 15/19 [00:18<00:01,  3.30it/s, est. speed input: 893.08 toks/s, output: 140.82 toks/s]
Processed prompts:  89%|████████▉ | 17/19 [00:20<00:00,  2.33it/s, est. speed input: 952.65 toks/s, output: 162.22 toks/s]
Processed prompts:  95%|█████████▍| 18/19 [00:20<00:00,  2.17it/s, est. speed input: 981.18 toks/s, output: 176.78 toks/s]
Processed prompts: 100%|██████████| 19/19 [00:21<00:00,  1.76it/s, est. speed input: 993.47 toks/s, output: 192.92 toks/s]
Processed prompts: 100%|██████████| 19/19 [00:21<00:00,  1.14s/it, est. speed input: 993.47 toks/s, output: 192.92 toks/s]
+   [2/3] GPU 推理完成
+   OCR 耗时: 26.65s
+   [3/3] 后处理...
+   [3/3] 后处理完成 (0.00s)
+============================================================
+[SUCCESS] 全部完成
+   总耗时: 29.10s
+   平均: 1.53s/页
+============================================================
+INFO:     127.0.0.1:55910 - "POST /ocr HTTP/1.1" 200 OK
+   [1/3] Tokenize 19 页...
+   [1/3] Tokenize 完成
+   [2/3] GPU 批量推理 19 页...
+
Processed prompts:   0%|          | 0/19 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:   5%|▌         | 1/19 [00:12<03:37, 12.10s/it, est. speed input: 93.46 toks/s, output: 1.49 toks/s]
Processed prompts:  11%|█         | 2/19 [00:15<01:55,  6.77s/it, est. speed input: 149.33 toks/s, output: 8.98 toks/s]
Processed prompts:  16%|█▌        | 3/19 [00:15<01:04,  4.04s/it, est. speed input: 213.03 toks/s, output: 17.52 toks/s]
Processed prompts:  21%|██        | 4/19 [00:16<00:38,  2.53s/it, est. speed input: 280.00 toks/s, output: 26.61 toks/s]
Processed prompts:  26%|██▋       | 5/19 [00:16<00:25,  1.82s/it, est. speed input: 338.64 toks/s, output: 35.93 toks/s]
Processed prompts:  37%|███▋      | 7/19 [00:17<00:12,  1.04s/it, est. speed input: 457.99 toks/s, output: 55.13 toks/s]
Processed prompts:  53%|█████▎    | 10/19 [00:17<00:04,  1.85it/s, est. speed input: 645.21 toks/s, output: 86.08 toks/s]
Processed prompts:  58%|█████▊    | 11/19 [00:17<00:03,  2.10it/s, est. speed input: 701.51 toks/s, output: 96.37 toks/s]
Processed prompts:  63%|██████▎   | 12/19 [00:18<00:03,  2.07it/s, est. speed input: 743.95 toks/s, output: 105.96 toks/s]
Processed prompts:  68%|██████▊   | 13/19 [00:18<00:02,  2.27it/s, est. speed input: 792.69 toks/s, output: 117.10 toks/s]
Processed prompts:  74%|███████▎  | 14/19 [00:18<00:01,  2.72it/s, est. speed input: 846.58 toks/s, output: 129.33 toks/s]
Processed prompts:  79%|███████▉  | 15/19 [00:18<00:01,  3.21it/s, est. speed input: 899.49 toks/s, output: 141.83 toks/s]
Processed prompts:  89%|████████▉ | 17/19 [00:20<00:00,  2.31it/s, est. speed input: 959.04 toks/s, output: 163.31 toks/s]
Processed prompts:  95%|█████████▍| 18/19 [00:20<00:00,  2.15it/s, est. speed input: 987.37 toks/s, output: 177.90 toks/s]
Processed prompts: 100%|██████████| 19/19 [00:21<00:00,  1.75it/s, est. speed input: 999.56 toks/s, output: 194.11 toks/s]
Processed prompts: 100%|██████████| 19/19 [00:21<00:00,  1.13s/it, est. speed input: 999.56 toks/s, output: 194.11 toks/s]
+   [2/3] GPU 推理完成
+   OCR 耗时: 24.63s
+   [3/3] 后处理...
+   [3/3] 后处理完成 (0.00s)
+============================================================
+[SUCCESS] 全部完成
+   总耗时: 27.07s
+   平均: 1.42s/页
+============================================================
+INFO:     127.0.0.1:43926 - "POST /ocr HTTP/1.1" 200 OK
+   [1/3] Tokenize 22 页...
+   [1/3] Tokenize 完成
+   [2/3] GPU 批量推理 22 页...
+
Processed prompts:   0%|          | 0/22 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:   5%|▍         | 1/22 [00:15<05:17, 15.10s/it, est. speed input: 74.91 toks/s, output: 6.56 toks/s]
Processed prompts:   9%|▉         | 2/22 [00:15<02:11,  6.59s/it, est. speed input: 143.81 toks/s, output: 12.65 toks/s]
Processed prompts:  14%|█▎        | 3/22 [00:18<01:33,  4.90s/it, est. speed input: 182.30 toks/s, output: 19.61 toks/s]
Processed prompts:  18%|█▊        | 4/22 [00:20<01:07,  3.73s/it, est. speed input: 220.17 toks/s, output: 28.67 toks/s]
Processed prompts:  23%|██▎       | 5/22 [00:21<00:48,  2.85s/it, est. speed input: 258.91 toks/s, output: 39.01 toks/s]
Processed prompts:  27%|██▋       | 6/22 [00:22<00:35,  2.25s/it, est. speed input: 296.16 toks/s, output: 50.10 toks/s]
Processed prompts:  32%|███▏      | 7/22 [00:29<00:54,  3.65s/it, est. speed input: 268.68 toks/s, output: 56.47 toks/s]
Processed prompts:  36%|███▋      | 8/22 [00:31<00:45,  3.25s/it, est. speed input: 284.10 toks/s, output: 70.96 toks/s]
Processed prompts:  41%|████      | 9/22 [00:32<00:32,  2.53s/it, est. speed input: 310.48 toks/s, output: 88.09 toks/s]
Processed prompts:  45%|████▌     | 10/22 [00:35<00:30,  2.56s/it, est. speed input: 319.23 toks/s, output: 101.84 toks/s]
Processed prompts:  50%|█████     | 11/22 [00:37<00:27,  2.47s/it, est. speed input: 330.13 toks/s, output: 117.02 toks/s]
Processed prompts:  55%|█████▍    | 12/22 [00:37<00:17,  1.77s/it, est. speed input: 358.37 toks/s, output: 137.81 toks/s]
Processed prompts:  59%|█████▉    | 13/22 [00:38<00:12,  1.40s/it, est. speed input: 382.90 toks/s, output: 157.53 toks/s]
Processed prompts:  64%|██████▎   | 14/22 [00:39<00:09,  1.16s/it, est. speed input: 405.78 toks/s, output: 176.96 toks/s]
Processed prompts:  68%|██████▊   | 15/22 [00:42<00:13,  1.90s/it, est. speed input: 398.03 toks/s, output: 185.77 toks/s]
Processed prompts:  73%|███████▎  | 16/22 [00:44<00:10,  1.78s/it, est. speed input: 410.00 toks/s, output: 201.69 toks/s]
Processed prompts:  77%|███████▋  | 17/22 [00:44<00:07,  1.47s/it, est. speed input: 428.40 toks/s, output: 223.28 toks/s]
Processed prompts:  82%|████████▏ | 18/22 [00:45<00:04,  1.18s/it, est. speed input: 448.48 toks/s, output: 246.00 toks/s]
Processed prompts:  86%|████████▋ | 19/22 [00:45<00:02,  1.06it/s, est. speed input: 469.47 toks/s, output: 269.53 toks/s]
Processed prompts:  91%|█████████ | 20/22 [00:48<00:02,  1.48s/it, est. speed input: 466.38 toks/s, output: 282.20 toks/s]
Processed prompts:  95%|█████████▌| 21/22 [00:48<00:01,  1.07s/it, est. speed input: 488.51 toks/s, output: 307.47 toks/s]
Processed prompts: 100%|██████████| 22/22 [00:53<00:00,  2.18s/it, est. speed input: 466.14 toks/s, output: 316.51 toks/s]
Processed prompts: 100%|██████████| 22/22 [00:53<00:00,  2.43s/it, est. speed input: 466.14 toks/s, output: 316.51 toks/s]
+   [2/3] GPU 推理完成
+   OCR 耗时: 58.12s
+   [3/3] 后处理...
+   [3/3] 后处理完成 (0.00s)
+============================================================
+[SUCCESS] 全部完成
+   总耗时: 69.56s
+   平均: 3.16s/页
+============================================================
+INFO:     127.0.0.1:55008 - "POST /ocr HTTP/1.1" 200 OK
+   [1/3] Tokenize 22 页...
+   [1/3] Tokenize 完成
+   [2/3] GPU 批量推理 22 页...
+
Processed prompts:   0%|          | 0/22 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:   5%|▍         | 1/22 [00:15<05:21, 15.31s/it, est. speed input: 73.86 toks/s, output: 6.47 toks/s]
Processed prompts:   9%|▉         | 2/22 [00:15<02:13,  6.67s/it, est. speed input: 141.92 toks/s, output: 12.49 toks/s]
Processed prompts:  14%|█▎        | 3/22 [00:18<01:33,  4.94s/it, est. speed input: 180.25 toks/s, output: 19.39 toks/s]
Processed prompts:  18%|█▊        | 4/22 [00:20<01:07,  3.75s/it, est. speed input: 217.99 toks/s, output: 28.38 toks/s]
Processed prompts:  23%|██▎       | 5/22 [00:22<00:48,  2.87s/it, est. speed input: 256.49 toks/s, output: 38.64 toks/s]
Processed prompts:  27%|██▋       | 6/22 [00:23<00:36,  2.26s/it, est. speed input: 293.53 toks/s, output: 49.66 toks/s]
Processed prompts:  32%|███▏      | 7/22 [00:29<00:55,  3.68s/it, est. speed input: 266.42 toks/s, output: 56.00 toks/s]
Processed prompts:  36%|███▋      | 8/22 [00:32<00:45,  3.27s/it, est. speed input: 281.74 toks/s, output: 70.37 toks/s]
Processed prompts:  41%|████      | 9/22 [00:33<00:33,  2.54s/it, est. speed input: 307.92 toks/s, output: 87.36 toks/s]
Processed prompts:  45%|████▌     | 10/22 [00:35<00:30,  2.58s/it, est. speed input: 316.71 toks/s, output: 101.03 toks/s]
Processed prompts:  50%|█████     | 11/22 [00:38<00:27,  2.49s/it, est. speed input: 327.38 toks/s, output: 116.05 toks/s]
Processed prompts:  55%|█████▍    | 12/22 [00:38<00:17,  1.79s/it, est. speed input: 355.40 toks/s, output: 136.67 toks/s]
Processed prompts:  59%|█████▉    | 13/22 [00:38<00:12,  1.41s/it, est. speed input: 379.68 toks/s, output: 156.21 toks/s]
Processed prompts:  64%|██████▎   | 14/22 [00:39<00:09,  1.17s/it, est. speed input: 402.40 toks/s, output: 175.48 toks/s]
Processed prompts:  68%|██████▊   | 15/22 [00:43<00:13,  1.92s/it, est. speed input: 394.48 toks/s, output: 184.12 toks/s]
Processed prompts:  73%|███████▎  | 16/22 [00:44<00:10,  1.81s/it, est. speed input: 406.20 toks/s, output: 199.82 toks/s]
Processed prompts:  77%|███████▋  | 17/22 [00:45<00:07,  1.49s/it, est. speed input: 424.39 toks/s, output: 221.19 toks/s]
Processed prompts:  82%|████████▏ | 18/22 [00:45<00:04,  1.20s/it, est. speed input: 444.25 toks/s, output: 243.69 toks/s]
Processed prompts:  86%|████████▋ | 19/22 [00:46<00:02,  1.05it/s, est. speed input: 465.03 toks/s, output: 266.98 toks/s]
Processed prompts:  91%|█████████ | 20/22 [00:48<00:02,  1.50s/it, est. speed input: 461.96 toks/s, output: 279.52 toks/s]
Processed prompts:  95%|█████████▌| 21/22 [00:49<00:01,  1.08s/it, est. speed input: 483.88 toks/s, output: 304.56 toks/s]
Processed prompts: 100%|██████████| 22/22 [00:53<00:00,  2.22s/it, est. speed input: 461.15 toks/s, output: 313.13 toks/s]
Processed prompts: 100%|██████████| 22/22 [00:53<00:00,  2.45s/it, est. speed input: 461.15 toks/s, output: 313.13 toks/s]
+   [2/3] GPU 推理完成
+   OCR 耗时: 58.66s
+   [3/3] 后处理...
+   [3/3] 后处理完成 (0.00s)
+============================================================
+[SUCCESS] 全部完成
+   总耗时: 70.07s
+   平均: 3.19s/页
+============================================================
+INFO:     127.0.0.1:46898 - "POST /ocr HTTP/1.1" 200 OK
+   [1/3] Tokenize 22 页...
+   [1/3] Tokenize 完成
+   [2/3] GPU 批量推理 22 页...
+
Processed prompts:   0%|          | 0/22 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:   5%|▍         | 1/22 [00:15<05:25, 15.49s/it, est. speed input: 73.04 toks/s, output: 6.39 toks/s]
Processed prompts:   9%|▉         | 2/22 [00:16<02:14,  6.75s/it, est. speed input: 140.38 toks/s, output: 12.35 toks/s]
Processed prompts:  14%|█▎        | 3/22 [00:19<01:34,  4.99s/it, est. speed input: 178.38 toks/s, output: 19.19 toks/s]
Processed prompts:  18%|█▊        | 4/22 [00:20<01:08,  3.79s/it, est. speed input: 215.82 toks/s, output: 28.10 toks/s]
Processed prompts:  23%|██▎       | 5/22 [00:22<00:49,  2.89s/it, est. speed input: 254.00 toks/s, output: 38.27 toks/s]
Processed prompts:  27%|██▋       | 6/22 [00:23<00:36,  2.27s/it, est. speed input: 290.78 toks/s, output: 49.19 toks/s]
Processed prompts:  32%|███▏      | 7/22 [00:29<00:55,  3.70s/it, est. speed input: 264.06 toks/s, output: 55.50 toks/s]
Processed prompts:  36%|███▋      | 8/22 [00:32<00:46,  3.29s/it, est. speed input: 279.27 toks/s, output: 69.76 toks/s]
Processed prompts:  41%|████      | 9/22 [00:33<00:33,  2.56s/it, est. speed input: 305.18 toks/s, output: 86.59 toks/s]
Processed prompts:  45%|████▌     | 10/22 [00:36<00:31,  2.60s/it, est. speed input: 313.88 toks/s, output: 100.13 toks/s]
Processed prompts:  50%|█████     | 11/22 [00:38<00:27,  2.50s/it, est. speed input: 324.68 toks/s, output: 115.09 toks/s]
Processed prompts:  55%|█████▍    | 12/22 [00:38<00:17,  1.80s/it, est. speed input: 352.47 toks/s, output: 135.54 toks/s]
Processed prompts:  59%|█████▉    | 13/22 [00:39<00:12,  1.42s/it, est. speed input: 376.59 toks/s, output: 154.93 toks/s]
Processed prompts:  64%|██████▎   | 14/22 [00:39<00:09,  1.18s/it, est. speed input: 399.19 toks/s, output: 174.08 toks/s]
Processed prompts:  68%|██████▊   | 15/22 [00:43<00:13,  1.92s/it, est. speed input: 391.79 toks/s, output: 182.86 toks/s]
Processed prompts:  73%|███████▎  | 16/22 [00:44<00:10,  1.80s/it, est. speed input: 403.70 toks/s, output: 198.59 toks/s]
Processed prompts:  77%|███████▋  | 17/22 [00:45<00:07,  1.48s/it, est. speed input: 421.91 toks/s, output: 219.90 toks/s]
Processed prompts:  82%|████████▏ | 18/22 [00:46<00:04,  1.19s/it, est. speed input: 441.72 toks/s, output: 242.30 toks/s]
Processed prompts:  86%|████████▋ | 19/22 [00:46<00:02,  1.05it/s, est. speed input: 462.44 toks/s, output: 265.49 toks/s]
Processed prompts:  91%|█████████ | 20/22 [00:49<00:02,  1.48s/it, est. speed input: 459.78 toks/s, output: 278.20 toks/s]
Processed prompts:  95%|█████████▌| 21/22 [00:49<00:01,  1.07s/it, est. speed input: 481.62 toks/s, output: 303.14 toks/s]
Processed prompts: 100%|██████████| 22/22 [00:54<00:00,  2.17s/it, est. speed input: 460.32 toks/s, output: 312.56 toks/s]
Processed prompts: 100%|██████████| 22/22 [00:54<00:00,  2.46s/it, est. speed input: 460.32 toks/s, output: 312.56 toks/s]
+   [2/3] GPU 推理完成
+   OCR 耗时: 58.74s
+   [3/3] 后处理...
+   [3/3] 后处理完成 (0.00s)
+============================================================
+[SUCCESS] 全部完成
+   总耗时: 70.17s
+   平均: 3.19s/页
+============================================================
+INFO:     127.0.0.1:45882 - "POST /ocr HTTP/1.1" 200 OK
+   [1/3] Tokenize 22 页...
+   [1/3] Tokenize 完成
+   [2/3] GPU 批量推理 22 页...
+
Processed prompts:   0%|          | 0/22 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:   5%|▍         | 1/22 [00:15<05:17, 15.11s/it, est. speed input: 74.86 toks/s, output: 6.55 toks/s]
Processed prompts:   9%|▉         | 2/22 [00:15<02:11,  6.59s/it, est. speed input: 143.79 toks/s, output: 12.65 toks/s]
Processed prompts:  14%|█▎        | 3/22 [00:18<01:33,  4.91s/it, est. speed input: 182.07 toks/s, output: 19.59 toks/s]
Processed prompts:  18%|█▊        | 4/22 [00:20<01:07,  3.74s/it, est. speed input: 219.78 toks/s, output: 28.61 toks/s]
Processed prompts:  23%|██▎       | 5/22 [00:21<00:48,  2.86s/it, est. speed input: 258.36 toks/s, output: 38.92 toks/s]
Processed prompts:  27%|██▋       | 6/22 [00:22<00:36,  2.26s/it, est. speed input: 295.43 toks/s, output: 49.98 toks/s]
Processed prompts:  32%|███▏      | 7/22 [00:29<00:55,  3.68s/it, est. speed input: 267.59 toks/s, output: 56.24 toks/s]
Processed prompts:  36%|███▋      | 8/22 [00:31<00:45,  3.27s/it, est. speed input: 282.89 toks/s, output: 70.66 toks/s]
Processed prompts:  41%|████      | 9/22 [00:32<00:33,  2.55s/it, est. speed input: 309.03 toks/s, output: 87.68 toks/s]
Processed prompts:  45%|████▌     | 10/22 [00:35<00:31,  2.59s/it, est. speed input: 317.58 toks/s, output: 101.31 toks/s]
Processed prompts:  50%|█████     | 11/22 [00:37<00:27,  2.49s/it, est. speed input: 328.31 toks/s, output: 116.38 toks/s]
Processed prompts:  55%|█████▍    | 12/22 [00:38<00:17,  1.79s/it, est. speed input: 356.40 toks/s, output: 137.05 toks/s]
Processed prompts:  59%|█████▉    | 13/22 [00:38<00:12,  1.41s/it, est. speed input: 380.78 toks/s, output: 156.66 toks/s]
Processed prompts:  64%|██████▎   | 14/22 [00:39<00:09,  1.17s/it, est. speed input: 403.61 toks/s, output: 176.01 toks/s]
Processed prompts:  68%|██████▊   | 15/22 [00:42<00:13,  1.92s/it, est. speed input: 395.55 toks/s, output: 184.61 toks/s]
Processed prompts:  73%|███████▎  | 16/22 [00:44<00:10,  1.80s/it, est. speed input: 407.36 toks/s, output: 200.39 toks/s]
Processed prompts:  77%|███████▋  | 17/22 [00:45<00:07,  1.49s/it, est. speed input: 425.64 toks/s, output: 221.84 toks/s]
Processed prompts:  82%|████████▏ | 18/22 [00:45<00:04,  1.20s/it, est. speed input: 445.59 toks/s, output: 244.42 toks/s]
Processed prompts:  86%|████████▋ | 19/22 [00:46<00:02,  1.05it/s, est. speed input: 466.45 toks/s, output: 267.79 toks/s]
Processed prompts:  91%|█████████ | 20/22 [00:48<00:02,  1.48s/it, est. speed input: 463.57 toks/s, output: 280.50 toks/s]
Processed prompts:  95%|█████████▌| 21/22 [00:48<00:01,  1.07s/it, est. speed input: 485.58 toks/s, output: 305.63 toks/s]
Processed prompts: 100%|██████████| 22/22 [00:53<00:00,  2.17s/it, est. speed input: 463.82 toks/s, output: 314.93 toks/s]
Processed prompts: 100%|██████████| 22/22 [00:53<00:00,  2.44s/it, est. speed input: 463.82 toks/s, output: 314.93 toks/s]
+   [2/3] GPU 推理完成
+   OCR 耗时: 58.24s
+   [3/3] 后处理...
+   [3/3] 后处理完成 (0.00s)
+============================================================
+[SUCCESS] 全部完成
+   总耗时: 69.68s
+   平均: 3.17s/页
+============================================================
+INFO:     127.0.0.1:44226 - "POST /ocr HTTP/1.1" 200 OK
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr_server_8707_20260204_165024.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr_server_8707_20260204_165024.log
+INFO 02-04 16:50:28 [__init__.py:240] Automatically detected platform rocm.
+/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py:475: DeprecationWarning: 
+        on_event is deprecated, use lifespan event handlers instead.
+        Read more about it in the
+        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
+  @app.on_event("shutdown")
+[INFO] 加载模型: /home/lst/deepseek_ocr2
+INFO 02-04 16:50:33 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
+INFO 02-04 16:50:33 [config.py:721] This model supports multiple tasks: {'embed', 'reward', 'score', 'generate', 'classify'}. Defaulting to 'generate'.
+INFO 02-04 16:50:33 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False, 
+INFO 02-04 16:50:34 [rocm.py:226] None is not supported in AMD GPUs.
+INFO 02-04 16:50:34 [rocm.py:227] Using ROCmFlashAttention backend.
+WARNING 02-04 16:50:34 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
+INFO 02-04 16:50:34 [worker_base.py:653] ########## 42007 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
+INFO 02-04 16:50:34 [worker_base.py:654] ########## 42007 process(rank0) is running on memnode(s): {0, 1, 2, 3}
+WARNING: Logging before InitGoogleLogging() is written to STDERR
+I0204 16:50:34.705025 42007 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
+I0204 16:50:34.705101 42007 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 16:50:34.705564 42007 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x564c40e84ce0, SPLIT_COLOR: 3389850942126204093, PG Name: 1
+I0204 16:50:34.705579 42007 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 16:50:34.724964 42007 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x564c40e84ce0, SPLIT_COLOR: 3389850942126204093, PG Name: 3
+I0204 16:50:34.725004 42007 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 16:50:34.726367 42007 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x564c40e84ce0, SPLIT_COLOR: 3389850942126204093, PG Name: 5
+I0204 16:50:34.726390 42007 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 16:50:34.727412 42007 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x564c40e84ce0, SPLIT_COLOR: 3389850942126204093, PG Name: 7
+I0204 16:50:34.727430 42007 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+INFO 02-04 16:50:34 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
+INFO 02-04 16:50:34 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+INFO 02-04 16:50:35 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
+
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.93it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.91it/s]
+INFO 02-04 16:50:38 [loader.py:460] Loading weights took 1.73 seconds
+INFO 02-04 16:50:38 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.394852 seconds
+Some kwargs in processor config are unused and will not have any effect: normalize, pad_token, downsample_ratio, mask_prompt, patch_size, sft_format, ignore_id, image_token, image_mean, image_std, add_special_token, candidate_resolutions. 
+/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
+  x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
+WARNING 02-04 16:50:49 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
+INFO 02-04 16:50:50 [worker.py:287] Memory profiling takes 12.06 seconds
+INFO 02-04 16:50:50 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
+INFO 02-04 16:50:50 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
+INFO 02-04 16:50:50 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 0
+INFO 02-04 16:50:50 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
+INFO 02-04 16:50:50 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
+
Capturing CUDA graph shapes:   0%|          | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes:  17%|█▋        | 1/6 [00:00<00:02,  1.93it/s]
Capturing CUDA graph shapes:  33%|███▎      | 2/6 [00:00<00:01,  2.03it/s]
Capturing CUDA graph shapes:  50%|█████     | 3/6 [00:01<00:01,  2.03it/s]
Capturing CUDA graph shapes:  67%|██████▋   | 4/6 [00:02<00:01,  1.97it/s]
Capturing CUDA graph shapes:  83%|████████▎ | 5/6 [00:02<00:00,  1.99it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00,  2.03it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00,  2.01it/s]
+INFO 02-04 16:50:53 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
+INFO 02-04 16:50:53 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 15.65 seconds
+[SUCCESS] 模型加载完成
+[INFO] 线程池配置:
+   - CPU 线程池: 2 线程
+   - GPU 线程池: 1 线程
+[INFO] 服务启动: http://0.0.0.0:8707
+[INFO] 接口文档: http://0.0.0.0:8707/docs
+INFO:     Started server process [42007]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr_server_8707_20260204_165931.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr_server_8707_20260204_165931.log
+INFO 02-04 16:59:36 [__init__.py:240] Automatically detected platform rocm.
+/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py:478: DeprecationWarning: 
+        on_event is deprecated, use lifespan event handlers instead.
+        Read more about it in the
+        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
+  @app.on_event("shutdown")
+[INFO] 加载模型: /home/lst/deepseek_ocr2
+INFO 02-04 16:59:41 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
+INFO 02-04 16:59:41 [config.py:721] This model supports multiple tasks: {'embed', 'classify', 'generate', 'score', 'reward'}. Defaulting to 'generate'.
+INFO 02-04 16:59:41 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False, 
+INFO 02-04 16:59:41 [rocm.py:226] None is not supported in AMD GPUs.
+INFO 02-04 16:59:41 [rocm.py:227] Using ROCmFlashAttention backend.
+WARNING 02-04 16:59:41 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
+INFO 02-04 16:59:41 [worker_base.py:653] ########## 42893 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
+INFO 02-04 16:59:41 [worker_base.py:654] ########## 42893 process(rank0) is running on memnode(s): {0, 1, 2, 3}
+WARNING: Logging before InitGoogleLogging() is written to STDERR
+I0204 16:59:42.128494 42893 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
+I0204 16:59:42.128578 42893 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 16:59:42.129037 42893 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5571e7a2cd20, SPLIT_COLOR: 3389850942126204093, PG Name: 1
+I0204 16:59:42.129053 42893 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 16:59:42.149024 42893 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5571e7a2cd20, SPLIT_COLOR: 3389850942126204093, PG Name: 3
+I0204 16:59:42.149070 42893 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 16:59:42.150310 42893 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5571e7a2cd20, SPLIT_COLOR: 3389850942126204093, PG Name: 5
+I0204 16:59:42.150334 42893 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 16:59:42.151289 42893 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5571e7a2cd20, SPLIT_COLOR: 3389850942126204093, PG Name: 7
+I0204 16:59:42.151312 42893 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+INFO 02-04 16:59:42 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
+INFO 02-04 16:59:42 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+INFO 02-04 16:59:43 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
+
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.14it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.12it/s]
+INFO 02-04 16:59:45 [loader.py:460] Loading weights took 1.71 seconds
+INFO 02-04 16:59:45 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.386030 seconds
+Some kwargs in processor config are unused and will not have any effect: normalize, image_std, mask_prompt, image_token, candidate_resolutions, add_special_token, image_mean, pad_token, downsample_ratio, patch_size, ignore_id, sft_format. 
+/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
+  x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
+WARNING 02-04 16:59:57 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
+INFO 02-04 16:59:57 [worker.py:287] Memory profiling takes 12.02 seconds
+INFO 02-04 16:59:57 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
+INFO 02-04 16:59:57 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
+INFO 02-04 16:59:58 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 0
+INFO 02-04 16:59:58 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
+INFO 02-04 16:59:58 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
+
Capturing CUDA graph shapes:   0%|          | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes:  17%|█▋        | 1/6 [00:00<00:02,  1.93it/s]
Capturing CUDA graph shapes:  33%|███▎      | 2/6 [00:00<00:01,  2.03it/s]
Capturing CUDA graph shapes:  50%|█████     | 3/6 [00:01<00:01,  2.06it/s]
Capturing CUDA graph shapes:  67%|██████▋   | 4/6 [00:01<00:00,  2.04it/s]
Capturing CUDA graph shapes:  83%|████████▎ | 5/6 [00:02<00:00,  2.05it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00,  2.07it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00,  2.05it/s]
+INFO 02-04 17:00:01 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
+INFO 02-04 17:00:01 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 15.56 seconds
+[SUCCESS] 模型加载完成
+[INFO] 线程池配置:
+   - CPU 线程池: 2 线程
+   - GPU 线程池: 1 线程
+[INFO] 服务启动: http://0.0.0.0:8707
+[INFO] 接口文档: http://0.0.0.0:8707/docs
+INFO:     Started server process [42893]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
+Some kwargs in processor config are unused and will not have any effect: normalize, image_std, mask_prompt, image_token, candidate_resolutions, add_special_token, image_mean, pad_token, downsample_ratio, patch_size, ignore_id, sft_format. 
+   [1/3] Tokenize 22 页...
+   [1/3] Tokenize 完成
+   [2/3] GPU 批量推理 22 页...
+
Processed prompts:   0%|          | 0/22 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:   5%|▍         | 1/22 [00:15<05:23, 15.42s/it, est. speed input: 73.37 toks/s, output: 6.42 toks/s]
Processed prompts:   9%|▉         | 2/22 [00:16<02:14,  6.72s/it, est. speed input: 140.92 toks/s, output: 12.40 toks/s]
Processed prompts:  14%|█▎        | 3/22 [00:18<01:34,  4.95s/it, est. speed input: 179.52 toks/s, output: 19.31 toks/s]
Processed prompts:  18%|█▊        | 4/22 [00:20<01:07,  3.74s/it, est. speed input: 217.61 toks/s, output: 28.33 toks/s]
Processed prompts:  23%|██▎       | 5/22 [00:22<00:48,  2.85s/it, est. speed input: 256.45 toks/s, output: 38.64 toks/s]
Processed prompts:  27%|██▋       | 6/22 [00:23<00:35,  2.23s/it, est. speed input: 293.84 toks/s, output: 49.71 toks/s]
Processed prompts:  32%|███▏      | 7/22 [00:29<00:54,  3.62s/it, est. speed input: 267.89 toks/s, output: 56.31 toks/s]
Processed prompts:  36%|███▋      | 8/22 [00:31<00:45,  3.22s/it, est. speed input: 283.54 toks/s, output: 70.82 toks/s]
Processed prompts:  41%|████      | 9/22 [00:32<00:32,  2.50s/it, est. speed input: 309.93 toks/s, output: 87.93 toks/s]
Processed prompts:  45%|████▌     | 10/22 [00:35<00:30,  2.53s/it, est. speed input: 319.03 toks/s, output: 101.77 toks/s]
Processed prompts:  50%|█████     | 11/22 [00:37<00:26,  2.44s/it, est. speed input: 330.25 toks/s, output: 117.07 toks/s]
Processed prompts:  55%|█████▍    | 12/22 [00:37<00:17,  1.75s/it, est. speed input: 358.50 toks/s, output: 137.86 toks/s]
Processed prompts:  59%|█████▉    | 13/22 [00:38<00:12,  1.38s/it, est. speed input: 383.05 toks/s, output: 157.59 toks/s]
Processed prompts:  64%|██████▎   | 14/22 [00:39<00:09,  1.15s/it, est. speed input: 405.98 toks/s, output: 177.04 toks/s]
Processed prompts:  68%|██████▊   | 15/22 [00:42<00:13,  1.97s/it, est. speed input: 395.70 toks/s, output: 184.68 toks/s]
Processed prompts:  73%|███████▎  | 16/22 [00:44<00:11,  1.83s/it, est. speed input: 407.65 toks/s, output: 200.53 toks/s]
Processed prompts:  77%|███████▋  | 17/22 [00:45<00:07,  1.51s/it, est. speed input: 426.00 toks/s, output: 222.03 toks/s]
Processed prompts:  82%|████████▏ | 18/22 [00:45<00:04,  1.21s/it, est. speed input: 445.98 toks/s, output: 244.63 toks/s]
Processed prompts:  86%|████████▋ | 19/22 [00:46<00:02,  1.04it/s, est. speed input: 466.90 toks/s, output: 268.05 toks/s]
Processed prompts:  91%|█████████ | 20/22 [00:48<00:02,  1.48s/it, est. speed input: 464.27 toks/s, output: 280.92 toks/s]
Processed prompts:  95%|█████████▌| 21/22 [00:48<00:01,  1.07s/it, est. speed input: 486.31 toks/s, output: 306.08 toks/s]
Processed prompts: 100%|██████████| 22/22 [00:53<00:00,  2.15s/it, est. speed input: 465.01 toks/s, output: 315.75 toks/s]
Processed prompts: 100%|██████████| 22/22 [00:53<00:00,  2.43s/it, est. speed input: 465.01 toks/s, output: 315.75 toks/s]
+   [2/3] GPU 推理完成
+   OCR 耗时: 60.07s
+   [3/3] 后处理...
+   [3/3] 后处理完成 (0.01s)
+============================================================
+[SUCCESS] 全部完成
+   总耗时: 71.52s
+   平均: 3.25s/页
+============================================================
+INFO:     127.0.0.1:52722 - "POST /ocr HTTP/1.1" 200 OK
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr_server_8707_20260204_171722.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr_server_8707_20260204_171722.log
+INFO 02-04 17:17:27 [__init__.py:240] Automatically detected platform rocm.
+/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py:478: DeprecationWarning: 
+        on_event is deprecated, use lifespan event handlers instead.
+        Read more about it in the
+        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
+  @app.on_event("shutdown")
+[INFO] 加载模型: /home/lst/deepseek_ocr2
+INFO 02-04 17:17:32 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
+INFO 02-04 17:17:32 [config.py:721] This model supports multiple tasks: {'generate', 'embed', 'score', 'reward', 'classify'}. Defaulting to 'generate'.
+INFO 02-04 17:17:32 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False, 
+INFO 02-04 17:17:33 [rocm.py:226] None is not supported in AMD GPUs.
+INFO 02-04 17:17:33 [rocm.py:227] Using ROCmFlashAttention backend.
+WARNING 02-04 17:17:33 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
+INFO 02-04 17:17:33 [worker_base.py:653] ########## 44064 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
+INFO 02-04 17:17:33 [worker_base.py:654] ########## 44064 process(rank0) is running on memnode(s): {0, 1, 2, 3}
+WARNING: Logging before InitGoogleLogging() is written to STDERR
+I0204 17:17:33.456287 44064 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
+I0204 17:17:33.456384 44064 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 17:17:33.456838 44064 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x56216603a700, SPLIT_COLOR: 3389850942126204093, PG Name: 1
+I0204 17:17:33.456851 44064 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 17:17:33.476958 44064 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x56216603a700, SPLIT_COLOR: 3389850942126204093, PG Name: 3
+I0204 17:17:33.476995 44064 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 17:17:33.478204 44064 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x56216603a700, SPLIT_COLOR: 3389850942126204093, PG Name: 5
+I0204 17:17:33.478221 44064 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0204 17:17:33.479249 44064 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x56216603a700, SPLIT_COLOR: 3389850942126204093, PG Name: 7
+I0204 17:17:33.479267 44064 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+INFO 02-04 17:17:33 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
+INFO 02-04 17:17:33 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+INFO 02-04 17:17:34 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
+
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.15it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.13it/s]
+INFO 02-04 17:17:36 [loader.py:460] Loading weights took 1.72 seconds
+INFO 02-04 17:17:37 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.400736 seconds
+Some kwargs in processor config are unused and will not have any effect: sft_format, add_special_token, patch_size, image_std, candidate_resolutions, ignore_id, pad_token, downsample_ratio, image_token, image_mean, normalize, mask_prompt. 
+/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
+  x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
+WARNING 02-04 17:17:48 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
+INFO 02-04 17:17:49 [worker.py:287] Memory profiling takes 12.09 seconds
+INFO 02-04 17:17:49 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
+INFO 02-04 17:17:49 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
+INFO 02-04 17:17:49 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 0
+INFO 02-04 17:17:49 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
+INFO 02-04 17:17:49 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
+
Capturing CUDA graph shapes:   0%|          | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes:  17%|█▋        | 1/6 [00:00<00:02,  1.84it/s]
Capturing CUDA graph shapes:  33%|███▎      | 2/6 [00:01<00:02,  1.98it/s]
Capturing CUDA graph shapes:  50%|█████     | 3/6 [00:01<00:01,  2.03it/s]
Capturing CUDA graph shapes:  67%|██████▋   | 4/6 [00:02<00:00,  2.01it/s]
Capturing CUDA graph shapes:  83%|████████▎ | 5/6 [00:02<00:00,  2.02it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00,  2.04it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00,  2.02it/s]
+INFO 02-04 17:17:52 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
+INFO 02-04 17:17:52 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 15.69 seconds
+[SUCCESS] 模型加载完成
+[INFO] 线程池配置:
+   - CPU 线程池: 2 线程
+   - GPU 线程池: 1 线程
+[INFO] 服务启动: http://0.0.0.0:8707
+[INFO] 接口文档: http://0.0.0.0:8707/docs
+INFO:     Started server process [44064]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr_server_8707_20260204_175017.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr_server_8707_20260204_175017.log
+/usr/bin/python3: can't open file '/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py': [Errno 2] No such file or directory
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr_server_8707_20260204_175145.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr_server_8707_20260204_175145.log
+/usr/bin/python3: can't open file '/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py': [Errno 2] No such file or directory
--- a/DeepSeek-OCR2-vllm/offline_test.sh
+++ b/DeepSeek-OCR2-vllm/offline_test.sh
+export VLLM_USE_V1=0
+export HIP_VISIBLE_DEVICES=3
+# image：流式输出
+#python run_dpsk_ocr2_image.py
+# pdf
+python run_dpsk_ocr2_pdf.py