Commit 6ad287f7 authored by liuxu3's avatar liuxu3
Browse files

added DeepSeek OCR API by liushengtong

parent 80c11a03
INFO 02-04 18:32:47 [__init__.py:240] Automatically detected platform rocm.
/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py:476: DeprecationWarning:
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-04 18:32:52 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-04 18:32:52 [config.py:721] This model supports multiple tasks: {'generate', 'classify', 'embed', 'reward', 'score'}. Defaulting to 'generate'.
INFO 02-04 18:32:52 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False,
INFO 02-04 18:32:53 [rocm.py:226] None is not supported in AMD GPUs.
INFO 02-04 18:32:53 [rocm.py:227] Using ROCmFlashAttention backend.
WARNING 02-04 18:32:53 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
INFO 02-04 18:32:53 [worker_base.py:653] ########## 49838 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
INFO 02-04 18:32:53 [worker_base.py:654] ########## 49838 process(rank0) is running on memnode(s): {0, 1, 2, 3}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0204 18:32:53.341442 49838 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0204 18:32:53.341506 49838 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 18:32:53.341955 49838 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5635f2a9db40, SPLIT_COLOR: 3389850942126204093, PG Name: 1
I0204 18:32:53.341969 49838 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 18:32:53.361923 49838 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5635f2a9db40, SPLIT_COLOR: 3389850942126204093, PG Name: 3
I0204 18:32:53.361964 49838 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 18:32:53.363456 49838 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5635f2a9db40, SPLIT_COLOR: 3389850942126204093, PG Name: 5
I0204 18:32:53.363479 49838 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 18:32:53.364511 49838 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5635f2a9db40, SPLIT_COLOR: 3389850942126204093, PG Name: 7
I0204 18:32:53.364529 49838 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
INFO 02-04 18:32:53 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-04 18:32:53 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-04 18:32:54 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 6.91it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 6.89it/s]
INFO 02-04 18:32:56 [loader.py:460] Loading weights took 1.84 seconds
INFO 02-04 18:32:57 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.543877 seconds
Some kwargs in processor config are unused and will not have any effect: pad_token, add_special_token, sft_format, candidate_resolutions, patch_size, normalize, downsample_ratio, image_std, image_token, ignore_id, mask_prompt, image_mean.
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 513, in <module>
[rank0]: main()
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 504, in main
[rank0]: initialize_model(args.model_path, args.gpu_id)
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 272, in initialize_model
[rank0]: llm = LLM(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1182, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 255, in __init__
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 520, in from_engine_args
[rank0]: return engine_cls.from_vllm_config(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 496, in from_vllm_config
[rank0]: return cls(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 286, in __init__
[rank0]: self._initialize_kv_caches()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 432, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks())
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
[rank0]: results = self.collective_rpc("determine_num_available_blocks")
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
[rank0]: answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2624, in run_method
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 249, in determine_num_available_blocks
[rank0]: self.model_runner.profile_run()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1262, in profile_run
[rank0]: self._dummy_run(max_num_batched_tokens, max_num_seqs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1388, in _dummy_run
[rank0]: self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1948, in execute_model
[rank0]: hidden_or_intermediate_states = model_executable(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 541, in forward
[rank0]: vision_embeddings = self.get_multimodal_embeddings(**kwargs)
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 501, in get_multimodal_embeddings
[rank0]: vision_embeddings = self._process_image_input(image_input)
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 487, in _process_image_input
[rank0]: vision_features = self._pixel_values_to_embedding(
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 401, in _pixel_values_to_embedding
[rank0]: local_features_2 = self.qwen2_model(local_features_1)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/qwen2_d2e.py", line 264, in forward
[rank0]: batch_query_imgs = param_img.unsqueeze(0).expand(
[rank0]: UnboundLocalError: local variable 'param_img' referenced before assignment
INFO 02-04 18:33:41 [__init__.py:240] Automatically detected platform rocm.
/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py:476: DeprecationWarning:
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-04 18:33:45 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-04 18:33:45 [config.py:721] This model supports multiple tasks: {'generate', 'score', 'reward', 'classify', 'embed'}. Defaulting to 'generate'.
INFO 02-04 18:33:45 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False,
INFO 02-04 18:33:46 [rocm.py:226] None is not supported in AMD GPUs.
INFO 02-04 18:33:46 [rocm.py:227] Using ROCmFlashAttention backend.
WARNING 02-04 18:33:46 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
INFO 02-04 18:33:46 [worker_base.py:653] ########## 50743 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
INFO 02-04 18:33:46 [worker_base.py:654] ########## 50743 process(rank0) is running on memnode(s): {0, 1, 2, 3}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0204 18:33:46.885210 50743 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0204 18:33:46.885288 50743 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 18:33:46.885742 50743 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55b223a8f570, SPLIT_COLOR: 3389850942126204093, PG Name: 1
I0204 18:33:46.885754 50743 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 18:33:46.904980 50743 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55b223a8f570, SPLIT_COLOR: 3389850942126204093, PG Name: 3
I0204 18:33:46.905019 50743 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 18:33:46.906283 50743 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55b223a8f570, SPLIT_COLOR: 3389850942126204093, PG Name: 5
I0204 18:33:46.906301 50743 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 18:33:46.907312 50743 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55b223a8f570, SPLIT_COLOR: 3389850942126204093, PG Name: 7
I0204 18:33:46.907331 50743 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
INFO 02-04 18:33:46 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-04 18:33:46 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-04 18:33:47 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 7.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 7.16it/s]
INFO 02-04 18:33:50 [loader.py:460] Loading weights took 1.73 seconds
INFO 02-04 18:33:50 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.422877 seconds
Some kwargs in processor config are unused and will not have any effect: mask_prompt, ignore_id, add_special_token, pad_token, image_token, normalize, downsample_ratio, candidate_resolutions, sft_format, patch_size, image_mean, image_std.
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 18:34:02 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 18:34:02 [worker.py:287] Memory profiling takes 12.06 seconds
INFO 02-04 18:34:02 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-04 18:34:02 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
INFO 02-04 18:34:03 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 0
INFO 02-04 18:34:03 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
INFO 02-04 18:34:03 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s] Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:02, 1.91it/s] Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:01<00:01, 2.01it/s] Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 2.05it/s] Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:01<00:00, 2.02it/s] Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:02<00:00, 1.98it/s] Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.01it/s] Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.01it/s]
INFO 02-04 18:34:06 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
INFO 02-04 18:34:06 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 15.67 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 2 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [50743]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
INFO 02-04 18:34:51 [__init__.py:240] Automatically detected platform rocm.
/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py:476: DeprecationWarning:
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-04 18:34:56 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-04 18:34:56 [config.py:721] This model supports multiple tasks: {'generate', 'embed', 'classify', 'reward', 'score'}. Defaulting to 'generate'.
INFO 02-04 18:34:56 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False,
INFO 02-04 18:34:56 [rocm.py:226] None is not supported in AMD GPUs.
INFO 02-04 18:34:56 [rocm.py:227] Using ROCmFlashAttention backend.
WARNING 02-04 18:34:56 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
INFO 02-04 18:34:56 [worker_base.py:653] ########## 51630 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
INFO 02-04 18:34:56 [worker_base.py:654] ########## 51630 process(rank0) is running on memnode(s): {0, 1, 2, 3}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0204 18:34:57.060443 51630 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0204 18:34:57.060510 51630 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 18:34:57.060959 51630 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55b96dd7c0d0, SPLIT_COLOR: 3389850942126204093, PG Name: 1
I0204 18:34:57.060974 51630 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 18:34:57.080931 51630 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55b96dd7c0d0, SPLIT_COLOR: 3389850942126204093, PG Name: 3
I0204 18:34:57.080973 51630 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 18:34:57.082218 51630 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55b96dd7c0d0, SPLIT_COLOR: 3389850942126204093, PG Name: 5
I0204 18:34:57.082239 51630 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 18:34:57.083209 51630 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55b96dd7c0d0, SPLIT_COLOR: 3389850942126204093, PG Name: 7
I0204 18:34:57.083231 51630 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
INFO 02-04 18:34:57 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-04 18:34:57 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-04 18:34:58 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 6.96it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 6.95it/s]
INFO 02-04 18:35:00 [loader.py:460] Loading weights took 1.82 seconds
INFO 02-04 18:35:00 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.480003 seconds
Some kwargs in processor config are unused and will not have any effect: image_std, pad_token, normalize, ignore_id, mask_prompt, patch_size, downsample_ratio, add_special_token, sft_format, image_token, image_mean, candidate_resolutions.
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 513, in <module>
[rank0]: main()
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 504, in main
[rank0]: initialize_model(args.model_path, args.gpu_id)
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 272, in initialize_model
[rank0]: llm = LLM(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1182, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 255, in __init__
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 520, in from_engine_args
[rank0]: return engine_cls.from_vllm_config(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 496, in from_vllm_config
[rank0]: return cls(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 286, in __init__
[rank0]: self._initialize_kv_caches()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 432, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks())
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
[rank0]: results = self.collective_rpc("determine_num_available_blocks")
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
[rank0]: answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2624, in run_method
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 249, in determine_num_available_blocks
[rank0]: self.model_runner.profile_run()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1262, in profile_run
[rank0]: self._dummy_run(max_num_batched_tokens, max_num_seqs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1388, in _dummy_run
[rank0]: self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1948, in execute_model
[rank0]: hidden_or_intermediate_states = model_executable(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 541, in forward
[rank0]: vision_embeddings = self.get_multimodal_embeddings(**kwargs)
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 501, in get_multimodal_embeddings
[rank0]: vision_embeddings = self._process_image_input(image_input)
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 487, in _process_image_input
[rank0]: vision_features = self._pixel_values_to_embedding(
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 401, in _pixel_values_to_embedding
[rank0]: local_features_2 = self.qwen2_model(local_features_1)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/qwen2_d2e.py", line 264, in forward
[rank0]: batch_query_imgs = param_img.unsqueeze(0).expand(
[rank0]: UnboundLocalError: local variable 'param_img' referenced before assignment
INFO 02-05 10:03:22 [__init__.py:240] Automatically detected platform rocm.
/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py:476: DeprecationWarning:
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-05 10:03:27 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-05 10:03:27 [config.py:721] This model supports multiple tasks: {'score', 'generate', 'reward', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 02-05 10:03:27 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False,
INFO 02-05 10:03:28 [rocm.py:226] None is not supported in AMD GPUs.
INFO 02-05 10:03:28 [rocm.py:227] Using ROCmFlashAttention backend.
WARNING 02-05 10:03:28 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
INFO 02-05 10:03:28 [worker_base.py:653] ########## 58452 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
INFO 02-05 10:03:28 [worker_base.py:654] ########## 58452 process(rank0) is running on memnode(s): {0, 1, 2, 3}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0205 10:03:28.481029 58452 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0205 10:03:28.481109 58452 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0205 10:03:28.481564 58452 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55d4311a9f80, SPLIT_COLOR: 3389850942126204093, PG Name: 1
I0205 10:03:28.481581 58452 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0205 10:03:28.500953 58452 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55d4311a9f80, SPLIT_COLOR: 3389850942126204093, PG Name: 3
I0205 10:03:28.500998 58452 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0205 10:03:28.502223 58452 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55d4311a9f80, SPLIT_COLOR: 3389850942126204093, PG Name: 5
I0205 10:03:28.502246 58452 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0205 10:03:28.503187 58452 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55d4311a9f80, SPLIT_COLOR: 3389850942126204093, PG Name: 7
I0205 10:03:28.503209 58452 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
INFO 02-05 10:03:28 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-05 10:03:28 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-05 10:03:29 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 7.15it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 7.13it/s]
INFO 02-05 10:03:31 [loader.py:460] Loading weights took 1.79 seconds
INFO 02-05 10:03:32 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.454019 seconds
Some kwargs in processor config are unused and will not have any effect: image_token, sft_format, pad_token, image_std, add_special_token, ignore_id, patch_size, downsample_ratio, normalize, candidate_resolutions, image_mean, mask_prompt.
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-05 10:03:43 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-05 10:03:44 [worker.py:287] Memory profiling takes 11.98 seconds
INFO 02-05 10:03:44 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-05 10:03:44 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
INFO 02-05 10:03:44 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 0
INFO 02-05 10:03:44 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
INFO 02-05 10:03:44 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s] Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:02, 1.91it/s] Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:01<00:01, 2.01it/s] Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 2.04it/s] Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:02<00:01, 1.96it/s] Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:02<00:00, 1.97it/s] Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00, 2.01it/s] Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00, 2.00it/s]
INFO 02-05 10:03:47 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
INFO 02-05 10:03:47 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 15.60 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 2 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [58452]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: image_token, sft_format, pad_token, image_std, add_special_token, ignore_id, patch_size, downsample_ratio, normalize, candidate_resolutions, image_mean, mask_prompt.
[1/3] Tokenize 19 页...
[1/3] Tokenize 完成
[2/3] GPU 批量推理 19 页...
Processed prompts: 0%| | 0/19 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s] Processed prompts: 5%|▌ | 1/19 [00:12<03:39, 12.18s/it, est. speed input: 92.86 toks/s, output: 1.48 toks/s] Processed prompts: 11%|█ | 2/19 [00:15<01:55, 6.79s/it, est. speed input: 148.89 toks/s, output: 8.95 toks/s] Processed prompts: 16%|█▌ | 3/19 [00:15<01:03, 3.96s/it, est. speed input: 214.81 toks/s, output: 17.35 toks/s] Processed prompts: 21%|██ | 4/19 [00:16<00:38, 2.54s/it, est. speed input: 280.09 toks/s, output: 26.31 toks/s] Processed prompts: 26%|██▋ | 5/19 [00:16<00:25, 1.83s/it, est. speed input: 338.06 toks/s, output: 35.69 toks/s] Processed prompts: 37%|███▋ | 7/19 [00:16<00:11, 1.03it/s, est. speed input: 465.83 toks/s, output: 56.07 toks/s] Processed prompts: 53%|█████▎ | 10/19 [00:17<00:04, 1.99it/s, est. speed input: 657.25 toks/s, output: 87.69 toks/s] Processed prompts: 58%|█████▊ | 11/19 [00:17<00:03, 2.29it/s, est. speed input: 715.73 toks/s, output: 98.26 toks/s] Processed prompts: 63%|██████▎ | 12/19 [00:17<00:03, 2.20it/s, est. speed input: 758.09 toks/s, output: 107.92 toks/s] Processed prompts: 68%|██████▊ | 13/19 [00:18<00:02, 2.57it/s, est. speed input: 813.11 toks/s, output: 119.73 toks/s] Processed prompts: 79%|███████▉ | 15/19 [00:18<00:01, 3.53it/s, est. speed input: 925.03 toks/s, output: 144.44 toks/s] Processed prompts: 84%|████████▍ | 16/19 [00:18<00:00, 3.74it/s, est. speed input: 975.37 toks/s, output: 156.85 toks/s] Processed prompts: 89%|████████▉ | 17/19 [00:19<00:00, 2.12it/s, est. speed input: 977.96 toks/s, output: 165.21 toks/s] Processed prompts: 95%|█████████▍| 18/19 [00:20<00:00, 2.09it/s, est. speed input: 1009.77 toks/s, output: 180.35 toks/s] Processed prompts: 100%|██████████| 19/19 [00:21<00:00, 1.19it/s, est. speed input: 977.88 toks/s, output: 193.90 toks/s] Processed prompts: 100%|██████████| 19/19 [00:21<00:00, 1.16s/it, est. speed input: 977.88 toks/s, output: 193.90 toks/s]
[2/3] GPU 推理完成
OCR 耗时: 26.94s
[3/3] 后处理...
[3/3] 后处理完成 (0.00s)
============================================================
[SUCCESS] 全部完成
总耗时: 29.37s
平均: 1.55s/页
============================================================
INFO: 127.0.0.1:36956 - "POST /ocr HTTP/1.1" 200 OK
INFO 02-27 11:29:07 [__init__.py:240] Automatically detected platform rocm.
/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py:476: DeprecationWarning:
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-27 11:29:12 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-27 11:29:12 [config.py:721] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 02-27 11:29:12 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False,
INFO 02-27 11:29:13 [rocm.py:226] None is not supported in AMD GPUs.
INFO 02-27 11:29:13 [rocm.py:227] Using ROCmFlashAttention backend.
WARNING 02-27 11:29:13 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
INFO 02-27 11:29:13 [worker_base.py:653] ########## 2019 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63}
INFO 02-27 11:29:13 [worker_base.py:654] ########## 2019 process(rank0) is running on memnode(s): {0, 1, 2, 3, 4, 5, 6, 7}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0227 11:29:14.044931 2019 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0227 11:29:14.044994 2019 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:29:14.045455 2019 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5639914424b0, SPLIT_COLOR: 3389850942126204093, PG Name: 1
I0227 11:29:14.045471 2019 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:29:14.065977 2019 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5639914424b0, SPLIT_COLOR: 3389850942126204093, PG Name: 3
I0227 11:29:14.066015 2019 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:29:14.067288 2019 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5639914424b0, SPLIT_COLOR: 3389850942126204093, PG Name: 5
I0227 11:29:14.067306 2019 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:29:14.068274 2019 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5639914424b0, SPLIT_COLOR: 3389850942126204093, PG Name: 7
I0227 11:29:14.068290 2019 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
INFO 02-27 11:29:14 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-27 11:29:14 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-27 11:29:15 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 513, in <module>
[rank0]: main()
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 504, in main
[rank0]: initialize_model(args.model_path, args.gpu_id)
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 272, in initialize_model
[rank0]: llm = LLM(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1182, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 255, in __init__
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 520, in from_engine_args
[rank0]: return engine_cls.from_vllm_config(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 496, in from_vllm_config
[rank0]: return cls(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 283, in __init__
[rank0]: self.model_executor = executor_class(vllm_config=vllm_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
[rank0]: self._init_executor()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
[rank0]: self.collective_rpc("load_model")
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
[rank0]: answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2624, in run_method
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 203, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1136, in load_model
[rank0]: self.model = get_model(vllm_config=self.vllm_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
[rank0]: return loader.load_model(vllm_config=vllm_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 454, in load_model
[rank0]: model = _initialize_model(vllm_config=vllm_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
[rank0]: return model_class(vllm_config=vllm_config, prefix=prefix)
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 325, in __init__
[rank0]: self.language_model = init_vllm_registered_model(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 286, in init_vllm_registered_model
[rank0]: return _initialize_model(vllm_config=vllm_config, prefix=prefix)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
[rank0]: return model_class(vllm_config=vllm_config, prefix=prefix)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 457, in __init__
[rank0]: self.model = DeepseekModel(vllm_config=vllm_config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 358, in __init__
[rank0]: self.start_layer, self.end_layer, self.layers = make_layers(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 609, in make_layers
[rank0]: [PPMissingLayer() for _ in range(start_layer)] + [
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 610, in <listcomp>
[rank0]: maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 360, in <lambda>
[rank0]: lambda prefix: DeepseekDecoderLayer(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 300, in __init__
[rank0]: self.mlp = DeepseekMoE(config=config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 120, in __init__
[rank0]: self.pack_params()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 162, in pack_params
[rank0]: self.w2 = self.w2.permute(0, 2, 1).contiguous()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 79, in __torch_function__
[rank0]: return func(*args, **kwargs)
[rank0]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 140.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 3.24 GiB is allocated by PyTorch, and 235.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
INFO 02-27 11:31:39 [__init__.py:240] Automatically detected platform rocm.
/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py:476: DeprecationWarning:
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-27 11:31:45 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-27 11:31:45 [config.py:721] This model supports multiple tasks: {'embed', 'score', 'classify', 'generate', 'reward'}. Defaulting to 'generate'.
INFO 02-27 11:31:45 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False,
INFO 02-27 11:31:46 [rocm.py:226] None is not supported in AMD GPUs.
INFO 02-27 11:31:46 [rocm.py:227] Using ROCmFlashAttention backend.
WARNING 02-27 11:31:46 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
INFO 02-27 11:31:46 [worker_base.py:653] ########## 2386 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63}
INFO 02-27 11:31:46 [worker_base.py:654] ########## 2386 process(rank0) is running on memnode(s): {0, 1, 2, 3, 4, 5, 6, 7}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0227 11:31:46.413403 2386 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0227 11:31:46.413467 2386 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:31:46.413935 2386 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5609f68a77b0, SPLIT_COLOR: 3389850942126204093, PG Name: 1
I0227 11:31:46.413949 2386 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:31:46.434001 2386 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5609f68a77b0, SPLIT_COLOR: 3389850942126204093, PG Name: 3
I0227 11:31:46.434037 2386 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:31:46.435284 2386 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5609f68a77b0, SPLIT_COLOR: 3389850942126204093, PG Name: 5
I0227 11:31:46.435304 2386 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:31:46.436266 2386 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5609f68a77b0, SPLIT_COLOR: 3389850942126204093, PG Name: 7
I0227 11:31:46.436283 2386 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
INFO 02-27 11:31:46 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-27 11:31:46 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-27 11:31:47 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 513, in <module>
[rank0]: main()
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 504, in main
[rank0]: initialize_model(args.model_path, args.gpu_id)
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 272, in initialize_model
[rank0]: llm = LLM(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1182, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 255, in __init__
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 520, in from_engine_args
[rank0]: return engine_cls.from_vllm_config(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 496, in from_vllm_config
[rank0]: return cls(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 283, in __init__
[rank0]: self.model_executor = executor_class(vllm_config=vllm_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
[rank0]: self._init_executor()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
[rank0]: self.collective_rpc("load_model")
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
[rank0]: answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2624, in run_method
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 203, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1136, in load_model
[rank0]: self.model = get_model(vllm_config=self.vllm_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
[rank0]: return loader.load_model(vllm_config=vllm_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 454, in load_model
[rank0]: model = _initialize_model(vllm_config=vllm_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
[rank0]: return model_class(vllm_config=vllm_config, prefix=prefix)
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 325, in __init__
[rank0]: self.language_model = init_vllm_registered_model(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 286, in init_vllm_registered_model
[rank0]: return _initialize_model(vllm_config=vllm_config, prefix=prefix)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
[rank0]: return model_class(vllm_config=vllm_config, prefix=prefix)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 457, in __init__
[rank0]: self.model = DeepseekModel(vllm_config=vllm_config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 358, in __init__
[rank0]: self.start_layer, self.end_layer, self.layers = make_layers(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 609, in make_layers
[rank0]: [PPMissingLayer() for _ in range(start_layer)] + [
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 610, in <listcomp>
[rank0]: maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 360, in <lambda>
[rank0]: lambda prefix: DeepseekDecoderLayer(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 300, in __init__
[rank0]: self.mlp = DeepseekMoE(config=config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 120, in __init__
[rank0]: self.pack_params()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 162, in pack_params
[rank0]: self.w2 = self.w2.permute(0, 2, 1).contiguous()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 79, in __torch_function__
[rank0]: return func(*args, **kwargs)
[rank0]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 140.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 3.24 GiB is allocated by PyTorch, and 235.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
INFO 02-27 11:35:25 [__init__.py:240] Automatically detected platform rocm.
/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py:476: DeprecationWarning:
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
Traceback (most recent call last):
File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 513, in <module>
main()
File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 503, in main
os.environ["CUDA_VISIBLE_DEVICES"] = gpu_id
NameError: name 'gpu_id' is not defined
INFO 02-27 11:36:20 [__init__.py:240] Automatically detected platform rocm.
/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py:476: DeprecationWarning:
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-27 11:36:25 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-27 11:36:25 [config.py:721] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'.
INFO 02-27 11:36:25 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False,
INFO 02-27 11:36:26 [rocm.py:226] None is not supported in AMD GPUs.
INFO 02-27 11:36:26 [rocm.py:227] Using ROCmFlashAttention backend.
WARNING 02-27 11:36:26 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
INFO 02-27 11:36:26 [worker_base.py:653] ########## 2840 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63}
INFO 02-27 11:36:26 [worker_base.py:654] ########## 2840 process(rank0) is running on memnode(s): {0, 1, 2, 3, 4, 5, 6, 7}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0227 11:36:26.828375 2840 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0227 11:36:26.828431 2840 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:36:26.828945 2840 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55d962bc1680, SPLIT_COLOR: 3389850942126204093, PG Name: 1
I0227 11:36:26.828958 2840 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:36:26.848886 2840 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55d962bc1680, SPLIT_COLOR: 3389850942126204093, PG Name: 3
I0227 11:36:26.848920 2840 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:36:26.850196 2840 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55d962bc1680, SPLIT_COLOR: 3389850942126204093, PG Name: 5
I0227 11:36:26.850215 2840 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:36:26.851189 2840 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55d962bc1680, SPLIT_COLOR: 3389850942126204093, PG Name: 7
I0227 11:36:26.851207 2840 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
INFO 02-27 11:36:26 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-27 11:36:26 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-27 11:36:28 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 513, in <module>
[rank0]: main()
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 504, in main
[rank0]: initialize_model(args.model_path, args.gpu_id)
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 272, in initialize_model
[rank0]: llm = LLM(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1182, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 255, in __init__
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 520, in from_engine_args
[rank0]: return engine_cls.from_vllm_config(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 496, in from_vllm_config
[rank0]: return cls(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 283, in __init__
[rank0]: self.model_executor = executor_class(vllm_config=vllm_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
[rank0]: self._init_executor()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
[rank0]: self.collective_rpc("load_model")
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
[rank0]: answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2624, in run_method
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 203, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1136, in load_model
[rank0]: self.model = get_model(vllm_config=self.vllm_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
[rank0]: return loader.load_model(vllm_config=vllm_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 454, in load_model
[rank0]: model = _initialize_model(vllm_config=vllm_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
[rank0]: return model_class(vllm_config=vllm_config, prefix=prefix)
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 325, in __init__
[rank0]: self.language_model = init_vllm_registered_model(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 286, in init_vllm_registered_model
[rank0]: return _initialize_model(vllm_config=vllm_config, prefix=prefix)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
[rank0]: return model_class(vllm_config=vllm_config, prefix=prefix)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 457, in __init__
[rank0]: self.model = DeepseekModel(vllm_config=vllm_config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 358, in __init__
[rank0]: self.start_layer, self.end_layer, self.layers = make_layers(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 609, in make_layers
[rank0]: [PPMissingLayer() for _ in range(start_layer)] + [
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 610, in <listcomp>
[rank0]: maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 360, in <lambda>
[rank0]: lambda prefix: DeepseekDecoderLayer(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 300, in __init__
[rank0]: self.mlp = DeepseekMoE(config=config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 120, in __init__
[rank0]: self.pack_params()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 162, in pack_params
[rank0]: self.w2 = self.w2.permute(0, 2, 1).contiguous()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 79, in __torch_function__
[rank0]: return func(*args, **kwargs)
[rank0]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 140.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 3.24 GiB is allocated by PyTorch, and 235.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
INFO 02-27 11:41:38 [__init__.py:240] Automatically detected platform rocm.
/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py:476: DeprecationWarning:
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-27 11:41:43 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-27 11:41:43 [config.py:721] This model supports multiple tasks: {'score', 'classify', 'generate', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 02-27 11:41:43 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False,
INFO 02-27 11:41:44 [rocm.py:226] None is not supported in AMD GPUs.
INFO 02-27 11:41:44 [rocm.py:227] Using ROCmFlashAttention backend.
WARNING 02-27 11:41:44 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
INFO 02-27 11:41:44 [worker_base.py:653] ########## 3122 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63}
INFO 02-27 11:41:44 [worker_base.py:654] ########## 3122 process(rank0) is running on memnode(s): {0, 1, 2, 3, 4, 5, 6, 7}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0227 11:41:44.968997 3122 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0227 11:41:44.969059 3122 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:41:44.969534 3122 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55cf07c030c0, SPLIT_COLOR: 3389850942126204093, PG Name: 1
I0227 11:41:44.969554 3122 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:41:44.989876 3122 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55cf07c030c0, SPLIT_COLOR: 3389850942126204093, PG Name: 3
I0227 11:41:44.989917 3122 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:41:44.991149 3122 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55cf07c030c0, SPLIT_COLOR: 3389850942126204093, PG Name: 5
I0227 11:41:44.991173 3122 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:41:44.992142 3122 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55cf07c030c0, SPLIT_COLOR: 3389850942126204093, PG Name: 7
I0227 11:41:44.992164 3122 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
INFO 02-27 11:41:44 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-27 11:41:44 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-27 11:41:46 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 513, in <module>
[rank0]: main()
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 504, in main
[rank0]: initialize_model(args.model_path, args.gpu_id)
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 272, in initialize_model
[rank0]: llm = LLM(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1182, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 255, in __init__
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 520, in from_engine_args
[rank0]: return engine_cls.from_vllm_config(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 496, in from_vllm_config
[rank0]: return cls(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 283, in __init__
[rank0]: self.model_executor = executor_class(vllm_config=vllm_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
[rank0]: self._init_executor()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
[rank0]: self.collective_rpc("load_model")
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
[rank0]: answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2624, in run_method
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 203, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1136, in load_model
[rank0]: self.model = get_model(vllm_config=self.vllm_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
[rank0]: return loader.load_model(vllm_config=vllm_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 454, in load_model
[rank0]: model = _initialize_model(vllm_config=vllm_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
[rank0]: return model_class(vllm_config=vllm_config, prefix=prefix)
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 325, in __init__
[rank0]: self.language_model = init_vllm_registered_model(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 286, in init_vllm_registered_model
[rank0]: return _initialize_model(vllm_config=vllm_config, prefix=prefix)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
[rank0]: return model_class(vllm_config=vllm_config, prefix=prefix)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 457, in __init__
[rank0]: self.model = DeepseekModel(vllm_config=vllm_config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 358, in __init__
[rank0]: self.start_layer, self.end_layer, self.layers = make_layers(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 609, in make_layers
[rank0]: [PPMissingLayer() for _ in range(start_layer)] + [
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 610, in <listcomp>
[rank0]: maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 360, in <lambda>
[rank0]: lambda prefix: DeepseekDecoderLayer(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 300, in __init__
[rank0]: self.mlp = DeepseekMoE(config=config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 120, in __init__
[rank0]: self.pack_params()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 162, in pack_params
[rank0]: self.w2 = self.w2.permute(0, 2, 1).contiguous()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 79, in __torch_function__
[rank0]: return func(*args, **kwargs)
[rank0]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 140.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 3.24 GiB is allocated by PyTorch, and 235.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
INFO 02-27 11:44:38 [__init__.py:240] Automatically detected platform rocm.
/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py:476: DeprecationWarning:
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-27 11:44:43 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-27 11:44:43 [config.py:721] This model supports multiple tasks: {'reward', 'score', 'embed', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 02-27 11:44:43 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False,
INFO 02-27 11:44:44 [rocm.py:226] None is not supported in AMD GPUs.
INFO 02-27 11:44:44 [rocm.py:227] Using ROCmFlashAttention backend.
WARNING 02-27 11:44:44 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
INFO 02-27 11:44:44 [worker_base.py:653] ########## 3409 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63}
INFO 02-27 11:44:44 [worker_base.py:654] ########## 3409 process(rank0) is running on memnode(s): {0, 1, 2, 3, 4, 5, 6, 7}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0227 11:44:44.820218 3409 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0227 11:44:44.820281 3409 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:44:44.820798 3409 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x557d5ac8e9c0, SPLIT_COLOR: 3389850942126204093, PG Name: 1
I0227 11:44:44.820814 3409 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:44:44.840890 3409 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x557d5ac8e9c0, SPLIT_COLOR: 3389850942126204093, PG Name: 3
I0227 11:44:44.840930 3409 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:44:44.842145 3409 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x557d5ac8e9c0, SPLIT_COLOR: 3389850942126204093, PG Name: 5
I0227 11:44:44.842170 3409 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:44:44.843230 3409 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x557d5ac8e9c0, SPLIT_COLOR: 3389850942126204093, PG Name: 7
I0227 11:44:44.843253 3409 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
INFO 02-27 11:44:44 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-27 11:44:44 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-27 11:44:46 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:07<00:00, 7.90s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:07<00:00, 7.90s/it]
INFO 02-27 11:44:57 [loader.py:460] Loading weights took 10.70 seconds
INFO 02-27 11:44:57 [model_runner.py:1165] Model loading took 6.3336 GiB and 12.624936 seconds
Some kwargs in processor config are unused and will not have any effect: image_std, pad_token, image_token, patch_size, image_mean, ignore_id, add_special_token, downsample_ratio, mask_prompt, candidate_resolutions, normalize, sft_format.
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-27 11:45:11 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-27 11:45:12 [worker.py:287] Memory profiling takes 14.13 seconds
INFO 02-27 11:45:12 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-27 11:45:12 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
INFO 02-27 11:45:12 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 0
INFO 02-27 11:45:12 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
INFO 02-27 11:45:12 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s] Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:03, 1.44it/s] Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:01<00:02, 1.51it/s] Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 1.53it/s] Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:02<00:01, 1.52it/s] Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:03<00:00, 1.51it/s] Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00, 1.50it/s] Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00, 1.50it/s]
INFO 02-27 11:45:16 [model_runner.py:1752] Graph capturing finished in 4 secs, took 0.12 GiB
INFO 02-27 11:45:16 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 18.93 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 2 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [3409]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: image_std, pad_token, image_token, patch_size, image_mean, ignore_id, add_special_token, downsample_ratio, mask_prompt, candidate_resolutions, normalize, sft_format.
[1/3] Tokenize 19 页...
[1/3] Tokenize 完成
[2/3] GPU 批量推理 19 页...
Processed prompts: 0%| | 0/19 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s] Processed prompts: 5%|▌ | 1/19 [00:12<03:45, 12.55s/it, est. speed input: 90.13 toks/s, output: 1.43 toks/s] Processed prompts: 11%|█ | 2/19 [00:15<01:58, 6.99s/it, est. speed input: 144.51 toks/s, output: 8.69 toks/s] Processed prompts: 16%|█▌ | 3/19 [00:16<01:05, 4.08s/it, est. speed input: 208.56 toks/s, output: 16.84 toks/s] Processed prompts: 21%|██ | 4/19 [00:16<00:39, 2.61s/it, est. speed input: 271.98 toks/s, output: 25.55 toks/s] Processed prompts: 26%|██▋ | 5/19 [00:17<00:26, 1.88s/it, est. speed input: 328.37 toks/s, output: 34.67 toks/s] Processed prompts: 37%|███▋ | 7/19 [00:17<00:11, 1.00it/s, est. speed input: 452.57 toks/s, output: 54.48 toks/s] Processed prompts: 53%|█████▎ | 10/19 [00:17<00:04, 1.94it/s, est. speed input: 638.47 toks/s, output: 85.19 toks/s] Processed prompts: 58%|█████▊ | 11/19 [00:17<00:03, 2.23it/s, est. speed input: 695.37 toks/s, output: 95.47 toks/s] Processed prompts: 63%|██████▎ | 12/19 [00:18<00:03, 2.14it/s, est. speed input: 736.62 toks/s, output: 104.86 toks/s] Processed prompts: 68%|██████▊ | 13/19 [00:18<00:02, 2.51it/s, est. speed input: 790.12 toks/s, output: 116.34 toks/s] Processed prompts: 79%|███████▉ | 15/19 [00:18<00:01, 3.44it/s, est. speed input: 898.89 toks/s, output: 140.36 toks/s] Processed prompts: 84%|████████▍ | 16/19 [00:19<00:00, 3.65it/s, est. speed input: 947.88 toks/s, output: 152.43 toks/s] Processed prompts: 89%|████████▉ | 17/19 [00:20<00:00, 2.06it/s, est. speed input: 950.50 toks/s, output: 160.57 toks/s] Processed prompts: 95%|█████████▍| 18/19 [00:20<00:00, 2.03it/s, est. speed input: 981.45 toks/s, output: 175.29 toks/s] Processed prompts: 100%|██████████| 19/19 [00:22<00:00, 1.16it/s, est. speed input: 951.02 toks/s, output: 188.57 toks/s] Processed prompts: 100%|██████████| 19/19 [00:22<00:00, 1.19s/it, est. speed input: 951.02 toks/s, output: 188.57 toks/s]
[2/3] GPU 推理完成
OCR 耗时: 28.28s
[3/3] 后处理...
[3/3] 后处理完成 (0.00s)
============================================================
[SUCCESS] 全部完成
总耗时: 30.77s
平均: 1.62s/页
============================================================
INFO: 127.0.0.1:56926 - "POST /ocr HTTP/1.1" 200 OK
INFO 02-04 14:25:53 [__init__.py:240] Automatically detected platform rocm.
/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py:472: DeprecationWarning:
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr
INFO 02-04 14:25:58 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-04 14:25:58 [config.py:721] This model supports multiple tasks: {'score', 'generate', 'embed', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 02-04 14:25:58 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr', speculative_config=None, tokenizer='/home/lst/deepseek_ocr', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=3281, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False,
INFO 02-04 14:25:58 [rocm.py:226] None is not supported in AMD GPUs.
INFO 02-04 14:25:58 [rocm.py:227] Using ROCmFlashAttention backend.
WARNING 02-04 14:25:58 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
INFO 02-04 14:25:58 [worker_base.py:653] ########## 12366 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
INFO 02-04 14:25:58 [worker_base.py:654] ########## 12366 process(rank0) is running on memnode(s): {0, 1, 2, 3}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0204 14:25:59.137439 12366 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0204 14:25:59.137527 12366 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 14:25:59.138036 12366 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5636c9281ad0, SPLIT_COLOR: 3389850942126204093, PG Name: 1
I0204 14:25:59.138052 12366 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 14:25:59.157974 12366 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5636c9281ad0, SPLIT_COLOR: 3389850942126204093, PG Name: 3
I0204 14:25:59.158027 12366 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 14:25:59.159323 12366 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5636c9281ad0, SPLIT_COLOR: 3389850942126204093, PG Name: 5
I0204 14:25:59.159346 12366 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 14:25:59.160290 12366 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5636c9281ad0, SPLIT_COLOR: 3389850942126204093, PG Name: 7
I0204 14:25:59.160312 12366 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
INFO 02-04 14:25:59 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-04 14:25:59 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-04 14:26:00 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 6.84it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 6.81it/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py", line 509, in <module>
[rank0]: main()
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py", line 500, in main
[rank0]: initialize_model(args.model_path, args.gpu_id)
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py", line 270, in initialize_model
[rank0]: llm = LLM(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1182, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 255, in __init__
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 520, in from_engine_args
[rank0]: return engine_cls.from_vllm_config(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 496, in from_vllm_config
[rank0]: return cls(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 283, in __init__
[rank0]: self.model_executor = executor_class(vllm_config=vllm_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
[rank0]: self._init_executor()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
[rank0]: self.collective_rpc("load_model")
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
[rank0]: answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2624, in run_method
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 203, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1136, in load_model
[rank0]: self.model = get_model(vllm_config=self.vllm_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
[rank0]: return loader.load_model(vllm_config=vllm_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 457, in load_model
[rank0]: loaded_weights = model.load_weights(
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 576, in load_weights
[rank0]: autoloaded_weights = loader.load_weights(processed_weights, mapper=self.hf_to_vllm_mapper)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 261, in load_weights
[rank0]: autoloaded_weights = set(self._load_module("", self.module, weights))
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 222, in _load_module
[rank0]: yield from self._load_module(prefix,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 195, in _load_module
[rank0]: loaded_params = module_load_weights(weights)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 494, in load_weights
[rank0]: return loader.load_weights(weights)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 261, in load_weights
[rank0]: autoloaded_weights = set(self._load_module("", self.module, weights))
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 222, in _load_module
[rank0]: yield from self._load_module(prefix,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 195, in _load_module
[rank0]: loaded_params = module_load_weights(weights)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 441, in load_weights
[rank0]: param = params_dict[name]
[rank0]: KeyError: 'image_newline'
INFO 02-04 14:28:00 [__init__.py:240] Automatically detected platform rocm.
/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py:472: DeprecationWarning:
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-04 14:28:05 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-04 14:28:05 [config.py:721] This model supports multiple tasks: {'embed', 'generate', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 02-04 14:28:05 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=3281, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False,
INFO 02-04 14:28:06 [rocm.py:226] None is not supported in AMD GPUs.
INFO 02-04 14:28:06 [rocm.py:227] Using ROCmFlashAttention backend.
WARNING 02-04 14:28:06 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
INFO 02-04 14:28:06 [worker_base.py:653] ########## 12544 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
INFO 02-04 14:28:06 [worker_base.py:654] ########## 12544 process(rank0) is running on memnode(s): {0, 1, 2, 3}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0204 14:28:06.276472 12544 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0204 14:28:06.276561 12544 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 14:28:06.277021 12544 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x56262096b7b0, SPLIT_COLOR: 3389850942126204093, PG Name: 1
I0204 14:28:06.277037 12544 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 14:28:06.296979 12544 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x56262096b7b0, SPLIT_COLOR: 3389850942126204093, PG Name: 3
I0204 14:28:06.297024 12544 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 14:28:06.298228 12544 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x56262096b7b0, SPLIT_COLOR: 3389850942126204093, PG Name: 5
I0204 14:28:06.298249 12544 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 14:28:06.299176 12544 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x56262096b7b0, SPLIT_COLOR: 3389850942126204093, PG Name: 7
I0204 14:28:06.299197 12544 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
INFO 02-04 14:28:06 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-04 14:28:06 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-04 14:28:07 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 7.13it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 7.11it/s]
INFO 02-04 14:28:09 [loader.py:460] Loading weights took 1.92 seconds
INFO 02-04 14:28:10 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.579491 seconds
Some kwargs in processor config are unused and will not have any effect: ignore_id, patch_size, image_std, sft_format, normalize, image_token, add_special_token, mask_prompt, image_mean, candidate_resolutions, pad_token, downsample_ratio.
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 14:28:19 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 14:28:20 [worker.py:287] Memory profiling takes 10.29 seconds
INFO 02-04 14:28:20 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-04 14:28:20 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 1.86GiB; the rest of the memory reserved for KV Cache is 47.81GiB.
INFO 02-04 14:28:20 [executor_base.py:112] # rocm blocks: 13055, # CPU blocks: 1092
INFO 02-04 14:28:20 [executor_base.py:117] Maximum concurrency for 3281 tokens per request: 254.65x
INFO 02-04 14:28:22 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s] Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:02, 1.80it/s] Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:01<00:02, 1.93it/s] Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 2.00it/s] Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:02<00:01, 2.00it/s] Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:02<00:00, 2.01it/s] Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.04it/s] Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.00it/s]
INFO 02-04 14:28:25 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
INFO 02-04 14:28:25 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 15.87 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 2 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [12544]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: ignore_id, patch_size, image_std, sft_format, normalize, image_token, add_special_token, mask_prompt, image_mean, candidate_resolutions, pad_token, downsample_ratio.
[1/3] Tokenize 19 页...
[1/3] Tokenize 完成
[2/3] GPU 批量推理 19 页...
Processed prompts: 0%| | 0/19 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s] Processed prompts: 5%|▌ | 1/19 [00:12<03:41, 12.32s/it, est. speed input: 91.79 toks/s, output: 1.46 toks/s] Processed prompts: 11%|█ | 2/19 [00:15<01:57, 6.90s/it, est. speed input: 146.60 toks/s, output: 8.81 toks/s] Processed prompts: 16%|█▌ | 3/19 [00:16<01:05, 4.10s/it, est. speed input: 209.42 toks/s, output: 17.22 toks/s] Processed prompts: 21%|██ | 4/19 [00:16<00:38, 2.57s/it, est. speed input: 275.32 toks/s, output: 26.17 toks/s] Processed prompts: 26%|██▋ | 5/19 [00:16<00:25, 1.84s/it, est. speed input: 333.36 toks/s, output: 35.37 toks/s] Processed prompts: 37%|███▋ | 7/19 [00:17<00:11, 1.03it/s, est. speed input: 459.21 toks/s, output: 55.22 toks/s] Processed prompts: 53%|█████▎ | 10/19 [00:17<00:04, 1.94it/s, est. speed input: 645.85 toks/s, output: 86.11 toks/s] Processed prompts: 58%|█████▊ | 11/19 [00:17<00:03, 2.21it/s, est. speed input: 702.21 toks/s, output: 96.40 toks/s] Processed prompts: 63%|██████▎ | 12/19 [00:18<00:03, 2.12it/s, est. speed input: 743.65 toks/s, output: 105.86 toks/s] Processed prompts: 68%|██████▊ | 13/19 [00:18<00:02, 2.48it/s, est. speed input: 797.42 toks/s, output: 117.42 toks/s] Processed prompts: 74%|███████▎ | 14/19 [00:18<00:01, 3.04it/s, est. speed input: 853.49 toks/s, output: 129.58 toks/s] Processed prompts: 79%|███████▉ | 15/19 [00:18<00:01, 3.59it/s, est. speed input: 907.64 toks/s, output: 141.83 toks/s] Processed prompts: 84%|████████▍ | 16/19 [00:18<00:00, 3.92it/s, est. speed input: 958.14 toks/s, output: 154.13 toks/s] Processed prompts: 89%|████████▉ | 17/19 [00:20<00:01, 1.98it/s, est. speed input: 960.27 toks/s, output: 162.27 toks/s] Processed prompts: 95%|█████████▍| 18/19 [00:20<00:00, 1.91it/s, est. speed input: 988.41 toks/s, output: 176.87 toks/s] Processed prompts: 100%|██████████| 19/19 [00:21<00:00, 1.69it/s, est. speed input: 1006.67 toks/s, output: 193.52 toks/s] Processed prompts: 100%|██████████| 19/19 [00:21<00:00, 1.12s/it, est. speed input: 1006.67 toks/s, output: 193.52 toks/s]
[2/3] GPU 推理完成
OCR 耗时: 26.37s
[3/3] 后处理...
[3/3] 后处理完成 (0.00s)
============================================================
[SUCCESS] 全部完成
总耗时: 28.83s
平均: 1.52s/页
============================================================
INFO: 127.0.0.1:51734 - "POST /ocr HTTP/1.1" 200 OK
[1/3] Tokenize 19 页...
[1/3] Tokenize 完成
[2/3] GPU 批量推理 19 页...
Processed prompts: 0%| | 0/19 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s] Processed prompts: 5%|▌ | 1/19 [00:12<03:37, 12.07s/it, est. speed input: 93.70 toks/s, output: 1.49 toks/s] Processed prompts: 11%|█ | 2/19 [00:15<01:55, 6.80s/it, est. speed input: 149.07 toks/s, output: 8.96 toks/s] Processed prompts: 16%|█▌ | 3/19 [00:15<01:04, 4.06s/it, est. speed input: 212.46 toks/s, output: 17.47 toks/s] Processed prompts: 21%|██ | 4/19 [00:16<00:38, 2.55s/it, est. speed input: 279.15 toks/s, output: 26.53 toks/s] Processed prompts: 26%|██▋ | 5/19 [00:17<00:26, 1.92s/it, est. speed input: 332.36 toks/s, output: 35.26 toks/s] Processed prompts: 37%|███▋ | 7/19 [00:17<00:12, 1.02s/it, est. speed input: 457.76 toks/s, output: 55.04 toks/s] Processed prompts: 53%|█████▎ | 10/19 [00:17<00:04, 1.87it/s, est. speed input: 643.63 toks/s, output: 85.82 toks/s] Processed prompts: 58%|█████▊ | 11/19 [00:17<00:03, 2.13it/s, est. speed input: 699.71 toks/s, output: 96.06 toks/s] Processed prompts: 63%|██████▎ | 12/19 [00:18<00:03, 2.06it/s, est. speed input: 740.91 toks/s, output: 105.47 toks/s] Processed prompts: 68%|██████▊ | 13/19 [00:18<00:02, 2.42it/s, est. speed input: 794.48 toks/s, output: 116.99 toks/s] Processed prompts: 74%|███████▎ | 14/19 [00:18<00:01, 2.96it/s, est. speed input: 850.35 toks/s, output: 129.10 toks/s] Processed prompts: 79%|███████▉ | 15/19 [00:18<00:01, 3.51it/s, est. speed input: 904.30 toks/s, output: 141.31 toks/s] Processed prompts: 84%|████████▍ | 16/19 [00:18<00:00, 3.84it/s, est. speed input: 954.60 toks/s, output: 153.56 toks/s] Processed prompts: 89%|████████▉ | 17/19 [00:20<00:01, 1.97it/s, est. speed input: 956.87 toks/s, output: 161.69 toks/s] Processed prompts: 95%|█████████▍| 18/19 [00:20<00:00, 1.90it/s, est. speed input: 985.04 toks/s, output: 176.27 toks/s] Processed prompts: 100%|██████████| 19/19 [00:21<00:00, 1.69it/s, est. speed input: 1003.54 toks/s, output: 192.92 toks/s] Processed prompts: 100%|██████████| 19/19 [00:21<00:00, 1.13s/it, est. speed input: 1003.54 toks/s, output: 192.92 toks/s]
[2/3] GPU 推理完成
OCR 耗时: 24.64s
[3/3] 后处理...
[3/3] 后处理完成 (0.00s)
============================================================
[SUCCESS] 全部完成
总耗时: 27.09s
平均: 1.43s/页
============================================================
INFO: 127.0.0.1:34080 - "POST /ocr HTTP/1.1" 200 OK
[1/3] Tokenize 19 页...
[1/3] Tokenize 完成
[2/3] GPU 批量推理 19 页...
Processed prompts: 0%| | 0/19 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s] Processed prompts: 5%|▌ | 1/19 [00:12<03:37, 12.07s/it, est. speed input: 93.72 toks/s, output: 1.49 toks/s] Processed prompts: 11%|█ | 2/19 [00:15<01:55, 6.80s/it, est. speed input: 149.03 toks/s, output: 8.96 toks/s] Processed prompts: 16%|█▌ | 3/19 [00:15<01:04, 4.05s/it, est. speed input: 212.52 toks/s, output: 17.48 toks/s] Processed prompts: 21%|██ | 4/19 [00:16<00:38, 2.55s/it, est. speed input: 279.28 toks/s, output: 26.54 toks/s] Processed prompts: 26%|██▋ | 5/19 [00:16<00:25, 1.82s/it, est. speed input: 337.87 toks/s, output: 35.85 toks/s] Processed prompts: 37%|███▋ | 7/19 [00:17<00:11, 1.03it/s, est. speed input: 465.26 toks/s, output: 55.95 toks/s] Processed prompts: 53%|█████▎ | 10/19 [00:17<00:04, 1.96it/s, est. speed input: 654.17 toks/s, output: 87.22 toks/s] Processed prompts: 58%|█████▊ | 11/19 [00:17<00:03, 2.22it/s, est. speed input: 711.13 toks/s, output: 97.63 toks/s] Processed prompts: 63%|██████▎ | 12/19 [00:18<00:03, 2.13it/s, est. speed input: 752.74 toks/s, output: 107.15 toks/s] Processed prompts: 68%|██████▊ | 13/19 [00:18<00:02, 2.49it/s, est. speed input: 807.04 toks/s, output: 118.84 toks/s] Processed prompts: 74%|███████▎ | 14/19 [00:18<00:01, 3.05it/s, est. speed input: 863.71 toks/s, output: 131.13 toks/s] Processed prompts: 79%|███████▉ | 15/19 [00:18<00:01, 3.60it/s, est. speed input: 918.44 toks/s, output: 143.52 toks/s] Processed prompts: 84%|████████▍ | 16/19 [00:18<00:00, 3.91it/s, est. speed input: 969.22 toks/s, output: 155.91 toks/s] Processed prompts: 89%|████████▉ | 17/19 [00:19<00:01, 1.98it/s, est. speed input: 970.71 toks/s, output: 164.03 toks/s] Processed prompts: 95%|█████████▍| 18/19 [00:20<00:00, 1.90it/s, est. speed input: 998.67 toks/s, output: 178.71 toks/s] Processed prompts: 100%|██████████| 19/19 [00:21<00:00, 1.69it/s, est. speed input: 1016.79 toks/s, output: 195.47 toks/s] Processed prompts: 100%|██████████| 19/19 [00:21<00:00, 1.11s/it, est. speed input: 1016.79 toks/s, output: 195.47 toks/s]
[2/3] GPU 推理完成
OCR 耗时: 24.22s
[3/3] 后处理...
[3/3] 后处理完成 (0.00s)
============================================================
[SUCCESS] 全部完成
总耗时: 26.65s
平均: 1.40s/页
============================================================
INFO: 127.0.0.1:47994 - "POST /ocr HTTP/1.1" 200 OK
[1/3] Tokenize 19 页...
[1/3] Tokenize 完成
[2/3] GPU 批量推理 19 页...
Processed prompts: 0%| | 0/19 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s] Processed prompts: 5%|▌ | 1/19 [00:12<03:37, 12.07s/it, est. speed input: 93.73 toks/s, output: 1.49 toks/s] Processed prompts: 11%|█ | 2/19 [00:15<01:55, 6.81s/it, est. speed input: 148.91 toks/s, output: 8.95 toks/s] Processed prompts: 16%|█▌ | 3/19 [00:15<01:04, 4.06s/it, est. speed input: 212.24 toks/s, output: 17.45 toks/s] Processed prompts: 21%|██ | 4/19 [00:16<00:38, 2.55s/it, est. speed input: 278.87 toks/s, output: 26.51 toks/s] Processed prompts: 26%|██▋ | 5/19 [00:16<00:25, 1.83s/it, est. speed input: 337.25 toks/s, output: 35.78 toks/s] Processed prompts: 37%|███▋ | 7/19 [00:17<00:11, 1.03it/s, est. speed input: 464.39 toks/s, output: 55.84 toks/s] Processed prompts: 53%|█████▎ | 10/19 [00:17<00:04, 1.95it/s, est. speed input: 652.75 toks/s, output: 87.03 toks/s] Processed prompts: 58%|█████▊ | 11/19 [00:17<00:03, 2.21it/s, est. speed input: 709.50 toks/s, output: 97.41 toks/s] Processed prompts: 63%|██████▎ | 12/19 [00:18<00:03, 2.12it/s, est. speed input: 750.88 toks/s, output: 106.89 toks/s] Processed prompts: 68%|██████▊ | 13/19 [00:18<00:02, 2.47it/s, est. speed input: 805.06 toks/s, output: 118.54 toks/s] Processed prompts: 74%|███████▎ | 14/19 [00:18<00:01, 3.03it/s, est. speed input: 861.59 toks/s, output: 130.81 toks/s] Processed prompts: 79%|███████▉ | 15/19 [00:18<00:01, 3.58it/s, est. speed input: 916.18 toks/s, output: 143.16 toks/s] Processed prompts: 84%|████████▍ | 16/19 [00:18<00:00, 3.90it/s, est. speed input: 966.99 toks/s, output: 155.55 toks/s] Processed prompts: 89%|████████▉ | 17/19 [00:19<00:01, 1.97it/s, est. speed input: 968.43 toks/s, output: 163.65 toks/s] Processed prompts: 95%|█████████▍| 18/19 [00:20<00:00, 1.90it/s, est. speed input: 996.51 toks/s, output: 178.32 toks/s] Processed prompts: 100%|██████████| 19/19 [00:21<00:00, 1.69it/s, est. speed input: 1014.71 toks/s, output: 195.06 toks/s] Processed prompts: 100%|██████████| 19/19 [00:21<00:00, 1.11s/it, est. speed input: 1014.71 toks/s, output: 195.06 toks/s]
[2/3] GPU 推理完成
OCR 耗时: 24.31s
[3/3] 后处理...
[3/3] 后处理完成 (0.00s)
============================================================
[SUCCESS] 全部完成
总耗时: 26.75s
平均: 1.41s/页
============================================================
INFO: 127.0.0.1:49594 - "POST /ocr HTTP/1.1" 200 OK
INFO 02-04 14:39:01 [__init__.py:240] Automatically detected platform rocm.
/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py:472: DeprecationWarning:
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-04 14:39:06 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-04 14:39:06 [config.py:721] This model supports multiple tasks: {'score', 'reward', 'embed', 'classify', 'generate'}. Defaulting to 'generate'.
INFO 02-04 14:39:06 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False,
INFO 02-04 14:39:06 [rocm.py:226] None is not supported in AMD GPUs.
INFO 02-04 14:39:06 [rocm.py:227] Using ROCmFlashAttention backend.
WARNING 02-04 14:39:06 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
INFO 02-04 14:39:06 [worker_base.py:653] ########## 13561 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
INFO 02-04 14:39:06 [worker_base.py:654] ########## 13561 process(rank0) is running on memnode(s): {0, 1, 2, 3}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0204 14:39:07.188728 13561 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0204 14:39:07.188807 13561 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 14:39:07.189251 13561 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5640a3302a40, SPLIT_COLOR: 3389850942126204093, PG Name: 1
I0204 14:39:07.189267 13561 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 14:39:07.208966 13561 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5640a3302a40, SPLIT_COLOR: 3389850942126204093, PG Name: 3
I0204 14:39:07.209004 13561 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 14:39:07.210227 13561 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5640a3302a40, SPLIT_COLOR: 3389850942126204093, PG Name: 5
I0204 14:39:07.210245 13561 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 14:39:07.211236 13561 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5640a3302a40, SPLIT_COLOR: 3389850942126204093, PG Name: 7
I0204 14:39:07.211254 13561 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
INFO 02-04 14:39:07 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-04 14:39:07 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-04 14:39:08 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 7.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 7.16it/s]
INFO 02-04 14:39:10 [loader.py:460] Loading weights took 1.84 seconds
INFO 02-04 14:39:10 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.529130 seconds
Some kwargs in processor config are unused and will not have any effect: normalize, image_std, candidate_resolutions, sft_format, image_mean, add_special_token, patch_size, pad_token, mask_prompt, downsample_ratio, image_token, ignore_id.
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 14:39:22 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 14:39:23 [worker.py:287] Memory profiling takes 11.97 seconds
INFO 02-04 14:39:23 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-04 14:39:23 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
INFO 02-04 14:39:23 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 1092
INFO 02-04 14:39:23 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
INFO 02-04 14:39:25 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s] Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:02, 1.89it/s] Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:01<00:02, 1.88it/s] Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 1.96it/s] Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:02<00:01, 1.97it/s] Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:02<00:00, 1.99it/s] Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00, 2.02it/s] Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00, 1.99it/s]
INFO 02-04 14:39:28 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
INFO 02-04 14:39:28 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 17.42 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 2 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [13561]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: normalize, image_std, candidate_resolutions, sft_format, image_mean, add_special_token, patch_size, pad_token, mask_prompt, downsample_ratio, image_token, ignore_id.
[1/3] Tokenize 19 页...
[1/3] Tokenize 完成
[2/3] GPU 批量推理 19 页...
Processed prompts: 0%| | 0/19 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s] Processed prompts: 5%|▌ | 1/19 [00:12<03:40, 12.25s/it, est. speed input: 92.36 toks/s, output: 1.47 toks/s] Processed prompts: 11%|█ | 2/19 [00:15<01:56, 6.87s/it, est. speed input: 147.38 toks/s, output: 8.86 toks/s] Processed prompts: 16%|█▌ | 3/19 [00:16<01:05, 4.08s/it, est. speed input: 210.52 toks/s, output: 17.31 toks/s] Processed prompts: 21%|██ | 4/19 [00:16<00:38, 2.56s/it, est. speed input: 276.73 toks/s, output: 26.30 toks/s] Processed prompts: 26%|██▋ | 5/19 [00:16<00:25, 1.83s/it, est. speed input: 334.79 toks/s, output: 35.52 toks/s] Processed prompts: 37%|███▋ | 7/19 [00:17<00:11, 1.02it/s, est. speed input: 460.48 toks/s, output: 55.43 toks/s] Processed prompts: 53%|█████▎ | 10/19 [00:17<00:04, 1.96it/s, est. speed input: 648.80 toks/s, output: 86.56 toks/s] Processed prompts: 58%|█████▊ | 11/19 [00:17<00:03, 2.22it/s, est. speed input: 705.22 toks/s, output: 96.87 toks/s] Processed prompts: 63%|██████▎ | 12/19 [00:18<00:03, 2.15it/s, est. speed input: 747.49 toks/s, output: 106.46 toks/s] Processed prompts: 68%|██████▊ | 13/19 [00:18<00:02, 2.34it/s, est. speed input: 796.15 toks/s, output: 117.61 toks/s] Processed prompts: 74%|███████▎ | 14/19 [00:18<00:01, 2.78it/s, est. speed input: 850.10 toks/s, output: 129.87 toks/s] Processed prompts: 79%|███████▉ | 15/19 [00:18<00:01, 3.27it/s, est. speed input: 903.05 toks/s, output: 142.39 toks/s] Processed prompts: 89%|████████▉ | 17/19 [00:19<00:00, 2.30it/s, est. speed input: 961.83 toks/s, output: 163.78 toks/s] Processed prompts: 95%|█████████▍| 18/19 [00:20<00:00, 2.14it/s, est. speed input: 990.16 toks/s, output: 178.40 toks/s] Processed prompts: 100%|██████████| 19/19 [00:21<00:00, 1.75it/s, est. speed input: 1002.04 toks/s, output: 194.59 toks/s] Processed prompts: 100%|██████████| 19/19 [00:21<00:00, 1.13s/it, est. speed input: 1002.04 toks/s, output: 194.59 toks/s]
[2/3] GPU 推理完成
OCR 耗时: 26.52s
[3/3] 后处理...
[3/3] 后处理完成 (0.00s)
============================================================
[SUCCESS] 全部完成
总耗时: 28.98s
平均: 1.53s/页
============================================================
INFO: 127.0.0.1:53340 - "POST /ocr HTTP/1.1" 200 OK
INFO 02-04 14:50:16 [__init__.py:240] Automatically detected platform rocm.
/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py:472: DeprecationWarning:
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-04 14:50:21 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-04 14:50:21 [config.py:721] This model supports multiple tasks: {'classify', 'embed', 'score', 'reward', 'generate'}. Defaulting to 'generate'.
INFO 02-04 14:50:21 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False,
INFO 02-04 14:50:22 [rocm.py:226] None is not supported in AMD GPUs.
INFO 02-04 14:50:22 [rocm.py:227] Using ROCmFlashAttention backend.
WARNING 02-04 14:50:22 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
INFO 02-04 14:50:22 [worker_base.py:653] ########## 14555 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
INFO 02-04 14:50:22 [worker_base.py:654] ########## 14555 process(rank0) is running on memnode(s): {0, 1, 2, 3}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0204 14:50:22.719424 14555 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0204 14:50:22.719481 14555 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 14:50:22.719913 14555 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55e48a312250, SPLIT_COLOR: 3389850942126204093, PG Name: 1
I0204 14:50:22.719926 14555 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 14:50:22.738953 14555 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55e48a312250, SPLIT_COLOR: 3389850942126204093, PG Name: 3
I0204 14:50:22.738993 14555 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 14:50:22.740214 14555 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55e48a312250, SPLIT_COLOR: 3389850942126204093, PG Name: 5
I0204 14:50:22.740231 14555 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 14:50:22.741233 14555 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55e48a312250, SPLIT_COLOR: 3389850942126204093, PG Name: 7
I0204 14:50:22.741250 14555 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
INFO 02-04 14:50:22 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-04 14:50:22 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-04 14:50:23 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 7.14it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 7.12it/s]
INFO 02-04 14:50:26 [loader.py:460] Loading weights took 1.81 seconds
INFO 02-04 14:50:26 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.499614 seconds
Some kwargs in processor config are unused and will not have any effect: ignore_id, image_token, add_special_token, sft_format, image_mean, image_std, mask_prompt, downsample_ratio, candidate_resolutions, patch_size, pad_token, normalize.
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 14:50:38 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 14:50:39 [worker.py:287] Memory profiling takes 12.33 seconds
INFO 02-04 14:50:39 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-04 14:50:39 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
INFO 02-04 14:50:39 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 1092
INFO 02-04 14:50:39 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
INFO 02-04 14:50:41 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s] Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:02, 1.93it/s] Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:00<00:01, 2.03it/s] Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 2.06it/s] Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:01<00:00, 2.01it/s] Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:02<00:00, 1.99it/s] Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.01it/s] Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.01it/s]
INFO 02-04 14:50:44 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
INFO 02-04 14:50:44 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 17.52 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 2 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [14555]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: ignore_id, image_token, add_special_token, sft_format, image_mean, image_std, mask_prompt, downsample_ratio, candidate_resolutions, patch_size, pad_token, normalize.
[1/3] Tokenize 19 页...
[1/3] Tokenize 完成
[2/3] GPU 批量推理 19 页...
Processed prompts: 0%| | 0/19 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s] Processed prompts: 5%|▌ | 1/19 [00:12<03:45, 12.53s/it, est. speed input: 90.28 toks/s, output: 1.44 toks/s] Processed prompts: 11%|█ | 2/19 [00:15<01:58, 6.97s/it, est. speed input: 144.95 toks/s, output: 8.71 toks/s] Processed prompts: 16%|█▌ | 3/19 [00:16<01:06, 4.14s/it, est. speed input: 207.25 toks/s, output: 17.04 toks/s] Processed prompts: 21%|██ | 4/19 [00:16<00:38, 2.59s/it, est. speed input: 272.56 toks/s, output: 25.91 toks/s] Processed prompts: 26%|██▋ | 5/19 [00:17<00:25, 1.85s/it, est. speed input: 330.05 toks/s, output: 35.02 toks/s] Processed prompts: 37%|███▋ | 7/19 [00:17<00:11, 1.01it/s, est. speed input: 454.16 toks/s, output: 54.67 toks/s] Processed prompts: 53%|█████▎ | 10/19 [00:17<00:04, 1.95it/s, est. speed input: 640.14 toks/s, output: 85.41 toks/s] Processed prompts: 58%|█████▊ | 11/19 [00:17<00:03, 2.21it/s, est. speed input: 696.02 toks/s, output: 95.61 toks/s] Processed prompts: 63%|██████▎ | 12/19 [00:18<00:03, 2.15it/s, est. speed input: 738.44 toks/s, output: 105.17 toks/s] Processed prompts: 68%|██████▊ | 13/19 [00:18<00:02, 2.35it/s, est. speed input: 786.93 toks/s, output: 116.25 toks/s] Processed prompts: 74%|███████▎ | 14/19 [00:18<00:01, 2.80it/s, est. speed input: 840.49 toks/s, output: 128.40 toks/s] Processed prompts: 79%|███████▉ | 15/19 [00:18<00:01, 3.30it/s, est. speed input: 893.08 toks/s, output: 140.82 toks/s] Processed prompts: 89%|████████▉ | 17/19 [00:20<00:00, 2.33it/s, est. speed input: 952.65 toks/s, output: 162.22 toks/s] Processed prompts: 95%|█████████▍| 18/19 [00:20<00:00, 2.17it/s, est. speed input: 981.18 toks/s, output: 176.78 toks/s] Processed prompts: 100%|██████████| 19/19 [00:21<00:00, 1.76it/s, est. speed input: 993.47 toks/s, output: 192.92 toks/s] Processed prompts: 100%|██████████| 19/19 [00:21<00:00, 1.14s/it, est. speed input: 993.47 toks/s, output: 192.92 toks/s]
[2/3] GPU 推理完成
OCR 耗时: 26.65s
[3/3] 后处理...
[3/3] 后处理完成 (0.00s)
============================================================
[SUCCESS] 全部完成
总耗时: 29.10s
平均: 1.53s/页
============================================================
INFO: 127.0.0.1:55910 - "POST /ocr HTTP/1.1" 200 OK
[1/3] Tokenize 19 页...
[1/3] Tokenize 完成
[2/3] GPU 批量推理 19 页...
Processed prompts: 0%| | 0/19 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s] Processed prompts: 5%|▌ | 1/19 [00:12<03:37, 12.10s/it, est. speed input: 93.46 toks/s, output: 1.49 toks/s] Processed prompts: 11%|█ | 2/19 [00:15<01:55, 6.77s/it, est. speed input: 149.33 toks/s, output: 8.98 toks/s] Processed prompts: 16%|█▌ | 3/19 [00:15<01:04, 4.04s/it, est. speed input: 213.03 toks/s, output: 17.52 toks/s] Processed prompts: 21%|██ | 4/19 [00:16<00:38, 2.53s/it, est. speed input: 280.00 toks/s, output: 26.61 toks/s] Processed prompts: 26%|██▋ | 5/19 [00:16<00:25, 1.82s/it, est. speed input: 338.64 toks/s, output: 35.93 toks/s] Processed prompts: 37%|███▋ | 7/19 [00:17<00:12, 1.04s/it, est. speed input: 457.99 toks/s, output: 55.13 toks/s] Processed prompts: 53%|█████▎ | 10/19 [00:17<00:04, 1.85it/s, est. speed input: 645.21 toks/s, output: 86.08 toks/s] Processed prompts: 58%|█████▊ | 11/19 [00:17<00:03, 2.10it/s, est. speed input: 701.51 toks/s, output: 96.37 toks/s] Processed prompts: 63%|██████▎ | 12/19 [00:18<00:03, 2.07it/s, est. speed input: 743.95 toks/s, output: 105.96 toks/s] Processed prompts: 68%|██████▊ | 13/19 [00:18<00:02, 2.27it/s, est. speed input: 792.69 toks/s, output: 117.10 toks/s] Processed prompts: 74%|███████▎ | 14/19 [00:18<00:01, 2.72it/s, est. speed input: 846.58 toks/s, output: 129.33 toks/s] Processed prompts: 79%|███████▉ | 15/19 [00:18<00:01, 3.21it/s, est. speed input: 899.49 toks/s, output: 141.83 toks/s] Processed prompts: 89%|████████▉ | 17/19 [00:20<00:00, 2.31it/s, est. speed input: 959.04 toks/s, output: 163.31 toks/s] Processed prompts: 95%|█████████▍| 18/19 [00:20<00:00, 2.15it/s, est. speed input: 987.37 toks/s, output: 177.90 toks/s] Processed prompts: 100%|██████████| 19/19 [00:21<00:00, 1.75it/s, est. speed input: 999.56 toks/s, output: 194.11 toks/s] Processed prompts: 100%|██████████| 19/19 [00:21<00:00, 1.13s/it, est. speed input: 999.56 toks/s, output: 194.11 toks/s]
[2/3] GPU 推理完成
OCR 耗时: 24.63s
[3/3] 后处理...
[3/3] 后处理完成 (0.00s)
============================================================
[SUCCESS] 全部完成
总耗时: 27.07s
平均: 1.42s/页
============================================================
INFO: 127.0.0.1:43926 - "POST /ocr HTTP/1.1" 200 OK
[1/3] Tokenize 22 页...
[1/3] Tokenize 完成
[2/3] GPU 批量推理 22 页...
Processed prompts: 0%| | 0/22 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s] Processed prompts: 5%|▍ | 1/22 [00:15<05:17, 15.10s/it, est. speed input: 74.91 toks/s, output: 6.56 toks/s] Processed prompts: 9%|▉ | 2/22 [00:15<02:11, 6.59s/it, est. speed input: 143.81 toks/s, output: 12.65 toks/s] Processed prompts: 14%|█▎ | 3/22 [00:18<01:33, 4.90s/it, est. speed input: 182.30 toks/s, output: 19.61 toks/s] Processed prompts: 18%|█▊ | 4/22 [00:20<01:07, 3.73s/it, est. speed input: 220.17 toks/s, output: 28.67 toks/s] Processed prompts: 23%|██▎ | 5/22 [00:21<00:48, 2.85s/it, est. speed input: 258.91 toks/s, output: 39.01 toks/s] Processed prompts: 27%|██▋ | 6/22 [00:22<00:35, 2.25s/it, est. speed input: 296.16 toks/s, output: 50.10 toks/s] Processed prompts: 32%|███▏ | 7/22 [00:29<00:54, 3.65s/it, est. speed input: 268.68 toks/s, output: 56.47 toks/s] Processed prompts: 36%|███▋ | 8/22 [00:31<00:45, 3.25s/it, est. speed input: 284.10 toks/s, output: 70.96 toks/s] Processed prompts: 41%|████ | 9/22 [00:32<00:32, 2.53s/it, est. speed input: 310.48 toks/s, output: 88.09 toks/s] Processed prompts: 45%|████▌ | 10/22 [00:35<00:30, 2.56s/it, est. speed input: 319.23 toks/s, output: 101.84 toks/s] Processed prompts: 50%|█████ | 11/22 [00:37<00:27, 2.47s/it, est. speed input: 330.13 toks/s, output: 117.02 toks/s] Processed prompts: 55%|█████▍ | 12/22 [00:37<00:17, 1.77s/it, est. speed input: 358.37 toks/s, output: 137.81 toks/s] Processed prompts: 59%|█████▉ | 13/22 [00:38<00:12, 1.40s/it, est. speed input: 382.90 toks/s, output: 157.53 toks/s] Processed prompts: 64%|██████▎ | 14/22 [00:39<00:09, 1.16s/it, est. speed input: 405.78 toks/s, output: 176.96 toks/s] Processed prompts: 68%|██████▊ | 15/22 [00:42<00:13, 1.90s/it, est. speed input: 398.03 toks/s, output: 185.77 toks/s] Processed prompts: 73%|███████▎ | 16/22 [00:44<00:10, 1.78s/it, est. speed input: 410.00 toks/s, output: 201.69 toks/s] Processed prompts: 77%|███████▋ | 17/22 [00:44<00:07, 1.47s/it, est. speed input: 428.40 toks/s, output: 223.28 toks/s] Processed prompts: 82%|████████▏ | 18/22 [00:45<00:04, 1.18s/it, est. speed input: 448.48 toks/s, output: 246.00 toks/s] Processed prompts: 86%|████████▋ | 19/22 [00:45<00:02, 1.06it/s, est. speed input: 469.47 toks/s, output: 269.53 toks/s] Processed prompts: 91%|█████████ | 20/22 [00:48<00:02, 1.48s/it, est. speed input: 466.38 toks/s, output: 282.20 toks/s] Processed prompts: 95%|█████████▌| 21/22 [00:48<00:01, 1.07s/it, est. speed input: 488.51 toks/s, output: 307.47 toks/s] Processed prompts: 100%|██████████| 22/22 [00:53<00:00, 2.18s/it, est. speed input: 466.14 toks/s, output: 316.51 toks/s] Processed prompts: 100%|██████████| 22/22 [00:53<00:00, 2.43s/it, est. speed input: 466.14 toks/s, output: 316.51 toks/s]
[2/3] GPU 推理完成
OCR 耗时: 58.12s
[3/3] 后处理...
[3/3] 后处理完成 (0.00s)
============================================================
[SUCCESS] 全部完成
总耗时: 69.56s
平均: 3.16s/页
============================================================
INFO: 127.0.0.1:55008 - "POST /ocr HTTP/1.1" 200 OK
[1/3] Tokenize 22 页...
[1/3] Tokenize 完成
[2/3] GPU 批量推理 22 页...
Processed prompts: 0%| | 0/22 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s] Processed prompts: 5%|▍ | 1/22 [00:15<05:21, 15.31s/it, est. speed input: 73.86 toks/s, output: 6.47 toks/s] Processed prompts: 9%|▉ | 2/22 [00:15<02:13, 6.67s/it, est. speed input: 141.92 toks/s, output: 12.49 toks/s] Processed prompts: 14%|█▎ | 3/22 [00:18<01:33, 4.94s/it, est. speed input: 180.25 toks/s, output: 19.39 toks/s] Processed prompts: 18%|█▊ | 4/22 [00:20<01:07, 3.75s/it, est. speed input: 217.99 toks/s, output: 28.38 toks/s] Processed prompts: 23%|██▎ | 5/22 [00:22<00:48, 2.87s/it, est. speed input: 256.49 toks/s, output: 38.64 toks/s] Processed prompts: 27%|██▋ | 6/22 [00:23<00:36, 2.26s/it, est. speed input: 293.53 toks/s, output: 49.66 toks/s] Processed prompts: 32%|███▏ | 7/22 [00:29<00:55, 3.68s/it, est. speed input: 266.42 toks/s, output: 56.00 toks/s] Processed prompts: 36%|███▋ | 8/22 [00:32<00:45, 3.27s/it, est. speed input: 281.74 toks/s, output: 70.37 toks/s] Processed prompts: 41%|████ | 9/22 [00:33<00:33, 2.54s/it, est. speed input: 307.92 toks/s, output: 87.36 toks/s] Processed prompts: 45%|████▌ | 10/22 [00:35<00:30, 2.58s/it, est. speed input: 316.71 toks/s, output: 101.03 toks/s] Processed prompts: 50%|█████ | 11/22 [00:38<00:27, 2.49s/it, est. speed input: 327.38 toks/s, output: 116.05 toks/s] Processed prompts: 55%|█████▍ | 12/22 [00:38<00:17, 1.79s/it, est. speed input: 355.40 toks/s, output: 136.67 toks/s] Processed prompts: 59%|█████▉ | 13/22 [00:38<00:12, 1.41s/it, est. speed input: 379.68 toks/s, output: 156.21 toks/s] Processed prompts: 64%|██████▎ | 14/22 [00:39<00:09, 1.17s/it, est. speed input: 402.40 toks/s, output: 175.48 toks/s] Processed prompts: 68%|██████▊ | 15/22 [00:43<00:13, 1.92s/it, est. speed input: 394.48 toks/s, output: 184.12 toks/s] Processed prompts: 73%|███████▎ | 16/22 [00:44<00:10, 1.81s/it, est. speed input: 406.20 toks/s, output: 199.82 toks/s] Processed prompts: 77%|███████▋ | 17/22 [00:45<00:07, 1.49s/it, est. speed input: 424.39 toks/s, output: 221.19 toks/s] Processed prompts: 82%|████████▏ | 18/22 [00:45<00:04, 1.20s/it, est. speed input: 444.25 toks/s, output: 243.69 toks/s] Processed prompts: 86%|████████▋ | 19/22 [00:46<00:02, 1.05it/s, est. speed input: 465.03 toks/s, output: 266.98 toks/s] Processed prompts: 91%|█████████ | 20/22 [00:48<00:02, 1.50s/it, est. speed input: 461.96 toks/s, output: 279.52 toks/s] Processed prompts: 95%|█████████▌| 21/22 [00:49<00:01, 1.08s/it, est. speed input: 483.88 toks/s, output: 304.56 toks/s] Processed prompts: 100%|██████████| 22/22 [00:53<00:00, 2.22s/it, est. speed input: 461.15 toks/s, output: 313.13 toks/s] Processed prompts: 100%|██████████| 22/22 [00:53<00:00, 2.45s/it, est. speed input: 461.15 toks/s, output: 313.13 toks/s]
[2/3] GPU 推理完成
OCR 耗时: 58.66s
[3/3] 后处理...
[3/3] 后处理完成 (0.00s)
============================================================
[SUCCESS] 全部完成
总耗时: 70.07s
平均: 3.19s/页
============================================================
INFO: 127.0.0.1:46898 - "POST /ocr HTTP/1.1" 200 OK
[1/3] Tokenize 22 页...
[1/3] Tokenize 完成
[2/3] GPU 批量推理 22 页...
Processed prompts: 0%| | 0/22 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s] Processed prompts: 5%|▍ | 1/22 [00:15<05:25, 15.49s/it, est. speed input: 73.04 toks/s, output: 6.39 toks/s] Processed prompts: 9%|▉ | 2/22 [00:16<02:14, 6.75s/it, est. speed input: 140.38 toks/s, output: 12.35 toks/s] Processed prompts: 14%|█▎ | 3/22 [00:19<01:34, 4.99s/it, est. speed input: 178.38 toks/s, output: 19.19 toks/s] Processed prompts: 18%|█▊ | 4/22 [00:20<01:08, 3.79s/it, est. speed input: 215.82 toks/s, output: 28.10 toks/s] Processed prompts: 23%|██▎ | 5/22 [00:22<00:49, 2.89s/it, est. speed input: 254.00 toks/s, output: 38.27 toks/s] Processed prompts: 27%|██▋ | 6/22 [00:23<00:36, 2.27s/it, est. speed input: 290.78 toks/s, output: 49.19 toks/s] Processed prompts: 32%|███▏ | 7/22 [00:29<00:55, 3.70s/it, est. speed input: 264.06 toks/s, output: 55.50 toks/s] Processed prompts: 36%|███▋ | 8/22 [00:32<00:46, 3.29s/it, est. speed input: 279.27 toks/s, output: 69.76 toks/s] Processed prompts: 41%|████ | 9/22 [00:33<00:33, 2.56s/it, est. speed input: 305.18 toks/s, output: 86.59 toks/s] Processed prompts: 45%|████▌ | 10/22 [00:36<00:31, 2.60s/it, est. speed input: 313.88 toks/s, output: 100.13 toks/s] Processed prompts: 50%|█████ | 11/22 [00:38<00:27, 2.50s/it, est. speed input: 324.68 toks/s, output: 115.09 toks/s] Processed prompts: 55%|█████▍ | 12/22 [00:38<00:17, 1.80s/it, est. speed input: 352.47 toks/s, output: 135.54 toks/s] Processed prompts: 59%|█████▉ | 13/22 [00:39<00:12, 1.42s/it, est. speed input: 376.59 toks/s, output: 154.93 toks/s] Processed prompts: 64%|██████▎ | 14/22 [00:39<00:09, 1.18s/it, est. speed input: 399.19 toks/s, output: 174.08 toks/s] Processed prompts: 68%|██████▊ | 15/22 [00:43<00:13, 1.92s/it, est. speed input: 391.79 toks/s, output: 182.86 toks/s] Processed prompts: 73%|███████▎ | 16/22 [00:44<00:10, 1.80s/it, est. speed input: 403.70 toks/s, output: 198.59 toks/s] Processed prompts: 77%|███████▋ | 17/22 [00:45<00:07, 1.48s/it, est. speed input: 421.91 toks/s, output: 219.90 toks/s] Processed prompts: 82%|████████▏ | 18/22 [00:46<00:04, 1.19s/it, est. speed input: 441.72 toks/s, output: 242.30 toks/s] Processed prompts: 86%|████████▋ | 19/22 [00:46<00:02, 1.05it/s, est. speed input: 462.44 toks/s, output: 265.49 toks/s] Processed prompts: 91%|█████████ | 20/22 [00:49<00:02, 1.48s/it, est. speed input: 459.78 toks/s, output: 278.20 toks/s] Processed prompts: 95%|█████████▌| 21/22 [00:49<00:01, 1.07s/it, est. speed input: 481.62 toks/s, output: 303.14 toks/s] Processed prompts: 100%|██████████| 22/22 [00:54<00:00, 2.17s/it, est. speed input: 460.32 toks/s, output: 312.56 toks/s] Processed prompts: 100%|██████████| 22/22 [00:54<00:00, 2.46s/it, est. speed input: 460.32 toks/s, output: 312.56 toks/s]
[2/3] GPU 推理完成
OCR 耗时: 58.74s
[3/3] 后处理...
[3/3] 后处理完成 (0.00s)
============================================================
[SUCCESS] 全部完成
总耗时: 70.17s
平均: 3.19s/页
============================================================
INFO: 127.0.0.1:45882 - "POST /ocr HTTP/1.1" 200 OK
[1/3] Tokenize 22 页...
[1/3] Tokenize 完成
[2/3] GPU 批量推理 22 页...
Processed prompts: 0%| | 0/22 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s] Processed prompts: 5%|▍ | 1/22 [00:15<05:17, 15.11s/it, est. speed input: 74.86 toks/s, output: 6.55 toks/s] Processed prompts: 9%|▉ | 2/22 [00:15<02:11, 6.59s/it, est. speed input: 143.79 toks/s, output: 12.65 toks/s] Processed prompts: 14%|█▎ | 3/22 [00:18<01:33, 4.91s/it, est. speed input: 182.07 toks/s, output: 19.59 toks/s] Processed prompts: 18%|█▊ | 4/22 [00:20<01:07, 3.74s/it, est. speed input: 219.78 toks/s, output: 28.61 toks/s] Processed prompts: 23%|██▎ | 5/22 [00:21<00:48, 2.86s/it, est. speed input: 258.36 toks/s, output: 38.92 toks/s] Processed prompts: 27%|██▋ | 6/22 [00:22<00:36, 2.26s/it, est. speed input: 295.43 toks/s, output: 49.98 toks/s] Processed prompts: 32%|███▏ | 7/22 [00:29<00:55, 3.68s/it, est. speed input: 267.59 toks/s, output: 56.24 toks/s] Processed prompts: 36%|███▋ | 8/22 [00:31<00:45, 3.27s/it, est. speed input: 282.89 toks/s, output: 70.66 toks/s] Processed prompts: 41%|████ | 9/22 [00:32<00:33, 2.55s/it, est. speed input: 309.03 toks/s, output: 87.68 toks/s] Processed prompts: 45%|████▌ | 10/22 [00:35<00:31, 2.59s/it, est. speed input: 317.58 toks/s, output: 101.31 toks/s] Processed prompts: 50%|█████ | 11/22 [00:37<00:27, 2.49s/it, est. speed input: 328.31 toks/s, output: 116.38 toks/s] Processed prompts: 55%|█████▍ | 12/22 [00:38<00:17, 1.79s/it, est. speed input: 356.40 toks/s, output: 137.05 toks/s] Processed prompts: 59%|█████▉ | 13/22 [00:38<00:12, 1.41s/it, est. speed input: 380.78 toks/s, output: 156.66 toks/s] Processed prompts: 64%|██████▎ | 14/22 [00:39<00:09, 1.17s/it, est. speed input: 403.61 toks/s, output: 176.01 toks/s] Processed prompts: 68%|██████▊ | 15/22 [00:42<00:13, 1.92s/it, est. speed input: 395.55 toks/s, output: 184.61 toks/s] Processed prompts: 73%|███████▎ | 16/22 [00:44<00:10, 1.80s/it, est. speed input: 407.36 toks/s, output: 200.39 toks/s] Processed prompts: 77%|███████▋ | 17/22 [00:45<00:07, 1.49s/it, est. speed input: 425.64 toks/s, output: 221.84 toks/s] Processed prompts: 82%|████████▏ | 18/22 [00:45<00:04, 1.20s/it, est. speed input: 445.59 toks/s, output: 244.42 toks/s] Processed prompts: 86%|████████▋ | 19/22 [00:46<00:02, 1.05it/s, est. speed input: 466.45 toks/s, output: 267.79 toks/s] Processed prompts: 91%|█████████ | 20/22 [00:48<00:02, 1.48s/it, est. speed input: 463.57 toks/s, output: 280.50 toks/s] Processed prompts: 95%|█████████▌| 21/22 [00:48<00:01, 1.07s/it, est. speed input: 485.58 toks/s, output: 305.63 toks/s] Processed prompts: 100%|██████████| 22/22 [00:53<00:00, 2.17s/it, est. speed input: 463.82 toks/s, output: 314.93 toks/s] Processed prompts: 100%|██████████| 22/22 [00:53<00:00, 2.44s/it, est. speed input: 463.82 toks/s, output: 314.93 toks/s]
[2/3] GPU 推理完成
OCR 耗时: 58.24s
[3/3] 后处理...
[3/3] 后处理完成 (0.00s)
============================================================
[SUCCESS] 全部完成
总耗时: 69.68s
平均: 3.17s/页
============================================================
INFO: 127.0.0.1:44226 - "POST /ocr HTTP/1.1" 200 OK
INFO 02-04 16:50:28 [__init__.py:240] Automatically detected platform rocm.
/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py:475: DeprecationWarning:
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-04 16:50:33 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-04 16:50:33 [config.py:721] This model supports multiple tasks: {'embed', 'reward', 'score', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 02-04 16:50:33 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False,
INFO 02-04 16:50:34 [rocm.py:226] None is not supported in AMD GPUs.
INFO 02-04 16:50:34 [rocm.py:227] Using ROCmFlashAttention backend.
WARNING 02-04 16:50:34 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
INFO 02-04 16:50:34 [worker_base.py:653] ########## 42007 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
INFO 02-04 16:50:34 [worker_base.py:654] ########## 42007 process(rank0) is running on memnode(s): {0, 1, 2, 3}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0204 16:50:34.705025 42007 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0204 16:50:34.705101 42007 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 16:50:34.705564 42007 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x564c40e84ce0, SPLIT_COLOR: 3389850942126204093, PG Name: 1
I0204 16:50:34.705579 42007 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 16:50:34.724964 42007 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x564c40e84ce0, SPLIT_COLOR: 3389850942126204093, PG Name: 3
I0204 16:50:34.725004 42007 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 16:50:34.726367 42007 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x564c40e84ce0, SPLIT_COLOR: 3389850942126204093, PG Name: 5
I0204 16:50:34.726390 42007 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 16:50:34.727412 42007 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x564c40e84ce0, SPLIT_COLOR: 3389850942126204093, PG Name: 7
I0204 16:50:34.727430 42007 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
INFO 02-04 16:50:34 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-04 16:50:34 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-04 16:50:35 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 6.93it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 6.91it/s]
INFO 02-04 16:50:38 [loader.py:460] Loading weights took 1.73 seconds
INFO 02-04 16:50:38 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.394852 seconds
Some kwargs in processor config are unused and will not have any effect: normalize, pad_token, downsample_ratio, mask_prompt, patch_size, sft_format, ignore_id, image_token, image_mean, image_std, add_special_token, candidate_resolutions.
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 16:50:49 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 16:50:50 [worker.py:287] Memory profiling takes 12.06 seconds
INFO 02-04 16:50:50 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-04 16:50:50 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
INFO 02-04 16:50:50 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 0
INFO 02-04 16:50:50 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
INFO 02-04 16:50:50 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s] Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:02, 1.93it/s] Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:00<00:01, 2.03it/s] Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 2.03it/s] Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:02<00:01, 1.97it/s] Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:02<00:00, 1.99it/s] Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.03it/s] Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.01it/s]
INFO 02-04 16:50:53 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
INFO 02-04 16:50:53 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 15.65 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 2 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [42007]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
INFO 02-04 16:59:36 [__init__.py:240] Automatically detected platform rocm.
/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py:478: DeprecationWarning:
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-04 16:59:41 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-04 16:59:41 [config.py:721] This model supports multiple tasks: {'embed', 'classify', 'generate', 'score', 'reward'}. Defaulting to 'generate'.
INFO 02-04 16:59:41 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False,
INFO 02-04 16:59:41 [rocm.py:226] None is not supported in AMD GPUs.
INFO 02-04 16:59:41 [rocm.py:227] Using ROCmFlashAttention backend.
WARNING 02-04 16:59:41 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
INFO 02-04 16:59:41 [worker_base.py:653] ########## 42893 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
INFO 02-04 16:59:41 [worker_base.py:654] ########## 42893 process(rank0) is running on memnode(s): {0, 1, 2, 3}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0204 16:59:42.128494 42893 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0204 16:59:42.128578 42893 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 16:59:42.129037 42893 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5571e7a2cd20, SPLIT_COLOR: 3389850942126204093, PG Name: 1
I0204 16:59:42.129053 42893 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 16:59:42.149024 42893 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5571e7a2cd20, SPLIT_COLOR: 3389850942126204093, PG Name: 3
I0204 16:59:42.149070 42893 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 16:59:42.150310 42893 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5571e7a2cd20, SPLIT_COLOR: 3389850942126204093, PG Name: 5
I0204 16:59:42.150334 42893 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 16:59:42.151289 42893 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5571e7a2cd20, SPLIT_COLOR: 3389850942126204093, PG Name: 7
I0204 16:59:42.151312 42893 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
INFO 02-04 16:59:42 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-04 16:59:42 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-04 16:59:43 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 7.14it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 7.12it/s]
INFO 02-04 16:59:45 [loader.py:460] Loading weights took 1.71 seconds
INFO 02-04 16:59:45 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.386030 seconds
Some kwargs in processor config are unused and will not have any effect: normalize, image_std, mask_prompt, image_token, candidate_resolutions, add_special_token, image_mean, pad_token, downsample_ratio, patch_size, ignore_id, sft_format.
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 16:59:57 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 16:59:57 [worker.py:287] Memory profiling takes 12.02 seconds
INFO 02-04 16:59:57 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-04 16:59:57 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
INFO 02-04 16:59:58 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 0
INFO 02-04 16:59:58 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
INFO 02-04 16:59:58 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s] Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:02, 1.93it/s] Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:00<00:01, 2.03it/s] Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 2.06it/s] Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:01<00:00, 2.04it/s] Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:02<00:00, 2.05it/s] Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.07it/s] Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.05it/s]
INFO 02-04 17:00:01 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
INFO 02-04 17:00:01 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 15.56 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 2 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [42893]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: normalize, image_std, mask_prompt, image_token, candidate_resolutions, add_special_token, image_mean, pad_token, downsample_ratio, patch_size, ignore_id, sft_format.
[1/3] Tokenize 22 页...
[1/3] Tokenize 完成
[2/3] GPU 批量推理 22 页...
Processed prompts: 0%| | 0/22 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s] Processed prompts: 5%|▍ | 1/22 [00:15<05:23, 15.42s/it, est. speed input: 73.37 toks/s, output: 6.42 toks/s] Processed prompts: 9%|▉ | 2/22 [00:16<02:14, 6.72s/it, est. speed input: 140.92 toks/s, output: 12.40 toks/s] Processed prompts: 14%|█▎ | 3/22 [00:18<01:34, 4.95s/it, est. speed input: 179.52 toks/s, output: 19.31 toks/s] Processed prompts: 18%|█▊ | 4/22 [00:20<01:07, 3.74s/it, est. speed input: 217.61 toks/s, output: 28.33 toks/s] Processed prompts: 23%|██▎ | 5/22 [00:22<00:48, 2.85s/it, est. speed input: 256.45 toks/s, output: 38.64 toks/s] Processed prompts: 27%|██▋ | 6/22 [00:23<00:35, 2.23s/it, est. speed input: 293.84 toks/s, output: 49.71 toks/s] Processed prompts: 32%|███▏ | 7/22 [00:29<00:54, 3.62s/it, est. speed input: 267.89 toks/s, output: 56.31 toks/s] Processed prompts: 36%|███▋ | 8/22 [00:31<00:45, 3.22s/it, est. speed input: 283.54 toks/s, output: 70.82 toks/s] Processed prompts: 41%|████ | 9/22 [00:32<00:32, 2.50s/it, est. speed input: 309.93 toks/s, output: 87.93 toks/s] Processed prompts: 45%|████▌ | 10/22 [00:35<00:30, 2.53s/it, est. speed input: 319.03 toks/s, output: 101.77 toks/s] Processed prompts: 50%|█████ | 11/22 [00:37<00:26, 2.44s/it, est. speed input: 330.25 toks/s, output: 117.07 toks/s] Processed prompts: 55%|█████▍ | 12/22 [00:37<00:17, 1.75s/it, est. speed input: 358.50 toks/s, output: 137.86 toks/s] Processed prompts: 59%|█████▉ | 13/22 [00:38<00:12, 1.38s/it, est. speed input: 383.05 toks/s, output: 157.59 toks/s] Processed prompts: 64%|██████▎ | 14/22 [00:39<00:09, 1.15s/it, est. speed input: 405.98 toks/s, output: 177.04 toks/s] Processed prompts: 68%|██████▊ | 15/22 [00:42<00:13, 1.97s/it, est. speed input: 395.70 toks/s, output: 184.68 toks/s] Processed prompts: 73%|███████▎ | 16/22 [00:44<00:11, 1.83s/it, est. speed input: 407.65 toks/s, output: 200.53 toks/s] Processed prompts: 77%|███████▋ | 17/22 [00:45<00:07, 1.51s/it, est. speed input: 426.00 toks/s, output: 222.03 toks/s] Processed prompts: 82%|████████▏ | 18/22 [00:45<00:04, 1.21s/it, est. speed input: 445.98 toks/s, output: 244.63 toks/s] Processed prompts: 86%|████████▋ | 19/22 [00:46<00:02, 1.04it/s, est. speed input: 466.90 toks/s, output: 268.05 toks/s] Processed prompts: 91%|█████████ | 20/22 [00:48<00:02, 1.48s/it, est. speed input: 464.27 toks/s, output: 280.92 toks/s] Processed prompts: 95%|█████████▌| 21/22 [00:48<00:01, 1.07s/it, est. speed input: 486.31 toks/s, output: 306.08 toks/s] Processed prompts: 100%|██████████| 22/22 [00:53<00:00, 2.15s/it, est. speed input: 465.01 toks/s, output: 315.75 toks/s] Processed prompts: 100%|██████████| 22/22 [00:53<00:00, 2.43s/it, est. speed input: 465.01 toks/s, output: 315.75 toks/s]
[2/3] GPU 推理完成
OCR 耗时: 60.07s
[3/3] 后处理...
[3/3] 后处理完成 (0.01s)
============================================================
[SUCCESS] 全部完成
总耗时: 71.52s
平均: 3.25s/页
============================================================
INFO: 127.0.0.1:52722 - "POST /ocr HTTP/1.1" 200 OK
INFO 02-04 17:17:27 [__init__.py:240] Automatically detected platform rocm.
/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py:478: DeprecationWarning:
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-04 17:17:32 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-04 17:17:32 [config.py:721] This model supports multiple tasks: {'generate', 'embed', 'score', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 02-04 17:17:32 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False,
INFO 02-04 17:17:33 [rocm.py:226] None is not supported in AMD GPUs.
INFO 02-04 17:17:33 [rocm.py:227] Using ROCmFlashAttention backend.
WARNING 02-04 17:17:33 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
INFO 02-04 17:17:33 [worker_base.py:653] ########## 44064 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
INFO 02-04 17:17:33 [worker_base.py:654] ########## 44064 process(rank0) is running on memnode(s): {0, 1, 2, 3}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0204 17:17:33.456287 44064 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0204 17:17:33.456384 44064 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 17:17:33.456838 44064 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x56216603a700, SPLIT_COLOR: 3389850942126204093, PG Name: 1
I0204 17:17:33.456851 44064 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 17:17:33.476958 44064 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x56216603a700, SPLIT_COLOR: 3389850942126204093, PG Name: 3
I0204 17:17:33.476995 44064 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 17:17:33.478204 44064 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x56216603a700, SPLIT_COLOR: 3389850942126204093, PG Name: 5
I0204 17:17:33.478221 44064 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0204 17:17:33.479249 44064 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x56216603a700, SPLIT_COLOR: 3389850942126204093, PG Name: 7
I0204 17:17:33.479267 44064 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
INFO 02-04 17:17:33 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-04 17:17:33 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-04 17:17:34 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 7.15it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 7.13it/s]
INFO 02-04 17:17:36 [loader.py:460] Loading weights took 1.72 seconds
INFO 02-04 17:17:37 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.400736 seconds
Some kwargs in processor config are unused and will not have any effect: sft_format, add_special_token, patch_size, image_std, candidate_resolutions, ignore_id, pad_token, downsample_ratio, image_token, image_mean, normalize, mask_prompt.
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 17:17:48 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 17:17:49 [worker.py:287] Memory profiling takes 12.09 seconds
INFO 02-04 17:17:49 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-04 17:17:49 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
INFO 02-04 17:17:49 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 0
INFO 02-04 17:17:49 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
INFO 02-04 17:17:49 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s] Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:02, 1.84it/s] Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:01<00:02, 1.98it/s] Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 2.03it/s] Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:02<00:00, 2.01it/s] Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:02<00:00, 2.02it/s] Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.04it/s] Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.02it/s]
INFO 02-04 17:17:52 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
INFO 02-04 17:17:52 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 15.69 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 2 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [44064]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
/usr/bin/python3: can't open file '/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py': [Errno 2] No such file or directory
/usr/bin/python3: can't open file '/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr_server.py': [Errno 2] No such file or directory
export VLLM_USE_V1=0
export HIP_VISIBLE_DEVICES=3
# image:流式输出
#python run_dpsk_ocr2_image.py
# pdf
python run_dpsk_ocr2_pdf.py
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment