deepseek_ocr2_server_8707_20260227_113134.log

INFO 02-27 11:31:39 [__init__.py:240] Automatically detected platform rocm.
/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py:476: DeprecationWarning: 
        on_event is deprecated, use lifespan event handlers instead.

        Read more about it in the
        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
        
  @app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-27 11:31:45 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-27 11:31:45 [config.py:721] This model supports multiple tasks: {'embed', 'score', 'classify', 'generate', 'reward'}. Defaulting to 'generate'.
INFO 02-27 11:31:45 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False, 
INFO 02-27 11:31:46 [rocm.py:226] None is not supported in AMD GPUs.
INFO 02-27 11:31:46 [rocm.py:227] Using ROCmFlashAttention backend.
WARNING 02-27 11:31:46 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
INFO 02-27 11:31:46 [worker_base.py:653] ########## 2386 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63}
INFO 02-27 11:31:46 [worker_base.py:654] ########## 2386 process(rank0) is running on memnode(s): {0, 1, 2, 3, 4, 5, 6, 7}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0227 11:31:46.413403  2386 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0227 11:31:46.413467  2386 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:31:46.413935  2386 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5609f68a77b0, SPLIT_COLOR: 3389850942126204093, PG Name: 1
I0227 11:31:46.413949  2386 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:31:46.434001  2386 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5609f68a77b0, SPLIT_COLOR: 3389850942126204093, PG Name: 3
I0227 11:31:46.434037  2386 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:31:46.435284  2386 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5609f68a77b0, SPLIT_COLOR: 3389850942126204093, PG Name: 5
I0227 11:31:46.435304  2386 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0227 11:31:46.436266  2386 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5609f68a77b0, SPLIT_COLOR: 3389850942126204093, PG Name: 7
I0227 11:31:46.436283  2386 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
INFO 02-27 11:31:46 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-27 11:31:46 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-27 11:31:47 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 513, in <module>
[rank0]:     main()
[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 504, in main
[rank0]:     initialize_model(args.model_path, args.gpu_id)
[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 272, in initialize_model
[rank0]:     llm = LLM(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1182, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 255, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 520, in from_engine_args
[rank0]:     return engine_cls.from_vllm_config(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 496, in from_vllm_config
[rank0]:     return cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 283, in __init__
[rank0]:     self.model_executor = executor_class(vllm_config=vllm_config)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
[rank0]:     self.collective_rpc("load_model")
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
[rank0]:     answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2624, in run_method
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 203, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1136, in load_model
[rank0]:     self.model = get_model(vllm_config=self.vllm_config)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
[rank0]:     return loader.load_model(vllm_config=vllm_config)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 454, in load_model
[rank0]:     model = _initialize_model(vllm_config=vllm_config)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
[rank0]:     return model_class(vllm_config=vllm_config, prefix=prefix)
[rank0]:   File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2.py", line 325, in __init__
[rank0]:     self.language_model = init_vllm_registered_model(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 286, in init_vllm_registered_model
[rank0]:     return _initialize_model(vllm_config=vllm_config, prefix=prefix)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
[rank0]:     return model_class(vllm_config=vllm_config, prefix=prefix)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 457, in __init__
[rank0]:     self.model = DeepseekModel(vllm_config=vllm_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 358, in __init__
[rank0]:     self.start_layer, self.end_layer, self.layers = make_layers(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 609, in make_layers
[rank0]:     [PPMissingLayer() for _ in range(start_layer)] + [
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 610, in <listcomp>
[rank0]:     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 360, in <lambda>
[rank0]:     lambda prefix: DeepseekDecoderLayer(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 300, in __init__
[rank0]:     self.mlp = DeepseekMoE(config=config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 120, in __init__
[rank0]:     self.pack_params()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek.py", line 162, in pack_params
[rank0]:     self.w2 = self.w2.permute(0, 2, 1).contiguous()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 79, in __torch_function__
[rank0]:     return func(*args, **kwargs)
[rank0]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 140.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 3.24 GiB is allocated by PyTorch, and 235.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)