on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-04 18:32:52 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-04 18:32:52 [config.py:721] This model supports multiple tasks: {'generate', 'classify', 'embed', 'reward', 'score'}. Defaulting to 'generate'.
INFO 02-04 18:32:53 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-04 18:32:53 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-04 18:32:54 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
INFO 02-04 18:32:56 [loader.py:460] Loading weights took 1.84 seconds
INFO 02-04 18:32:57 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.543877 seconds
Some kwargs in processor config are unused and will not have any effect: pad_token, add_special_token, sft_format, candidate_resolutions, patch_size, normalize, downsample_ratio, image_std, image_token, ignore_id, mask_prompt, image_mean.
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 513, in <module>
[rank0]: main()
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 504, in main
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-04 18:33:45 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-04 18:33:45 [config.py:721] This model supports multiple tasks: {'generate', 'score', 'reward', 'classify', 'embed'}. Defaulting to 'generate'.
INFO 02-04 18:33:46 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-04 18:33:46 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-04 18:33:47 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
INFO 02-04 18:33:50 [loader.py:460] Loading weights took 1.73 seconds
INFO 02-04 18:33:50 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.422877 seconds
Some kwargs in processor config are unused and will not have any effect: mask_prompt, ignore_id, add_special_token, pad_token, image_token, normalize, downsample_ratio, candidate_resolutions, sft_format, patch_size, image_mean, image_std.
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 18:34:02 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 18:34:02 [worker.py:287] Memory profiling takes 12.06 seconds
INFO 02-04 18:34:02 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-04 18:34:02 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
INFO 02-04 18:34:03 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 0
INFO 02-04 18:34:03 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
INFO 02-04 18:34:03 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:02, 1.91it/s]
Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:01<00:01, 2.01it/s]
Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 2.05it/s]
Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:01<00:00, 2.02it/s]
Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:02<00:00, 1.98it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.01it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.01it/s]
INFO 02-04 18:34:06 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
INFO 02-04 18:34:06 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 15.67 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 2 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [50743]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-04 18:34:56 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-04 18:34:56 [config.py:721] This model supports multiple tasks: {'generate', 'embed', 'classify', 'reward', 'score'}. Defaulting to 'generate'.
INFO 02-04 18:34:57 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-04 18:34:57 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-04 18:34:58 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
INFO 02-04 18:35:00 [loader.py:460] Loading weights took 1.82 seconds
INFO 02-04 18:35:00 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.480003 seconds
Some kwargs in processor config are unused and will not have any effect: image_std, pad_token, normalize, ignore_id, mask_prompt, patch_size, downsample_ratio, add_special_token, sft_format, image_token, image_mean, candidate_resolutions.
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 513, in <module>
[rank0]: main()
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 504, in main
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-05 10:03:27 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-05 10:03:27 [config.py:721] This model supports multiple tasks: {'score', 'generate', 'reward', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 02-05 10:03:28 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-05 10:03:28 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-05 10:03:29 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
INFO 02-05 10:03:31 [loader.py:460] Loading weights took 1.79 seconds
INFO 02-05 10:03:32 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.454019 seconds
Some kwargs in processor config are unused and will not have any effect: image_token, sft_format, pad_token, image_std, add_special_token, ignore_id, patch_size, downsample_ratio, normalize, candidate_resolutions, image_mean, mask_prompt.
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-05 10:03:43 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-05 10:03:44 [worker.py:287] Memory profiling takes 11.98 seconds
INFO 02-05 10:03:44 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-05 10:03:44 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
INFO 02-05 10:03:44 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 0
INFO 02-05 10:03:44 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
INFO 02-05 10:03:44 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:02, 1.91it/s]
Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:01<00:01, 2.01it/s]
Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 2.04it/s]
Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:02<00:01, 1.96it/s]
Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:02<00:00, 1.97it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00, 2.01it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00, 2.00it/s]
INFO 02-05 10:03:47 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
INFO 02-05 10:03:47 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 15.60 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 2 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [58452]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: image_token, sft_format, pad_token, image_std, add_special_token, ignore_id, patch_size, downsample_ratio, normalize, candidate_resolutions, image_mean, mask_prompt.
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-27 11:29:12 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-27 11:29:12 [config.py:721] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 02-27 11:29:14 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-27 11:29:14 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-27 11:29:15 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 513, in <module>
[rank0]: main()
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 504, in main
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 79, in __torch_function__
[rank0]: return func(*args, **kwargs)
[rank0]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 140.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 3.24 GiB is allocated by PyTorch, and 235.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-27 11:31:45 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-27 11:31:45 [config.py:721] This model supports multiple tasks: {'embed', 'score', 'classify', 'generate', 'reward'}. Defaulting to 'generate'.
INFO 02-27 11:31:46 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-27 11:31:46 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-27 11:31:47 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 513, in <module>
[rank0]: main()
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 504, in main
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 79, in __torch_function__
[rank0]: return func(*args, **kwargs)
[rank0]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 140.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 3.24 GiB is allocated by PyTorch, and 235.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-27 11:36:25 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-27 11:36:25 [config.py:721] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'.
INFO 02-27 11:36:26 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-27 11:36:26 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-27 11:36:28 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 513, in <module>
[rank0]: main()
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 504, in main
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 79, in __torch_function__
[rank0]: return func(*args, **kwargs)
[rank0]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 140.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 3.24 GiB is allocated by PyTorch, and 235.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-27 11:41:43 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-27 11:41:43 [config.py:721] This model supports multiple tasks: {'score', 'classify', 'generate', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 02-27 11:41:44 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-27 11:41:44 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-27 11:41:46 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 513, in <module>
[rank0]: main()
[rank0]: File "/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py", line 504, in main
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 79, in __torch_function__
[rank0]: return func(*args, **kwargs)
[rank0]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 140.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 3.24 GiB is allocated by PyTorch, and 235.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-27 11:44:43 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-27 11:44:43 [config.py:721] This model supports multiple tasks: {'reward', 'score', 'embed', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 02-27 11:44:44 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-27 11:44:44 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-27 11:44:46 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
INFO 02-27 11:44:57 [loader.py:460] Loading weights took 10.70 seconds
INFO 02-27 11:44:57 [model_runner.py:1165] Model loading took 6.3336 GiB and 12.624936 seconds
Some kwargs in processor config are unused and will not have any effect: image_std, pad_token, image_token, patch_size, image_mean, ignore_id, add_special_token, downsample_ratio, mask_prompt, candidate_resolutions, normalize, sft_format.
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-27 11:45:11 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-27 11:45:12 [worker.py:287] Memory profiling takes 14.13 seconds
INFO 02-27 11:45:12 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-27 11:45:12 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
INFO 02-27 11:45:12 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 0
INFO 02-27 11:45:12 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
INFO 02-27 11:45:12 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:03, 1.44it/s]
Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:01<00:02, 1.51it/s]
Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 1.53it/s]
Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:02<00:01, 1.52it/s]
Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:03<00:00, 1.51it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00, 1.50it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00, 1.50it/s]
INFO 02-27 11:45:16 [model_runner.py:1752] Graph capturing finished in 4 secs, took 0.12 GiB
INFO 02-27 11:45:16 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 18.93 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 2 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [3409]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: image_std, pad_token, image_token, patch_size, image_mean, ignore_id, add_special_token, downsample_ratio, mask_prompt, candidate_resolutions, normalize, sft_format.
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr
INFO 02-04 14:25:58 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-04 14:25:58 [config.py:721] This model supports multiple tasks: {'score', 'generate', 'embed', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 02-04 14:25:59 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-04 14:25:59 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-04 14:26:00 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-04 14:28:05 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-04 14:28:05 [config.py:721] This model supports multiple tasks: {'embed', 'generate', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 02-04 14:28:06 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-04 14:28:06 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-04 14:28:07 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
INFO 02-04 14:28:09 [loader.py:460] Loading weights took 1.92 seconds
INFO 02-04 14:28:10 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.579491 seconds
Some kwargs in processor config are unused and will not have any effect: ignore_id, patch_size, image_std, sft_format, normalize, image_token, add_special_token, mask_prompt, image_mean, candidate_resolutions, pad_token, downsample_ratio.
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 14:28:19 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 14:28:20 [worker.py:287] Memory profiling takes 10.29 seconds
INFO 02-04 14:28:20 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-04 14:28:20 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 1.86GiB; the rest of the memory reserved for KV Cache is 47.81GiB.
INFO 02-04 14:28:20 [executor_base.py:112] # rocm blocks: 13055, # CPU blocks: 1092
INFO 02-04 14:28:20 [executor_base.py:117] Maximum concurrency for 3281 tokens per request: 254.65x
INFO 02-04 14:28:22 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:02, 1.80it/s]
Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:01<00:02, 1.93it/s]
Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 2.00it/s]
Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:02<00:01, 2.00it/s]
Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:02<00:00, 2.01it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.04it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.00it/s]
INFO 02-04 14:28:25 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
INFO 02-04 14:28:25 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 15.87 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 2 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [12544]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: ignore_id, patch_size, image_std, sft_format, normalize, image_token, add_special_token, mask_prompt, image_mean, candidate_resolutions, pad_token, downsample_ratio.
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-04 14:39:06 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-04 14:39:06 [config.py:721] This model supports multiple tasks: {'score', 'reward', 'embed', 'classify', 'generate'}. Defaulting to 'generate'.
INFO 02-04 14:39:07 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-04 14:39:07 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-04 14:39:08 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
INFO 02-04 14:39:10 [loader.py:460] Loading weights took 1.84 seconds
INFO 02-04 14:39:10 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.529130 seconds
Some kwargs in processor config are unused and will not have any effect: normalize, image_std, candidate_resolutions, sft_format, image_mean, add_special_token, patch_size, pad_token, mask_prompt, downsample_ratio, image_token, ignore_id.
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 14:39:22 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 14:39:23 [worker.py:287] Memory profiling takes 11.97 seconds
INFO 02-04 14:39:23 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-04 14:39:23 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
INFO 02-04 14:39:23 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 1092
INFO 02-04 14:39:23 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
INFO 02-04 14:39:25 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:02, 1.89it/s]
Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:01<00:02, 1.88it/s]
Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 1.96it/s]
Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:02<00:01, 1.97it/s]
Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:02<00:00, 1.99it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00, 2.02it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00, 1.99it/s]
INFO 02-04 14:39:28 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
INFO 02-04 14:39:28 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 17.42 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 2 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [13561]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: normalize, image_std, candidate_resolutions, sft_format, image_mean, add_special_token, patch_size, pad_token, mask_prompt, downsample_ratio, image_token, ignore_id.
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-04 14:50:21 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-04 14:50:21 [config.py:721] This model supports multiple tasks: {'classify', 'embed', 'score', 'reward', 'generate'}. Defaulting to 'generate'.
INFO 02-04 14:50:22 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-04 14:50:22 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-04 14:50:23 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
INFO 02-04 14:50:26 [loader.py:460] Loading weights took 1.81 seconds
INFO 02-04 14:50:26 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.499614 seconds
Some kwargs in processor config are unused and will not have any effect: ignore_id, image_token, add_special_token, sft_format, image_mean, image_std, mask_prompt, downsample_ratio, candidate_resolutions, patch_size, pad_token, normalize.
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 14:50:38 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 14:50:39 [worker.py:287] Memory profiling takes 12.33 seconds
INFO 02-04 14:50:39 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-04 14:50:39 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
INFO 02-04 14:50:39 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 1092
INFO 02-04 14:50:39 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
INFO 02-04 14:50:41 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:02, 1.93it/s]
Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:00<00:01, 2.03it/s]
Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 2.06it/s]
Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:01<00:00, 2.01it/s]
Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:02<00:00, 1.99it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.01it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.01it/s]
INFO 02-04 14:50:44 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
INFO 02-04 14:50:44 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 17.52 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 2 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [14555]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: ignore_id, image_token, add_special_token, sft_format, image_mean, image_std, mask_prompt, downsample_ratio, candidate_resolutions, patch_size, pad_token, normalize.
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-04 16:50:33 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-04 16:50:33 [config.py:721] This model supports multiple tasks: {'embed', 'reward', 'score', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 02-04 16:50:34 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-04 16:50:34 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-04 16:50:35 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
INFO 02-04 16:50:38 [loader.py:460] Loading weights took 1.73 seconds
INFO 02-04 16:50:38 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.394852 seconds
Some kwargs in processor config are unused and will not have any effect: normalize, pad_token, downsample_ratio, mask_prompt, patch_size, sft_format, ignore_id, image_token, image_mean, image_std, add_special_token, candidate_resolutions.
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 16:50:49 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 16:50:50 [worker.py:287] Memory profiling takes 12.06 seconds
INFO 02-04 16:50:50 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-04 16:50:50 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
INFO 02-04 16:50:50 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 0
INFO 02-04 16:50:50 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
INFO 02-04 16:50:50 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:02, 1.93it/s]
Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:00<00:01, 2.03it/s]
Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 2.03it/s]
Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:02<00:01, 1.97it/s]
Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:02<00:00, 1.99it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.03it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.01it/s]
INFO 02-04 16:50:53 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
INFO 02-04 16:50:53 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 15.65 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 2 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [42007]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-04 16:59:41 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-04 16:59:41 [config.py:721] This model supports multiple tasks: {'embed', 'classify', 'generate', 'score', 'reward'}. Defaulting to 'generate'.
INFO 02-04 16:59:42 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-04 16:59:42 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-04 16:59:43 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
INFO 02-04 16:59:45 [loader.py:460] Loading weights took 1.71 seconds
INFO 02-04 16:59:45 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.386030 seconds
Some kwargs in processor config are unused and will not have any effect: normalize, image_std, mask_prompt, image_token, candidate_resolutions, add_special_token, image_mean, pad_token, downsample_ratio, patch_size, ignore_id, sft_format.
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 16:59:57 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 16:59:57 [worker.py:287] Memory profiling takes 12.02 seconds
INFO 02-04 16:59:57 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-04 16:59:57 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
INFO 02-04 16:59:58 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 0
INFO 02-04 16:59:58 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
INFO 02-04 16:59:58 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:02, 1.93it/s]
Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:00<00:01, 2.03it/s]
Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 2.06it/s]
Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:01<00:00, 2.04it/s]
Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:02<00:00, 2.05it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.07it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.05it/s]
INFO 02-04 17:00:01 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
INFO 02-04 17:00:01 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 15.56 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 2 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [42893]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: normalize, image_std, mask_prompt, image_token, candidate_resolutions, add_special_token, image_mean, pad_token, downsample_ratio, patch_size, ignore_id, sft_format.
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr2
INFO 02-04 17:17:32 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
INFO 02-04 17:17:32 [config.py:721] This model supports multiple tasks: {'generate', 'embed', 'score', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 02-04 17:17:33 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-04 17:17:33 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
INFO 02-04 17:17:34 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
INFO 02-04 17:17:36 [loader.py:460] Loading weights took 1.72 seconds
INFO 02-04 17:17:37 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.400736 seconds
Some kwargs in processor config are unused and will not have any effect: sft_format, add_special_token, patch_size, image_std, candidate_resolutions, ignore_id, pad_token, downsample_ratio, image_token, image_mean, normalize, mask_prompt.
/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 17:17:48 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 17:17:49 [worker.py:287] Memory profiling takes 12.09 seconds
INFO 02-04 17:17:49 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-04 17:17:49 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
INFO 02-04 17:17:49 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 0
INFO 02-04 17:17:49 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
INFO 02-04 17:17:49 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:02, 1.84it/s]
Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:01<00:02, 1.98it/s]
Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 2.03it/s]
Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:02<00:00, 2.01it/s]
Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:02<00:00, 2.02it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.04it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.02it/s]
INFO 02-04 17:17:52 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
INFO 02-04 17:17:52 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 15.69 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 2 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [44064]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)