on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr
INFO 02-04 11:46:40 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCRForCausalLM']}
INFO 02-04 11:46:40 [config.py:721] This model supports multiple tasks: {'generate', 'embed', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 02-04 11:46:45 [loader.py:460] Loading weights took 2.10 seconds
INFO 02-04 11:46:45 [model_runner.py:1165] Model loading took 6.2319 GiB and 3.525227 seconds
Some kwargs in processor config are unused and will not have any effect: candidate_resolutions, downsample_ratio, patch_size, ignore_id, mask_prompt, normalize, pad_token, add_special_token, image_std, image_mean, sft_format, image_token.
/home/lst/DeepSeek-OCR-vllm/deepencoder/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 11:46:58 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 11:46:58 [worker.py:287] Memory profiling takes 12.97 seconds
INFO 02-04 11:46:58 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.50) = 31.99GiB
INFO 02-04 11:46:58 [worker.py:287] model weights take 6.23GiB; non_torch_memory takes 1.55GiB; PyTorch activation peak memory takes 1.45GiB; the rest of the memory reserved for KV Cache is 22.76GiB.
INFO 02-04 11:46:58 [executor_base.py:112] # rocm blocks: 1553, # CPU blocks: 273
INFO 02-04 11:46:58 [executor_base.py:117] Maximum concurrency for 3281 tokens per request: 121.17x
INFO 02-04 11:47:01 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:03, 1.29it/s]
Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:01<00:02, 1.51it/s]
Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:02<00:02, 1.49it/s]
Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:02<00:01, 1.49it/s]
Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:03<00:00, 1.50it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00, 1.59it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00, 1.53it/s]
INFO 02-04 11:47:04 [model_runner.py:1752] Graph capturing finished in 4 secs, took 0.12 GiB
INFO 02-04 11:47:04 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 19.41 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 1 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [2639]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: candidate_resolutions, downsample_ratio, patch_size, ignore_id, mask_prompt, normalize, pad_token, add_special_token, image_std, image_mean, sft_format, image_token.
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr
INFO 02-04 11:59:29 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCRForCausalLM']}
INFO 02-04 11:59:29 [config.py:721] This model supports multiple tasks: {'reward', 'classify', 'embed', 'score', 'generate'}. Defaulting to 'generate'.
INFO 02-04 11:59:33 [loader.py:460] Loading weights took 2.07 seconds
INFO 02-04 11:59:33 [model_runner.py:1165] Model loading took 6.2319 GiB and 3.502876 seconds
Some kwargs in processor config are unused and will not have any effect: downsample_ratio, image_token, ignore_id, patch_size, candidate_resolutions, sft_format, pad_token, image_mean, normalize, add_special_token, mask_prompt, image_std.
/home/lst/DeepSeek-OCR-vllm/deepencoder/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 11:59:47 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 11:59:48 [worker.py:287] Memory profiling takes 14.62 seconds
INFO 02-04 11:59:48 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.75) = 47.99GiB
INFO 02-04 11:59:48 [worker.py:287] model weights take 6.23GiB; non_torch_memory takes 1.55GiB; PyTorch activation peak memory takes 1.60GiB; the rest of the memory reserved for KV Cache is 38.60GiB.
INFO 02-04 11:59:48 [executor_base.py:112] # rocm blocks: 10541, # CPU blocks: 1092
INFO 02-04 11:59:48 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 82.35x
INFO 02-04 11:59:50 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:02, 1.83it/s]
Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:01<00:01, 2.00it/s]
Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 2.05it/s]
Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:01<00:00, 2.02it/s]
Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:02<00:00, 2.04it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.08it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.04it/s]
INFO 02-04 11:59:53 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
INFO 02-04 11:59:53 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 20.09 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 1 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [3602]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: downsample_ratio, image_token, ignore_id, patch_size, candidate_resolutions, sft_format, pad_token, image_mean, normalize, add_special_token, mask_prompt, image_std.
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr
INFO 02-04 13:15:25 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCRForCausalLM']}
INFO 02-04 13:15:25 [config.py:721] This model supports multiple tasks: {'reward', 'classify', 'generate', 'score', 'embed'}. Defaulting to 'generate'.
INFO 02-04 13:15:30 [loader.py:460] Loading weights took 2.13 seconds
INFO 02-04 13:15:30 [model_runner.py:1165] Model loading took 6.2319 GiB and 3.562644 seconds
Some kwargs in processor config are unused and will not have any effect: ignore_id, pad_token, image_token, add_special_token, candidate_resolutions, mask_prompt, image_std, sft_format, patch_size, normalize, image_mean, downsample_ratio.
/home/lst/DeepSeek-OCR-vllm/deepencoder/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 13:15:44 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 13:15:45 [worker.py:287] Memory profiling takes 14.64 seconds
INFO 02-04 13:15:45 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.75) = 47.99GiB
INFO 02-04 13:15:45 [worker.py:287] model weights take 6.23GiB; non_torch_memory takes 1.55GiB; PyTorch activation peak memory takes 1.60GiB; the rest of the memory reserved for KV Cache is 38.60GiB.
INFO 02-04 13:15:45 [executor_base.py:112] # rocm blocks: 10541, # CPU blocks: 1092
INFO 02-04 13:15:45 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 82.35x
INFO 02-04 13:15:47 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/19 [00:00<?, ?it/s]
Capturing CUDA graph shapes: 5%|▌ | 1/19 [00:00<00:08, 2.02it/s]
Capturing CUDA graph shapes: 11%|█ | 2/19 [00:00<00:08, 2.09it/s]
Capturing CUDA graph shapes: 16%|█▌ | 3/19 [00:01<00:07, 2.11it/s]
Capturing CUDA graph shapes: 21%|██ | 4/19 [00:01<00:07, 2.06it/s]
Capturing CUDA graph shapes: 26%|██▋ | 5/19 [00:02<00:06, 2.06it/s]
Capturing CUDA graph shapes: 32%|███▏ | 6/19 [00:02<00:06, 2.09it/s]
Capturing CUDA graph shapes: 37%|███▋ | 7/19 [00:03<00:05, 2.07it/s]
Capturing CUDA graph shapes: 42%|████▏ | 8/19 [00:03<00:05, 2.05it/s]
Capturing CUDA graph shapes: 47%|████▋ | 9/19 [00:04<00:04, 2.04it/s]
Capturing CUDA graph shapes: 53%|█████▎ | 10/19 [00:04<00:04, 2.07it/s]
Capturing CUDA graph shapes: 58%|█████▊ | 11/19 [00:05<00:03, 2.08it/s]
Capturing CUDA graph shapes: 63%|██████▎ | 12/19 [00:05<00:03, 2.05it/s]
Capturing CUDA graph shapes: 68%|██████▊ | 13/19 [00:06<00:02, 2.06it/s]
Capturing CUDA graph shapes: 74%|███████▎ | 14/19 [00:06<00:02, 2.08it/s]
Capturing CUDA graph shapes: 79%|███████▉ | 15/19 [00:07<00:01, 2.10it/s]
Capturing CUDA graph shapes: 84%|████████▍ | 16/19 [00:07<00:01, 2.11it/s]
Capturing CUDA graph shapes: 89%|████████▉ | 17/19 [00:08<00:00, 2.08it/s]
Capturing CUDA graph shapes: 95%|█████████▍| 18/19 [00:08<00:00, 2.08it/s]
Capturing CUDA graph shapes: 100%|██████████| 19/19 [00:09<00:00, 2.10it/s]
Capturing CUDA graph shapes: 100%|██████████| 19/19 [00:09<00:00, 2.08it/s]
INFO 02-04 13:15:56 [model_runner.py:1752] Graph capturing finished in 9 secs, took 0.16 GiB
INFO 02-04 13:15:56 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 26.33 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 2 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [4593]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: ignore_id, pad_token, image_token, add_special_token, candidate_resolutions, mask_prompt, image_std, sft_format, patch_size, normalize, image_mean, downsample_ratio.
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr
INFO 02-04 13:19:55 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCRForCausalLM']}
INFO 02-04 13:19:55 [config.py:721] This model supports multiple tasks: {'generate', 'embed', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 02-04 13:20:00 [loader.py:460] Loading weights took 2.10 seconds
INFO 02-04 13:20:00 [model_runner.py:1165] Model loading took 6.2319 GiB and 3.520996 seconds
Some kwargs in processor config are unused and will not have any effect: add_special_token, mask_prompt, image_std, image_token, pad_token, image_mean, candidate_resolutions, sft_format, downsample_ratio, ignore_id, patch_size, normalize.
/home/lst/DeepSeek-OCR-vllm/deepencoder/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 13:20:14 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 13:20:15 [worker.py:287] Memory profiling takes 14.27 seconds
INFO 02-04 13:20:15 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.75) = 47.99GiB
INFO 02-04 13:20:15 [worker.py:287] model weights take 6.23GiB; non_torch_memory takes 1.55GiB; PyTorch activation peak memory takes 1.60GiB; the rest of the memory reserved for KV Cache is 38.60GiB.
INFO 02-04 13:20:15 [executor_base.py:112] # rocm blocks: 10541, # CPU blocks: 1092
INFO 02-04 13:20:15 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 82.35x
INFO 02-04 13:20:17 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/19 [00:00<?, ?it/s]
Capturing CUDA graph shapes: 5%|▌ | 1/19 [00:00<00:08, 2.03it/s]
Capturing CUDA graph shapes: 11%|█ | 2/19 [00:00<00:08, 2.10it/s]
Capturing CUDA graph shapes: 16%|█▌ | 3/19 [00:01<00:07, 2.12it/s]
Capturing CUDA graph shapes: 21%|██ | 4/19 [00:01<00:07, 2.13it/s]
Capturing CUDA graph shapes: 26%|██▋ | 5/19 [00:02<00:06, 2.11it/s]
Capturing CUDA graph shapes: 32%|███▏ | 6/19 [00:02<00:06, 2.12it/s]
Capturing CUDA graph shapes: 37%|███▋ | 7/19 [00:03<00:05, 2.13it/s]
Capturing CUDA graph shapes: 42%|████▏ | 8/19 [00:03<00:05, 2.10it/s]
Capturing CUDA graph shapes: 47%|████▋ | 9/19 [00:04<00:04, 2.08it/s]
Capturing CUDA graph shapes: 53%|█████▎ | 10/19 [00:04<00:04, 2.09it/s]
Capturing CUDA graph shapes: 58%|█████▊ | 11/19 [00:05<00:03, 2.11it/s]
Capturing CUDA graph shapes: 63%|██████▎ | 12/19 [00:05<00:03, 2.11it/s]
Capturing CUDA graph shapes: 68%|██████▊ | 13/19 [00:06<00:02, 2.12it/s]
Capturing CUDA graph shapes: 74%|███████▎ | 14/19 [00:06<00:02, 2.12it/s]
Capturing CUDA graph shapes: 79%|███████▉ | 15/19 [00:07<00:01, 2.12it/s]
Capturing CUDA graph shapes: 84%|████████▍ | 16/19 [00:07<00:01, 2.12it/s]
Capturing CUDA graph shapes: 89%|████████▉ | 17/19 [00:08<00:00, 2.08it/s]
Capturing CUDA graph shapes: 95%|█████████▍| 18/19 [00:08<00:00, 2.08it/s]
Capturing CUDA graph shapes: 100%|██████████| 19/19 [00:09<00:00, 2.04it/s]
Capturing CUDA graph shapes: 100%|██████████| 19/19 [00:09<00:00, 2.09it/s]
INFO 02-04 13:20:26 [model_runner.py:1752] Graph capturing finished in 9 secs, took 0.16 GiB
INFO 02-04 13:20:26 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 25.91 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 1 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [5553]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: add_special_token, mask_prompt, image_std, image_token, pad_token, image_mean, candidate_resolutions, sft_format, downsample_ratio, ignore_id, patch_size, normalize.
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr
INFO 02-04 13:22:34 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCRForCausalLM']}
INFO 02-04 13:22:34 [config.py:721] This model supports multiple tasks: {'reward', 'classify', 'embed', 'generate', 'score'}. Defaulting to 'generate'.
INFO 02-04 13:22:39 [loader.py:460] Loading weights took 1.97 seconds
INFO 02-04 13:22:39 [model_runner.py:1165] Model loading took 6.2319 GiB and 3.414745 seconds
Some kwargs in processor config are unused and will not have any effect: image_mean, sft_format, add_special_token, downsample_ratio, image_token, pad_token, ignore_id, patch_size, candidate_resolutions, mask_prompt, normalize, image_std.
/home/lst/DeepSeek-OCR-vllm/deepencoder/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 13:22:53 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 13:22:54 [worker.py:287] Memory profiling takes 14.41 seconds
INFO 02-04 13:22:54 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.75) = 47.99GiB
INFO 02-04 13:22:54 [worker.py:287] model weights take 6.23GiB; non_torch_memory takes 1.55GiB; PyTorch activation peak memory takes 1.60GiB; the rest of the memory reserved for KV Cache is 38.60GiB.
INFO 02-04 13:22:54 [executor_base.py:112] # rocm blocks: 10541, # CPU blocks: 1092
INFO 02-04 13:22:54 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 82.35x
INFO 02-04 13:22:56 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/19 [00:00<?, ?it/s]
Capturing CUDA graph shapes: 5%|▌ | 1/19 [00:00<00:09, 1.87it/s]
Capturing CUDA graph shapes: 11%|█ | 2/19 [00:01<00:08, 1.94it/s]
Capturing CUDA graph shapes: 16%|█▌ | 3/19 [00:01<00:08, 1.99it/s]
Capturing CUDA graph shapes: 21%|██ | 4/19 [00:02<00:07, 1.99it/s]
Capturing CUDA graph shapes: 26%|██▋ | 5/19 [00:02<00:06, 2.02it/s]
Capturing CUDA graph shapes: 32%|███▏ | 6/19 [00:02<00:06, 2.06it/s]
Capturing CUDA graph shapes: 37%|███▋ | 7/19 [00:03<00:05, 2.08it/s]
Capturing CUDA graph shapes: 42%|████▏ | 8/19 [00:03<00:05, 2.10it/s]
Capturing CUDA graph shapes: 47%|████▋ | 9/19 [00:04<00:04, 2.04it/s]
Capturing CUDA graph shapes: 53%|█████▎ | 10/19 [00:04<00:04, 2.05it/s]
Capturing CUDA graph shapes: 58%|█████▊ | 11/19 [00:05<00:03, 2.08it/s]
Capturing CUDA graph shapes: 63%|██████▎ | 12/19 [00:05<00:03, 2.10it/s]
Capturing CUDA graph shapes: 68%|██████▊ | 13/19 [00:06<00:02, 2.11it/s]
Capturing CUDA graph shapes: 74%|███████▎ | 14/19 [00:06<00:02, 2.12it/s]
Capturing CUDA graph shapes: 79%|███████▉ | 15/19 [00:07<00:01, 2.13it/s]
Capturing CUDA graph shapes: 84%|████████▍ | 16/19 [00:07<00:01, 2.13it/s]
Capturing CUDA graph shapes: 89%|████████▉ | 17/19 [00:08<00:00, 2.10it/s]
Capturing CUDA graph shapes: 95%|█████████▍| 18/19 [00:08<00:00, 2.05it/s]
Capturing CUDA graph shapes: 100%|██████████| 19/19 [00:09<00:00, 2.07it/s]
Capturing CUDA graph shapes: 100%|██████████| 19/19 [00:09<00:00, 2.07it/s]
INFO 02-04 13:23:05 [model_runner.py:1752] Graph capturing finished in 9 secs, took 0.16 GiB
INFO 02-04 13:23:05 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 26.07 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 16 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [6517]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: image_mean, sft_format, add_special_token, downsample_ratio, image_token, pad_token, ignore_id, patch_size, candidate_resolutions, mask_prompt, normalize, image_std.
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr
INFO 02-04 13:27:19 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCRForCausalLM']}
INFO 02-04 13:27:19 [config.py:721] This model supports multiple tasks: {'classify', 'embed', 'score', 'reward', 'generate'}. Defaulting to 'generate'.
INFO 02-04 13:27:23 [loader.py:460] Loading weights took 2.11 seconds
INFO 02-04 13:27:23 [model_runner.py:1165] Model loading took 6.2319 GiB and 3.541007 seconds
Some kwargs in processor config are unused and will not have any effect: candidate_resolutions, mask_prompt, pad_token, sft_format, image_std, image_mean, add_special_token, normalize, downsample_ratio, patch_size, ignore_id, image_token.
/home/lst/DeepSeek-OCR-vllm/deepencoder/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 13:27:36 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 13:27:36 [worker.py:287] Memory profiling takes 13.04 seconds
INFO 02-04 13:27:36 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.50) = 31.99GiB
INFO 02-04 13:27:36 [worker.py:287] model weights take 6.23GiB; non_torch_memory takes 1.55GiB; PyTorch activation peak memory takes 1.45GiB; the rest of the memory reserved for KV Cache is 22.76GiB.
INFO 02-04 13:27:37 [executor_base.py:112] # rocm blocks: 6214, # CPU blocks: 1092
INFO 02-04 13:27:37 [executor_base.py:117] Maximum concurrency for 3281 tokens per request: 121.21x
INFO 02-04 13:27:39 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:03, 1.33it/s]
Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:01<00:02, 1.53it/s]
Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 1.61it/s]
Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:02<00:01, 1.62it/s]
Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:03<00:00, 1.64it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00, 1.66it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00, 1.62it/s]
INFO 02-04 13:27:42 [model_runner.py:1752] Graph capturing finished in 4 secs, took 0.12 GiB
INFO 02-04 13:27:42 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 19.26 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 16 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [7962]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: candidate_resolutions, mask_prompt, pad_token, sft_format, image_std, image_mean, add_special_token, normalize, downsample_ratio, patch_size, ignore_id, image_token.
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr
INFO 02-04 13:34:54 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCRForCausalLM']}
INFO 02-04 13:34:54 [config.py:721] This model supports multiple tasks: {'embed', 'score', 'classify', 'reward', 'generate'}. Defaulting to 'generate'.
INFO 02-04 13:34:58 [loader.py:460] Loading weights took 1.98 seconds
INFO 02-04 13:34:59 [model_runner.py:1165] Model loading took 6.2319 GiB and 3.419236 seconds
Some kwargs in processor config are unused and will not have any effect: add_special_token, downsample_ratio, candidate_resolutions, mask_prompt, normalize, pad_token, image_std, image_token, sft_format, image_mean, patch_size, ignore_id.
/home/lst/DeepSeek-OCR-vllm/deepencoder/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 13:35:11 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 13:35:12 [worker.py:287] Memory profiling takes 12.98 seconds
INFO 02-04 13:35:12 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-04 13:35:12 [worker.py:287] model weights take 6.23GiB; non_torch_memory takes 1.55GiB; PyTorch activation peak memory takes 1.45GiB; the rest of the memory reserved for KV Cache is 48.35GiB.
INFO 02-04 13:35:12 [executor_base.py:112] # rocm blocks: 13203, # CPU blocks: 1092
INFO 02-04 13:35:12 [executor_base.py:117] Maximum concurrency for 3281 tokens per request: 257.54x
INFO 02-04 13:35:14 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:02, 1.90it/s]
Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:01<00:02, 1.97it/s]
Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 2.00it/s]
Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:02<00:01, 1.97it/s]
Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:02<00:00, 1.98it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00, 2.03it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00, 2.00it/s]
INFO 02-04 13:35:17 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
INFO 02-04 13:35:17 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 18.56 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 8 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [9384]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: add_special_token, downsample_ratio, candidate_resolutions, mask_prompt, normalize, pad_token, image_std, image_token, sft_format, image_mean, patch_size, ignore_id.
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr
INFO 02-04 13:38:10 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCRForCausalLM']}
INFO 02-04 13:38:10 [config.py:721] This model supports multiple tasks: {'reward', 'generate', 'classify', 'score', 'embed'}. Defaulting to 'generate'.
INFO 02-04 13:38:14 [loader.py:460] Loading weights took 1.97 seconds
INFO 02-04 13:38:14 [model_runner.py:1165] Model loading took 6.2319 GiB and 3.406699 seconds
Some kwargs in processor config are unused and will not have any effect: downsample_ratio, mask_prompt, add_special_token, ignore_id, normalize, image_std, image_token, patch_size, image_mean, sft_format, pad_token, candidate_resolutions.
/home/lst/DeepSeek-OCR-vllm/deepencoder/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 13:38:27 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 13:38:28 [worker.py:287] Memory profiling takes 13.04 seconds
INFO 02-04 13:38:28 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-04 13:38:28 [worker.py:287] model weights take 6.23GiB; non_torch_memory takes 1.55GiB; PyTorch activation peak memory takes 1.45GiB; the rest of the memory reserved for KV Cache is 48.35GiB.
INFO 02-04 13:38:28 [executor_base.py:112] # rocm blocks: 13203, # CPU blocks: 1092
INFO 02-04 13:38:28 [executor_base.py:117] Maximum concurrency for 3281 tokens per request: 257.54x
INFO 02-04 13:38:30 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:03, 1.37it/s]
Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:01<00:02, 1.69it/s]
Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 1.81it/s]
Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:02<00:01, 1.90it/s]
Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:02<00:00, 1.90it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00, 1.97it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00, 1.87it/s]
INFO 02-04 13:38:33 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
INFO 02-04 13:38:33 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 18.84 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 2 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [11311]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: downsample_ratio, mask_prompt, add_special_token, ignore_id, normalize, image_std, image_token, patch_size, image_mean, sft_format, pad_token, candidate_resolutions.
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr
INFO 02-04 15:40:57 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCRForCausalLM']}
INFO 02-04 15:40:57 [config.py:721] This model supports multiple tasks: {'generate', 'classify', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
INFO 02-04 15:41:02 [loader.py:460] Loading weights took 2.10 seconds
INFO 02-04 15:41:02 [model_runner.py:1165] Model loading took 6.2319 GiB and 3.532530 seconds
Some kwargs in processor config are unused and will not have any effect: mask_prompt, sft_format, normalize, add_special_token, image_mean, candidate_resolutions, image_token, ignore_id, pad_token, patch_size, image_std, downsample_ratio.
/home/lst/DeepSeek-OCR-vllm/deepencoder/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 15:41:16 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 15:41:17 [worker.py:287] Memory profiling takes 14.39 seconds
INFO 02-04 15:41:17 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-04 15:41:17 [worker.py:287] model weights take 6.23GiB; non_torch_memory takes 1.55GiB; PyTorch activation peak memory takes 1.60GiB; the rest of the memory reserved for KV Cache is 48.20GiB.
INFO 02-04 15:41:17 [executor_base.py:112] # rocm blocks: 13162, # CPU blocks: 1092
INFO 02-04 15:41:17 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 102.83x
INFO 02-04 15:41:19 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:02, 1.93it/s]
Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:00<00:01, 2.04it/s]
Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:01<00:01, 2.00it/s]
Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:02<00:01, 1.99it/s]
Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:02<00:00, 2.02it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.06it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:02<00:00, 2.03it/s]
INFO 02-04 15:41:22 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
INFO 02-04 15:41:22 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 19.95 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 2 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8707
[INFO] 接口文档: http://0.0.0.0:8707/docs
INFO: Started server process [41027]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8707 (Press CTRL+C to quit)
Some kwargs in processor config are unused and will not have any effect: mask_prompt, sft_format, normalize, add_special_token, image_mean, candidate_resolutions, image_token, ignore_id, pad_token, patch_size, image_std, downsample_ratio.
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr
INFO 02-04 17:02:48 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCRForCausalLM']}
INFO 02-04 17:02:48 [config.py:721] This model supports multiple tasks: {'embed', 'classify', 'score', 'generate', 'reward'}. Defaulting to 'generate'.
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 79, in __torch_function__
[rank0]: return func(*args, **kwargs)
[rank0]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 280.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 3.30 GiB is allocated by PyTorch, and 15.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr
INFO 02-04 17:19:43 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCRForCausalLM']}
INFO 02-04 17:19:43 [config.py:721] This model supports multiple tasks: {'generate', 'score', 'reward', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 02-04 17:19:47 [loader.py:460] Loading weights took 2.11 seconds
INFO 02-04 17:19:47 [model_runner.py:1165] Model loading took 6.2319 GiB and 3.538627 seconds
Some kwargs in processor config are unused and will not have any effect: add_special_token, pad_token, downsample_ratio, ignore_id, mask_prompt, image_std, image_mean, normalize, patch_size, image_token, sft_format, candidate_resolutions.
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/lst/DeepSeek-OCR-vllm/deepseek_ocr_server.py", line 509, in <module>
[rank0]: main()
[rank0]: File "/home/lst/DeepSeek-OCR-vllm/deepseek_ocr_server.py", line 500, in main
on_event is deprecated, use lifespan event handlers instead.
Read more about it in the
[FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
@app.on_event("shutdown")
[INFO] 加载模型: /home/lst/deepseek_ocr
INFO 02-04 17:20:32 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCRForCausalLM']}
INFO 02-04 17:20:32 [config.py:721] This model supports multiple tasks: {'reward', 'embed', 'score', 'classify', 'generate'}. Defaulting to 'generate'.
INFO 02-04 17:20:36 [loader.py:460] Loading weights took 2.12 seconds
INFO 02-04 17:20:37 [model_runner.py:1165] Model loading took 6.2319 GiB and 3.572350 seconds
Some kwargs in processor config are unused and will not have any effect: normalize, sft_format, image_token, mask_prompt, ignore_id, downsample_ratio, patch_size, candidate_resolutions, image_mean, pad_token, add_special_token, image_std.
/home/lst/DeepSeek-OCR-vllm/deepencoder/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
WARNING 02-04 17:20:51 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
INFO 02-04 17:20:51 [worker.py:287] Memory profiling takes 14.59 seconds
INFO 02-04 17:20:51 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 02-04 17:20:51 [worker.py:287] model weights take 6.23GiB; non_torch_memory takes 1.55GiB; PyTorch activation peak memory takes 1.60GiB; the rest of the memory reserved for KV Cache is 48.20GiB.
INFO 02-04 17:20:52 [executor_base.py:112] # rocm blocks: 13162, # CPU blocks: 1092
INFO 02-04 17:20:52 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 102.83x
INFO 02-04 17:20:54 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes: 17%|█▋ | 1/6 [00:00<00:03, 1.46it/s]
Capturing CUDA graph shapes: 33%|███▎ | 2/6 [00:01<00:02, 1.51it/s]
Capturing CUDA graph shapes: 50%|█████ | 3/6 [00:02<00:02, 1.48it/s]
Capturing CUDA graph shapes: 67%|██████▋ | 4/6 [00:02<00:01, 1.62it/s]
Capturing CUDA graph shapes: 83%|████████▎ | 5/6 [00:03<00:00, 1.72it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00, 1.84it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00, 1.70it/s]
INFO 02-04 17:20:57 [model_runner.py:1752] Graph capturing finished in 4 secs, took 0.12 GiB
INFO 02-04 17:20:57 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 20.50 seconds
[SUCCESS] 模型加载完成
[INFO] 线程池配置:
- CPU 线程池: 2 线程
- GPU 线程池: 1 线程
[INFO] 服务启动: http://0.0.0.0:8708
[INFO] 接口文档: http://0.0.0.0:8708/docs
INFO: Started server process [45841]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8708 (Press CTRL+C to quit)
images_seq_mask),f"tokenize_with_images func: tokenized_str's length {len(tokenized_str)} is not equal to imags_seq_mask's length {len(images_seq_mask)}"