added DeepSeek OCR API by liushengtong

6ad287f7 · liuxu3 · 80c11a03 · 6ad287f7 · 6ad287f7 · 6ad287f7
Commit 6ad287f7 authored Feb 28, 2026 by liuxu3
20 changed files
--- a/DeepSeek-OCR2-vllm/deepencoderv2/__pycache__/qwen2_d2e.cpython-310.pyc
+++ b/DeepSeek-OCR2-vllm/deepencoderv2/__pycache__/qwen2_d2e.cpython-310.pyc
--- a/DeepSeek-OCR2-vllm/deepencoderv2/__pycache__/qwen2_d2e.cpython-312.pyc
+++ b/DeepSeek-OCR2-vllm/deepencoderv2/__pycache__/qwen2_d2e.cpython-312.pyc
--- a/DeepSeek-OCR2-vllm/deepencoderv2/__pycache__/sam_vary_sdpa.cpython-310.pyc
+++ b/DeepSeek-OCR2-vllm/deepencoderv2/__pycache__/sam_vary_sdpa.cpython-310.pyc
--- a/DeepSeek-OCR2-vllm/deepencoderv2/__pycache__/sam_vary_sdpa.cpython-312.pyc
+++ b/DeepSeek-OCR2-vllm/deepencoderv2/__pycache__/sam_vary_sdpa.cpython-312.pyc
--- a/DeepSeek-OCR2-vllm/deepencoderv2/build_linear.py
+++ b/DeepSeek-OCR2-vllm/deepencoderv2/build_linear.py
--- a/DeepSeek-OCR2-vllm/deepencoderv2/qwen2_d2e.py
+++ b/DeepSeek-OCR2-vllm/deepencoderv2/qwen2_d2e.py
--- a/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py
+++ b/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py
--- a/DeepSeek-OCR2-vllm/deepseek_ocr2.py
+++ b/DeepSeek-OCR2-vllm/deepseek_ocr2.py
--- a/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py
+++ b/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py
--- a/DeepSeek-OCR2-vllm/doc/DeepSeek_OCR_paper_layouts.md
+++ b/DeepSeek-OCR2-vllm/doc/DeepSeek_OCR_paper_layouts.md
--- a/DeepSeek-OCR2-vllm/doc/DeepSeek_OCR_paper_layouts.pdf
+++ b/DeepSeek-OCR2-vllm/doc/DeepSeek_OCR_paper_layouts.pdf
--- a/DeepSeek-OCR2-vllm/doc/model.png
+++ b/DeepSeek-OCR2-vllm/doc/model.png
--- a/DeepSeek-OCR2-vllm/doc/result_with_boxes_hf.jpg
+++ b/DeepSeek-OCR2-vllm/doc/result_with_boxes_hf.jpg
--- a/DeepSeek-OCR2-vllm/doc/result_with_boxes_vllm.jpg
+++ b/DeepSeek-OCR2-vllm/doc/result_with_boxes_vllm.jpg
--- a/DeepSeek-OCR2-vllm/doc/test.png
+++ b/DeepSeek-OCR2-vllm/doc/test.png
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8000_20260227_115143.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8000_20260227_115143.log
+INFO 02-27 11:51:47 [__init__.py:240] Automatically detected platform rocm.
+/home/lst/DeepSeek-OCR2-vllm/deepseek_ocr2_server.py:476: DeprecationWarning: 
+        on_event is deprecated, use lifespan event handlers instead.
+
+        Read more about it in the
+        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
+        
+  @app.on_event("shutdown")
+[INFO] 加载模型: /home/lst/deepseek_ocr2
+INFO 02-27 11:51:53 [config.py:460] Overriding HF config with {'architectures': ['DeepseekOCR2ForCausalLM']}
+INFO 02-27 11:51:53 [config.py:721] This model supports multiple tasks: {'score', 'reward', 'classify', 'generate', 'embed'}. Defaulting to 'generate'.
+INFO 02-27 11:51:53 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/home/lst/deepseek_ocr2', speculative_config=None, tokenizer='/home/lst/deepseek_ocr2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/lst/deepseek_ocr2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[24,16,8,4,2,1],"max_capture_size":24}, use_cached_outputs=False, 
+INFO 02-27 11:51:53 [rocm.py:226] None is not supported in AMD GPUs.
+INFO 02-27 11:51:53 [rocm.py:227] Using ROCmFlashAttention backend.
+WARNING 02-27 11:51:53 [worker_base.py:41] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1
+INFO 02-27 11:51:53 [worker_base.py:653] ########## 4675 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63}
+INFO 02-27 11:51:53 [worker_base.py:654] ########## 4675 process(rank0) is running on memnode(s): {0, 1, 2, 3, 4, 5, 6, 7}
+WARNING: Logging before InitGoogleLogging() is written to STDERR
+I0227 11:51:54.129271  4675 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
+I0227 11:51:54.129348  4675 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:51:54.129817  4675 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x563cdcfcebb0, SPLIT_COLOR: 3389850942126204093, PG Name: 1
+I0227 11:51:54.129829  4675 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:51:54.150034  4675 ProcessGroupNCCL.cpp:881] [PG 3 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x563cdcfcebb0, SPLIT_COLOR: 3389850942126204093, PG Name: 3
+I0227 11:51:54.150072  4675 ProcessGroupNCCL.cpp:890] [PG 3 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:51:54.151399  4675 ProcessGroupNCCL.cpp:881] [PG 5 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x563cdcfcebb0, SPLIT_COLOR: 3389850942126204093, PG Name: 5
+I0227 11:51:54.151417  4675 ProcessGroupNCCL.cpp:890] [PG 5 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+I0227 11:51:54.152455  4675 ProcessGroupNCCL.cpp:881] [PG 7 Rank 0] ProcessGroupNCCL initialization options: size: 1, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x563cdcfcebb0, SPLIT_COLOR: 3389850942126204093, PG Name: 7
+I0227 11:51:54.152474  4675 ProcessGroupNCCL.cpp:890] [PG 7 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 3, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
+INFO 02-27 11:51:54 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
+INFO 02-27 11:51:54 [model_runner.py:1133] Starting to load model /home/lst/deepseek_ocr2...
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+INFO 02-27 11:51:55 [config.py:3627] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24] is overridden by config [1, 2, 4, 8, 16, 24]
+
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.78it/s]
+
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.78it/s]
+
+INFO 02-27 11:51:58 [loader.py:460] Loading weights took 2.26 seconds
+INFO 02-27 11:51:58 [model_runner.py:1165] Model loading took 6.3336 GiB and 3.979427 seconds
+Some kwargs in processor config are unused and will not have any effect: add_special_token, ignore_id, candidate_resolutions, image_mean, image_token, sft_format, normalize, mask_prompt, image_std, downsample_ratio, patch_size, pad_token. 
+/home/lst/DeepSeek-OCR2-vllm/deepencoderv2/sam_vary_sdpa.py:310: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
+  x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_bias)
+WARNING 02-27 11:52:11 [fused_moe.py:882] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=K100_AI.json
+INFO 02-27 11:52:12 [worker.py:287] Memory profiling takes 13.74 seconds
+INFO 02-27 11:52:12 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
+INFO 02-27 11:52:12 [worker.py:287] model weights take 6.33GiB; non_torch_memory takes 1.58GiB; PyTorch activation peak memory takes 2.00GiB; the rest of the memory reserved for KV Cache is 47.67GiB.
+INFO 02-27 11:52:12 [executor_base.py:112] # rocm blocks: 13017, # CPU blocks: 0
+INFO 02-27 11:52:12 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 101.70x
+INFO 02-27 11:52:12 [model_runner.py:1523] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
+
Capturing CUDA graph shapes:   0%|          | 0/6 [00:00<?, ?it/s]
Capturing CUDA graph shapes:  17%|█▋        | 1/6 [00:00<00:02,  1.84it/s]
Capturing CUDA graph shapes:  33%|███▎      | 2/6 [00:01<00:02,  1.95it/s]
Capturing CUDA graph shapes:  50%|█████     | 3/6 [00:01<00:01,  1.99it/s]
Capturing CUDA graph shapes:  67%|██████▋   | 4/6 [00:02<00:01,  1.97it/s]
Capturing CUDA graph shapes:  83%|████████▎ | 5/6 [00:02<00:00,  1.97it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00,  1.97it/s]
Capturing CUDA graph shapes: 100%|██████████| 6/6 [00:03<00:00,  1.96it/s]
+INFO 02-27 11:52:15 [model_runner.py:1752] Graph capturing finished in 3 secs, took 0.12 GiB
+INFO 02-27 11:52:15 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 17.43 seconds
+[SUCCESS] 模型加载完成
+
+[INFO] 服务启动: http://0.0.0.0:8000
+[INFO] 接口文档: http://0.0.0.0:8000/docs
+
+INFO:     Started server process [4675]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+ERROR:    [Errno 98] error while attempting to bind on address ('0.0.0.0', 8000): address already in use
+INFO:     Waiting for application shutdown.
+INFO:     Application shutdown complete.
+[INFO] 关闭线程池...
+[SUCCESS] 线程池已关闭
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260204_175309.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260204_175309.log
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260204_175510.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260204_175510.log
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260204_182801.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260204_182801.log
--- a/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260204_182953.log
+++ b/DeepSeek-OCR2-vllm/logs/deepseek_ocr2_server_8707_20260204_182953.log