nohup: ignoring input INFO 12-15 14:08:06 [__init__.py:245] Automatically detected platform rocm. INFO 12-15 14:08:09 [api_server.py:1395] vLLM API server version 0.9.2 INFO 12-15 14:08:09 [cli_args.py:325] non-default args: {'model': '../OctoMed/OctoMed-7B/', 'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 32768, 'max_seq_len_to_capture': 32768} INFO 12-15 14:08:16 [config.py:850] This model supports multiple tasks: {'reward', 'classify', 'generate', 'embed'}. Defaulting to 'generate'. `torch_dtype` is deprecated! Use `dtype` instead! INFO 12-15 14:08:16 [config.py:1488] Using max model len 32768 INFO 12-15 14:08:16 [config.py:2301] Chunked prefill is enabled with max_num_batched_tokens=2048. INFO 12-15 14:08:20 [__init__.py:245] Automatically detected platform rocm. INFO 12-15 14:08:22 [core.py:529] Waiting for init message from front-end. INFO 12-15 14:08:22 [core.py:71] Initializing a V1 LLM engine (v0.9.2) with config: model='../OctoMed/OctoMed-7B/', speculative_config=None, tokenizer='../OctoMed/OctoMed-7B/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=../OctoMed/OctoMed-7B/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null} WARNING 12-15 14:08:22 [worker_base.py:42] VLLM_RANK0_NUMA is unset or set incorrectly, vllm will not bind to numa! VLLM_RANK0_NUMA = -1 INFO 12-15 14:08:22 [worker_base.py:654] ########## 488 process(rank0) is running on CPU(s): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175} INFO 12-15 14:08:22 [worker_base.py:655] ########## 488 process(rank0) is running on memnode(s): {0, 1} INFO 12-15 14:08:32 [parallel_state.py:1077] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0 INFO 12-15 14:08:33 [gpu_model_runner.py:1819] Starting to load model ../OctoMed/OctoMed-7B/... INFO 12-15 14:08:33 [gpu_model_runner.py:1824] Loading model from scratch... INFO 12-15 14:08:33 [rocm.py:288] Using Flash Attention backend on V1 engine. (only supports block size 64) Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nDescribe this image in one sentence.<|vision_start|><|image_pad|><|vision_end|><|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.01, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32739, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None. INFO 12-15 14:21:47 [async_llm.py:270] Added request chatcmpl-46c2e53655184f288c755e4c36c0d5a6. INFO 12-15 14:21:54 [loggers.py:118] Engine 000: Avg prompt throughput: 37.3 tokens/s, Avg generation throughput: 21.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0% INFO: 127.0.0.1:39298 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO 12-15 14:22:04 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 14.9 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% INFO 12-15 14:22:14 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%