Unverified Commit 1bd007f2 authored by co63oc's avatar co63oc Committed by GitHub
Browse files

fix some typos (#24071)


Signed-off-by: default avatarco63oc <co63oc@users.noreply.github.com>
parent 136d853e
...@@ -57,7 +57,7 @@ def invoke_main() -> None: ...@@ -57,7 +57,7 @@ def invoke_main() -> None:
"--num-iteration", "--num-iteration",
type=int, type=int,
default=1000, default=1000,
help="Number of iterations to run to stablize final data readings", help="Number of iterations to run to stabilize final data readings",
) )
parser.add_argument( parser.add_argument(
"--allocate-blocks", "--allocate-blocks",
......
...@@ -77,7 +77,7 @@ def invoke_main() -> None: ...@@ -77,7 +77,7 @@ def invoke_main() -> None:
"--num-iteration", "--num-iteration",
type=int, type=int,
default=100, default=100,
help="Number of iterations to run to stablize final data readings", help="Number of iterations to run to stabilize final data readings",
) )
parser.add_argument( parser.add_argument(
"--num-req", type=int, default=128, help="Number of requests in the batch" "--num-req", type=int, default=128, help="Number of requests in the batch"
......
...@@ -181,7 +181,7 @@ struct W4A8GemmKernel { ...@@ -181,7 +181,7 @@ struct W4A8GemmKernel {
auto A_ptr = static_cast<MmaType const*>(A.const_data_ptr()); auto A_ptr = static_cast<MmaType const*>(A.const_data_ptr());
auto B_ptr = static_cast<QuantType const*>(B.const_data_ptr()); auto B_ptr = static_cast<QuantType const*>(B.const_data_ptr());
auto D_ptr = static_cast<ElementD*>(D.data_ptr()); auto D_ptr = static_cast<ElementD*>(D.data_ptr());
// can we avoid harcode the 8 here // can we avoid hardcode the 8 here
auto S_ptr = auto S_ptr =
static_cast<cutlass::Array<ElementScale, ScalePackSize> const*>( static_cast<cutlass::Array<ElementScale, ScalePackSize> const*>(
group_scales.const_data_ptr()); group_scales.const_data_ptr());
......
...@@ -210,7 +210,7 @@ vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4 -dp 2 ...@@ -210,7 +210,7 @@ vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4 -dp 2
!!! note !!! note
API server scale-out disables [multi-modal IPC caching](#ipc-caching) API server scale-out disables [multi-modal IPC caching](#ipc-caching)
because it requires a one-to-one correspondance between API and engine core processes. because it requires a one-to-one correspondence between API and engine core processes.
This does not impact [multi-modal processor caching](#processor-caching). This does not impact [multi-modal processor caching](#processor-caching).
...@@ -227,7 +227,7 @@ to avoid repeatedly processing the same multi-modal inputs in `BaseMultiModalPro ...@@ -227,7 +227,7 @@ to avoid repeatedly processing the same multi-modal inputs in `BaseMultiModalPro
### IPC Caching ### IPC Caching
Multi-modal IPC caching is automatically enabled when Multi-modal IPC caching is automatically enabled when
there is a one-to-one correspondance between API (`P0`) and engine core (`P1`) processes, there is a one-to-one correspondence between API (`P0`) and engine core (`P1`) processes,
to avoid repeatedly transferring the same multi-modal inputs between them. to avoid repeatedly transferring the same multi-modal inputs between them.
### Configuration ### Configuration
......
...@@ -2,7 +2,7 @@ ...@@ -2,7 +2,7 @@
IO Processor plugins are a feature that allows pre and post processing of the model input and output for pooling models. The idea is that users are allowed to pass a custom input to vLLM that is converted into one or more model prompts and fed to the model `encode` method. One potential use-case of such plugins is that of using vLLM for generating multi-modal data. Say users feed an image to vLLM and get an image in output. IO Processor plugins are a feature that allows pre and post processing of the model input and output for pooling models. The idea is that users are allowed to pass a custom input to vLLM that is converted into one or more model prompts and fed to the model `encode` method. One potential use-case of such plugins is that of using vLLM for generating multi-modal data. Say users feed an image to vLLM and get an image in output.
When performing an inference with IO Processor plugins, the prompt type is defined by the plugin and the same is valid for the final request output. vLLM does not perform any validation of input/output data, and it is up to the plugin to ensure the correct data is being fed to the model and returned to the user. As of now these plugins support only pooling models and can be triggerd via the `encode` method in `LLM` and `AsyncLLM`, or in online serving mode via the `/pooling` endpoint. When performing an inference with IO Processor plugins, the prompt type is defined by the plugin and the same is valid for the final request output. vLLM does not perform any validation of input/output data, and it is up to the plugin to ensure the correct data is being fed to the model and returned to the user. As of now these plugins support only pooling models and can be triggered via the `encode` method in `LLM` and `AsyncLLM`, or in online serving mode via the `/pooling` endpoint.
## Writing an IO Processor Plugin ## Writing an IO Processor Plugin
......
...@@ -12,7 +12,7 @@ from vllm.pooling_params import PoolingParams ...@@ -12,7 +12,7 @@ from vllm.pooling_params import PoolingParams
# multimodal data. In this specific case this example will take a geotiff # multimodal data. In this specific case this example will take a geotiff
# image as input, process it using the multimodal data processor, and # image as input, process it using the multimodal data processor, and
# perform inference. # perform inference.
# Reuirement - install plugin at: # Requirement - install plugin at:
# https://github.com/christian-pinto/prithvi_io_processor_plugin # https://github.com/christian-pinto/prithvi_io_processor_plugin
......
...@@ -10,7 +10,7 @@ import requests ...@@ -10,7 +10,7 @@ import requests
# multimodal data. In this specific case this example will take a geotiff # multimodal data. In this specific case this example will take a geotiff
# image as input, process it using the multimodal data processor, and # image as input, process it using the multimodal data processor, and
# perform inference. # perform inference.
# Reuirements : # Requirements :
# - install plugin at: # - install plugin at:
# https://github.com/christian-pinto/prithvi_io_processor_plugin # https://github.com/christian-pinto/prithvi_io_processor_plugin
# - start vllm in serving mode with the below args # - start vllm in serving mode with the below args
......
...@@ -134,7 +134,7 @@ class SimpleModelWithTwoGraphs(ParentModel): ...@@ -134,7 +134,7 @@ class SimpleModelWithTwoGraphs(ParentModel):
# Test will fail without set_model_tag here with error: # Test will fail without set_model_tag here with error:
# "ValueError: too many values to unpack (expected 3)" # "ValueError: too many values to unpack (expected 3)"
# This is because CompiledAttention and CompiledAttentionTwo # This is because CompiledAttention and CompiledAttentionTwo
# have different implmentations but the same torch.compile # have different implementations but the same torch.compile
# cache dir will be used as default prefix is 'model_tag' # cache dir will be used as default prefix is 'model_tag'
with set_model_tag("attn_one"): with set_model_tag("attn_one"):
self.attn_one = CompiledAttention( self.attn_one = CompiledAttention(
......
...@@ -224,7 +224,7 @@ def tg_mxfp4_moe( ...@@ -224,7 +224,7 @@ def tg_mxfp4_moe(
assert (w2_bias.dim() == 2 and w2_bias.shape[0] == num_experts assert (w2_bias.dim() == 2 and w2_bias.shape[0] == num_experts
and w2_bias.shape[1] == hidden_size) and w2_bias.shape[1] == hidden_size)
# Swap w1 and w3 as the defenition of # Swap w1 and w3 as the definition of
# swiglu is different in the trtllm-gen # swiglu is different in the trtllm-gen
w13_weight_scale_ = w13_weight_scale.clone() w13_weight_scale_ = w13_weight_scale.clone()
w13_weight_ = w13_weight.clone() w13_weight_ = w13_weight.clone()
......
...@@ -52,7 +52,7 @@ def test_profiling(model_id: str, max_model_len: int): ...@@ -52,7 +52,7 @@ def test_profiling(model_id: str, max_model_len: int):
chunks_per_image = prod(mm_data["patches_per_image"]) chunks_per_image = prod(mm_data["patches_per_image"])
total_num_patches = chunks_per_image * tokens_per_patch total_num_patches = chunks_per_image * tokens_per_patch
num_tiles = mm_data["aspect_ratios"][0][0] * mm_data["aspect_ratios"][0][ num_tiles = mm_data["aspect_ratios"][0][0] * mm_data["aspect_ratios"][0][
1] # x-y seperator tokens 1] # x-y separator tokens
total_tokens = total_num_patches.item() + num_tiles.item( total_tokens = total_num_patches.item() + num_tiles.item(
) + 3 # image start, image, image end ) + 3 # image start, image, image end
......
...@@ -27,7 +27,7 @@ def use_v0_only(monkeypatch): ...@@ -27,7 +27,7 @@ def use_v0_only(monkeypatch):
reason="ModelOpt FP8 is not supported on this GPU type.") reason="ModelOpt FP8 is not supported on this GPU type.")
def test_modelopt_fp8_checkpoint_setup(vllm_runner): def test_modelopt_fp8_checkpoint_setup(vllm_runner):
"""Test ModelOpt FP8 checkpoint loading and structure validation.""" """Test ModelOpt FP8 checkpoint loading and structure validation."""
# TODO: provide a small publically available test checkpoint # TODO: provide a small publicly available test checkpoint
model_path = ("/home/scratch.omniml_data_1/zhiyu/ckpts/test_ckpts/" model_path = ("/home/scratch.omniml_data_1/zhiyu/ckpts/test_ckpts/"
"TinyLlama-1.1B-Chat-v1.0-fp8-0710") "TinyLlama-1.1B-Chat-v1.0-fp8-0710")
......
...@@ -82,7 +82,7 @@ def test_beam_search_with_concurrency_limit( ...@@ -82,7 +82,7 @@ def test_beam_search_with_concurrency_limit(
beam_width: int, beam_width: int,
) -> None: ) -> None:
# example_prompts[1]&[3]&[7] fails due to unknown reason even without # example_prompts[1]&[3]&[7] fails due to unknown reason even without
# concurency limit. skip them for now. # concurrency limit. skip them for now.
example_prompts = (example_prompts[:8]) example_prompts = (example_prompts[:8])
concurrency_limit = 2 concurrency_limit = 2
assert len(example_prompts) > concurrency_limit assert len(example_prompts) > concurrency_limit
......
...@@ -160,7 +160,7 @@ def test_local_attention_virtual_batches(test_data: LocalAttentionTestData): ...@@ -160,7 +160,7 @@ def test_local_attention_virtual_batches(test_data: LocalAttentionTestData):
# Use torch.arange instead of torch.randint so we can assert on # Use torch.arange instead of torch.randint so we can assert on
# block table tensor values. The block table will have shape # block table tensor values. The block table will have shape
# (num_batches, cdiv(max_seq_len, block_size)) and the values will be # (num_batches, cdiv(max_seq_len, block_size)) and the values will be
# aranged from 0 to cdiv(max_seq_len, block_size)-1 # arranged from 0 to cdiv(max_seq_len, block_size)-1
arange_block_indices=True, arange_block_indices=True,
) )
......
...@@ -33,7 +33,7 @@ def _check_path_len(path): ...@@ -33,7 +33,7 @@ def _check_path_len(path):
def _list_path(path): def _list_path(path):
"""Return the list of foldername (hashes generatd) under the path""" """Return the list of foldername (hashes generated) under the path"""
return list(path.iterdir()) return list(path.iterdir())
...@@ -41,7 +41,7 @@ def run_test(tmp_path, processor, llm: LLM, question: str, ...@@ -41,7 +41,7 @@ def run_test(tmp_path, processor, llm: LLM, question: str,
image_urls: list[Image], expected_len: int, info: str): image_urls: list[Image], expected_len: int, info: str):
""" """
One individual test to process the prompt and output base on 1 set of input One individual test to process the prompt and output base on 1 set of input
Then check if the length in the strorage path matches the expected length Then check if the length in the storage path matches the expected length
`info` introduces details or purpose of the individual test `info` introduces details or purpose of the individual test
""" """
print(f"***info: {info}***") print(f"***info: {info}***")
...@@ -115,7 +115,7 @@ def test_shared_storage_connector_hashes(tmp_path): ...@@ -115,7 +115,7 @@ def test_shared_storage_connector_hashes(tmp_path):
""" """
Tests that SharedStorageConnector saves KV to the storage locations Tests that SharedStorageConnector saves KV to the storage locations
with proper hashes; that are unique for inputs with identical text but with proper hashes; that are unique for inputs with identical text but
differnt images (same size), or same multiple images but different orders. different images (same size), or same multiple images but different orders.
""" """
# Using tmp_path as the storage path to store KV # Using tmp_path as the storage path to store KV
print(f"KV storage path at: {str(tmp_path)}") print(f"KV storage path at: {str(tmp_path)}")
...@@ -171,12 +171,12 @@ def test_shared_storage_connector_hashes(tmp_path): ...@@ -171,12 +171,12 @@ def test_shared_storage_connector_hashes(tmp_path):
img=[image_1], img=[image_1],
expected_len=2, expected_len=2,
info=("image_1 single input the 2nd time. " info=("image_1 single input the 2nd time. "
"It should not form aother new hash.")), "It should not form another new hash.")),
InputCase(text=TEXT_PROMPTS[0], InputCase(text=TEXT_PROMPTS[0],
img=[image_2], img=[image_2],
expected_len=2, expected_len=2,
info=("image_2 single input the 2nd time. " info=("image_2 single input the 2nd time. "
"It should not form aother new hash.")), "It should not form another new hash.")),
InputCase(text=TEXT_PROMPTS[0], InputCase(text=TEXT_PROMPTS[0],
img=[image_1, image_2], img=[image_1, image_2],
expected_len=3, expected_len=3,
...@@ -189,12 +189,12 @@ def test_shared_storage_connector_hashes(tmp_path): ...@@ -189,12 +189,12 @@ def test_shared_storage_connector_hashes(tmp_path):
img=[image_1, image_2], img=[image_1, image_2],
expected_len=4, expected_len=4,
info=("[image_1, image_2] input the 2nd time. " info=("[image_1, image_2] input the 2nd time. "
"It should not form aother new hash.")), "It should not form another new hash.")),
InputCase(text=TEXT_PROMPTS[0], InputCase(text=TEXT_PROMPTS[0],
img=[image_2, image_1], img=[image_2, image_1],
expected_len=4, expected_len=4,
info=("[image_2, image_1] input the 2nd time. " info=("[image_2, image_1] input the 2nd time. "
"It should not form aother new hash.")), "It should not form another new hash.")),
InputCase(text=TEXT_PROMPTS[0], InputCase(text=TEXT_PROMPTS[0],
img=[], img=[],
expected_len=5, expected_len=5,
......
...@@ -81,7 +81,7 @@ def _run_test(kwargs: dict, logitproc_loaded: bool) -> None: ...@@ -81,7 +81,7 @@ def _run_test(kwargs: dict, logitproc_loaded: bool) -> None:
target_token = params.extra_args[DUMMY_LOGITPROC_ARG] target_token = params.extra_args[DUMMY_LOGITPROC_ARG]
if not all(x == target_token for x in lp_toks): if not all(x == target_token for x in lp_toks):
raise AssertionError( raise AssertionError(
f"Request {bdx} generated {lp_toks}, shoud all be " f"Request {bdx} generated {lp_toks}, should all be "
f"{target_token}") f"{target_token}")
else: else:
# This request does not exercise custom logitproc (or custom # This request does not exercise custom logitproc (or custom
......
...@@ -189,7 +189,7 @@ async def get_request( ...@@ -189,7 +189,7 @@ async def get_request(
# NOTE: If we simply accumulate the random delta values # NOTE: If we simply accumulate the random delta values
# from the gamma distribution, their sum would have 1-2% gap # from the gamma distribution, their sum would have 1-2% gap
# from target_total_delay_s. The purpose of the following logic is to # from target_total_delay_s. The purpose of the following logic is to
# close the gap for stablizing the throughput data # close the gap for stabilizing the throughput data
# from different random seeds. # from different random seeds.
target_total_delay_s = total_requests / request_rate target_total_delay_s = total_requests / request_rate
normalize_factor = target_total_delay_s / delay_ts[-1] normalize_factor = target_total_delay_s / delay_ts[-1]
......
...@@ -234,7 +234,7 @@ class CompilationConfig: ...@@ -234,7 +234,7 @@ class CompilationConfig:
- FULL_AND_PIECEWISE. - FULL_AND_PIECEWISE.
PIECEWISE mode build piecewise cudagraph only, keeping the cudagraph PIECEWISE mode build piecewise cudagraph only, keeping the cudagraph
incompatiable ops (i.e. some attention ops) outside the cudagraph incompatible ops (i.e. some attention ops) outside the cudagraph
for general flexibility. for general flexibility.
This is the default mode. This is the default mode.
......
...@@ -87,7 +87,7 @@ class ParallelConfig: ...@@ -87,7 +87,7 @@ class ParallelConfig:
data_parallel_external_lb: bool = False data_parallel_external_lb: bool = False
"""Whether to use "external" DP LB mode. Applies only to online serving """Whether to use "external" DP LB mode. Applies only to online serving
and when data_parallel_size > 0. This is useful for a "one-pod-per-rank" and when data_parallel_size > 0. This is useful for a "one-pod-per-rank"
wide-EP setup in Kuberentes. Set implicitly when --data-parallel-rank wide-EP setup in Kubernetes. Set implicitly when --data-parallel-rank
is provided explicitly to vllm serve.""" is provided explicitly to vllm serve."""
data_parallel_hybrid_lb: bool = False data_parallel_hybrid_lb: bool = False
"""Whether to use "hybrid" DP LB mode. Applies only to online serving """Whether to use "hybrid" DP LB mode. Applies only to online serving
......
...@@ -787,7 +787,7 @@ class NixlConnectorWorker: ...@@ -787,7 +787,7 @@ class NixlConnectorWorker:
self.src_xfer_side_handle = self.nixl_wrapper.prep_xfer_dlist( self.src_xfer_side_handle = self.nixl_wrapper.prep_xfer_dlist(
"NIXL_INIT_AGENT", descs) "NIXL_INIT_AGENT", descs)
# TODO(mgoin): Hybrid memory allocator is currently diabled for # TODO(mgoin): Hybrid memory allocator is currently disabled for
# models with local attention (Llama 4). Can remove this once enabled. # models with local attention (Llama 4). Can remove this once enabled.
if self.vllm_config.model_config.hf_config.model_type == "llama4": if self.vllm_config.model_config.hf_config.model_type == "llama4":
from transformers import Llama4TextConfig from transformers import Llama4TextConfig
......
...@@ -717,7 +717,7 @@ class OpenAIServingResponses(OpenAIServing): ...@@ -717,7 +717,7 @@ class OpenAIServingResponses(OpenAIServing):
prev_msgs.append(msg) prev_msgs.append(msg)
messages.extend(prev_msgs) messages.extend(prev_msgs)
# Append the new input. # Append the new input.
# Reponses API supports simple text inputs without chat format. # Responses API supports simple text inputs without chat format.
if isinstance(request.input, str): if isinstance(request.input, str):
messages.append(get_user_message(request.input)) messages.append(get_user_message(request.input))
else: else:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment