Unverified Commit 4e8c3f1c authored by wang.yuqi's avatar wang.yuqi Committed by GitHub
Browse files

[Frontend][last/5] Improve pooling entrypoints | clean up. (#39675)


Signed-off-by: default avatarwang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: default avatarwang.yuqi <noooop@126.com>
Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
parent 5e5afafa
......@@ -59,6 +59,16 @@ please refer to [IO Processor Plugins](../../design/io_processor_plugins.md).
Within classification tasks, there is a specialized subcategory: Cross-encoder (aka reranker) models. These models
are a subset of classification models that accept two prompts as input and output num_labels equal to 1.
### Pooling Types
| Pooling Tasks | Granularity | Description |
|----------------|---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `CLS` pooling | Sequence-wise | For BERT‑like (bidirectional self‑attention) models, CLS pooling is used by default. This means the last_hidden_states corresponding to the first token (the [CLS] token) is taken as the output. |
| `LAST` pooling | Sequence-wise | For GPT‑like (causal self‑attention) models, LAST pooling is used by default. This means the last_hidden_states corresponding to the last token is taken as the output. |
| `MEAN` pooling | Sequence-wise | Many studies have shown that averaging the last_hidden_states over all input tokens performs better on certain downstream tasks. Therefore, more and more models are using MEAN pooling. |
| `ALL` pooling | Token-wise | Outputs the last_hidden_states for all input tokens. |
| `STEP` pooling | Token-wise | Filters and outputs the last_hidden_states corresponding to the token IDs returned by returned_token_ids. |
### Score Types
The scoring models is designed to compute similarity scores between two input prompts. It supports three model types
......
......@@ -160,6 +160,8 @@ The following Score API parameters are supported:
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-extra-params"
--8<-- "vllm/entrypoints/pooling/scoring/protocol.py:scoring-common-params"
--8<-- "vllm/entrypoints/pooling/scoring/protocol.py:score-request-params"
```
#### Examples
......@@ -370,6 +372,8 @@ The following rerank api parameters are supported:
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-extra-params"
--8<-- "vllm/entrypoints/pooling/scoring/protocol.py:scoring-common-params"
--8<-- "vllm/entrypoints/pooling/scoring/protocol.py:rerank-request-params"
```
#### Examples
......
......@@ -68,7 +68,7 @@ If your model is not in the above list, we will try to automatically convert the
Forced alignment usage requires `--hf-overrides '{"architectures": ["Qwen3ASRForcedAlignerForTokenClassification"]}'`.
Please refer to [examples/pooling/token_classify/forced_alignment_offline.py](../../../examples/pooling/token_classify/forced_alignment_offline.py).
### As Reward Models
### Reward Models
Using token classification models as reward models. For details on reward models, see [Reward Models](reward.md).
......
......@@ -467,28 +467,11 @@ It consists of two endpoints:
- `/tokenize` corresponds to calling `tokenizer.encode()`.
- `/detokenize` corresponds to calling `tokenizer.decode()`.
### Score API
#### Score Template
Some scoring models require a specific prompt format to work correctly. You can specify a custom score template using the `--chat-template` parameter (see [Chat Template](#chat-template)).
Score templates are supported for **cross-encoder** models only. If you are using an **embedding** model for scoring, vLLM does not apply a score template.
Like chat templates, the score template receives a `messages` list. For scoring, each message has a `role` attribute—either `"query"` or `"document"`. For the usual kind of point-wise cross-encoder, you can expect exactly two messages: one query and one document. To access the query and document content, use Jinja's `selectattr` filter:
- **Query**: `{{ (messages | selectattr("role", "eq", "query") | first).content }}`
- **Document**: `{{ (messages | selectattr("role", "eq", "document") | first).content }}`
This approach is more robust than index-based access (`messages[0]`, `messages[1]`) because it selects messages by their semantic role. It also avoids assumptions about message ordering if additional message types are added to `messages` in the future.
Example template file: [examples/pooling/score/template/nemotron-rerank.jinja](../../examples/pooling/score/template/nemotron-rerank.jinja)
### Generative Scoring API
The `/generative_scoring` endpoint uses a CausalLM model (e.g., Llama, Qwen, Mistral) to compute the probability of specified token IDs appearing as the next token. Each item (document) is concatenated with the query to form a prompt, and the model predicts how likely each label token is as the next token after that prompt. This lets you score items against a query — for example, asking "Is this the capital of France?" and scoring each city by how likely the model is to answer "Yes".
This endpoint is automatically available when the server is started with a generative model (task `"generate"`). It is separate from the pooling-based [Score API](#score-api), which uses cross-encoder, bi-encoder, or late-interaction models.
This endpoint is automatically available when the server is started with a generative model (task `"generate"`). It is separate from the pooling-based [Score API](../models/pooling_models/scoring.md#score-api), which uses cross-encoder, bi-encoder, or late-interaction models.
**Requirements:**
......
......@@ -48,7 +48,7 @@ from vllm.entrypoints.chat_utils import (
ChatTemplateContentFormatOption,
load_chat_template,
)
from vllm.entrypoints.pooling.io_processor_factories import init_pooling_io_processors
from vllm.entrypoints.pooling.factories import init_pooling_io_processors
from vllm.entrypoints.pooling.scoring.io_processor import ScoringIOProcessor
from vllm.entrypoints.pooling.scoring.typing import ScoreInput
from vllm.entrypoints.pooling.typing import OfflineInputsContext, OfflineOutputsContext
......
......@@ -220,6 +220,12 @@ def build_app(
elastic_ep_attach_router(app)
from vllm.entrypoints.openai.generative_scoring.api_router import (
register_generative_scoring_api_router,
)
register_generative_scoring_api_router(app)
if "generate" in supported_tasks or "render" in supported_tasks:
from vllm.entrypoints.serve.render.api_router import (
attach_router as attach_render_router,
......@@ -242,17 +248,10 @@ def build_app(
register_realtime_api_router(app)
if any(task in POOLING_TASKS for task in supported_tasks):
from vllm.entrypoints.pooling import register_pooling_api_routers
from vllm.entrypoints.pooling.factories import register_pooling_api_routers
register_pooling_api_routers(app, supported_tasks, model_config)
if "generate" in supported_tasks:
from vllm.entrypoints.openai.generative_scoring.api_router import (
register_generative_scoring_api_router,
)
register_generative_scoring_api_router(app)
app.root_path = args.root_path
app.add_middleware(
CORSMiddleware,
......@@ -401,6 +400,12 @@ async def init_app_state(
engine_client, state, args, request_logger, supported_tasks
)
from vllm.entrypoints.openai.generative_scoring.api_router import (
init_generative_scoring_state,
)
await init_generative_scoring_state(engine_client, state, args, request_logger)
if "transcription" in supported_tasks:
from vllm.entrypoints.openai.speech_to_text.api_router import (
init_transcription_state,
......@@ -416,17 +421,10 @@ async def init_app_state(
init_realtime_state(engine_client, state, args, request_logger, supported_tasks)
if any(task in POOLING_TASKS for task in supported_tasks):
from vllm.entrypoints.pooling import init_pooling_state
from vllm.entrypoints.pooling.factories import init_pooling_state
init_pooling_state(engine_client, state, args, request_logger, supported_tasks)
if "generate" in supported_tasks:
from vllm.entrypoints.openai.generative_scoring.api_router import (
init_generative_scoring_state,
)
await init_generative_scoring_state(engine_client, state, args, request_logger)
state.enable_server_load_tracking = args.enable_server_load_tracking
state.server_load_metrics = 0
......
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
from typing import TYPE_CHECKING
from vllm.config import ModelConfig
from vllm.tasks import SupportedTask
if TYPE_CHECKING:
from vllm.entrypoints.sagemaker.api_router import (
EndpointFn,
GetHandlerFn,
RequestType,
)
def get_generate_invocation_types(
supported_tasks: tuple["SupportedTask", ...],
model_config: ModelConfig | None = None,
):
# NOTE: Items defined earlier take higher priority
invocation_types: list[tuple[RequestType, tuple[GetHandlerFn, EndpointFn]]] = []
if "generate" in supported_tasks:
from vllm.entrypoints.openai.chat_completion.api_router import (
chat,
create_chat_completion,
)
from vllm.entrypoints.openai.chat_completion.protocol import (
ChatCompletionRequest,
)
from vllm.entrypoints.openai.completion.api_router import (
completion,
create_completion,
)
from vllm.entrypoints.openai.completion.protocol import CompletionRequest
invocation_types += [
(ChatCompletionRequest, (chat, create_chat_completion)),
(CompletionRequest, (completion, create_completion)),
]
return invocation_types
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
from typing import TYPE_CHECKING
from fastapi import FastAPI
from vllm.config import ModelConfig
from vllm.entrypoints.pooling.utils import enable_scoring_api
from vllm.logger import init_logger
if TYPE_CHECKING:
from argparse import Namespace
from starlette.datastructures import State
from vllm.engine.protocol import EngineClient
from vllm.entrypoints.logger import RequestLogger
from vllm.tasks import SupportedTask
else:
RequestLogger = object
SupportedTask = object
logger = init_logger(__name__)
def register_pooling_api_routers(
app: FastAPI,
supported_tasks: tuple["SupportedTask", ...],
model_config: ModelConfig | None = None,
):
if model_config is None:
return
pooling_task = model_config.get_pooling_task(supported_tasks)
if pooling_task is not None:
from vllm.entrypoints.pooling.pooling.api_router import router as pooling_router
app.include_router(pooling_router)
if "classify" in supported_tasks:
from vllm.entrypoints.pooling.classify.api_router import (
router as classify_router,
)
app.include_router(classify_router)
if "embed" in supported_tasks:
from vllm.entrypoints.pooling.embed.api_router import router as embed_router
app.include_router(embed_router)
if enable_scoring_api(supported_tasks, model_config):
from vllm.entrypoints.pooling.scoring.api_router import router as score_router
app.include_router(score_router)
def init_pooling_state(
engine_client: "EngineClient",
state: "State",
args: "Namespace",
request_logger: RequestLogger | None,
supported_tasks: tuple["SupportedTask", ...],
):
from vllm.entrypoints.chat_utils import load_chat_template
from vllm.entrypoints.pooling.classify.serving import ServingClassification
from vllm.entrypoints.pooling.embed.serving import ServingEmbedding
from vllm.entrypoints.pooling.pooling.serving import ServingPooling
from vllm.entrypoints.pooling.scoring.serving import ServingScores
from vllm.tasks import POOLING_TASKS
model_config = engine_client.model_config
resolved_chat_template = load_chat_template(args.chat_template)
state.serving_pooling = (
(
ServingPooling(
engine_client,
state.openai_serving_models,
supported_tasks=supported_tasks,
request_logger=request_logger,
chat_template=resolved_chat_template,
chat_template_content_format=args.chat_template_content_format,
trust_request_chat_template=args.trust_request_chat_template,
)
)
if any(t in supported_tasks for t in POOLING_TASKS)
else None
)
state.serving_embedding = (
ServingEmbedding(
engine_client,
state.openai_serving_models,
request_logger=request_logger,
chat_template=resolved_chat_template,
chat_template_content_format=args.chat_template_content_format,
trust_request_chat_template=args.trust_request_chat_template,
)
if "embed" in supported_tasks
else None
)
state.serving_classification = (
ServingClassification(
engine_client,
state.openai_serving_models,
request_logger=request_logger,
chat_template=resolved_chat_template,
chat_template_content_format=args.chat_template_content_format,
trust_request_chat_template=args.trust_request_chat_template,
)
if "classify" in supported_tasks
else None
)
state.serving_scores = (
ServingScores(
engine_client,
state.openai_serving_models,
request_logger=request_logger,
chat_template=resolved_chat_template,
chat_template_content_format=args.chat_template_content_format,
trust_request_chat_template=args.trust_request_chat_template,
enable_flash_late_interaction=getattr(
args, "enable_flash_late_interaction", True
),
)
if enable_scoring_api(supported_tasks, model_config)
else None
)
......@@ -13,22 +13,29 @@ from vllm.entrypoints.chat_utils import (
ConversationMessage,
)
from vllm.entrypoints.openai.engine.serving import RendererChatRequest, RendererRequest
from vllm.entrypoints.pooling.scoring.typing import ScoringData
from vllm.entrypoints.pooling.typing import (
from vllm.inputs import EngineInput, SingletonPrompt
from vllm.renderers import BaseRenderer, TokenizeParams, merge_kwargs
from vllm.renderers.inputs.preprocess import parse_model_prompt, prompt_to_seq
from vllm.tool_parsers import ToolParser
from vllm.utils.mistral import is_mistral_tokenizer
from ..scoring.typing import ScoringData
from ..typing import (
OfflineInputsContext,
OfflineOutputsContext,
PoolingChatLikeRequest,
PoolingCompletionLikeRequest,
PoolingServeContext,
)
from vllm.inputs import EngineInput, SingletonPrompt
from vllm.renderers import BaseRenderer, TokenizeParams, merge_kwargs
from vllm.renderers.inputs.preprocess import parse_model_prompt, prompt_to_seq
from vllm.tool_parsers import ToolParser
from vllm.utils.mistral import is_mistral_tokenizer
class PoolingIOProcessor:
"""Processor for handling preprocessing & postprocessing ops for pooling requests.
This class manages both online (serving) and offline (batch) processing of pooling
requests, handling chat and completion formats.
"""
name: str
def __init__(
......@@ -58,7 +65,7 @@ class PoolingIOProcessor:
def pre_process_online(self, ctx: PoolingServeContext):
request = ctx.request
if isinstance(ctx.request, PoolingChatLikeRequest):
if isinstance(request, PoolingChatLikeRequest):
self._validate_chat_template(
request_chat_template=request.chat_template,
chat_template_kwargs=request.chat_template_kwargs,
......
......@@ -22,7 +22,6 @@ from vllm.entrypoints.chat_utils import (
from vllm.entrypoints.logger import RequestLogger
from vllm.entrypoints.openai.engine.protocol import ErrorResponse
from vllm.entrypoints.openai.models.serving import OpenAIServingModels
from vllm.entrypoints.pooling.typing import AnyPoolingRequest, PoolingServeContext
from vllm.exceptions import VLLMNotFoundError
from vllm.inputs import EngineInput
from vllm.lora.request import LoRARequest
......@@ -36,6 +35,7 @@ from vllm.tracing import (
from vllm.utils import random_uuid
from vllm.utils.async_utils import make_async, merge_async_iterators
from ..typing import AnyPoolingRequest, PoolingServeContext
from .io_processor import PoolingIOProcessor
......
......@@ -5,13 +5,14 @@ from fastapi import APIRouter, Depends, Request
from fastapi.responses import Response
from vllm.entrypoints.openai.utils import validate_json_request
from vllm.entrypoints.pooling.classify.protocol import ClassificationRequest
from vllm.entrypoints.pooling.classify.serving import ServingClassification
from vllm.entrypoints.utils import (
load_aware_call,
with_cancellation,
)
from .protocol import ClassificationRequest
from .serving import ServingClassification
router = APIRouter()
......
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
from vllm.entrypoints.pooling.base.io_processor import PoolingIOProcessor
from ..base.io_processor import PoolingIOProcessor
class ClassifyIOProcessor(PoolingIOProcessor):
......
......@@ -9,15 +9,16 @@ from pydantic import Field
from vllm import PoolingParams
from vllm.config import ModelConfig
from vllm.entrypoints.openai.engine.protocol import OpenAIBaseModel, UsageInfo
from vllm.entrypoints.pooling.base.protocol import (
from vllm.logger import init_logger
from vllm.renderers import TokenizeParams
from vllm.utils import random_uuid
from ..base.protocol import (
ChatRequestMixin,
ClassifyRequestMixin,
CompletionRequestMixin,
PoolingBasicRequestMixin,
)
from vllm.logger import init_logger
from vllm.renderers import TokenizeParams
from vllm.utils import random_uuid
logger = init_logger(__name__)
......
......@@ -7,11 +7,11 @@ import numpy as np
from fastapi.responses import JSONResponse
from vllm.entrypoints.openai.engine.protocol import UsageInfo
from vllm.entrypoints.pooling.base.serving import PoolingServing
from vllm.entrypoints.pooling.typing import PoolingServeContext
from vllm.logger import init_logger
from vllm.outputs import ClassificationOutput
from ..base.serving import PoolingServing
from ..typing import PoolingServeContext
from .io_processor import ClassifyIOProcessor
from .protocol import (
ClassificationData,
......
......@@ -7,13 +7,11 @@ from fastapi import APIRouter, Depends, Request
from vllm.entrypoints.openai.engine.protocol import ErrorResponse
from vllm.entrypoints.openai.utils import validate_json_request
from vllm.entrypoints.pooling.embed.protocol import (
CohereEmbedRequest,
EmbeddingRequest,
)
from vllm.entrypoints.pooling.embed.serving import ServingEmbedding
from vllm.entrypoints.utils import load_aware_call, with_cancellation
from .protocol import CohereEmbedRequest, EmbeddingRequest
from .serving import ServingEmbedding
router = APIRouter()
......
......@@ -16,21 +16,6 @@ from vllm.entrypoints.chat_utils import (
ChatCompletionMessageParam,
CustomChatCompletionMessageParam,
)
from vllm.entrypoints.pooling.base.io_processor import PoolingIOProcessor
from vllm.entrypoints.pooling.embed.protocol import (
CohereEmbedContent,
CohereEmbedInput,
CohereEmbedRequest,
EmbeddingChatRequest,
EmbeddingCompletionRequest,
)
from vllm.entrypoints.pooling.scoring.io_processor import JinaRankingIOProcessorMixin
from vllm.entrypoints.pooling.typing import (
OfflineInputsContext,
PoolingChatLikeRequest,
PoolingCompletionLikeRequest,
PoolingServeContext,
)
from vllm.inputs import EngineInput, tokens_input
from vllm.logger import init_logger
from vllm.outputs import PoolingOutput, PoolingRequestOutput
......@@ -39,6 +24,22 @@ from vllm.renderers.hf import resolve_chat_template
from vllm.utils.collection_utils import chunk_list
from vllm.utils.mistral import is_mistral_tokenizer
from ..base.io_processor import PoolingIOProcessor
from ..scoring.io_processor import JinaRankingIOProcessorMixin
from ..typing import (
OfflineInputsContext,
PoolingChatLikeRequest,
PoolingCompletionLikeRequest,
PoolingServeContext,
)
from .protocol import (
CohereEmbedContent,
CohereEmbedInput,
CohereEmbedRequest,
EmbeddingChatRequest,
EmbeddingCompletionRequest,
)
logger = init_logger(__name__)
......@@ -94,7 +95,7 @@ class EmbedIOProcessor(PoolingIOProcessor):
if ctx.engine_inputs is None:
raise ValueError("Engine prompts not available")
ctx.intermediates = ctx.engine_inputs
ctx.original_engine_inputs = ctx.engine_inputs
request_id = ctx.request_id
max_model_len = self.model_config.max_model_len
chunked_engine_inputs: list[EngineInput] = []
......@@ -189,10 +190,10 @@ class EmbedIOProcessor(PoolingIOProcessor):
aggregator["total_weight"] += weight
aggregator["chunk_count"] += 1
if ctx.intermediates is None:
raise ValueError("Original prompts inputs not available")
if ctx.original_engine_inputs is None:
raise ValueError("Original engine inputs not available")
original_engine_inputs = cast(list[EngineInput], ctx.intermediates)
original_engine_inputs = ctx.original_engine_inputs
num_prompts = len(original_engine_inputs)
# Finalize aggregated results
......
......@@ -18,14 +18,15 @@ from pydantic import BaseModel, Field
from vllm import PoolingParams
from vllm.config import ModelConfig
from vllm.entrypoints.openai.engine.protocol import OpenAIBaseModel, UsageInfo
from vllm.entrypoints.pooling.base.protocol import (
from vllm.renderers import TokenizeParams
from vllm.utils import random_uuid
from ..base.protocol import (
ChatRequestMixin,
CompletionRequestMixin,
EmbedRequestMixin,
PoolingBasicRequestMixin,
)
from vllm.renderers import TokenizeParams
from vllm.utils import random_uuid
# ---------------------------------------------------------------------------
# OpenAI /v1/embeddings — request models
......
......@@ -9,9 +9,20 @@ from fastapi.responses import JSONResponse, Response, StreamingResponse
from typing_extensions import assert_never
from vllm.entrypoints.openai.engine.protocol import UsageInfo
from vllm.entrypoints.pooling.base.serving import PoolingServing
from vllm.entrypoints.pooling.embed.io_processor import EmbedIOProcessor
from vllm.entrypoints.pooling.embed.protocol import (
from vllm.logger import init_logger
from vllm.outputs import PoolingRequestOutput
from vllm.utils.serial_utils import EmbedDType, Endianness
from ..base.serving import PoolingServing
from ..typing import PoolingServeContext
from ..utils import (
encode_pooling_bytes,
encode_pooling_output_base64,
encode_pooling_output_float,
get_json_response_cls,
)
from .io_processor import EmbedIOProcessor
from .protocol import (
CohereBilledUnits,
CohereEmbedRequest,
CohereEmbedResponse,
......@@ -22,16 +33,6 @@ from vllm.entrypoints.pooling.embed.protocol import (
EmbeddingResponseData,
build_typed_embeddings,
)
from vllm.entrypoints.pooling.typing import PoolingServeContext
from vllm.entrypoints.pooling.utils import (
encode_pooling_bytes,
encode_pooling_output_base64,
encode_pooling_output_float,
get_json_response_cls,
)
from vllm.logger import init_logger
from vllm.outputs import PoolingRequestOutput
from vllm.utils.serial_utils import EmbedDType, Endianness
logger = init_logger(__name__)
......
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
from vllm.config import VllmConfig
from typing import TYPE_CHECKING
from fastapi import FastAPI
from vllm.config import ModelConfig, VllmConfig
from vllm.entrypoints.chat_utils import ChatTemplateConfig
from vllm.logger import init_logger
from vllm.plugins.io_processors import has_io_processor
from vllm.renderers import BaseRenderer
from vllm.tasks import SupportedTask
from vllm.tasks import POOLING_TASKS, SupportedTask
from .base.io_processor import PoolingIOProcessor
from .utils import enable_scoring_api
if TYPE_CHECKING:
from argparse import Namespace
from starlette.datastructures import State
from vllm.engine.protocol import EngineClient
from vllm.entrypoints.logger import RequestLogger
from vllm.entrypoints.sagemaker.api_router import (
EndpointFn,
GetHandlerFn,
RequestType,
)
else:
RequestLogger = object
logger = init_logger(__name__)
def init_pooling_io_processors(
supported_tasks: tuple[SupportedTask, ...],
......@@ -74,3 +98,159 @@ def init_pooling_io_processors(
)
for task, processor_cls in processors.items()
}
def register_pooling_api_routers(
app: FastAPI,
supported_tasks: tuple["SupportedTask", ...],
model_config: ModelConfig | None = None,
):
if model_config is None:
return
pooling_task = model_config.get_pooling_task(supported_tasks)
if pooling_task is not None:
from .pooling.api_router import router as pooling_router
app.include_router(pooling_router)
if "classify" in supported_tasks:
from .classify.api_router import (
router as classify_router,
)
app.include_router(classify_router)
if "embed" in supported_tasks:
from .embed.api_router import router as embed_router
app.include_router(embed_router)
if enable_scoring_api(supported_tasks, model_config):
from .scoring.api_router import router as score_router
app.include_router(score_router)
def init_pooling_state(
engine_client: "EngineClient",
state: "State",
args: "Namespace",
request_logger: RequestLogger | None,
supported_tasks: tuple["SupportedTask", ...],
):
from vllm.entrypoints.chat_utils import load_chat_template
from vllm.tasks import POOLING_TASKS
from .classify.serving import ServingClassification
from .embed.serving import ServingEmbedding
from .pooling.serving import ServingPooling
from .scoring.serving import ServingScores
model_config = engine_client.model_config
resolved_chat_template = load_chat_template(args.chat_template)
state.serving_pooling = (
(
ServingPooling(
engine_client,
state.openai_serving_models,
supported_tasks=supported_tasks,
request_logger=request_logger,
chat_template=resolved_chat_template,
chat_template_content_format=args.chat_template_content_format,
trust_request_chat_template=args.trust_request_chat_template,
)
)
if any(t in supported_tasks for t in POOLING_TASKS)
else None
)
state.serving_embedding = (
ServingEmbedding(
engine_client,
state.openai_serving_models,
request_logger=request_logger,
chat_template=resolved_chat_template,
chat_template_content_format=args.chat_template_content_format,
trust_request_chat_template=args.trust_request_chat_template,
)
if "embed" in supported_tasks
else None
)
state.serving_classification = (
ServingClassification(
engine_client,
state.openai_serving_models,
request_logger=request_logger,
chat_template=resolved_chat_template,
chat_template_content_format=args.chat_template_content_format,
trust_request_chat_template=args.trust_request_chat_template,
)
if "classify" in supported_tasks
else None
)
state.serving_scores = (
ServingScores(
engine_client,
state.openai_serving_models,
request_logger=request_logger,
chat_template=resolved_chat_template,
chat_template_content_format=args.chat_template_content_format,
trust_request_chat_template=args.trust_request_chat_template,
enable_flash_late_interaction=getattr(
args, "enable_flash_late_interaction", True
),
)
if enable_scoring_api(supported_tasks, model_config)
else None
)
def get_pooling_invocation_types(
supported_tasks: tuple["SupportedTask", ...],
model_config: ModelConfig | None = None,
):
# NOTE: Items defined earlier take higher priority
invocation_types: list[tuple[RequestType, tuple[GetHandlerFn, EndpointFn]]] = []
if "embed" in supported_tasks:
from .embed.api_router import create_embedding, embedding
from .embed.protocol import EmbeddingRequest
invocation_types += [
(EmbeddingRequest, (embedding, create_embedding)),
]
if "classify" in supported_tasks:
from .classify.api_router import classify, create_classify
from .classify.protocol import ClassificationRequest
invocation_types += [
(ClassificationRequest, (classify, create_classify)),
]
if enable_scoring_api(supported_tasks, model_config):
from .scoring.api_router import do_rerank, rerank
from .scoring.protocol import RerankRequest
invocation_types += [
(RerankRequest, (rerank, do_rerank)),
]
from .scoring.api_router import create_score, score
from .scoring.protocol import ScoreRequest
invocation_types += [
(ScoreRequest, (score, create_score)),
]
if any(task in POOLING_TASKS for task in supported_tasks):
from .pooling.api_router import create_pooling, pooling
from .pooling.protocol import PoolingRequest
invocation_types += [
(PoolingRequest, (pooling, create_pooling)),
]
return invocation_types
......@@ -6,10 +6,11 @@ from fastapi import APIRouter, Depends, Request
from vllm.entrypoints.openai.engine.protocol import ErrorResponse
from vllm.entrypoints.openai.utils import validate_json_request
from vllm.entrypoints.pooling.pooling.protocol import PoolingRequest
from vllm.entrypoints.pooling.pooling.serving import ServingPooling
from vllm.entrypoints.utils import load_aware_call, with_cancellation
from .protocol import PoolingRequest
from .serving import ServingPooling
router = APIRouter()
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment