Commit 51679bbd authored by zhuwenwen's avatar zhuwenwen
Browse files

resolve merge confilcts

parents 4095d0db 1af090b5
...@@ -11,6 +11,14 @@ This guide shows how to use vLLM to: ...@@ -11,6 +11,14 @@ This guide shows how to use vLLM to:
Be sure to complete the :ref:`installation instructions <installation>` before continuing with this guide. Be sure to complete the :ref:`installation instructions <installation>` before continuing with this guide.
.. note::
By default, vLLM downloads model from `HuggingFace <https://huggingface.co/>`_. If you would like to use models from `ModelScope <https://www.modelscope.cn>`_ in the following examples, please set the environment variable:
.. code-block:: shell
export VLLM_USE_MODELSCOPE=True
Offline Batched Inference Offline Batched Inference
------------------------- -------------------------
...@@ -40,16 +48,6 @@ Initialize vLLM's engine for offline inference with the ``LLM`` class and the `O ...@@ -40,16 +48,6 @@ Initialize vLLM's engine for offline inference with the ``LLM`` class and the `O
llm = LLM(model="facebook/opt-125m") llm = LLM(model="facebook/opt-125m")
Use model from www.modelscope.cn
.. code-block:: shell
export VLLM_USE_MODELSCOPE=True
.. code-block:: python
llm = LLM(model="qwen/Qwen-7B-Chat", revision="v1.1.8", trust_remote_code=True)
Call ``llm.generate`` to generate the outputs. It adds the input prompts to vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of ``RequestOutput`` objects, which include all the output tokens. Call ``llm.generate`` to generate the outputs. It adds the input prompts to vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of ``RequestOutput`` objects, which include all the output tokens.
.. code-block:: python .. code-block:: python
...@@ -65,49 +63,11 @@ Call ``llm.generate`` to generate the outputs. It adds the input prompts to vLLM ...@@ -65,49 +63,11 @@ Call ``llm.generate`` to generate the outputs. It adds the input prompts to vLLM
The code example can also be found in `examples/offline_inference.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference.py>`_. The code example can also be found in `examples/offline_inference.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference.py>`_.
API Server
----------
vLLM can be deployed as an LLM service. We provide an example `FastAPI <https://fastapi.tiangolo.com/>`_ server. Check `vllm/entrypoints/api_server.py <https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py>`_ for the server implementation. The server uses ``AsyncLLMEngine`` class to support asynchronous processing of incoming requests.
Start the server:
.. code-block:: console
$ python -m vllm.entrypoints.api_server
Use model from www.modelscope.cn
.. code-block:: console
$ VLLM_USE_MODELSCOPE=True python -m vllm.entrypoints.api_server \
$ --model="qwen/Qwen-7B-Chat" \
$ --revision="v1.1.8" \
$ --trust-remote-code
By default, this command starts the server at ``http://localhost:8000`` with the OPT-125M model.
Query the model in shell:
.. code-block:: console
$ curl http://localhost:8000/generate \
$ -d '{
$ "prompt": "San Francisco is a",
$ "use_beam_search": true,
$ "n": 4,
$ "temperature": 0
$ }'
See `examples/api_client.py <https://github.com/vllm-project/vllm/blob/main/examples/api_client.py>`_ for a more detailed client example.
OpenAI-Compatible Server OpenAI-Compatible Server
------------------------ ------------------------
vLLM can be deployed as a server that mimics the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.
By default, it starts the server at ``http://localhost:8000``. You can specify the address with ``--host`` and ``--port`` arguments. The server currently hosts one model at a time (OPT-125M in the above command) and implements `list models <https://platform.openai.com/docs/api-reference/models/list>`_, `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_, and `create completion <https://platform.openai.com/docs/api-reference/completions/create>`_ endpoints. We are actively adding support for more endpoints. By default, it starts the server at ``http://localhost:8000``. You can specify the address with ``--host`` and ``--port`` arguments. The server currently hosts one model at a time (OPT-125M in the command below) and implements `list models <https://platform.openai.com/docs/api-reference/models/list>`_, `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_, and `create completion <https://platform.openai.com/docs/api-reference/completions/create>`_ endpoints. We are actively adding support for more endpoints.
Start the server: Start the server:
...@@ -116,13 +76,6 @@ Start the server: ...@@ -116,13 +76,6 @@ Start the server:
$ python -m vllm.entrypoints.openai.api_server \ $ python -m vllm.entrypoints.openai.api_server \
$ --model facebook/opt-125m $ --model facebook/opt-125m
Use model from www.modelscope.cn
.. code-block:: console
$ VLLM_USE_MODELSCOPE=True python -m vllm.entrypoints.openai.api_server \
$ --model="qwen/Qwen-7B-Chat" --revision="v1.1.8" --trust-remote-code
By default, the server uses a predefined chat template stored in the tokenizer. You can override this template by using the ``--chat-template`` argument: By default, the server uses a predefined chat template stored in the tokenizer. You can override this template by using the ``--chat-template`` argument:
.. code-block:: console .. code-block:: console
...@@ -137,6 +90,8 @@ This server can be queried in the same format as OpenAI API. For example, list t ...@@ -137,6 +90,8 @@ This server can be queried in the same format as OpenAI API. For example, list t
$ curl http://localhost:8000/v1/models $ curl http://localhost:8000/v1/models
You can pass in the argument ``--api-key`` or environment variable ``VLLM_API_KEY`` to enable the server to check for API key in the header.
Using OpenAI Completions API with vLLM Using OpenAI Completions API with vLLM
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
......
...@@ -31,7 +31,7 @@ vLLM is fast with: ...@@ -31,7 +31,7 @@ vLLM is fast with:
* Efficient management of attention key and value memory with **PagedAttention** * Efficient management of attention key and value memory with **PagedAttention**
* Continuous batching of incoming requests * Continuous batching of incoming requests
* Fast model execution with CUDA/HIP graph * Fast model execution with CUDA/HIP graph
* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_ * Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_, FP8 KV Cache
* Optimized CUDA kernels * Optimized CUDA kernels
vLLM is flexible and easy to use with: vLLM is flexible and easy to use with:
...@@ -42,6 +42,8 @@ vLLM is flexible and easy to use with: ...@@ -42,6 +42,8 @@ vLLM is flexible and easy to use with:
* Streaming outputs * Streaming outputs
* OpenAI-compatible API server * OpenAI-compatible API server
* Support NVIDIA GPUs and AMD GPUs * Support NVIDIA GPUs and AMD GPUs
* (Experimental) Prefix caching support
* (Experimental) Multi-lora support
For more information, check out the following: For more information, check out the following:
...@@ -85,4 +87,16 @@ Documentation ...@@ -85,4 +87,16 @@ Documentation
:maxdepth: 1 :maxdepth: 1
:caption: Quantization :caption: Quantization
quantization/auto_awq quantization/auto_awq
\ No newline at end of file
.. toctree::
:maxdepth: 2
:caption: Developer Documentation
dev/engine/engine_index
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
...@@ -68,6 +68,12 @@ Alongside each architecture, we include some popular models that use it. ...@@ -68,6 +68,12 @@ Alongside each architecture, we include some popular models that use it.
* - :code:`QWenLMHeadModel` * - :code:`QWenLMHeadModel`
- Qwen - Qwen
- :code:`Qwen/Qwen-7B`, :code:`Qwen/Qwen-7B-Chat`, etc. - :code:`Qwen/Qwen-7B`, :code:`Qwen/Qwen-7B-Chat`, etc.
* - :code:`Qwen2ForCausalLM`
- Qwen2
- :code:`Qwen/Qwen2-beta-7B`, :code:`Qwen/Qwen2-beta-7B-Chat`, etc.
* - :code:`StableLMEpochForCausalLM`
- StableLM
- :code:`stabilityai/stablelm-3b-4e1t/` , :code:`stabilityai/stablelm-base-alpha-7b-v2`, etc.
* - :code:`YiForCausalLM` * - :code:`YiForCausalLM`
- Yi - Yi
- :code:`01-ai/Yi-6B`, :code:`01-ai/Yi-34B`, etc. - :code:`01-ai/Yi-6B`, :code:`01-ai/Yi-34B`, etc.
......
.. _fp8_e5m2_kv_cache:
FP8 E5M2 KV Cache
==================
The int8/int4 quantization scheme requires additional scale GPU memory storage, which reduces the expected GPU memory benefits.
The FP8 data format retains 2~3 mantissa bits and can convert float/fp16/bflaot16 and fp8 to each other.
Here is an example of how to enable this feature:
.. code-block:: python
from vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="facebook/opt-125m", kv_cache_dtype="fp8_e5m2")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
import argparse
from openai import OpenAI
import gradio as gr
# Argument parser setup
parser = argparse.ArgumentParser(
description='Chatbot Interface with Customizable Parameters')
parser.add_argument('--model-url',
type=str,
default='http://localhost:8000/v1',
help='Model URL')
parser.add_argument('-m',
'--model',
type=str,
required=True,
help='Model name for the chatbot')
parser.add_argument('--temp',
type=float,
default=0.8,
help='Temperature for text generation')
parser.add_argument('--stop-token-ids',
type=str,
default='',
help='Comma-separated stop token IDs')
parser.add_argument("--host", type=str, default=None)
parser.add_argument("--port", type=int, default=8001)
# Parse the arguments
args = parser.parse_args()
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = args.model_url
# Create an OpenAI client to interact with the API server
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
def predict(message, history):
# Convert chat history to OpenAI format
history_openai_format = [{
"role": "system",
"content": "You are a great ai assistant."
}]
for human, assistant in history:
history_openai_format.append({"role": "user", "content": human})
history_openai_format.append({
"role": "assistant",
"content": assistant
})
history_openai_format.append({"role": "user", "content": message})
# Create a chat completion request and send it to the API server
stream = client.chat.completions.create(
model=args.model, # Model name to use
messages=history_openai_format, # Chat history
temperature=args.temp, # Temperature for text generation
stream=True, # Stream response
extra_body={
'repetition_penalty':
1,
'stop_token_ids': [
int(id.strip()) for id in args.stop_token_ids.split(',')
if id.strip()
] if args.stop_token_ids else []
})
# Read and return generated text from response stream
partial_message = ""
for chunk in stream:
partial_message += (chunk.choices[0].delta.content or "")
yield partial_message
# Create and launch a chat interface with Gradio
gr.ChatInterface(predict).queue().launch(server_name=args.host,
server_port=args.port,
share=True)
"""
This example shows how to use the multi-LoRA functionality for offline inference.
Requires HuggingFace credentials for access to Llama2.
"""
from typing import Optional, List, Tuple
from huggingface_hub import snapshot_download
from vllm import EngineArgs, LLMEngine, SamplingParams, RequestOutput
from vllm.lora.request import LoRARequest
def create_test_prompts(lora_path: str) -> List[Tuple[str, SamplingParams]]:
"""Create a list of test prompts with their sampling parameters.
2 requests for base model, 4 requests for the LoRA. We define 2
different LoRA adapters (using the same model for demo purposes).
Since we also set `max_loras=1`, the expectation is that the requests
with the second LoRA adapter will be ran after all requests with the
first adapter have finished.
"""
return [
("A robot may not injure a human being",
SamplingParams(temperature=0.0,
logprobs=1,
prompt_logprobs=1,
max_tokens=128), None),
("To be or not to be,",
SamplingParams(temperature=0.8,
top_k=5,
presence_penalty=0.2,
max_tokens=128), None),
("[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]",
SamplingParams(temperature=0.0,
logprobs=1,
prompt_logprobs=1,
max_tokens=128,
stop_token_ids=[32003]),
LoRARequest("sql-lora", 1, lora_path)),
("[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]",
SamplingParams(n=3,
best_of=3,
use_beam_search=True,
temperature=0,
max_tokens=128,
stop_token_ids=[32003]),
LoRARequest("sql-lora", 1, lora_path)),
("[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]",
SamplingParams(temperature=0.0,
logprobs=1,
prompt_logprobs=1,
max_tokens=128,
stop_token_ids=[32003]),
LoRARequest("sql-lora2", 2, lora_path)),
("[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]",
SamplingParams(n=3,
best_of=3,
use_beam_search=True,
temperature=0,
max_tokens=128,
stop_token_ids=[32003]),
LoRARequest("sql-lora", 1, lora_path)),
]
def process_requests(engine: LLMEngine,
test_prompts: List[Tuple[str, SamplingParams,
Optional[LoRARequest]]]):
"""Continuously process a list of prompts and handle the outputs."""
request_id = 0
while test_prompts or engine.has_unfinished_requests():
if test_prompts:
prompt, sampling_params, lora_request = test_prompts.pop(0)
engine.add_request(str(request_id),
prompt,
sampling_params,
lora_request=lora_request)
request_id += 1
request_outputs: List[RequestOutput] = engine.step()
for request_output in request_outputs:
if request_output.finished:
print(request_output)
def initialize_engine() -> LLMEngine:
"""Initialize the LLMEngine."""
# max_loras: controls the number of LoRAs that can be used in the same
# batch. Larger numbers will cause higher memory usage, as each LoRA
# slot requires its own preallocated tensor.
# max_lora_rank: controls the maximum supported rank of all LoRAs. Larger
# numbers will cause higher memory usage. If you know that all LoRAs will
# use the same rank, it is recommended to set this as low as possible.
# max_cpu_loras: controls the size of the CPU LoRA cache.
engine_args = EngineArgs(model="meta-llama/Llama-2-7b-hf",
enable_lora=True,
max_loras=1,
max_lora_rank=8,
max_cpu_loras=2,
max_num_seqs=256)
return LLMEngine.from_engine_args(engine_args)
def main():
"""Main function that sets up and runs the prompt processing."""
engine = initialize_engine()
lora_path = snapshot_download(repo_id="yard1/llama-2-7b-sql-lora-test")
test_prompts = create_test_prompts(lora_path)
process_requests(engine, test_prompts)
if __name__ == '__main__':
main()
from vllm import LLM, SamplingParams
prefix = (
"You are an expert school principal, skilled in effectively managing "
"faculty and staff. Draft 10-15 questions for a potential first grade "
"Head Teacher for my K-12, all-girls', independent school that emphasizes "
"community, joyful discovery, and life-long learning. The candidate is "
"coming in for a first-round panel interview for a 8th grade Math "
"teaching role. They have 5 years of previous teaching experience "
"as an assistant teacher at a co-ed, public school with experience "
"in middle school math teaching. Based on these information, fulfill "
"the following paragraph: ")
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.0)
# Create an LLM.
llm = LLM(model="facebook/opt-125m")
generating_prompts = [prefix + prompt for prompt in prompts]
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(generating_prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
print("-" * 80)
# -1 since the last token can change when concatenating prompts.
prefix_pos = len(llm.llm_engine.tokenizer.encode(prefix)) - 1
# The llm.generate call will batch all prompts and send the batch at once if resources allow.
# The prefix will only be cached after the first batch is processed, so we need to call generate once
# to calculate the prefix and cache it.
outputs = llm.generate(generating_prompts[0],
sampling_params,
prefix_pos=[prefix_pos])
# Subsequent batches can leverage the cached prefix
outputs = llm.generate(generating_prompts,
sampling_params,
prefix_pos=[prefix_pos] * len(generating_prompts))
# Print the outputs. You should see the same outputs as before
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
...@@ -32,6 +32,5 @@ chat_completion = client.chat.completions.create( ...@@ -32,6 +32,5 @@ chat_completion = client.chat.completions.create(
model=model, model=model,
) )
print("Chat completion results:") print("Chat completion results:")
print(chat_completion) print(chat_completion)
...@@ -21,8 +21,7 @@ completion = client.completions.create( ...@@ -21,8 +21,7 @@ completion = client.completions.create(
echo=False, echo=False,
n=2, n=2,
stream=stream, stream=stream,
logprobs=3 logprobs=3)
)
print("Completion results:") print("Completion results:")
if stream: if stream:
......
{{ (messages|selectattr('role', 'equalto', 'system')|list|last).content|trim if (messages|selectattr('role', 'equalto', 'system')|list) else '' }}
{% for message in messages %}
{% if message['role'] == 'user' %}
<reserved_106>
{{ message['content']|trim -}}
{% if not loop.last %}
{% endif %}
{% elif message['role'] == 'assistant' %}
<reserved_107>
{{ message['content']|trim -}}
{% if not loop.last %}
{% endif %}
{% endif %}
{% endfor %}
{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}
<reserved_107>
{% endif %}
\ No newline at end of file
...@@ -71,7 +71,7 @@ format_changed() { ...@@ -71,7 +71,7 @@ format_changed() {
# Format all files # Format all files
format_all() { format_all() {
yapf --in-place "${YAPF_FLAGS[@]}" "${YAPF_EXCLUDES[@]}" vllm tests yapf --in-place "${YAPF_FLAGS[@]}" "${YAPF_EXCLUDES[@]}" .
} }
## This flag formats individual files. --files *must* be the first command line ## This flag formats individual files. --files *must* be the first command line
......
...@@ -13,4 +13,9 @@ types-setuptools ...@@ -13,4 +13,9 @@ types-setuptools
pytest pytest
pytest-forked pytest-forked
pytest-asyncio pytest-asyncio
httpx
einops # required for MPT
flash_attn # required for HuggingFace's llama implementation
openai
requests
ray
\ No newline at end of file
sentencepiece # Required for LLaMA tokenizer.
numpy
transformers-neuronx >= 0.9.0
torch-neuronx >= 2.1.0
neuronx-cc
fastapi
uvicorn[standard]
pydantic >= 2.0 # Required for OpenAI server.
aioprometheus[starlette]
...@@ -2,12 +2,12 @@ ninja # For faster builds. ...@@ -2,12 +2,12 @@ ninja # For faster builds.
typing-extensions>=4.8.0 typing-extensions>=4.8.0
starlette starlette
psutil psutil
ray >= 2.5.1 ray >= 2.9
sentencepiece # Required for LLaMA tokenizer. sentencepiece # Required for LLaMA tokenizer.
numpy numpy
tokenizers>=0.15.0 tokenizers>=0.15.0
transformers >= 4.36.0 # Required for Mixtral. transformers >= 4.37.0 # Required for Mixtral.
fastapi fastapi
uvicorn[standard] uvicorn[standard]
pydantic == 1.10.13 # Required for OpenAI server. pydantic >= 2.0 # Required for OpenAI server.
aioprometheus[starlette] aioprometheus[starlette]
ninja # For faster builds. ninja # For faster builds.
psutil psutil
ray >= 2.5.1 ray >= 2.9
sentencepiece # Required for LLaMA tokenizer. sentencepiece # Required for LLaMA tokenizer.
numpy numpy
torch == 2.1.2 torch == 2.1.2
transformers >= 4.36.0 # Required for Mixtral. transformers >= 4.37.0 # Required for Qwen2
xformers == 0.0.23.post1 # Required for CUDA 12.1. xformers == 0.0.23.post1 # Required for CUDA 12.1.
fastapi fastapi
uvicorn[standard] uvicorn[standard]
pydantic == 1.10.13 # Required for OpenAI server. pydantic >= 2.0 # Required for OpenAI server.
aioprometheus[starlette] aioprometheus[starlette]
pynvml == 11.5.0
import contextlib
import io import io
import os import os
import re import re
import subprocess import subprocess
from typing import List, Set
import warnings import warnings
from pathlib import Path
from typing import List, Set
from packaging.version import parse, Version from packaging.version import parse, Version
import setuptools import setuptools
import torch import torch
import torch.utils.cpp_extension as torch_cpp_ext
from torch.utils.cpp_extension import BuildExtension, CUDAExtension, CUDA_HOME, ROCM_HOME from torch.utils.cpp_extension import BuildExtension, CUDAExtension, CUDA_HOME, ROCM_HOME
from typing import Optional, Union from typing import Optional, Union
...@@ -20,7 +23,7 @@ MAIN_CUDA_VERSION = "12.1" ...@@ -20,7 +23,7 @@ MAIN_CUDA_VERSION = "12.1"
# Supported NVIDIA GPU architectures. # Supported NVIDIA GPU architectures.
NVIDIA_SUPPORTED_ARCHS = {"7.0", "7.5", "8.0", "8.6", "8.9", "9.0"} NVIDIA_SUPPORTED_ARCHS = {"7.0", "7.5", "8.0", "8.6", "8.9", "9.0"}
ROCM_SUPPORTED_ARCHS = {"gfx90a", "gfx908", "gfx906", "gfx926","gfx1030", "gfx1100"} ROCM_SUPPORTED_ARCHS = {"gfx90a", "gfx908", "gfx906", "gfx926","gfx1030", "gfx1100"}
# SUPPORTED_ARCHS = NVIDIA_SUPPORTED_ARCHS.union(ROCM_SUPPORTED_ARCHS) # SUPPORTED_ARCHS = NVIDIA_SUPPORTED_ARCHS.union(ROCM_SUPPORTED_ARCHS)
...@@ -28,8 +31,17 @@ def _is_hip() -> bool: ...@@ -28,8 +31,17 @@ def _is_hip() -> bool:
return torch.version.hip is not None return torch.version.hip is not None
def _is_neuron() -> bool:
torch_neuronx_installed = True
try:
subprocess.run(["neuron-ls"], capture_output=True, check=True)
except FileNotFoundError:
torch_neuronx_installed = False
return torch_neuronx_installed
def _is_cuda() -> bool: def _is_cuda() -> bool:
return torch.version.cuda is not None return (torch.version.cuda is not None) and not _is_neuron()
# Compiler flags. # Compiler flags.
...@@ -43,6 +55,8 @@ if _is_hip(): ...@@ -43,6 +55,8 @@ if _is_hip():
"Cannot find ROCM_HOME. ROCm must be available to build the package." "Cannot find ROCM_HOME. ROCm must be available to build the package."
) )
NVCC_FLAGS += ["-DUSE_ROCM"] NVCC_FLAGS += ["-DUSE_ROCM"]
NVCC_FLAGS += ["-U__HIP_NO_HALF_CONVERSIONS__"]
NVCC_FLAGS += ["-U__HIP_NO_HALF_OPERATORS__"]
if _is_cuda() and CUDA_HOME is None: if _is_cuda() and CUDA_HOME is None:
raise RuntimeError( raise RuntimeError(
...@@ -91,6 +105,30 @@ def get_hipcc_rocm_version(): ...@@ -91,6 +105,30 @@ def get_hipcc_rocm_version():
return None return None
def glob(pattern: str):
root = Path(__name__).parent
return [str(p) for p in root.glob(pattern)]
def get_neuronxcc_version():
import sysconfig
site_dir = sysconfig.get_paths()["purelib"]
version_file = os.path.join(site_dir, "neuronxcc", "version",
"__init__.py")
# Check if the command was executed successfully
with open(version_file, "rt") as fp:
content = fp.read()
# Extract the version using a regular expression
match = re.search(r"__version__ = '(\S+)'", content)
if match:
# Return the version string
return match.group(1)
else:
raise RuntimeError("Could not find HIP version in the output")
def get_nvcc_cuda_version(cuda_dir: str) -> Version: def get_nvcc_cuda_version(cuda_dir: str) -> Version:
"""Get the CUDA version from nvcc. """Get the CUDA version from nvcc.
...@@ -155,6 +193,8 @@ if _is_cuda() and not compute_capabilities: ...@@ -155,6 +193,8 @@ if _is_cuda() and not compute_capabilities:
"GPUs with compute capability below 7.0 are not supported.") "GPUs with compute capability below 7.0 are not supported.")
compute_capabilities.add(f"{major}.{minor}") compute_capabilities.add(f"{major}.{minor}")
ext_modules = []
if _is_cuda(): if _is_cuda():
nvcc_cuda_version = get_nvcc_cuda_version(CUDA_HOME) nvcc_cuda_version = get_nvcc_cuda_version(CUDA_HOME)
if not compute_capabilities: if not compute_capabilities:
...@@ -192,6 +232,8 @@ if _is_cuda(): ...@@ -192,6 +232,8 @@ if _is_cuda():
raise RuntimeError( raise RuntimeError(
"CUDA 11.8 or higher is required for compute capability 9.0.") "CUDA 11.8 or higher is required for compute capability 9.0.")
NVCC_FLAGS_PUNICA = NVCC_FLAGS.copy()
# Add target compute capabilities to NVCC flags. # Add target compute capabilities to NVCC flags.
for capability in compute_capabilities: for capability in compute_capabilities:
num = capability[0] + capability[2] num = capability[0] + capability[2]
...@@ -200,6 +242,14 @@ if _is_cuda(): ...@@ -200,6 +242,14 @@ if _is_cuda():
NVCC_FLAGS += [ NVCC_FLAGS += [
"-gencode", f"arch=compute_{num},code=compute_{num}" "-gencode", f"arch=compute_{num},code=compute_{num}"
] ]
if int(capability[0]) >= 8:
NVCC_FLAGS_PUNICA += [
"-gencode", f"arch=compute_{num},code=sm_{num}"
]
if capability.endswith("+PTX"):
NVCC_FLAGS_PUNICA += [
"-gencode", f"arch=compute_{num},code=compute_{num}"
]
# Use NVCC threads to parallelize the build. # Use NVCC threads to parallelize the build.
if nvcc_cuda_version >= Version("11.2"): if nvcc_cuda_version >= Version("11.2"):
...@@ -207,14 +257,52 @@ if _is_cuda(): ...@@ -207,14 +257,52 @@ if _is_cuda():
num_threads = min(os.cpu_count(), nvcc_threads) num_threads = min(os.cpu_count(), nvcc_threads)
NVCC_FLAGS += ["--threads", str(num_threads)] NVCC_FLAGS += ["--threads", str(num_threads)]
if nvcc_cuda_version >= Version("11.8"):
NVCC_FLAGS += ["-DENABLE_FP8_E5M2"]
# changes for punica kernels
NVCC_FLAGS += torch_cpp_ext.COMMON_NVCC_FLAGS
REMOVE_NVCC_FLAGS = [
'-D__CUDA_NO_HALF_OPERATORS__',
'-D__CUDA_NO_HALF_CONVERSIONS__',
'-D__CUDA_NO_BFLOAT16_CONVERSIONS__',
'-D__CUDA_NO_HALF2_OPERATORS__',
]
for flag in REMOVE_NVCC_FLAGS:
with contextlib.suppress(ValueError):
torch_cpp_ext.COMMON_NVCC_FLAGS.remove(flag)
install_punica = bool(int(os.getenv("VLLM_INSTALL_PUNICA_KERNELS", "0")))
device_count = torch.cuda.device_count()
for i in range(device_count):
major, minor = torch.cuda.get_device_capability(i)
if major < 8:
install_punica = False
break
if install_punica:
ext_modules.append(
CUDAExtension(
name="vllm._punica_C",
sources=["csrc/punica/punica_ops.cc"] +
glob("csrc/punica/bgmv/*.cu"),
extra_compile_args={
"cxx": CXX_FLAGS,
"nvcc": NVCC_FLAGS_PUNICA,
},
))
# elif _is_hip(): # elif _is_hip():
# amd_arch = get_amdgpu_offload_arch() # amd_archs = os.getenv("GPU_ARCHS")
# if amd_arch not in ROCM_SUPPORTED_ARCHS: # if amd_archs is None:
# raise RuntimeError( # amd_archs = get_amdgpu_offload_arch()
# f"Only the following arch is supported: {ROCM_SUPPORTED_ARCHS}" # for arch in amd_archs.split(";"):
# f"amdgpu_arch_found: {amd_arch}") # if arch not in ROCM_SUPPORTED_ARCHS:
# raise RuntimeError(
ext_modules = [] # f"Only the following arch is supported: {ROCM_SUPPORTED_ARCHS}"
# f"amdgpu_arch_found: {arch}")
# NVCC_FLAGS += [f"--offload-arch={arch}"]
elif _is_neuron():
neuronxcc_version = get_neuronxcc_version()
vllm_extension_sources = [ vllm_extension_sources = [
"csrc/cache_kernels.cu", "csrc/cache_kernels.cu",
...@@ -225,21 +313,25 @@ vllm_extension_sources = [ ...@@ -225,21 +313,25 @@ vllm_extension_sources = [
"csrc/quantization/squeezellm/quant_cuda_kernel.cu", "csrc/quantization/squeezellm/quant_cuda_kernel.cu",
"csrc/quantization/gptq/q_gemm.cu", "csrc/quantization/gptq/q_gemm.cu",
"csrc/cuda_utils_kernels.cu", "csrc/cuda_utils_kernels.cu",
"csrc/moe_align_block_size_kernels.cu",
"csrc/pybind.cpp", "csrc/pybind.cpp",
] ]
if _is_cuda(): if _is_cuda():
vllm_extension_sources.append("csrc/quantization/awq/gemm_kernels.cu") vllm_extension_sources.append("csrc/quantization/awq/gemm_kernels.cu")
vllm_extension_sources.append("csrc/custom_all_reduce.cu")
vllm_extension = CUDAExtension( if not _is_neuron():
name="vllm._C", vllm_extension = CUDAExtension(
sources=vllm_extension_sources, name="vllm._C",
extra_compile_args={ sources=vllm_extension_sources,
"cxx": CXX_FLAGS, extra_compile_args={
"nvcc": NVCC_FLAGS, "cxx": CXX_FLAGS,
}, "nvcc": NVCC_FLAGS,
) },
ext_modules.append(vllm_extension) libraries=["cuda"] if _is_cuda() else [],
)
ext_modules.append(vllm_extension)
def get_path(*filepath) -> str: def get_path(*filepath) -> str:
...@@ -300,8 +392,8 @@ def get_version_add(sha: Optional[str] = None) -> str: ...@@ -300,8 +392,8 @@ def get_version_add(sha: Optional[str] = None) -> str:
version += ".torch" + torch.__version__[:3] version += ".torch" + torch.__version__[:3]
with open(add_version_path, encoding="utf-8",mode="w") as file: with open(add_version_path, encoding="utf-8",mode="w") as file:
file.write("__version__='0.2.7'\n") file.write("__version__='0.3.0'\n")
file.write("__dcu_version__='0.2.7+{}'\n".format(version)) file.write("__dcu_version__='0.3.0+{}'\n".format(version))
file.close() file.close()
...@@ -323,6 +415,12 @@ def get_vllm_version() -> str: ...@@ -323,6 +415,12 @@ def get_vllm_version() -> str:
# rocm_version_str = hipcc_version.replace(".", "")[:3] # rocm_version_str = hipcc_version.replace(".", "")[:3]
# version += f"+rocm{rocm_version_str}" # version += f"+rocm{rocm_version_str}"
version = get_version() version = get_version()
elif _is_neuron():
# Get the Neuron version
neuron_version = str(neuronxcc_version)
if neuron_version != MAIN_CUDA_VERSION:
neuron_version_str = neuron_version.replace(".", "")[:3]
version += f"+neuron{neuron_version_str}"
else: else:
cuda_version = str(nvcc_cuda_version) cuda_version = str(nvcc_cuda_version)
if cuda_version != MAIN_CUDA_VERSION: if cuda_version != MAIN_CUDA_VERSION:
...@@ -346,12 +444,20 @@ def get_requirements() -> List[str]: ...@@ -346,12 +444,20 @@ def get_requirements() -> List[str]:
if _is_hip(): if _is_hip():
with open(get_path("requirements-rocm.txt")) as f: with open(get_path("requirements-rocm.txt")) as f:
requirements = f.read().strip().split("\n") requirements = f.read().strip().split("\n")
elif _is_neuron():
with open(get_path("requirements-neuron.txt")) as f:
requirements = f.read().strip().split("\n")
else: else:
with open(get_path("requirements.txt")) as f: with open(get_path("requirements.txt")) as f:
requirements = f.read().strip().split("\n") requirements = f.read().strip().split("\n")
return requirements return requirements
package_data = {"vllm": ["py.typed"]}
if os.environ.get("VLLM_USE_PRECOMPILED"):
ext_modules = []
package_data["vllm"].append("*.so")
setuptools.setup( setuptools.setup(
name="vllm", name="vllm",
version=get_vllm_version(), version=get_vllm_version(),
...@@ -379,6 +485,6 @@ setuptools.setup( ...@@ -379,6 +485,6 @@ setuptools.setup(
python_requires=">=3.8", python_requires=">=3.8",
install_requires=get_requirements(), install_requires=get_requirements(),
ext_modules=ext_modules, ext_modules=ext_modules,
cmdclass={"build_ext": BuildExtension}, cmdclass={"build_ext": BuildExtension} if not _is_neuron() else {},
package_data={"vllm": ["py.typed"]}, package_data=package_data,
) )
...@@ -29,8 +29,13 @@ def api_server(): ...@@ -29,8 +29,13 @@ def api_server():
script_path = Path(__file__).parent.joinpath( script_path = Path(__file__).parent.joinpath(
"api_server_async_engine.py").absolute() "api_server_async_engine.py").absolute()
uvicorn_process = subprocess.Popen([ uvicorn_process = subprocess.Popen([
sys.executable, "-u", sys.executable,
str(script_path), "--model", "facebook/opt-125m" "-u",
str(script_path),
"--model",
"facebook/opt-125m",
"--host",
"127.0.0.1",
]) ])
yield yield
uvicorn_process.terminate() uvicorn_process.terminate()
...@@ -81,6 +86,9 @@ def test_api_server(api_server): ...@@ -81,6 +86,9 @@ def test_api_server(api_server):
pool.join() pool.join()
# check cancellation stats # check cancellation stats
# give it some times to update the stats
time.sleep(1)
num_aborted_requests = requests.get( num_aborted_requests = requests.get(
"http://localhost:8000/stats").json()["num_aborted_requests"] "http://localhost:8000/stats").json()["num_aborted_requests"]
assert num_aborted_requests > 0 assert num_aborted_requests > 0
......
...@@ -25,6 +25,13 @@ class MockEngine: ...@@ -25,6 +25,13 @@ class MockEngine:
return [RequestOutput( return [RequestOutput(
request_id=self.request_id)] if self.request_id else [] request_id=self.request_id)] if self.request_id else []
async def encode_request_async(
self,
*args,
**kwargs,
):
return [1]
def generate(self, request_id): def generate(self, request_id):
self.request_id = request_id self.request_id = request_id
...@@ -35,6 +42,10 @@ class MockEngine: ...@@ -35,6 +42,10 @@ class MockEngine:
del kwargs # Unused del kwargs # Unused
self.add_request_calls += 1 self.add_request_calls += 1
async def add_request_async(self, **kwargs):
del kwargs # Unused
self.add_request_calls += 1
def abort_request(self, request_id): def abort_request(self, request_id):
del request_id # Unused del request_id # Unused
self.abort_request_calls += 1 self.abort_request_calls += 1
......
from argparse import Namespace
from dataclasses import dataclass from dataclasses import dataclass
import os
import pathlib
import pytest import pytest
from fastapi.testclient import TestClient
from vllm.entrypoints.openai.api_server import * from vllm.transformers_utils.tokenizer import get_tokenizer
from vllm.entrypoints.openai.serving_chat import OpenAIServingChat
from vllm.entrypoints.openai.protocol import ChatCompletionRequest
chatml_jinja_path = pathlib.Path(os.path.dirname(os.path.abspath(
__file__))).parent.parent / "examples/template_chatml.jinja"
assert chatml_jinja_path.exists()
# Define models, templates, and their corresponding expected outputs # Define models, templates, and their corresponding expected outputs
MODEL_TEMPLATE_GENERATON_OUTPUT = [ MODEL_TEMPLATE_GENERATON_OUTPUT = [
...@@ -12,8 +18,7 @@ MODEL_TEMPLATE_GENERATON_OUTPUT = [ ...@@ -12,8 +18,7 @@ MODEL_TEMPLATE_GENERATON_OUTPUT = [
"Hello</s>Hi there!</s>What is the capital of</s>"), "Hello</s>Hi there!</s>What is the capital of</s>"),
("facebook/opt-125m", None, False, ("facebook/opt-125m", None, False,
"Hello</s>Hi there!</s>What is the capital of</s>"), "Hello</s>Hi there!</s>What is the capital of</s>"),
("facebook/opt-125m", "../../examples/template_chatml.jinja", True, ("facebook/opt-125m", chatml_jinja_path, True, """<|im_start|>user
"""<|im_start|>user
Hello<|im_end|> Hello<|im_end|>
<|im_start|>assistant <|im_start|>assistant
Hi there!<|im_end|> Hi there!<|im_end|>
...@@ -21,8 +26,7 @@ Hi there!<|im_end|> ...@@ -21,8 +26,7 @@ Hi there!<|im_end|>
What is the capital of<|im_end|> What is the capital of<|im_end|>
<|im_start|>assistant <|im_start|>assistant
"""), """),
("facebook/opt-125m", "../../examples/template_chatml.jinja", False, ("facebook/opt-125m", chatml_jinja_path, False, """<|im_start|>user
"""<|im_start|>user
Hello<|im_end|> Hello<|im_end|>
<|im_start|>assistant <|im_start|>assistant
Hi there!<|im_end|> Hi there!<|im_end|>
...@@ -44,7 +48,6 @@ TEST_MESSAGES = [ ...@@ -44,7 +48,6 @@ TEST_MESSAGES = [
'content': 'What is the capital of' 'content': 'What is the capital of'
}, },
] ]
client = TestClient(app)
@dataclass @dataclass
...@@ -52,14 +55,17 @@ class MockTokenizer: ...@@ -52,14 +55,17 @@ class MockTokenizer:
chat_template = None chat_template = None
@dataclass
class MockServingChat:
tokenizer: MockTokenizer
def test_load_chat_template(): def test_load_chat_template():
# Testing chatml template # Testing chatml template
template = "../../examples/template_chatml.jinja"
mock_args = Namespace(chat_template=template)
tokenizer = MockTokenizer() tokenizer = MockTokenizer()
mock_serving_chat = MockServingChat(tokenizer)
# Call the function with the mocked args OpenAIServingChat._load_chat_template(mock_serving_chat,
load_chat_template(mock_args, tokenizer) chat_template=chatml_jinja_path)
template_content = tokenizer.chat_template template_content = tokenizer.chat_template
...@@ -73,11 +79,11 @@ def test_load_chat_template(): ...@@ -73,11 +79,11 @@ def test_load_chat_template():
def test_no_load_chat_template(): def test_no_load_chat_template():
# Testing chatml template # Testing chatml template
template = "../../examples/does_not_exist" template = "../../examples/does_not_exist"
mock_args = Namespace(chat_template=template)
tokenizer = MockTokenizer() tokenizer = MockTokenizer()
# Call the function with the mocked args mock_serving_chat = MockServingChat(tokenizer)
load_chat_template(mock_args, tokenizer=tokenizer) OpenAIServingChat._load_chat_template(mock_serving_chat,
chat_template=template)
template_content = tokenizer.chat_template template_content = tokenizer.chat_template
# Test assertions # Test assertions
...@@ -94,9 +100,9 @@ async def test_get_gen_prompt(model, template, add_generation_prompt, ...@@ -94,9 +100,9 @@ async def test_get_gen_prompt(model, template, add_generation_prompt,
expected_output): expected_output):
# Initialize the tokenizer # Initialize the tokenizer
tokenizer = get_tokenizer(tokenizer_name=model) tokenizer = get_tokenizer(tokenizer_name=model)
mock_serving_chat = MockServingChat(tokenizer)
mock_args = Namespace(chat_template=template) OpenAIServingChat._load_chat_template(mock_serving_chat,
load_chat_template(mock_args, tokenizer) chat_template=template)
# Create a mock request object using keyword arguments # Create a mock request object using keyword arguments
mock_request = ChatCompletionRequest( mock_request = ChatCompletionRequest(
...@@ -112,8 +118,3 @@ async def test_get_gen_prompt(model, template, add_generation_prompt, ...@@ -112,8 +118,3 @@ async def test_get_gen_prompt(model, template, add_generation_prompt,
# Test assertion # Test assertion
assert result == expected_output, f"The generated prompt does not match the expected output for model {model} and template {template}" assert result == expected_output, f"The generated prompt does not match the expected output for model {model} and template {template}"
def test_health_endpoint():
response = client.get("/health")
assert response.status_code == 200
...@@ -2,32 +2,20 @@ ...@@ -2,32 +2,20 @@
Run `pytest tests/distributed/test_comm_ops.py --forked`. Run `pytest tests/distributed/test_comm_ops.py --forked`.
""" """
from multiprocessing import Process, set_start_method
import pytest import pytest
import torch import torch
import ray
from vllm.config import ParallelConfig
from vllm.utils import get_open_port
from vllm.model_executor.parallel_utils.communication_op import ( from vllm.model_executor.parallel_utils.communication_op import (
tensor_model_parallel_all_reduce, tensor_model_parallel_all_reduce,
tensor_model_parallel_all_gather, tensor_model_parallel_all_gather,
broadcast_tensor_dict,
) )
from vllm.worker.worker import _init_distributed_environment from vllm.test_utils import (init_test_distributed_environment,
multi_process_tensor_parallel)
def init_test_distributed_environment(pipeline_parallel_size: int,
tensor_parallel_size: int, rank: int,
distributed_init_port: str):
parallel_config = ParallelConfig(pipeline_parallel_size,
tensor_parallel_size,
worker_use_ray=True)
distributed_init_method = f"tcp://localhost:{distributed_init_port}"
torch.cuda.set_device(rank)
_init_distributed_environment(parallel_config, rank,
distributed_init_method)
@ray.remote(num_gpus=1, max_calls=1)
def all_reduce_test_worker(tensor_parallel_size: int, rank: int, def all_reduce_test_worker(tensor_parallel_size: int, rank: int,
distributed_init_port: str): distributed_init_port: str):
init_test_distributed_environment(1, tensor_parallel_size, rank, init_test_distributed_environment(1, tensor_parallel_size, rank,
...@@ -43,6 +31,7 @@ def all_reduce_test_worker(tensor_parallel_size: int, rank: int, ...@@ -43,6 +31,7 @@ def all_reduce_test_worker(tensor_parallel_size: int, rank: int,
assert torch.allclose(t, expected) assert torch.allclose(t, expected)
@ray.remote(num_gpus=1, max_calls=1)
def all_gather_test_worker(tensor_parallel_size: int, rank: int, def all_gather_test_worker(tensor_parallel_size: int, rank: int,
distributed_init_port: str): distributed_init_port: str):
init_test_distributed_environment(1, tensor_parallel_size, rank, init_test_distributed_environment(1, tensor_parallel_size, rank,
...@@ -64,20 +53,40 @@ def all_gather_test_worker(tensor_parallel_size: int, rank: int, ...@@ -64,20 +53,40 @@ def all_gather_test_worker(tensor_parallel_size: int, rank: int,
assert torch.allclose(t, expected) assert torch.allclose(t, expected)
@ray.remote(num_gpus=1, max_calls=1)
def broadcast_tensor_dict_test_worker(tensor_parallel_size: int, rank: int,
distributed_init_port: str):
init_test_distributed_environment(1, tensor_parallel_size, rank,
distributed_init_port)
test_dict = {
"a": torch.arange(8, dtype=torch.float32, device="cuda"),
"b": torch.arange(16, dtype=torch.int8, device="cuda"),
"c": "test",
"d": [1, 2, 3],
"e": {
"a": 1,
"b": 2
},
}
if rank == 0:
broadcast_tensor_dict(test_dict, src=0)
else:
recv_dict = broadcast_tensor_dict(src=0)
assert len(recv_dict) == len(test_dict)
assert torch.allclose(recv_dict["a"], test_dict["a"])
assert torch.allclose(recv_dict["b"], test_dict["b"])
assert recv_dict["c"] == test_dict["c"]
assert recv_dict["d"] == test_dict["d"]
assert recv_dict["e"] == test_dict["e"]
@pytest.mark.skipif(torch.cuda.device_count() < 2, @pytest.mark.skipif(torch.cuda.device_count() < 2,
reason="Need at least 2 GPUs to run the test.") reason="Need at least 2 GPUs to run the test.")
@pytest.mark.parametrize("tensor_parallel_size", [2]) @pytest.mark.parametrize("tensor_parallel_size", [2])
@pytest.mark.parametrize("test_target", @pytest.mark.parametrize("test_target", [
[all_reduce_test_worker, all_gather_test_worker]) all_reduce_test_worker, all_gather_test_worker,
broadcast_tensor_dict_test_worker
])
def test_multi_process_tensor_parallel(tensor_parallel_size, test_target): def test_multi_process_tensor_parallel(tensor_parallel_size, test_target):
set_start_method("spawn", force=True) multi_process_tensor_parallel(tensor_parallel_size, test_target)
distributed_init_port = get_open_port()
processes = []
for rank in range(tensor_parallel_size):
p = Process(target=test_target,
args=(tensor_parallel_size, rank, distributed_init_port))
p.start()
processes.append(p)
for p in processes:
p.join()
assert all(p.exitcode == 0 for p in processes)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment