Install vLLM with pip or [from source](https://llm-serving-cacheflow.readthedocs-hosted.com/en/latest/getting_started/installation.html#build-from-source):
Install vLLM with pip or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
```bash
```bash
pip install vllm
pip install vllm
...
@@ -54,10 +54,10 @@ pip install vllm
...
@@ -54,10 +54,10 @@ pip install vllm
## Getting Started
## Getting Started
Visit our [documentation](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/) to get started.
Visit our [documentation](https://vllm.readthedocs.io/en/latest/) to get started.
@@ -22,7 +22,7 @@ Import ``LLM`` and ``SamplingParams`` from vLLM. The ``LLM`` class is the main c
...
@@ -22,7 +22,7 @@ Import ``LLM`` and ``SamplingParams`` from vLLM. The ``LLM`` class is the main c
from vllm import LLM, SamplingParams
from vllm import LLM, SamplingParams
Define the list of input prompts and the sampling parameters for generation. The sampling temperature is set to 0.8 and the nucleus sampling probability is set to 0.95. For more information about the sampling parameters, refer to the `class definition <https://github.com/WoosukKwon/vllm/blob/main/vllm/sampling_params.py>`_.
Define the list of input prompts and the sampling parameters for generation. The sampling temperature is set to 0.8 and the nucleus sampling probability is set to 0.95. For more information about the sampling parameters, refer to the `class definition <https://github.com/vllm-project/vllm/blob/main/vllm/sampling_params.py>`_.
.. code-block:: python
.. code-block:: python
...
@@ -53,13 +53,13 @@ Call ``llm.generate`` to generate the outputs. It adds the input prompts to vLLM
...
@@ -53,13 +53,13 @@ Call ``llm.generate`` to generate the outputs. It adds the input prompts to vLLM
The code example can also be found in `examples/offline_inference.py <https://github.com/WoosukKwon/vllm/blob/main/examples/offline_inference.py>`_.
The code example can also be found in `examples/offline_inference.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference.py>`_.
API Server
API Server
----------
----------
vLLM can be deployed as an LLM service. We provide an example `FastAPI <https://fastapi.tiangolo.com/>`_ server. Check `vllm/entrypoints/api_server.py <https://github.com/WoosukKwon/vllm/blob/main/vllm/entrypoints/api_server.py>`_ for the server implementation. The server uses ``AsyncLLMEngine`` class to support asynchronous processing of incoming requests.
vLLM can be deployed as an LLM service. We provide an example `FastAPI <https://fastapi.tiangolo.com/>`_ server. Check `vllm/entrypoints/api_server.py <https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py>`_ for the server implementation. The server uses ``AsyncLLMEngine`` class to support asynchronous processing of incoming requests.
Start the server:
Start the server:
...
@@ -81,7 +81,7 @@ Query the model in shell:
...
@@ -81,7 +81,7 @@ Query the model in shell:
$ "temperature": 0
$ "temperature": 0
$ }'
$ }'
See `examples/api_client.py <https://github.com/WoosukKwon/vllm/blob/main/examples/api_client.py>`_ for a more detailed client example.
See `examples/api_client.py <https://github.com/vllm-project/vllm/blob/main/examples/api_client.py>`_ for a more detailed client example.
OpenAI-Compatible Server
OpenAI-Compatible Server
------------------------
------------------------
...
@@ -128,4 +128,4 @@ Since this server is compatible with OpenAI API, you can use it as a drop-in rep
...
@@ -128,4 +128,4 @@ Since this server is compatible with OpenAI API, you can use it as a drop-in rep
prompt="San Francisco is a")
prompt="San Francisco is a")
print("Completion result:", completion)
print("Completion result:", completion)
For a more detailed client example, refer to `examples/openai_client.py <https://github.com/WoosukKwon/vllm/blob/main/examples/openai_client.py>`_.
For a more detailed client example, refer to `examples/openai_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_client.py>`_.
@@ -11,22 +11,22 @@ This document provides a high-level guide on integrating a `HuggingFace Transfor
...
@@ -11,22 +11,22 @@ This document provides a high-level guide on integrating a `HuggingFace Transfor
However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex.
However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex.
.. tip::
.. tip::
If you are encountering issues while integrating your model into vLLM, feel free to open an issue on our `GitHub <https://github.com/WoosukKwon/vllm/issues>`_ repository.
If you are encountering issues while integrating your model into vLLM, feel free to open an issue on our `GitHub <https://github.com/vllm-project/vllm/issues>`_ repository.
We will be happy to help you out!
We will be happy to help you out!
0. Fork the vLLM repository
0. Fork the vLLM repository
--------------------------------
--------------------------------
Start by forking our `GitHub <https://github.com/WoosukKwon/vllm/issues>`_ repository and then :ref:`build it from source <build_from_source>`.
Start by forking our `GitHub <https://github.com/vllm-project/vllm/issues>`_ repository and then :ref:`build it from source <build_from_source>`.
This gives you the ability to modify the codebase and test your model.
This gives you the ability to modify the codebase and test your model.
1. Bring your model code
1. Bring your model code
------------------------
------------------------
Clone the PyTorch model code from the HuggingFace Transformers repository and put it into the `vllm/model_executor/models <https://github.com/WoosukKwon/vllm/tree/main/vllm/model_executor/models>`_ directory.
Clone the PyTorch model code from the HuggingFace Transformers repository and put it into the `vllm/model_executor/models <https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models>`_ directory.
For instance, vLLM's `OPT model <https://github.com/WoosukKwon/vllm/blob/main/vllm/model_executor/models/opt.py>`_ was adpated from the HuggingFace's `modeling_opt.py <https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py>`_ file.
For instance, vLLM's `OPT model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/opt.py>`_ was adpated from the HuggingFace's `modeling_opt.py <https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py>`_ file.
.. warning::
.. warning::
When copying the model code, make sure to review and adhere to the code's copyright and licensing terms.
When copying the model code, make sure to review and adhere to the code's copyright and licensing terms.
...
@@ -91,4 +91,4 @@ While the process is straightforward for most layers, the tensor-parallel layers
...
@@ -91,4 +91,4 @@ While the process is straightforward for most layers, the tensor-parallel layers
5. Register your model
5. Register your model
----------------------
----------------------
Finally, include your :code:`*ForCausalLM` class in `vllm/model_executor/models/__init__.py <https://github.com/WoosukKwon/vllm/blob/main/vllm/model_executor/models/__init__.py>`_ and register it to the :code:`_MODEL_REGISTRY` in `vllm/model_executor/model_loader.py <https://github.com/WoosukKwon/vllm/blob/main/vllm/model_executor/model_loader.py>`_.
Finally, include your :code:`*ForCausalLM` class in `vllm/model_executor/models/__init__.py <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/__init__.py>`_ and register it to the :code:`_MODEL_REGISTRY` in `vllm/model_executor/model_loader.py <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/model_loader.py>`_.