# -rw-rw-r-- 1 mgoin mgoin 205M Jun 9 18:03 flashinfer_python-0.2.6.post1-cp39-abi3-linux_x86_64.whl
# $ # upload the wheel to a public location, e.g. https://wheels.vllm.ai/flashinfer/v0.2.6.post1/flashinfer_python-0.2.6.post1-cp39-abi3-linux_x86_64.whl
@@ -30,8 +30,8 @@ Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at
...
@@ -30,8 +30,8 @@ Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at
Where to get started with vLLM depends on the type of user. If you are looking to:
Where to get started with vLLM depends on the type of user. If you are looking to:
- Run open-source models on vLLM, we recommend starting with the [Quickstart Guide](./getting_started/quickstart.md)
- Run open-source models on vLLM, we recommend starting with the [Quickstart Guide](./getting_started/quickstart.md)
- Build applications with vLLM, we recommend starting with the [User Guide](./usage)
- Build applications with vLLM, we recommend starting with the [User Guide](./usage/README.md)
- Build vLLM, we recommend starting with [Developer Guide](./contributing)
- Build vLLM, we recommend starting with [Developer Guide](./contributing/README.md)
For information about the development of vLLM, see:
For information about the development of vLLM, see:
...
@@ -56,7 +56,7 @@ vLLM is flexible and easy to use with:
...
@@ -56,7 +56,7 @@ vLLM is flexible and easy to use with:
- Tensor, pipeline, data and expert parallelism support for distributed inference
- Tensor, pipeline, data and expert parallelism support for distributed inference
- Streaming outputs
- Streaming outputs
- OpenAI-compatible API server
- OpenAI-compatible API server
- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
@@ -20,8 +20,6 @@ API documentation for vLLM's configuration classes.
...
@@ -20,8 +20,6 @@ API documentation for vLLM's configuration classes.
-[vllm.config.CompilationConfig][]
-[vllm.config.CompilationConfig][]
-[vllm.config.VllmConfig][]
-[vllm.config.VllmConfig][]
[](){ #offline-inference-api }
## Offline Inference
## Offline Inference
LLM Class.
LLM Class.
...
@@ -45,18 +43,14 @@ Engine classes for offline and online inference.
...
@@ -45,18 +43,14 @@ Engine classes for offline and online inference.
Inference parameters for vLLM APIs.
Inference parameters for vLLM APIs.
[](){ #sampling-params }
-[vllm.SamplingParams][]
-[vllm.SamplingParams][]
-[vllm.PoolingParams][]
-[vllm.PoolingParams][]
[](){ #multi-modality }
## Multi-Modality
## Multi-Modality
vLLM provides experimental support for multi-modal models through the [vllm.multimodal][] package.
vLLM provides experimental support for multi-modal models through the [vllm.multimodal][] package.
Multi-modal inputs can be passed alongside text and token prompts to [supported models][supported-mm-models]
Multi-modal inputs can be passed alongside text and token prompts to [supported models](../models/supported_models.md#list-of-multimodal-language-models)
via the `multi_modal_data` field in [vllm.inputs.PromptType][].
via the `multi_modal_data` field in [vllm.inputs.PromptType][].
Looking to add your own multi-modal model? Please follow the instructions listed [here](../contributing/model/multimodal.md).
Looking to add your own multi-modal model? Please follow the instructions listed [here](../contributing/model/multimodal.md).