@@ -11,6 +11,14 @@ This guide shows how to use vLLM to:
Be sure to complete the :ref:`installation instructions <installation>` before continuing with this guide.
.. note::
By default, vLLM downloads model from `HuggingFace <https://huggingface.co/>`_. If you would like to use models from `ModelScope <https://www.modelscope.cn>`_ in the following examples, please set the environment variable:
.. code-block:: shell
export VLLM_USE_MODELSCOPE=True
Offline Batched Inference
-------------------------
...
...
@@ -40,16 +48,6 @@ Initialize vLLM's engine for offline inference with the ``LLM`` class and the `O
Call ``llm.generate`` to generate the outputs. It adds the input prompts to vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of ``RequestOutput`` objects, which include all the output tokens.
By default, the server uses a predefined chat template stored in the tokenizer. You can override this template by using the ``--chat-template`` argument: