[Minor] Fix the format in quick start guide related to Model Scope (#2425)

f745847e · Zhuohan Li · GitHub · 6549aef2 · f745847e
Unverified Commit f745847e authored Jan 11, 2024 by Zhuohan Li Committed by GitHub Jan 11, 2024
Hide whitespace changes
Inline Side-by-side

Showing with 8 additions and 27 deletions

docs/source/getting_started/quickstart.rst docs/source/getting_started/quickstart.rst +8 -27

No files found.
--- a/docs/source/getting_started/quickstart.rst
+++ b/docs/source/getting_started/quickstart.rst
@@ -11,6 +11,14 @@ This guide shows how to use vLLM to:

 Be sure to complete the :ref:`installation instructions <installation>` before continuing with this guide.

+.. note::
+
+    By default, vLLM downloads model from `HuggingFace <https://huggingface.co/>`_. If you would like to use models from `ModelScope <https://www.modelscope.cn>`_ in the following examples, please set the environment variable:
+
+    .. code-block:: shell
+
+        export VLLM_USE_MODELSCOPE=True
+
 Offline Batched Inference
 -------------------------

@@ -40,16 +48,6 @@ Initialize vLLM's engine for offline inference with the ``LLM`` class and the `O

    llm = LLM(model="facebook/opt-125m")

-Use model from www.modelscope.cn
-
-.. code-block:: shell
-
-    export VLLM_USE_MODELSCOPE=True
-
-.. code-block:: python
-
-    llm = LLM(model="qwen/Qwen-7B-Chat", revision="v1.1.8", trust_remote_code=True)
-
 Call ``llm.generate`` to generate the outputs. It adds the input prompts to vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of ``RequestOutput`` objects, which include all the output tokens.

 .. code-block:: python
@@ -77,16 +75,6 @@ Start the server:

    $ python -m vllm.entrypoints.api_server

-Use model from www.modelscope.cn
-
-.. code-block:: console
-
-    $ VLLM_USE_MODELSCOPE=True python -m vllm.entrypoints.api_server \
-    $    --model="qwen/Qwen-7B-Chat" \
-    $    --revision="v1.1.8" \
-    $    --trust-remote-code
-
-
 By default, this command starts the server at ``http://localhost:8000`` with the OPT-125M model.

 Query the model in shell:
@@ -116,13 +104,6 @@ Start the server:
    $ python -m vllm.entrypoints.openai.api_server \
    $     --model facebook/opt-125m

-Use model from www.modelscope.cn
-
-.. code-block:: console
-
-    $ VLLM_USE_MODELSCOPE=True python -m vllm.entrypoints.openai.api_server \
-    $     --model="qwen/Qwen-7B-Chat" --revision="v1.1.8" --trust-remote-code
-
 By default, the server uses a predefined chat template stored in the tokenizer. You can override this template by using the ``--chat-template`` argument:

 .. code-block:: console