Add docs on serving with Llama Stack (#10183)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> Co-authored-by: Russell Bryant <rbryant@redhat.com>

Add docs on serving with Llama Stack (#10183)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> Co-authored-by: Russell Bryant <rbryant@redhat.com>
4800339c · Yuan Tang · GitHub · fe15729a · 4800339c · 4800339c
Unverified Commit 4800339c authored Nov 11, 2024 by Yuan Tang Committed by GitHub Nov 11, 2024
Showing with 43 additions and 0 deletions

docs/source/serving/integrations.rst docs/source/serving/integrations.rst +1 -0

docs/source/serving/serving_with_llamastack.rst docs/source/serving/serving_with_llamastack.rst +42 -0

No files found.
--- a/docs/source/serving/integrations.rst
+++ b/docs/source/serving/integrations.rst
@@ -13,3 +13,4 @@ Integrations
   deploying_with_dstack
   serving_with_langchain
   serving_with_llamaindex
+   serving_with_llamastack
--- a/docs/source/serving/serving_with_llamastack.rst
+++ b/docs/source/serving/serving_with_llamastack.rst
+.. _run_on_llamastack:
+Serving with Llama Stack
+============================
+vLLM is also available via `Llama Stack <https://github.com/meta-llama/llama-stack>`_ .
+To install Llama Stack, run
+.. code-block:: console
+    $ pip install llama-stack -q
+Inference using OpenAI Compatible API
+-------------------------------------
+Then start Llama Stack server pointing to your vLLM server with the following configuration:
+.. code-block:: yaml
+    inference:
+      - provider_id: vllm0
+        provider_type: remote::vllm
+        config:
+          url: http://127.0.0.1:8000
+Please refer to `this guide <https://github.com/meta-llama/llama-stack/blob/main/docs/source/getting_started/distributions/self_hosted_distro/remote_vllm.md>`_ for more details on this remote vLLM provider.
+Inference via Embedded vLLM
+---------------------------
+An `inline vLLM provider
+<https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/inline/inference/vllm>`_
+is also available. This is a sample of configuration using that method:
+.. code-block:: yaml
+    inference
+      - provider_type: vllm
+        config:
+          model: Llama3.1-8B-Instruct
+          tensor_parallel_size: 4