Unverified Commit 4800339c authored by Yuan Tang's avatar Yuan Tang Committed by GitHub
Browse files

Add docs on serving with Llama Stack (#10183)


Signed-off-by: default avatarYuan Tang <terrytangyuan@gmail.com>
Co-authored-by: default avatarRussell Bryant <rbryant@redhat.com>
parent fe15729a
...@@ -13,3 +13,4 @@ Integrations ...@@ -13,3 +13,4 @@ Integrations
deploying_with_dstack deploying_with_dstack
serving_with_langchain serving_with_langchain
serving_with_llamaindex serving_with_llamaindex
serving_with_llamastack
.. _run_on_llamastack:
Serving with Llama Stack
============================
vLLM is also available via `Llama Stack <https://github.com/meta-llama/llama-stack>`_ .
To install Llama Stack, run
.. code-block:: console
$ pip install llama-stack -q
Inference using OpenAI Compatible API
-------------------------------------
Then start Llama Stack server pointing to your vLLM server with the following configuration:
.. code-block:: yaml
inference:
- provider_id: vllm0
provider_type: remote::vllm
config:
url: http://127.0.0.1:8000
Please refer to `this guide <https://github.com/meta-llama/llama-stack/blob/main/docs/source/getting_started/distributions/self_hosted_distro/remote_vllm.md>`_ for more details on this remote vLLM provider.
Inference via Embedded vLLM
---------------------------
An `inline vLLM provider
<https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/inline/inference/vllm>`_
is also available. This is a sample of configuration using that method:
.. code-block:: yaml
inference
- provider_type: vllm
config:
model: Llama3.1-8B-Instruct
tensor_parallel_size: 4
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment