serving_with_llamastack.rst 1.14 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
.. _run_on_llamastack:

Serving with Llama Stack
============================

vLLM is also available via `Llama Stack <https://github.com/meta-llama/llama-stack>`_ .

To install Llama Stack, run

.. code-block:: console

    $ pip install llama-stack -q

Inference using OpenAI Compatible API
-------------------------------------

Then start Llama Stack server pointing to your vLLM server with the following configuration:

.. code-block:: yaml

    inference:
      - provider_id: vllm0
        provider_type: remote::vllm
        config:
          url: http://127.0.0.1:8000

27
Please refer to `this guide <https://llama-stack.readthedocs.io/en/latest/distributions/self_hosted_distro/remote-vllm.html>`_ for more details on this remote vLLM provider.
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

Inference via Embedded vLLM
---------------------------

An `inline vLLM provider
<https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/inline/inference/vllm>`_
is also available. This is a sample of configuration using that method:

.. code-block:: yaml

    inference
      - provider_type: vllm
        config:
          model: Llama3.1-8B-Instruct
          tensor_parallel_size: 4