deploying_with_dstack.rst 3.51 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
.. _deploying_with_dstack:

Deploying with dstack
============================

.. raw:: html

    <p align="center">
        <img src="https://i.ibb.co/71kx6hW/vllm-dstack.png" alt="vLLM_plus_dstack"/>
    </p>

vLLM can be run on a cloud based GPU machine with `dstack <https://dstack.ai/>`__, an open-source framework for running LLMs on any cloud. This tutorial assumes that you have already configured credentials, gateway, and GPU quotas on your cloud environment.

To install dstack client, run:

.. code-block:: console

    $ pip install "dstack[all]
    $ dstack server

Next, to configure your dstack project, run:
    
.. code-block:: console

    $ mkdir -p vllm-dstack
    $ cd vllm-dstack
    $ dstack init

Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
    
.. code-block:: yaml

    type: service
    
    python: "3.11"
    env:
        - MODEL=NousResearch/Llama-2-7b-chat-hf
    port: 8000
    resources:
        gpu: 24GB
    commands:
        - pip install vllm
43
        - vllm serve $MODEL --port 8000
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
    model:
        format: openai
        type: chat
        name: NousResearch/Llama-2-7b-chat-hf

Then, run the following CLI for provisioning:

.. code-block:: console

    $ dstack run . -f serve.dstack.yml
    
    ⠸ Getting run plan...
     Configuration  serve.dstack.yml             
     Project        deep-diver-main              
     User           deep-diver                   
     Min resources  2..xCPU, 8GB.., 1xGPU (24GB) 
     Max price      -                            
     Max duration   -                            
     Spot policy    auto                         
     Retry policy   no                           
    
     #  BACKEND  REGION       INSTANCE       RESOURCES                               SPOT  PRICE       
     1  gcp   us-central1  g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804   
     2  gcp   us-east1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804   
     3  gcp   us-west1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804   
        ...                                                                                            
     Shown 3 of 193 offers, $5.876 max
    
    Continue? [y/n]: y
    ⠙ Submitting run...
    ⠏ Launching spicy-treefrog-1 (pulling)
    spicy-treefrog-1 provisioning completed (running)
    Service is published at ...

After the provisioning, you can interact with the model by using the OpenAI SDK:

.. code-block:: python

    from openai import OpenAI
    
    client = OpenAI(
        base_url="https://gateway.<gateway domain>",
        api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
    )
    
    completion = client.chat.completions.create(
        model="NousResearch/Llama-2-7b-chat-hf",
        messages=[
            {
                "role": "user",
                "content": "Compose a poem that explains the concept of recursion in programming.",
            }
        ]
    )

    print(completion.choices[0].message.content)

.. note::

    dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out `this repository <https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm>`__