skypilot.rst 6.99 KB
Newer Older
chenzk's avatar
v1.0  
chenzk committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
SkyPilot
========

.. attention:: 
    To be updated for Qwen3.

What is SkyPilot
----------------

SkyPilot is a framework for running LLMs, AI, and batch jobs on any
cloud, offering maximum cost savings, the highest GPU availability, and
managed execution. Its features include:

-  Get the best GPU availability by utilizing multiple resources pools
   across multiple regions and clouds.
-  Pay absolute minimum — SkyPilot picks the cheapest resources across
   regions and clouds. No managed solution markups.
-  Scale up to multiple replicas across different locations and
   accelerators, all served with a single endpoint
-  Everything stays in your cloud account (your VMs & buckets)
-  Completely private - no one else sees your chat history

Install SkyPilot
----------------

We advise you to follow the
`instruction <https://skypilot.readthedocs.io/en/latest/getting-started/installation.html>`__
to install SkyPilot. Here we provide a simple example of using ``pip``
for the installation as shown below.

.. code:: bash

   # You can use any of the following clouds that you have access to:
   # aws, gcp, azure, oci, lamabda, runpod, fluidstack, paperspace,
   # cudo, ibm, scp, vsphere, kubernetes
   pip install "skypilot-nightly[aws,gcp]"

After that, you need to verify cloud access with a command like:

.. code:: bash

   sky check

For more information, check the `official document <https://skypilot.readthedocs.io/en/latest/getting-started/installation.html>`__ and see if you have
set up your cloud accounts correctly.

Alternatively, you can also use the official docker image with SkyPilot
master branch automatically cloned by running:

.. code:: bash

   # NOTE: '--platform linux/amd64' is needed for Apple Silicon Macs
   docker run --platform linux/amd64 \
     -td --rm --name sky \
     -v "$HOME/.sky:/root/.sky:rw" \
     -v "$HOME/.aws:/root/.aws:rw" \
     -v "$HOME/.config/gcloud:/root/.config/gcloud:rw" \
     berkeleyskypilot/skypilot-nightly

   docker exec -it sky /bin/bash

Running Qwen2.5-72B-Instruct with SkyPilot
------------------------------------------

1. Start serving Qwen2.5-72B-Instruct on a single instance with any
   available GPU in the list specified in
   `serve-72b.yaml <https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-72b.yaml>`__
   with a vLLM-powered OpenAI-compatible endpoint:
   
   .. code:: bash

      sky launch -c qwen serve-72b.yaml

   **Before launching, make sure you have changed Qwen/Qwen2-72B-Instruct to Qwen/Qwen2.5-72B-Instruct in the YAML file.**

2. Send a request to the endpoint for completion:

   .. code:: bash

      IP=$(sky status --ip qwen)

      curl -L http://$IP:8000/v1/completions \
         -H "Content-Type: application/json" \
         -d '{
            "model": "Qwen/Qwen2.5-72B-Instruct",
            "prompt": "My favorite food is",
            "max_tokens": 512
      }' | jq -r '.choices[0].text'

3. Send a request for chat completion:

   .. code:: bash

      curl -L http://$IP:8000/v1/chat/completions \
         -H "Content-Type: application/json" \
         -d '{
            "model": "Qwen/Qwen2.5-72B-Instruct",
            "messages": [
            {
               "role": "system",
               "content": "You are Qwen, created by Alibaba Cloud. You are a helpful and honest chat expert."
            },
            {
               "role": "user",
               "content": "What is the best food?"
            }
            ],
            "max_tokens": 512
      }' | jq -r '.choices[0].message.content'

Scale up the service with SkyPilot Serve
----------------------------------------

1. With `SkyPilot
   Serve <https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html>`__,
   a serving library built on top of SkyPilot, scaling up the Qwen
   service is as simple as running:

   .. code:: bash

      sky serve up -n qwen ./serve-72b.yaml

   **Before launching, make sure you have changed Qwen/Qwen2-72B-Instruct to Qwen/Qwen2.5-72B-Instruct in the YAML file.**

   This will start the service with multiple replicas on the cheapest
   available locations and accelerators. SkyServe will automatically manage
   the replicas, monitor their health, autoscale based on load, and restart
   them when needed.

   A single endpoint will be returned and any request sent to the endpoint
   will be routed to the ready replicas.

2. To check the status of the service, run:

   .. code:: bash

      sky serve status qwen

   After a while, you will see the following output:

   ::

      Services
      NAME        VERSION  UPTIME  STATUS        REPLICAS  ENDPOINT            
      Qwen  1        -       READY         2/2       3.85.107.228:30002  

      Service Replicas
      SERVICE_NAME  ID  VERSION  IP  LAUNCHED    RESOURCES                   STATUS REGION  
      Qwen          1   1        -   2 mins ago  1x Azure({'A100-80GB': 8}) READY  eastus  
      Qwen          2   1        -   2 mins ago  1x GCP({'L4': 8})          READY  us-east4-a 

   As shown, the service is now backed by 2 replicas, one on Azure and one
   on GCP, and the accelerator type is chosen to be **the cheapest
   available one** on the clouds. That said, it maximizes the availability
   of the service while minimizing the cost.

3. To access the model, we use a ``curl -L`` command (``-L`` to follow
   redirect) to send the request to the endpoint:

   .. code:: bash

      ENDPOINT=$(sky serve status --endpoint qwen)

      curl -L http://$ENDPOINT/v1/chat/completions \
         -H "Content-Type: application/json" \
         -d '{
            "model": "Qwen/Qwen2.5-72B-Instruct",
            "messages": [
            {
               "role": "system",
               "content": "You are Qwen, created by Alibaba Cloud. You are a helpful and honest code assistant expert in Python."
            },
            {
               "role": "user",
               "content": "Show me the python code for quick sorting a list of integers."
            }
            ],
            "max_tokens": 512
      }' | jq -r '.choices[0].message.content'

Accessing Qwen2.5 with Chat GUI
---------------------------------------------

It is also possible to access the Qwen2.5 service with GUI by connecting a
`FastChat GUI server <https://github.com/lm-sys/FastChat>`__ to the endpoint launched
above (see `gui.yaml <https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/gui.yaml>`__).

1. Start the Chat Web UI:

   .. code:: bash

      sky launch -c qwen-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint qwen)

   **Before launching, make sure you have changed Qwen/Qwen1.5-72B-Chat to Qwen/Qwen2.5-72B-Instruct in the YAML file.**

2. Then, we can access the GUI at the returned gradio link:

   ::

      | INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live

   Note that you may get better results by using a different temperature and top_p value.

Summary
-------

With SkyPilot, it is easy for you to deploy Qwen2.5 on any cloud. We
advise you to read the official doc for more usages and updates.
Check `this <https://skypilot.readthedocs.io/>`__ out!