cerebrium.md 3.29 KB
Newer Older
1
2
3
4
---
title: Cerebrium
---
[](){ #deployment-cerebrium }
5
6
7
8
9
10
11
12
13
14

<p align="center">
    <img src="https://i.ibb.co/hHcScTT/Screenshot-2024-06-13-at-10-14-54.png" alt="vLLM_plus_cerebrium"/>
</p>

vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebrium.ai/), a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications.

To install the Cerebrium client, run:

```console
15
16
pip install cerebrium
cerebrium login
17
18
19
20
21
```

Next, create your Cerebrium project, run:

```console
22
cerebrium init vllm-project
23
24
25
26
27
28
29
30
31
32
33
34
```

Next, to install the required packages, add the following to your cerebrium.toml:

```toml
[cerebrium.deployment]
docker_base_image_url = "nvidia/cuda:12.1.1-runtime-ubuntu22.04"

[cerebrium.dependencies.pip]
vllm = "latest"
```

35
Next, let us add our code to handle inference for the LLM of your choice (`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your `main.py`:
36

37
??? Code
38

39
40
    ```python
    from vllm import LLM, SamplingParams
41

42
    llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
43

44
    def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
45

46
47
        sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
        outputs = llm.generate(prompts, sampling_params)
48

49
50
51
52
53
54
55
56
57
        # Print the outputs.
        results = []
        for output in outputs:
            prompt = output.prompt
            generated_text = output.outputs[0].text
            results.append({"prompt": prompt, "generated_text": generated_text})

        return {"results": results}
    ```
58

59
Then, run the following code to deploy it to the cloud:
60
61

```console
62
cerebrium deploy
63
64
```

65
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`)
66

67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
??? Command

    ```python
    curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
    -H 'Content-Type: application/json' \
    -H 'Authorization: <JWT TOKEN>' \
    --data '{
    "prompts": [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is"
    ]
    }'
    ```
82
83
84

You should get a response like:

85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
??? Response

    ```python
    {
        "run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
        "result": {
            "result": [
                {
                    "prompt": "Hello, my name is",
                    "generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
                },
                {
                    "prompt": "The president of the United States is",
                    "generated_text": " elected every four years. This is a democratic system.\n\n5. What"
                },
                {
                    "prompt": "The capital of France is",
                    "generated_text": " Paris.\n"
                },
                {
                    "prompt": "The future of AI is",
                    "generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
                }
            ]
        },
        "run_time_ms": 152.53663063049316
    }
    ```
113
114

You now have an autoscaling endpoint where you only pay for the compute you use!