"...offline_inference/offline_inference_vision_language.py" did not exist on "a60731247fba82fae5e71af7a19ea0df96de1caa"
litellm.md 2.08 KB
Newer Older
1
# LiteLLM
Reid's avatar
Reid committed
2
3
4
5
6
7
8
9
10
11
12
13
14
15

[LiteLLM](https://github.com/BerriAI/litellm) call all LLM APIs using the OpenAI format [Bedrock, Huggingface, VertexAI, TogetherAI, Azure, OpenAI, Groq etc.]

LiteLLM manages:

- Translate inputs to provider's `completion`, `embedding`, and `image_generation` endpoints
- [Consistent output](https://docs.litellm.ai/docs/completion/output), text responses will always be available at `['choices'][0]['message']['content']`
- Retry/fallback logic across multiple deployments (e.g. Azure/OpenAI) - [Router](https://docs.litellm.ai/docs/routing)
- Set Budgets & Rate limits per project, api key, model [LiteLLM Proxy Server (LLM Gateway)](https://docs.litellm.ai/docs/simple_proxy)

And LiteLLM supports all models on VLLM.

## Prerequisites

16
Set up the vLLM and litellm environment:
Reid's avatar
Reid committed
17

18
```bash
Reid's avatar
Reid committed
19
20
21
22
23
24
25
pip install vllm litellm
```

## Deploy

### Chat completion

26
1. Start the vLLM server with the supported chat completion model, e.g.
Reid's avatar
Reid committed
27

28
29
30
    ```bash
    vllm serve qwen/Qwen1.5-0.5B-Chat
    ```
Reid's avatar
Reid committed
31

32
1. Call it with litellm:
Reid's avatar
Reid committed
33

34
??? code
Reid's avatar
Reid committed
35

36
37
    ```python
    import litellm 
Reid's avatar
Reid committed
38

39
40
41
42
43
44
45
46
47
48
49
50
    messages = [{ "content": "Hello, how are you?","role": "user"}]

    # hosted_vllm is prefix key word and necessary
    response = litellm.completion(
                model="hosted_vllm/qwen/Qwen1.5-0.5B-Chat", # pass the vllm model name
                messages=messages,
                api_base="http://{your-vllm-server-host}:{your-vllm-server-port}/v1",
                temperature=0.2,
                max_tokens=80)

    print(response)
    ```
Reid's avatar
Reid committed
51
52
53

### Embeddings

54
1. Start the vLLM server with the supported embedding model, e.g.
Reid's avatar
Reid committed
55

56
57
58
    ```bash
    vllm serve BAAI/bge-base-en-v1.5
    ```
Reid's avatar
Reid committed
59

60
1. Call it with litellm:
Reid's avatar
Reid committed
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75

```python
from litellm import embedding   
import os

os.environ["HOSTED_VLLM_API_BASE"] = "http://{your-vllm-server-host}:{your-vllm-server-port}/v1"

# hosted_vllm is prefix key word and necessary
# pass the vllm model name
embedding = embedding(model="hosted_vllm/BAAI/bge-base-en-v1.5", input=["Hello world"])

print(embedding)
```

For details, see the tutorial [Using vLLM in LiteLLM](https://docs.litellm.ai/docs/providers/vllm).