consuming_tgi.md 6.76 KB
Newer Older
jixx's avatar
init  
jixx committed
1
2
# Consuming Text Generation Inference

jixx's avatar
jixx committed
3
4
5
6
7
There are many ways to consume Text Generation Inference (TGI) server in your applications. After launching the server, you can use the [Messages API](https://huggingface.co/docs/text-generation-inference/en/messages_api) `/v1/chat/completions` route and make a `POST` request to get results from the server. You can also pass `"stream": true` to the call if you want TGI to return a stream of tokens.

For more information on the API, consult the OpenAPI documentation of `text-generation-inference` available [here](https://huggingface.github.io/text-generation-inference).

You can make the requests using any tool of your preference, such as curl, Python, or TypeScript. For an end-to-end experience, we've open-sourced [ChatUI](https://github.com/huggingface/chat-ui), a chat interface for open-access models.
jixx's avatar
init  
jixx committed
8
9
10

## curl

jixx's avatar
jixx committed
11
After a successful server launch, you can query the model using the `v1/chat/completions` route, to get responses that are compliant to the OpenAI Chat Completion spec:
jixx's avatar
init  
jixx committed
12
13

```bash
jixx's avatar
jixx committed
14
curl localhost:8080/v1/chat/completions \
jixx's avatar
init  
jixx committed
15
    -X POST \
jixx's avatar
jixx committed
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    -d '{
  "model": "tgi",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is deep learning?"
    }
  ],
  "stream": true,
  "max_tokens": 20
}' \
jixx's avatar
init  
jixx committed
31
32
33
    -H 'Content-Type: application/json'
```

jixx's avatar
jixx committed
34
For non-chat use-cases, you can also use the `/generate` and `/generate_stream` routes.
jixx's avatar
init  
jixx committed
35
36

```bash
jixx's avatar
jixx committed
37
38
39
40
41
42
43
44
45
curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{
  "inputs":"What is Deep Learning?",
  "parameters":{
    "max_new_tokens":20
  }
}' \
    -H 'Content-Type: application/json'
jixx's avatar
init  
jixx committed
46
47
```

jixx's avatar
jixx committed
48
## Python
jixx's avatar
init  
jixx committed
49

jixx's avatar
jixx committed
50
### Inference Client
jixx's avatar
init  
jixx committed
51

jixx's avatar
jixx committed
52
[`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/main/en/index) is a Python library to interact with the Hugging Face Hub, including its endpoints. It provides a high-level class, [`huggingface_hub.InferenceClient`](https://huggingface.co/docs/huggingface_hub/package_reference/inference_client#huggingface_hub.InferenceClient), which makes it easy to make calls to TGI's Messages API. `InferenceClient` also takes care of parameter validation and provides a simple-to-use interface.
jixx's avatar
init  
jixx committed
53

jixx's avatar
jixx committed
54
Install `huggingface_hub` package via pip.
jixx's avatar
init  
jixx committed
55

jixx's avatar
jixx committed
56
57
```bash
pip install huggingface_hub
jixx's avatar
init  
jixx committed
58
59
```

jixx's avatar
jixx committed
60
You can now use `InferenceClient` the exact same way you would use `OpenAI` client in Python
jixx's avatar
init  
jixx committed
61
62

```python
jixx's avatar
jixx committed
63
from huggingface_hub import InferenceClient
jixx's avatar
init  
jixx committed
64

jixx's avatar
jixx committed
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
client = InferenceClient(
    base_url="http://localhost:8080/v1/",
)

output = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Count to 10"},
    ],
    stream=True,
    max_tokens=1024,
)

for chunk in output:
    print(chunk.choices[0].delta.content)
jixx's avatar
init  
jixx committed
81
82
```

jixx's avatar
jixx committed
83
You can check out more details about OpenAI compatibility [here](https://huggingface.co/docs/huggingface_hub/en/guides/inference#openai-compatibility).
jixx's avatar
init  
jixx committed
84

jixx's avatar
jixx committed
85
There is also an async version of the client, `AsyncInferenceClient`, based on `asyncio` and `aiohttp`. You can find docs for it [here](https://huggingface.co/docs/huggingface_hub/package_reference/inference_client#huggingface_hub.AsyncInferenceClient)
jixx's avatar
init  
jixx committed
86

jixx's avatar
jixx committed
87
### OpenAI Client
jixx's avatar
init  
jixx committed
88

jixx's avatar
jixx committed
89
You can directly use the OpenAI [Python](https://github.com/openai/openai-python) or [JS](https://github.com/openai/openai-node) clients to interact with TGI.
jixx's avatar
init  
jixx committed
90

jixx's avatar
jixx committed
91
Install the OpenAI Python package via pip.
jixx's avatar
init  
jixx committed
92

jixx's avatar
jixx committed
93
94
```bash
pip install openai
jixx's avatar
init  
jixx committed
95
```
jixx's avatar
jixx committed
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117

```python
from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(
    base_url="http://localhost:8080/v1/",
    api_key="-"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=True
)

# iterate and print stream
for message in chat_completion:
    print(message)
jixx's avatar
init  
jixx committed
118
119
```

jixx's avatar
jixx committed
120
## UI
jixx's avatar
init  
jixx committed
121

jixx's avatar
jixx committed
122
### Gradio
jixx's avatar
init  
jixx committed
123
124
125
126
127
128
129
130
131
132
133
134
135

Gradio is a Python library that helps you build web applications for your machine learning models with a few lines of code. It has a `ChatInterface` wrapper that helps create neat UIs for chatbots. Let's take a look at how to create a chatbot with streaming mode using TGI and Gradio. Let's install Gradio and Hub Python library first.

```bash
pip install huggingface-hub gradio
```

Assume you are serving your model on port 8080, we will query through [InferenceClient](consuming_tgi#inference-client).

```python
import gradio as gr
from huggingface_hub import InferenceClient

jixx's avatar
jixx committed
136
client = InferenceClient(base_url="http://127.0.0.1:8080")
jixx's avatar
init  
jixx committed
137
138
139

def inference(message, history):
    partial_message = ""
jixx's avatar
jixx committed
140
141
142
143
144
145
146
147
148
149
150
    output = client.chat.completions.create(
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": message},
        ],
        stream=True,
        max_tokens=1024,
    )

    for chunk in output:
        partial_message += chunk.choices[0].delta.content
jixx's avatar
init  
jixx committed
151
152
153
154
155
156
        yield partial_message

gr.ChatInterface(
    inference,
    chatbot=gr.Chatbot(height=300),
    textbox=gr.Textbox(placeholder="Chat with me!", container=False, scale=7),
jixx's avatar
jixx committed
157
    description="This is the demo for Gradio UI consuming TGI endpoint.",
jixx's avatar
init  
jixx committed
158
159
160
161
162
163
164
165
    title="Gradio 🤝 TGI",
    examples=["Are tomatoes vegetables?"],
    retry_btn="Retry",
    undo_btn="Undo",
    clear_btn="Clear",
).queue().launch()
```

jixx's avatar
jixx committed
166
You can check out the UI and try the demo directly here 👇
jixx's avatar
init  
jixx committed
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183

<div class="block dark:hidden">
	<iframe
        src="https://merve-gradio-tgi-2.hf.space?__theme=light"
        width="850"
        height="750"
    ></iframe>
</div>
<div class="hidden dark:block">
    <iframe
        src="https://merve-gradio-tgi-2.hf.space?__theme=dark"
        width="850"
        height="750"
    ></iframe>
</div>


jixx's avatar
jixx committed
184
You can read more about how to customize a `ChatInterface` [here](https://www.gradio.app/guides/creating-a-chatbot-fast).
jixx's avatar
init  
jixx committed
185

jixx's avatar
jixx committed
186
### ChatUI
jixx's avatar
init  
jixx committed
187

jixx's avatar
jixx committed
188
[ChatUI](https://github.com/huggingface/chat-ui) is an open-source interface built for consuming LLMs. It offers many customization options, such as web search with SERP API and more. ChatUI can automatically consume the TGI server and even provides an option to switch between different TGI endpoints. You can try it out at [Hugging Chat](https://huggingface.co/chat/), or use the [ChatUI Docker Space](https://huggingface.co/new-space?template=huggingchat/chat-ui-template) to deploy your own Hugging Chat to Spaces.
jixx's avatar
init  
jixx committed
189

jixx's avatar
jixx committed
190
To serve both ChatUI and TGI in same environment, simply add your own endpoints to the `MODELS` variable in `.env.local` file inside the `chat-ui` repository. Provide the endpoints pointing to where TGI is served.
jixx's avatar
init  
jixx committed
191

jixx's avatar
jixx committed
192
193
194
195
196
197
198
199
```
{
// rest of the model config here
"endpoints": [{"url": "https://HOST:PORT/generate_stream"}]
}
```

![ChatUI](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/chatui_screen.png)