consuming_tgi.md 6.76 KB
Newer Older
1
2
# Consuming Text Generation Inference

3
4
5
6
7
There are many ways to consume Text Generation Inference (TGI) server in your applications. After launching the server, you can use the [Messages API](https://huggingface.co/docs/text-generation-inference/en/messages_api) `/v1/chat/completions` route and make a `POST` request to get results from the server. You can also pass `"stream": true` to the call if you want TGI to return a stream of tokens.

For more information on the API, consult the OpenAPI documentation of `text-generation-inference` available [here](https://huggingface.github.io/text-generation-inference).

You can make the requests using any tool of your preference, such as curl, Python, or TypeScript. For an end-to-end experience, we've open-sourced [ChatUI](https://github.com/huggingface/chat-ui), a chat interface for open-access models.
8
9
10

## curl

11
After a successful server launch, you can query the model using the `v1/chat/completions` route, to get responses that are compliant to the OpenAI Chat Completion spec:
12

13
```bash
14
curl localhost:8080/v1/chat/completions \
15
    -X POST \
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    -d '{
  "model": "tgi",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is deep learning?"
    }
  ],
  "stream": true,
  "max_tokens": 20
}' \
31
32
33
    -H 'Content-Type: application/json'
```

34
For non-chat use-cases, you can also use the `/generate` and `/generate_stream` routes.
35

36
```bash
37
38
39
40
41
42
43
44
45
curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{
  "inputs":"What is Deep Learning?",
  "parameters":{
    "max_new_tokens":20
  }
}' \
    -H 'Content-Type: application/json'
46
47
```

48
## Python
49

50
### Inference Client
51

52
[`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/main/en/index) is a Python library to interact with the Hugging Face Hub, including its endpoints. It provides a high-level class, [`huggingface_hub.InferenceClient`](https://huggingface.co/docs/huggingface_hub/package_reference/inference_client#huggingface_hub.InferenceClient), which makes it easy to make calls to TGI's Messages API. `InferenceClient` also takes care of parameter validation and provides a simple-to-use interface.
53

54
Install `huggingface_hub` package via pip.
55

56
57
```bash
pip install huggingface_hub
58
59
```

60
You can now use `InferenceClient` the exact same way you would use `OpenAI` client in Python
61
62

```python
63
from huggingface_hub import InferenceClient
64

65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
client = InferenceClient(
    base_url="http://localhost:8080/v1/",
)

output = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Count to 10"},
    ],
    stream=True,
    max_tokens=1024,
)

for chunk in output:
    print(chunk.choices[0].delta.content)
81
82
```

83
You can check out more details about OpenAI compatibility [here](https://huggingface.co/docs/huggingface_hub/en/guides/inference#openai-compatibility).
84

85
There is also an async version of the client, `AsyncInferenceClient`, based on `asyncio` and `aiohttp`. You can find docs for it [here](https://huggingface.co/docs/huggingface_hub/package_reference/inference_client#huggingface_hub.AsyncInferenceClient)
86

87
### OpenAI Client
88

89
You can directly use the OpenAI [Python](https://github.com/openai/openai-python) or [JS](https://github.com/openai/openai-node) clients to interact with TGI.
90

Nicolas Patry's avatar
Nicolas Patry committed
91
Install the OpenAI Python package via pip.
92

93
94
```bash
pip install openai
95
```
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117

```python
from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(
    base_url="http://localhost:8080/v1/",
    api_key="-"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=True
)

# iterate and print stream
for message in chat_completion:
    print(message)
118
119
```

120
## UI
121

122
### Gradio
123
124
125
126
127
128
129

Gradio is a Python library that helps you build web applications for your machine learning models with a few lines of code. It has a `ChatInterface` wrapper that helps create neat UIs for chatbots. Let's take a look at how to create a chatbot with streaming mode using TGI and Gradio. Let's install Gradio and Hub Python library first.

```bash
pip install huggingface-hub gradio
```

OlivierDehaene's avatar
OlivierDehaene committed
130
131
Assume you are serving your model on port 8080, we will query through [InferenceClient](consuming_tgi#inference-client).

132
133
134
135
```python
import gradio as gr
from huggingface_hub import InferenceClient

136
client = InferenceClient(base_url="http://127.0.0.1:8080")
137
138
139

def inference(message, history):
    partial_message = ""
140
141
142
143
144
145
146
147
    output = client.chat.completions.create(
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": message},
        ],
        stream=True,
        max_tokens=1024,
    )
Nicolas Patry's avatar
Nicolas Patry committed
148

149
150
    for chunk in output:
        partial_message += chunk.choices[0].delta.content
151
152
153
154
155
156
        yield partial_message

gr.ChatInterface(
    inference,
    chatbot=gr.Chatbot(height=300),
    textbox=gr.Textbox(placeholder="Chat with me!", container=False, scale=7),
157
    description="This is the demo for Gradio UI consuming TGI endpoint.",
158
159
160
161
162
163
164
165
    title="Gradio 🤝 TGI",
    examples=["Are tomatoes vegetables?"],
    retry_btn="Retry",
    undo_btn="Undo",
    clear_btn="Clear",
).queue().launch()
```

166
You can check out the UI and try the demo directly here 👇
167
168

<div class="block dark:hidden">
OlivierDehaene's avatar
OlivierDehaene committed
169
	<iframe
170
171
172
173
174
175
        src="https://merve-gradio-tgi-2.hf.space?__theme=light"
        width="850"
        height="750"
    ></iframe>
</div>
<div class="hidden dark:block">
OlivierDehaene's avatar
OlivierDehaene committed
176
    <iframe
177
178
179
180
181
182
183
        src="https://merve-gradio-tgi-2.hf.space?__theme=dark"
        width="850"
        height="750"
    ></iframe>
</div>


184
You can read more about how to customize a `ChatInterface` [here](https://www.gradio.app/guides/creating-a-chatbot-fast).
185

186
### ChatUI
187

188
[ChatUI](https://github.com/huggingface/chat-ui) is an open-source interface built for consuming LLMs. It offers many customization options, such as web search with SERP API and more. ChatUI can automatically consume the TGI server and even provides an option to switch between different TGI endpoints. You can try it out at [Hugging Chat](https://huggingface.co/chat/), or use the [ChatUI Docker Space](https://huggingface.co/new-space?template=huggingchat/chat-ui-template) to deploy your own Hugging Chat to Spaces.
189

190
191
192
193
194
195
196
197
To serve both ChatUI and TGI in same environment, simply add your own endpoints to the `MODELS` variable in `.env.local` file inside the `chat-ui` repository. Provide the endpoints pointing to where TGI is served.

```
{
// rest of the model config here
"endpoints": [{"url": "https://HOST:PORT/generate_stream"}]
}
```
198

Nicolas Patry's avatar
Nicolas Patry committed
199
![ChatUI](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/chatui_screen.png)