README.md 9.61 KB
Newer Older
1
2
<div align="center">

3
4
![image](https://github.com/huggingface/text-generation-inference/assets/3841370/38ba1531-ea0d-4851-b31a-a6d4ddc944b0)

5
# Text Generation Inference
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
6

7
8
9
10
11
12
13
14
15
<a href="https://github.com/huggingface/text-generation-inference">
  <img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/huggingface/text-generation-inference?style=social">
</a>
<a href="https://github.com/huggingface/text-generation-inference/blob/main/LICENSE">
  <img alt="License" src="https://img.shields.io/github/license/huggingface/text-generation-inference">
</a>
<a href="https://huggingface.github.io/text-generation-inference">
  <img alt="Swagger API documentation" src="https://img.shields.io/badge/API-Swagger-informational">
</a>
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
16
17
</div>

18
A Rust, Python and gRPC server for text generation inference. Used in production at [HuggingFace](https://huggingface.co)
19
20
21
22
23
to power LLMs api-inference widgets.

## Table of contents

- [Features](#features)
24
- [Optimized Architectures](#optimized-architectures)
25
26
- [Get Started](#get-started)
  - [Docker](#docker)
27
  - [API Documentation](#api-documentation)
OlivierDehaene's avatar
OlivierDehaene committed
28
  - [Using a private or gated model](#using-a-private-or-gated-model)
29
  - [A note on Shared Memory](#a-note-on-shared-memory-shm)
30
  - [Distributed Tracing](#distributed-tracing)
31
32
  - [Local Install](#local-install)
  - [CUDA Kernels](#cuda-kernels)
OlivierDehaene's avatar
OlivierDehaene committed
33
- [Run Falcon](#run-falcon)
34
35
36
37
  - [Run](#run)
  - [Quantization](#quantization)
- [Develop](#develop)
- [Testing](#testing)
38
- [Other supported hardware](#other-supported-hardware)
39

40
## Features
Olivier Dehaene's avatar
Olivier Dehaene committed
41

OlivierDehaene's avatar
OlivierDehaene committed
42
43
- Serve the most popular Large Language Models with a simple launcher
- Tensor Parallelism for faster inference on multiple GPUs
Yannic Kilcher's avatar
Yannic Kilcher committed
44
- Token streaming using Server-Sent Events (SSE)
OlivierDehaene's avatar
v0.8.0  
OlivierDehaene committed
45
- [Continuous batching of incoming requests](https://github.com/huggingface/text-generation-inference/tree/main/router) for increased total throughput
46
47
- Optimized transformers code for inference using [flash-attention](https://github.com/HazyResearch/flash-attention) and [Paged Attention](https://github.com/vllm-project/vllm) on the most popular architectures
- Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) and [GPT-Q](https://arxiv.org/abs/2210.17323)
48
- [Safetensors](https://github.com/huggingface/safetensors) weight loading
OlivierDehaene's avatar
OlivierDehaene committed
49
- Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
50
- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor))
51
- Stop sequences
OlivierDehaene's avatar
OlivierDehaene committed
52
- Log probabilities
OlivierDehaene's avatar
OlivierDehaene committed
53
- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
54

55
## Optimized architectures
Olivier Dehaene's avatar
Olivier Dehaene committed
56

OlivierDehaene's avatar
OlivierDehaene committed
57
- [BLOOM](https://huggingface.co/bigscience/bloom)
58
- [FLAN-T5](https://huggingface.co/google/flan-t5-xxl)
59
- [Galactica](https://huggingface.co/facebook/galactica-120b)
60
61
- [GPT-Neox](https://huggingface.co/EleutherAI/gpt-neox-20b)
- [Llama](https://github.com/facebookresearch/llama)
62
63
- [OPT](https://huggingface.co/facebook/opt-66b)
- [SantaCoder](https://huggingface.co/bigcode/santacoder)
OlivierDehaene's avatar
v0.8.0  
OlivierDehaene committed
64
65
66
- [Starcoder](https://huggingface.co/bigcode/starcoder)
- [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b)
- [Falcon 40B](https://huggingface.co/tiiuae/falcon-40b)
OlivierDehaene's avatar
OlivierDehaene committed
67
68
- [MPT](https://huggingface.co/mosaicml/mpt-30b)
- [Llama V2](https://huggingface.co/meta-llama)
69

70
Other architectures are supported on a best effort basis using:
71
72
73
74
75
76
77

`AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")`

or

`AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")`

78
79
80
## Get started

### Docker
Olivier Dehaene's avatar
Olivier Dehaene committed
81

82
83
84
The easiest way of getting started is using the official Docker container:

```shell
OlivierDehaene's avatar
OlivierDehaene committed
85
model=tiiuae/falcon-7b-instruct
86
87
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

OlivierDehaene's avatar
OlivierDehaene committed
88
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.9.4 --model-id $model
89
```
90
**Note:** To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 11.8 or higher.
Olivier Dehaene's avatar
Olivier Dehaene committed
91

92
93
94
95
96
To see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs) or in the cli:
```
text-generation-launcher --help
```

97
You can then query the model using either the `/generate` or `/generate_stream` routes:
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
98

99
100
101
```shell
curl 127.0.0.1:8080/generate \
    -X POST \
OlivierDehaene's avatar
OlivierDehaene committed
102
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
103
104
    -H 'Content-Type: application/json'
```
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
105
106

```shell
107
108
curl 127.0.0.1:8080/generate_stream \
    -X POST \
OlivierDehaene's avatar
OlivierDehaene committed
109
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
110
    -H 'Content-Type: application/json'
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
111
112
```

OlivierDehaene's avatar
OlivierDehaene committed
113
114
115
or from Python:

```shell
116
pip install text-generation
OlivierDehaene's avatar
OlivierDehaene committed
117
118
```

119
120
```python
from text_generation import Client
OlivierDehaene's avatar
OlivierDehaene committed
121

122
client = Client("http://127.0.0.1:8080")
OlivierDehaene's avatar
OlivierDehaene committed
123
print(client.generate("What is Deep Learning?", max_new_tokens=20).generated_text)
OlivierDehaene's avatar
OlivierDehaene committed
124

125
text = ""
OlivierDehaene's avatar
OlivierDehaene committed
126
for response in client.generate_stream("What is Deep Learning?", max_new_tokens=20):
127
128
129
130
    if not response.token.special:
        text += response.token.text
print(text)
```
OlivierDehaene's avatar
OlivierDehaene committed
131

132
### API documentation
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
133

134
135
136
You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route.
The Swagger UI is also available at: [https://huggingface.github.io/text-generation-inference](https://huggingface.github.io/text-generation-inference).

OlivierDehaene's avatar
OlivierDehaene committed
137
### Using a private or gated model
Nicolas Patry's avatar
Nicolas Patry committed
138

139
You have the option to utilize the `HUGGING_FACE_HUB_TOKEN` environment variable for configuring the token employed by
OlivierDehaene's avatar
OlivierDehaene committed
140
`text-generation-inference`. This allows you to gain access to protected resources.
Nicolas Patry's avatar
Nicolas Patry committed
141

OlivierDehaene's avatar
OlivierDehaene committed
142
For example, if you want to serve the gated Llama V2 model variants:
143

OlivierDehaene's avatar
OlivierDehaene committed
144
145
146
147
148
149
1. Go to https://huggingface.co/settings/tokens
2. Copy your cli READ token
3. Export `HUGGING_FACE_HUB_TOKEN=<your cli READ token>`

or with Docker:

150
```shell
OlivierDehaene's avatar
OlivierDehaene committed
151
152
153
154
155
156
model=meta-llama/Llama-2-7b-chat-hf
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=<your cli READ token>

docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.9.3 --model-id $model
```
157

158
159
### A note on Shared Memory (shm)

160
[`NCCL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html) is a communication framework used by
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
`PyTorch` to do distributed training/inference. `text-generation-inference` make
use of `NCCL` to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of a `NCCL` group, `NCCL` might fall back to using the host memory if
peer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add `--shm-size 1g` on the above command.

If you are running `text-generation-inference` inside `Kubernetes`. You can also add Shared Memory to the container by
creating a volume with:

```yaml
- name: shm
  emptyDir:
   medium: Memory
   sizeLimit: 1Gi
```

and mounting it to `/dev/shm`.

181
Finally, you can also disable SHM sharing by using the `NCCL_SHM_DISABLE=1` environment variable. However, note that
182
183
this will impact performance.

OlivierDehaene's avatar
OlivierDehaene committed
184
185
186
187
188
### Distributed Tracing

`text-generation-inference` is instrumented with distributed tracing using OpenTelemetry. You can use this feature
by setting the address to an OTLP collector with the `--otlp-endpoint` argument.

189
190
### Local install

191
You can also opt to install `text-generation-inference` locally.
192

193
First [install Rust](https://rustup.rs/) and create a Python virtual environment with at least
194
195
196
197
198
Python 3.9, e.g. using `conda`:

```shell
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

199
conda create -n text-generation-inference python=3.9
200
201
202
conda activate text-generation-inference
```

203
204
205
206
207
208
209
210
211
212
213
214
You may also need to install Protoc.

On Linux:

```shell
PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP
```

215
On MacOS, using Homebrew:
216
217
218
219
220

```shell
brew install protobuf
```

221
Then run:
222

Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
223
```shell
224
BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels
225
make run-falcon-7b-instruct
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
226
227
```

228
**Note:** on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:
229
230

```shell
231
sudo apt-get install libssl-dev gcc -y
232
233
```

234
235
### CUDA Kernels

236
The custom CUDA kernels are only tested on NVIDIA A100s. If you have any installation or runtime issues, you can remove
Nicolas Patry's avatar
Nicolas Patry committed
237
the kernels by using the `DISABLE_CUSTOM_KERNELS=True` environment variable.
238
239
240

Be aware that the official Docker image has them enabled by default.

OlivierDehaene's avatar
OlivierDehaene committed
241
## Run Falcon
242

243
244
### Run

245
```shell
246
make run-falcon-7b-instruct
247
248
```

249
250
### Quantization

251
252
253
You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

```shell
OlivierDehaene's avatar
OlivierDehaene committed
254
make run-falcon-7b-instruct-quantize
255
256
```

257
## Develop
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
258

Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
259
```shell
260
261
make server-dev
make router-dev
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
262
263
```

264
## Testing
Nicolas Patry's avatar
Nicolas Patry committed
265
266

```shell
267
268
269
270
# python
make python-server-tests
make python-client-tests
# or both server and client tests
271
make python-tests
272
# rust cargo tests
273
274
make rust-tests
# integration tests
275
make integration-tests
276
```
277
278
279
280
281
282


## Other supported hardware

TGI is also supported on the following AI hardware accelerators:
- *Habana first-gen Gaudi and Gaudi2:* checkout [here](https://github.com/huggingface/optimum-habana/tree/main/text-generation-inference) how to serve models with TGI on Gaudi and Gaudi2 with [Optimum Habana](https://huggingface.co/docs/optimum/habana/index)