README.md 10.3 KB
Newer Older
1
2
<div align="center">

3
4
![image](https://github.com/huggingface/text-generation-inference/assets/3841370/38ba1531-ea0d-4851-b31a-a6d4ddc944b0)

5
# Text Generation Inference
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
6

7
8
9
10
11
12
<a href="https://github.com/huggingface/text-generation-inference">
  <img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/huggingface/text-generation-inference?style=social">
</a>
<a href="https://huggingface.github.io/text-generation-inference">
  <img alt="Swagger API documentation" src="https://img.shields.io/badge/API-Swagger-informational">
</a>
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
13

14
A Rust, Python and gRPC server for text generation inference. Used in production at [HuggingFace](https://huggingface.co)
OlivierDehaene's avatar
OlivierDehaene committed
15
16
17
to power Hugging Chat, the Inference API and Inference Endpoint.

</div>
18
19
20
21

## Table of contents

- [Features](#features)
22
- [Optimized Architectures](#optimized-architectures)
23
24
- [Get Started](#get-started)
  - [Docker](#docker)
25
  - [API Documentation](#api-documentation)
OlivierDehaene's avatar
OlivierDehaene committed
26
  - [Using a private or gated model](#using-a-private-or-gated-model)
27
  - [A note on Shared Memory](#a-note-on-shared-memory-shm)
28
  - [Distributed Tracing](#distributed-tracing)
29
30
  - [Local Install](#local-install)
  - [CUDA Kernels](#cuda-kernels)
OlivierDehaene's avatar
OlivierDehaene committed
31
- [Run Falcon](#run-falcon)
32
33
34
35
  - [Run](#run)
  - [Quantization](#quantization)
- [Develop](#develop)
- [Testing](#testing)
36
- [Other supported hardware](#other-supported-hardware)
37

38
## Features
Olivier Dehaene's avatar
Olivier Dehaene committed
39

OlivierDehaene's avatar
OlivierDehaene committed
40
41
- Serve the most popular Large Language Models with a simple launcher
- Tensor Parallelism for faster inference on multiple GPUs
Yannic Kilcher's avatar
Yannic Kilcher committed
42
- Token streaming using Server-Sent Events (SSE)
OlivierDehaene's avatar
v0.8.0  
OlivierDehaene committed
43
- [Continuous batching of incoming requests](https://github.com/huggingface/text-generation-inference/tree/main/router) for increased total throughput
44
45
- Optimized transformers code for inference using [flash-attention](https://github.com/HazyResearch/flash-attention) and [Paged Attention](https://github.com/vllm-project/vllm) on the most popular architectures
- Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) and [GPT-Q](https://arxiv.org/abs/2210.17323)
46
- [Safetensors](https://github.com/huggingface/safetensors) weight loading
OlivierDehaene's avatar
OlivierDehaene committed
47
- Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
48
- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor))
49
- Stop sequences
OlivierDehaene's avatar
OlivierDehaene committed
50
- Log probabilities
OlivierDehaene's avatar
OlivierDehaene committed
51
- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
52
53
54
- Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output.
- Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance.

55

56
## Optimized architectures
Olivier Dehaene's avatar
Olivier Dehaene committed
57

OlivierDehaene's avatar
OlivierDehaene committed
58
- [BLOOM](https://huggingface.co/bigscience/bloom)
59
- [FLAN-T5](https://huggingface.co/google/flan-t5-xxl)
60
- [Galactica](https://huggingface.co/facebook/galactica-120b)
61
62
- [GPT-Neox](https://huggingface.co/EleutherAI/gpt-neox-20b)
- [Llama](https://github.com/facebookresearch/llama)
63
64
- [OPT](https://huggingface.co/facebook/opt-66b)
- [SantaCoder](https://huggingface.co/bigcode/santacoder)
OlivierDehaene's avatar
v0.8.0  
OlivierDehaene committed
65
66
67
- [Starcoder](https://huggingface.co/bigcode/starcoder)
- [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b)
- [Falcon 40B](https://huggingface.co/tiiuae/falcon-40b)
OlivierDehaene's avatar
OlivierDehaene committed
68
69
- [MPT](https://huggingface.co/mosaicml/mpt-30b)
- [Llama V2](https://huggingface.co/meta-llama)
70
- [Code Llama](https://huggingface.co/codellama)
71
- [Mistral](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)
72

73
Other architectures are supported on a best effort basis using:
74
75
76
77
78
79
80

`AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")`

or

`AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")`

81
82
83
## Get started

### Docker
Olivier Dehaene's avatar
Olivier Dehaene committed
84

85
86
87
The easiest way of getting started is using the official Docker container:

```shell
OlivierDehaene's avatar
OlivierDehaene committed
88
model=tiiuae/falcon-7b-instruct
89
90
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

Nicolas Patry's avatar
Nicolas Patry committed
91
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.1.0 --model-id $model
92
```
93
**Note:** To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 11.8 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the `--gpus all` flag and add `--disable-custom-kernels`, please note CPU is not the intended platform for this project, so performance might be subpar.
Olivier Dehaene's avatar
Olivier Dehaene committed
94

Adarsh Shirawalmath's avatar
Adarsh Shirawalmath committed
95
To see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs) or in the cli):
96
97
98
99
```
text-generation-launcher --help
```

100
You can then query the model using either the `/generate` or `/generate_stream` routes:
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
101

102
103
104
```shell
curl 127.0.0.1:8080/generate \
    -X POST \
OlivierDehaene's avatar
OlivierDehaene committed
105
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
106
107
    -H 'Content-Type: application/json'
```
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
108
109

```shell
110
111
curl 127.0.0.1:8080/generate_stream \
    -X POST \
OlivierDehaene's avatar
OlivierDehaene committed
112
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
113
    -H 'Content-Type: application/json'
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
114
115
```

OlivierDehaene's avatar
OlivierDehaene committed
116
117
118
or from Python:

```shell
119
pip install text-generation
OlivierDehaene's avatar
OlivierDehaene committed
120
121
```

122
123
```python
from text_generation import Client
OlivierDehaene's avatar
OlivierDehaene committed
124

125
client = Client("http://127.0.0.1:8080")
OlivierDehaene's avatar
OlivierDehaene committed
126
print(client.generate("What is Deep Learning?", max_new_tokens=20).generated_text)
OlivierDehaene's avatar
OlivierDehaene committed
127

128
text = ""
OlivierDehaene's avatar
OlivierDehaene committed
129
for response in client.generate_stream("What is Deep Learning?", max_new_tokens=20):
130
131
132
133
    if not response.token.special:
        text += response.token.text
print(text)
```
OlivierDehaene's avatar
OlivierDehaene committed
134

135
### API documentation
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
136

137
138
139
You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route.
The Swagger UI is also available at: [https://huggingface.github.io/text-generation-inference](https://huggingface.github.io/text-generation-inference).

OlivierDehaene's avatar
OlivierDehaene committed
140
### Using a private or gated model
Nicolas Patry's avatar
Nicolas Patry committed
141

142
You have the option to utilize the `HUGGING_FACE_HUB_TOKEN` environment variable for configuring the token employed by
OlivierDehaene's avatar
OlivierDehaene committed
143
`text-generation-inference`. This allows you to gain access to protected resources.
Nicolas Patry's avatar
Nicolas Patry committed
144

OlivierDehaene's avatar
OlivierDehaene committed
145
For example, if you want to serve the gated Llama V2 model variants:
146

OlivierDehaene's avatar
OlivierDehaene committed
147
148
149
150
151
152
1. Go to https://huggingface.co/settings/tokens
2. Copy your cli READ token
3. Export `HUGGING_FACE_HUB_TOKEN=<your cli READ token>`

or with Docker:

153
```shell
OlivierDehaene's avatar
OlivierDehaene committed
154
155
156
157
model=meta-llama/Llama-2-7b-chat-hf
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=<your cli READ token>

158
docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id $model
OlivierDehaene's avatar
OlivierDehaene committed
159
```
160

161
162
### A note on Shared Memory (shm)

163
[`NCCL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html) is a communication framework used by
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
`PyTorch` to do distributed training/inference. `text-generation-inference` make
use of `NCCL` to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of a `NCCL` group, `NCCL` might fall back to using the host memory if
peer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add `--shm-size 1g` on the above command.

If you are running `text-generation-inference` inside `Kubernetes`. You can also add Shared Memory to the container by
creating a volume with:

```yaml
- name: shm
  emptyDir:
   medium: Memory
   sizeLimit: 1Gi
```

and mounting it to `/dev/shm`.

184
Finally, you can also disable SHM sharing by using the `NCCL_SHM_DISABLE=1` environment variable. However, note that
185
186
this will impact performance.

OlivierDehaene's avatar
OlivierDehaene committed
187
188
189
190
191
### Distributed Tracing

`text-generation-inference` is instrumented with distributed tracing using OpenTelemetry. You can use this feature
by setting the address to an OTLP collector with the `--otlp-endpoint` argument.

192
193
### Local install

194
You can also opt to install `text-generation-inference` locally.
195

196
First [install Rust](https://rustup.rs/) and create a Python virtual environment with at least
197
198
199
200
201
Python 3.9, e.g. using `conda`:

```shell
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

202
conda create -n text-generation-inference python=3.9
203
204
205
conda activate text-generation-inference
```

206
207
208
209
210
211
212
213
214
215
216
217
You may also need to install Protoc.

On Linux:

```shell
PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP
```

218
On MacOS, using Homebrew:
219
220
221
222
223

```shell
brew install protobuf
```

224
Then run:
225

Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
226
```shell
227
BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels
228
make run-falcon-7b-instruct
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
229
230
```

231
**Note:** on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:
232
233

```shell
234
sudo apt-get install libssl-dev gcc -y
235
236
```

237
238
### CUDA Kernels

239
The custom CUDA kernels are only tested on NVIDIA A100s. If you have any installation or runtime issues, you can remove
Nicolas Patry's avatar
Nicolas Patry committed
240
the kernels by using the `DISABLE_CUSTOM_KERNELS=True` environment variable.
241
242
243

Be aware that the official Docker image has them enabled by default.

OlivierDehaene's avatar
OlivierDehaene committed
244
## Run Falcon
245

246
247
### Run

248
```shell
249
make run-falcon-7b-instruct
250
251
```

252
253
### Quantization

254
255
256
You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

```shell
OlivierDehaene's avatar
OlivierDehaene committed
257
make run-falcon-7b-instruct-quantize
258
259
```

Nicolas Patry's avatar
Nicolas Patry committed
260
261
4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`.

262
## Develop
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
263

Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
264
```shell
265
266
make server-dev
make router-dev
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
267
268
```

269
## Testing
Nicolas Patry's avatar
Nicolas Patry committed
270
271

```shell
272
273
274
275
# python
make python-server-tests
make python-client-tests
# or both server and client tests
276
make python-tests
277
# rust cargo tests
278
279
make rust-tests
# integration tests
280
make integration-tests
281
```
282
283
284
285
286
287


## Other supported hardware

TGI is also supported on the following AI hardware accelerators:
- *Habana first-gen Gaudi and Gaudi2:* checkout [here](https://github.com/huggingface/optimum-habana/tree/main/text-generation-inference) how to serve models with TGI on Gaudi and Gaudi2 with [Optimum Habana](https://huggingface.co/docs/optimum/habana/index)
288