README.md 10.5 KB
Newer Older
1
<div align="center">
OlivierDehaene's avatar
OlivierDehaene committed
2

Nicolas Patry's avatar
Nicolas Patry committed
3
<a href="https://www.youtube.com/watch?v=jlMAX2Oaht0">
Nicolas Patry's avatar
Nicolas Patry committed
4
  <img width=560 width=315 alt="Making TGI deployment optimal" src="https://huggingface.co/datasets/Narsil/tgi_assets/resolve/main/thumbnail.png">
Nicolas Patry's avatar
Nicolas Patry committed
5
</a>
6

7
# Text Generation Inference
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
8

9
10
11
12
13
14
<a href="https://github.com/huggingface/text-generation-inference">
  <img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/huggingface/text-generation-inference?style=social">
</a>
<a href="https://huggingface.github.io/text-generation-inference">
  <img alt="Swagger API documentation" src="https://img.shields.io/badge/API-Swagger-informational">
</a>
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
15

16
A Rust, Python and gRPC server for text generation inference. Used in production at [HuggingFace](https://huggingface.co)
OlivierDehaene's avatar
OlivierDehaene committed
17
18
19
to power Hugging Chat, the Inference API and Inference Endpoint.

</div>
20
21
22
23

## Table of contents

- [Get Started](#get-started)
24
  - [API Documentation](#api-documentation)
OlivierDehaene's avatar
OlivierDehaene committed
25
  - [Using a private or gated model](#using-a-private-or-gated-model)
26
  - [A note on Shared Memory](#a-note-on-shared-memory-shm)
27
  - [Distributed Tracing](#distributed-tracing)
28
29
  - [Local Install](#local-install)
  - [CUDA Kernels](#cuda-kernels)
30
- [Optimized architectures](#optimized-architectures)
Nicolas Patry's avatar
Nicolas Patry committed
31
- [Run Mistral](#run-a-model)
32
33
34
35
  - [Run](#run)
  - [Quantization](#quantization)
- [Develop](#develop)
- [Testing](#testing)
36

37
Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and [more](https://huggingface.co/docs/text-generation-inference/supported_models). TGI implements many features, such as:
Olivier Dehaene's avatar
Olivier Dehaene committed
38

39
40
- Simple launcher to serve most popular LLMs
- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
OlivierDehaene's avatar
OlivierDehaene committed
41
- Tensor Parallelism for faster inference on multiple GPUs
Yannic Kilcher's avatar
Yannic Kilcher committed
42
- Token streaming using Server-Sent Events (SSE)
43
44
- Continuous batching of incoming requests for increased total throughput
- Optimized transformers code for inference using [Flash Attention](https://github.com/HazyResearch/flash-attention) and [Paged Attention](https://github.com/vllm-project/vllm) on the most popular architectures
Nicolas Patry's avatar
Nicolas Patry committed
45
46
47
48
49
- Quantization with :
  - [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
  - [GPT-Q](https://arxiv.org/abs/2210.17323)
  - [EETQ](https://github.com/NetEase-FuXi/EETQ)
  - [AWQ](https://github.com/casper-hansen/AutoAWQ)
50
- [Safetensors](https://github.com/huggingface/safetensors) weight loading
OlivierDehaene's avatar
OlivierDehaene committed
51
- Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
52
- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor))
53
- Stop sequences
OlivierDehaene's avatar
OlivierDehaene committed
54
- Log probabilities
Nicolas Patry's avatar
Nicolas Patry committed
55
56
- [Speculation](https://huggingface.co/docs/text-generation-inference/conceptual/speculation) ~2x latency
- [Guidance/JSON](https://huggingface.co/docs/text-generation-inference/conceptual/guidance). Specify output format to speed up inference and make sure the output is valid according to some specs..
57
58
- Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output
- Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance
59

Nicolas Patry's avatar
Nicolas Patry committed
60
61
62
63
64
65
66
### Hardware support

- [Nvidia](https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference)
- [AMD](https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference) (-rocm)
- [Inferentia](https://github.com/huggingface/optimum-neuron/tree/main/text-generation-inference)
- [Intel GPU](https://github.com/huggingface/text-generation-inference/pull/1475)
- [Gaudi](https://github.com/huggingface/tgi-gaudi)
67
- [Google TPU](https://huggingface.co/docs/optimum-tpu/howto/serving)
Nicolas Patry's avatar
Nicolas Patry committed
68

69

70
## Get Started
71
72

### Docker
Olivier Dehaene's avatar
Olivier Dehaene committed
73

74
For a detailed starting guide, please see the [Quick Tour](https://huggingface.co/docs/text-generation-inference/quicktour). The easiest way of getting started is using the official Docker container:
75
76

```shell
Nicolas Patry's avatar
Nicolas Patry committed
77
model=HuggingFaceH4/zephyr-7b-beta
Nicolas Patry's avatar
Nicolas Patry committed
78
79
# share a volume with the Docker container to avoid downloading weights every run
volume=$PWD/data
80

Nicolas Patry's avatar
Nicolas Patry committed
81
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
82
    ghcr.io/huggingface/text-generation-inference:2.1.1 --model-id $model
83
```
84

85
And then you can make requests like
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
86

87
```bash
88
curl 127.0.0.1:8080/generate_stream \
89
    -X POST \
OlivierDehaene's avatar
OlivierDehaene committed
90
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
91
92
    -H 'Content-Type: application/json'
```
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
93

94
**Note:** To use NVIDIA GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the `--gpus all` flag and add `--disable-custom-kernels`, please note CPU is not the intended platform for this project, so performance might be subpar.
fxmarty's avatar
fxmarty committed
95

96
**Note:** TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the [Supported Hardware documentation](https://huggingface.co/docs/text-generation-inference/supported_models#supported-hardware). To use AMD GPUs, please use `docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.1.1-rocm --model-id $model` instead of the command above.
OlivierDehaene's avatar
OlivierDehaene committed
97

98
To see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs) or in the cli):
OlivierDehaene's avatar
OlivierDehaene committed
99
```
100
text-generation-launcher --help
101
```
OlivierDehaene's avatar
OlivierDehaene committed
102

103
### API documentation
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
104

105
106
107
You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route.
The Swagger UI is also available at: [https://huggingface.github.io/text-generation-inference](https://huggingface.github.io/text-generation-inference).

OlivierDehaene's avatar
OlivierDehaene committed
108
### Using a private or gated model
Nicolas Patry's avatar
Nicolas Patry committed
109

110
You have the option to utilize the `HF_TOKEN` environment variable for configuring the token employed by
OlivierDehaene's avatar
OlivierDehaene committed
111
`text-generation-inference`. This allows you to gain access to protected resources.
Nicolas Patry's avatar
Nicolas Patry committed
112

OlivierDehaene's avatar
OlivierDehaene committed
113
For example, if you want to serve the gated Llama V2 model variants:
114

OlivierDehaene's avatar
OlivierDehaene committed
115
116
1. Go to https://huggingface.co/settings/tokens
2. Copy your cli READ token
117
3. Export `HF_TOKEN=<your cli READ token>`
OlivierDehaene's avatar
OlivierDehaene committed
118
119
120

or with Docker:

121
```shell
OlivierDehaene's avatar
OlivierDehaene committed
122
123
124
125
model=meta-llama/Llama-2-7b-chat-hf
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=<your cli READ token>

126
docker run --gpus all --shm-size 1g -e HF_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model
OlivierDehaene's avatar
OlivierDehaene committed
127
```
128

129
130
### A note on Shared Memory (shm)

131
[`NCCL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html) is a communication framework used by
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
`PyTorch` to do distributed training/inference. `text-generation-inference` make
use of `NCCL` to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of a `NCCL` group, `NCCL` might fall back to using the host memory if
peer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add `--shm-size 1g` on the above command.

If you are running `text-generation-inference` inside `Kubernetes`. You can also add Shared Memory to the container by
creating a volume with:

```yaml
- name: shm
  emptyDir:
   medium: Memory
   sizeLimit: 1Gi
```

and mounting it to `/dev/shm`.

152
Finally, you can also disable SHM sharing by using the `NCCL_SHM_DISABLE=1` environment variable. However, note that
153
154
this will impact performance.

OlivierDehaene's avatar
OlivierDehaene committed
155
156
157
### Distributed Tracing

`text-generation-inference` is instrumented with distributed tracing using OpenTelemetry. You can use this feature
Nicolas Patry's avatar
Nicolas Patry committed
158
by setting the address to an OTLP collector with the `--otlp-endpoint` argument. The default service name can be
159
overridden with the `--otlp-service-name` argument
OlivierDehaene's avatar
OlivierDehaene committed
160

161
162
### Architecture

fxmarty's avatar
fxmarty committed
163
![TGI architecture](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/TGI.png)
164

165
166
### Local install

167
You can also opt to install `text-generation-inference` locally.
168

169
First [install Rust](https://rustup.rs/) and create a Python virtual environment with at least
170
171
172
173
174
Python 3.9, e.g. using `conda`:

```shell
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Nicolas Patry's avatar
Nicolas Patry committed
175
conda create -n text-generation-inference python=3.11
176
177
178
conda activate text-generation-inference
```

179
180
181
182
183
184
185
186
187
188
189
190
You may also need to install Protoc.

On Linux:

```shell
PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP
```

191
On MacOS, using Homebrew:
192
193
194
195
196

```shell
brew install protobuf
```

197
Then run:
198

Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
199
```shell
200
BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels
Nicolas Patry's avatar
Nicolas Patry committed
201
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
202
203
```

204
**Note:** on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:
205
206

```shell
207
sudo apt-get install libssl-dev gcc -y
208
209
```

210
211
## Optimized architectures

Nicolas Patry's avatar
Nicolas Patry committed
212
TGI works out of the box to serve optimized models for all modern models. They can be found in [this list](https://huggingface.co/docs/text-generation-inference/supported_models).
213
214
215
216
217
218
219
220
221
222
223

Other architectures are supported on a best-effort basis using:

`AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")`

or

`AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")`



Nicolas Patry's avatar
Nicolas Patry committed
224
## Run locally
225

226
227
### Run

228
```shell
Nicolas Patry's avatar
Nicolas Patry committed
229
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2
230
231
```

232
233
### Quantization

234
235
236
You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

```shell
OlivierDehaene's avatar
OlivierDehaene committed
237
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize
238
239
```

Nicolas Patry's avatar
Nicolas Patry committed
240
241
4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`.

242
## Develop
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
243

Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
244
```shell
245
246
make server-dev
make router-dev
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
247
248
```

249
## Testing
Nicolas Patry's avatar
Nicolas Patry committed
250
251

```shell
252
253
254
255
# python
make python-server-tests
make python-client-tests
# or both server and client tests
256
make python-tests
257
# rust cargo tests
258
259
make rust-tests
# integration tests
260
make integration-tests
261
```