README.md 2.63 KB
Newer Older
1
# Text Generation Inference
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
2

Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
3
<div align="center">
Olivier Dehaene's avatar
Olivier Dehaene committed
4

Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
5
6
7
8
![architecture](assets/architecture.jpg)

</div>

OlivierDehaene's avatar
OlivierDehaene committed
9
10
A Rust and gRPC server for text generation inference. Used in production at [HuggingFace](https://huggingface.co) 
to power Bloom, BloomZ and MT0-XXL api-inference widgets.
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
11

12
## Features
Olivier Dehaene's avatar
Olivier Dehaene committed
13

OlivierDehaene's avatar
OlivierDehaene committed
14
- [Dynamic batching of incoming requests](https://github.com/huggingface/text-generation-inference/blob/main/router/src/batcher.rs#L88) for increased total throughput
15
- Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
16
17
- [Safetensors](https://github.com/huggingface/safetensors) weight loading
- 45ms per token generation for BLOOM with 8xA100 80GB
18
- Logits warpers (temperature scaling, topk, repetition penalty ...)
19
- Stop sequences
OlivierDehaene's avatar
OlivierDehaene committed
20
- Log probabilities
21

OlivierDehaene's avatar
OlivierDehaene committed
22
## Officially supported models
Olivier Dehaene's avatar
Olivier Dehaene committed
23

OlivierDehaene's avatar
OlivierDehaene committed
24
25
26
- [BLOOM](https://huggingface.co/bigscience/bloom)
- [BLOOMZ](https://huggingface.co/bigscience/bloomz)
- [MT0-XXL](https://huggingface.co/bigscience/mt0-xxl)
27
- ~~[Galactica](https://huggingface.co/facebook/galactica-120b)~~ (deactivated)
28
- [SantaCoder](https://huggingface.co/bigcode/santacoder)
29
- [GPT-Neox 20B](https://huggingface.co/EleutherAI/gpt-neox-20b): use `--revision pr/13`
30

31
32
33
34
35
36
37
38
Other models are supported on a best effort basis using:

`AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")`

or

`AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")`

39
## Load Tests for BLOOM
Olivier Dehaene's avatar
Olivier Dehaene committed
40

41
See `k6/load_test.js`
Olivier Dehaene's avatar
Olivier Dehaene committed
42

Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
43
44
45
|                                                              | avg       | min          | med       | max        | p(90)     | p(95)     | RPS      |
|--------------------------------------------------------------|-----------|--------------|-----------|------------|-----------|-----------|----------|
| [Original code](https://github.com/huggingface/transformers_bloom_parallel) | 8.9s      | 1s           | 9.12s     | 16.69s     | 13.7s     | 14.26s    | 5.9      |
46
| New batching logic                                           | **5.44s** | **959.53ms** | **5.28s** | **13.12s** | **7.78s** | **8.92s** | **9.08** |
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
47
48
49
50

## Install

```shell
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
51
make install
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
52
53
```

Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
54
## Run 
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
55

56
57
### BLOOM 560-m

Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
58
```shell
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
59
make run-bloom-560m
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
60
61
```

62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
### BLOOM

First you need to download the weights:

```shell
make download-bloom
```

```shell
make run-bloom # Requires 8xA100 80GB
```

You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

```shell
make run-bloom-quantize # Requires 8xA100 40GB
```

Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
80
81
## Test

Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
82
```shell
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
83
curl 127.0.0.1:3000/generate \
Nicolas Patry's avatar
Nicolas Patry committed
84
    -v \
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
85
86
87
    -X POST \
    -d '{"inputs":"Testing API","parameters":{"max_new_tokens":9}}' \
    -H 'Content-Type: application/json'
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
88
89
```

Nicolas Patry's avatar
Nicolas Patry committed
90
91
92
93
94
## Develop

```shell
make server-dev
make router-dev
95
```