README.md 1.61 KB
Newer Older
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
1
# LLM Text Generation Inference
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
2

Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
3
<div align="center">
Olivier Dehaene's avatar
Olivier Dehaene committed
4

Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
5
6
7
8
9
10
11
![architecture](assets/architecture.jpg)

</div>

A Rust and gRPC server for large language models text generation inference.

## Load Tests for BLOOM
Olivier Dehaene's avatar
Olivier Dehaene committed
12
13

See `k6/load_test.js`
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
14
We send the default examples with a 1 second delay between requests.
Olivier Dehaene's avatar
Olivier Dehaene committed
15
16

Stages: 
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
17
18
19
- Ramp up to 50 vus in 1min
- Ramp up from 50 to 100 vus in 2min
- Ramp down to 0 vus in 1min
Olivier Dehaene's avatar
Olivier Dehaene committed
20
21


Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
22
23
24
25
26
|                                                              | avg       | min          | med       | max        | p(90)     | p(95)     | RPS      |
|--------------------------------------------------------------|-----------|--------------|-----------|------------|-----------|-----------|----------|
| [Original code](https://github.com/huggingface/transformers_bloom_parallel) | 8.9s      | 1s           | 9.12s     | 16.69s     | 13.7s     | 14.26s    | 5.9      |
| ISO with original code                                       | 8.88s     | **959.53ms** | 8.89s     | 17.08s     | 13.34s    | 14.12s    | 5.94     |
| New batching logic                                           | **5.44s** | 1.27s        | **5.28s** | **13.12s** | **7.78s** | **8.92s** | **9.08** |
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
27
28
29
30

## Install

```shell
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
31
make install
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
32
33
```

Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
34
## Run 
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
35
36

```shell
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
37
make run-bloom-560m
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
38
39
```

Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
40
41
## Test

Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
42
```shell
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
43
curl 127.0.0.1:3000/generate \
Nicolas Patry's avatar
Nicolas Patry committed
44
    -v \
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
45
46
47
    -X POST \
    -d '{"inputs":"Testing API","parameters":{"max_new_tokens":9}}' \
    -H 'Content-Type: application/json'
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
48
49
```

Nicolas Patry's avatar
Nicolas Patry committed
50
51
52
53
54
55
56
## Develop

```shell
make server-dev
make router-dev
```

Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
57
58
## TODO:

Nicolas Patry's avatar
Nicolas Patry committed
59
60
61
- [ ] Add tests for the `server/model` logic
- [ ] Backport custom CUDA kernels to Transformers
- [ ] Install safetensors with pip