## LLaMa Inference For TencentPretrain
This project mainly supports LLaMa Inference and Microservice deployment based on [TencentPretrain](https://github.com/Tencent/TencentPretrain).
### Feature
- __Int8 Inference__ Support int8 inference with the bitsandbytes library, and adds batch inference compared to the LM inference script in tencentpretrain.
- __Optimized Inference__ Added cache for key and value in Multi-head Attention, requiring only the newly generated token to be input for each inference.
- __LLM Multi-Gpu Inference__ Support tensor parallel multi-gpu inference.
- __Microservices__ Support simple flask microservices and gradio-base online demo.
- __LoRA model Inference__ To be continued.
tips: need cuda.
### Requirements
* Python >= 3.7
* torch >= 1.9
* bitsandbytes
* argparse
### Input Parameters
* __--load_model_path__ (Required) pretrained model, default by fp16.
* __--test_path__ (Required) input prompts,one prompt each line.
* __--prediction_path__ (Required) save path for result.
* __--config_path__ (Required) file of model hyper-parameters, can be stored in config file.
* __--spm_model_path__ (Required) the path of model tokenizer.
* __--batch_size__ (Optional) default by 1. suggestion: consistent with the input.
* __--seq_length__ (Optional) default by 128. total length of generated content, equal to the length of input and generated sentence.
* __--world_size__ (Optional),default by 1. the number of gpus for tensor parallel inference.
* __--use_int8__ (Optional) default by False. whether use int8 to inference.
* __--top_k__ (Optional) default by 40.
* __--top_p__ (Optional) default by 0.95.
* __--temperature__ (Optional) default by 0.8.
* __--repetition_penalty_range__ (Optional) default by 1024.
* __--repetition_penalty_slope__ (Optional) default by 0.
* __--repetition_penalty__ (Optional) default by 1.15.
### Quick Start
#### FP16/Int8 Inference
fp16 inference:
```commandline
python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt \
--load_model_path xxx.bin \
--config_path ./config/llama_7b_config.json \
--spm_model_path ./tokenizer.model
```
int8 inference:
```commandline
python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt \
--load_model_path xxx.bin --use_int8 \
--config_path ./config/llama_7b_config.json \
--spm_model_path ./tokenizer.model
```
#### Multi-round chat
optional parameter: keep_length_ratio. it represents keep the ratio of context.
enter 'clear' will restart a round of new chat and 'exit' will exit the chat.
```commandline
python llama_dialogue.py --load_model_path xxxx.bin \
--config_path config.json \
--spm_model_path tokenizer.model \
--world_size 2
```
#### gradio server
need to install gradio
```commandline
pip install gradio
python llama_gradio.py --load_model_path xxxx.bin \
--config_path config.json \
--spm_model_path tokenizer.model
```
website open: http://127.0.0.1:7860/
#### Microservices deployment
need to install flask
```commandline
pip install flask
python llama_server.py --load_model_path xxxx.bin \
--config_path config.json \
--spm_model_path tokenizer.model
```
curl command:
```commandline
curl -H 'Content-Type: application/json' http://127.0.0.1:8888/chat -d '{"question": "xxx"}'
```
#### Multi-GPU Inference
need to install tensor_parallel
world_size = the number of gpu(gpu id start from 0.)
```commandline
pip install tensor_parallel
python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt \
--load_model_path xxxx.bin \
--config_path config.json \
--spm_model_path tokenizer.model \
--world_size 2
```