## LLaMa Inference For TencentPretrain This project mainly supports LLaMa Inference and Microservice deployment based on [TencentPretrain](https://github.com/Tencent/TencentPretrain).
### Feature - __Int8 Inference__ Support int8 inference with the bitsandbytes library, and adds batch inference compared to the LM inference script in tencentpretrain. - __Optimized Inference__ Added cache for key and value in Multi-head Attention, requiring only the newly generated token to be input for each inference. - __LLM Multi-Gpu Inference__ Support tensor parallel multi-gpu inference. - __Microservices__ Support simple flask microservices and gradio-base online demo. - __LoRA model Inference__ To be continued. tips: need cuda.
### Requirements * Python >= 3.7 * torch >= 1.9 * bitsandbytes * argparse
### Input Parameters * __--load_model_path__ (Required) pretrained model, default by fp16. * __--test_path__ (Required) input prompts,one prompt each line. * __--prediction_path__ (Required) save path for result. * __--config_path__ (Required) file of model hyper-parameters, can be stored in config file. * __--spm_model_path__ (Required) the path of model tokenizer. * __--batch_size__ (Optional) default by 1. suggestion: consistent with the input. * __--seq_length__ (Optional) default by 128. total length of generated content, equal to the length of input and generated sentence. * __--world_size__ (Optional),default by 1. the number of gpus for tensor parallel inference. * __--use_int8__ (Optional) default by False. whether use int8 to inference. * __--top_k__ (Optional) default by 40. * __--top_p__ (Optional) default by 0.95. * __--temperature__ (Optional) default by 0.8. * __--repetition_penalty_range__ (Optional) default by 1024. * __--repetition_penalty_slope__ (Optional) default by 0. * __--repetition_penalty__ (Optional) default by 1.15.
### Quick Start #### FP16/Int8 Inference fp16 inference: ```commandline python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt \ --load_model_path xxx.bin \ --config_path ./config/llama_7b_config.json \ --spm_model_path ./tokenizer.model ``` int8 inference: ```commandline python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt \ --load_model_path xxx.bin --use_int8 \ --config_path ./config/llama_7b_config.json \ --spm_model_path ./tokenizer.model ```
#### Multi-round chat optional parameter: keep_length_ratio. it represents keep the ratio of context. enter 'clear' will restart a round of new chat and 'exit' will exit the chat. ```commandline python llama_dialogue.py --load_model_path xxxx.bin \ --config_path config.json \ --spm_model_path tokenizer.model \ --world_size 2 ```
#### gradio server need to install gradio ```commandline pip install gradio python llama_gradio.py --load_model_path xxxx.bin \ --config_path config.json \ --spm_model_path tokenizer.model ``` website open: http://127.0.0.1:7860/
#### Microservices deployment need to install flask ```commandline pip install flask python llama_server.py --load_model_path xxxx.bin \ --config_path config.json \ --spm_model_path tokenizer.model ``` curl command: ```commandline curl -H 'Content-Type: application/json' http://127.0.0.1:8888/chat -d '{"question": "xxx"}' ```
#### Multi-GPU Inference need to install tensor_parallel world_size = the number of gpu(gpu id start from 0.) ```commandline pip install tensor_parallel python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt \ --load_model_path xxxx.bin \ --config_path config.json \ --spm_model_path tokenizer.model \ --world_size 2 ```