Commit b856c4ec authored by zhaoying1's avatar zhaoying1
Browse files

added llama_inference_pytorch

parents
Pipeline #546 failed with stages
in 0 seconds
This diff is collapsed.
# 基于TencentPretrain的LLaMa推理
## 模型介绍
```
LLaMA,这是一个基础语言模型的集合,参数范围从7B到65B。在数万亿的tokens上训练出的模型,并表明可以专门使用公开可用的数据集来训练最先进的模型,而不依赖于专有的和不可访问的数据集。
```
## 模型结构
```
LLAMA网络基于 Transformer 架构。提出了各种改进,并用于不同的模型,例如 PaLM。以下是与原始架构的主要区别:
预归一化。为了提高训练稳定性,对每个transformer 子层的输入进行归一化,而不是对输出进行归一化。使用 RMSNorm 归一化函数。
SwiGLU 激活函数 [PaLM]。使用 SwiGLU 激活函数替换 ReLU 非线性以提高性能。使用 2 /3 4d 的维度而不是 PaLM 中的 4d。
旋转嵌入。移除了绝对位置嵌入,而是添加了旋转位置嵌入 (RoPE),在网络的每一层。
```
以下是llama-7B的主要网络参数配置:
```
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 2048,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"rms_norm_eps": 1e-06,
"vocab_size": 32000
```
# LLAMA推理
## 环境配置
推荐使用docker方式运行,提供[光源](https://www.sourcefind.cn/)拉取的docker镜像:
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/vscode-pytorch:1.10.0-centos7.6-dtk-22.10-py37-latest
```
安装docker中没有的依赖
```
pip install tensor_parallel==1.2.5 --no-dependencies
pip install transformers==4.28.1 sentencepiece==0.1.99
```
## 模型下载地址
[llama chat 7B](https://huggingface.co/Linly-AI/ChatFlow-7B)
[llama caht 13b](https://huggingface.co/Linly-AI/ChatFlow-13B)
## 参数说明
```
--load_model_path (必填项),预训练好的模型,默认是fp16的(如果需要fp32,修改llama_infer.py的L41为对应的精度)
--test_path (必填项),输入的prompts,每一行是一个prompts。
--prediction_path (必填项),输出结果保存的路径。
--config_path (必填项),模型参数配置文件,可以保存在config文件夹中。
--spm_model_path (必填项),模型tokenizer存放的路径。
--batch_size (可选),默认为1。批处理大小,注意按需使用,因为attention cache会根据这个大小来构造tensor并且保存在显存中。
--seq_length (可选),默认为128。生成句子的总长度,等于prompts + 模型生成的长度。
--world_size (可选),默认为1。使用多少张卡进行张量并行推理。
--use_int8 (可选),默认为False。是否使用int8推理。
--top_k (可选),默认为40。句子的生成会针对top_k做采样,影响生成多样性。
--top_p (可选),默认为0.95。句子的生成会针对累积概率top_p做采样,影响生成多样性。
--temperature (可选),默认为0.8。对最后的probabilities做一次放缩,影响token采样结果。
--repetition_penalty_range (可选),默认为1024。重复出现token的惩罚范围。
--repetition_penalty_slope (可选),默认为0。重复出现token的惩罚slope。
--repetition_penalty (可选),默认为1.15。重复出现token的惩罚系数。
```
## 单卡推理
```
./run.sh
export HIP_VISIBLE_DEVICES=0 指定使用第0号卡
LOAD_MODEL 为下载的llama 模型bin路径
SPM_PATH 为下载的llama 模型tokenizer路径
--config_path 需要与使用的模型对齐,若使用13b的模型,这里需要改成config/llama_13b_config.json
```
## 多张卡并行推理
```
./run-tp.sh
export HIP_VISIBLE_DEVICES=0,1,2,3 指定使用第0,1,2,3号卡
LOAD_MODEL 为下载的llama 模型bin路径
SPM_PATH 为下载的llama 模型tokenizer路径
--config_path 需要与使用的模型对齐,若使用13b的模型,这里需要改成config/llama_13b_config.json
```
## 多轮对话
```
./run-dialogue.sh
#对话时输入 clear 清空聊天历史 输入 exit 退出程序
export HIP_VISIBLE_DEVICES=0,1,2,3 指定使用第0,1,2,3号卡
LOAD_MODEL 为下载的llama 模型bin路径
SPM_PATH 为下载的llama 模型tokenizer路径
--config_path 需要与使用的模型对齐,若使用13b的模型,这里需要改成config/llama_13b_config.json
```
## 多轮对话推理效果
![image-llama](./doc/llama-inf.jpg)
## 源码仓库及问题反馈
https://developer.hpccube.com/codes/hepj/llama_pytorch
## 参考
https://github.com/ProjectD-AI/llama_inference
\ No newline at end of file
## LLaMa Inference For TencentPretrain
This project mainly supports LLaMa Inference and Microservice deployment based on [TencentPretrain](https://github.com/Tencent/TencentPretrain).
<br>
### Feature
- __Int8 Inference__ Support int8 inference with the bitsandbytes library, and adds batch inference compared to the LM inference script in tencentpretrain.
- __Optimized Inference__ Added cache for key and value in Multi-head Attention, requiring only the newly generated token to be input for each inference.
- __LLM Multi-Gpu Inference__ Support tensor parallel multi-gpu inference.
- __Microservices__ Support simple flask microservices and gradio-base online demo.
- __LoRA model Inference__ To be continued.
tips: need cuda.
<br>
### Requirements
* Python >= 3.7
* torch >= 1.9
* bitsandbytes
* argparse
<br>
### Input Parameters
* __--load_model_path__ (Required) pretrained model, default by fp16.
* __--test_path__ (Required) input prompts,one prompt each line.
* __--prediction_path__ (Required) save path for result.
* __--config_path__ (Required) file of model hyper-parameters, can be stored in config file.
* __--spm_model_path__ (Required) the path of model tokenizer.
* __--batch_size__ (Optional) default by 1. suggestion: consistent with the input.
* __--seq_length__ (Optional) default by 128. total length of generated content, equal to the length of input and generated sentence.
* __--world_size__ (Optional),default by 1. the number of gpus for tensor parallel inference.
* __--use_int8__ (Optional) default by False. whether use int8 to inference.
* __--top_k__ (Optional) default by 40.
* __--top_p__ (Optional) default by 0.95.
* __--temperature__ (Optional) default by 0.8.
* __--repetition_penalty_range__ (Optional) default by 1024.
* __--repetition_penalty_slope__ (Optional) default by 0.
* __--repetition_penalty__ (Optional) default by 1.15.
<br>
### Quick Start
#### FP16/Int8 Inference
fp16 inference:
```commandline
python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt \
--load_model_path xxx.bin \
--config_path ./config/llama_7b_config.json \
--spm_model_path ./tokenizer.model
```
int8 inference:
```commandline
python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt \
--load_model_path xxx.bin --use_int8 \
--config_path ./config/llama_7b_config.json \
--spm_model_path ./tokenizer.model
```
<br>
#### Multi-round chat
optional parameter: keep_length_ratio. it represents keep the ratio of context.
enter 'clear' will restart a round of new chat and 'exit' will exit the chat.
```commandline
python llama_dialogue.py --load_model_path xxxx.bin \
--config_path config.json \
--spm_model_path tokenizer.model \
--world_size 2
```
<br>
#### gradio server
need to install gradio
```commandline
pip install gradio
python llama_gradio.py --load_model_path xxxx.bin \
--config_path config.json \
--spm_model_path tokenizer.model
```
website open: http://127.0.0.1:7860/
<br>
#### Microservices deployment
need to install flask
```commandline
pip install flask
python llama_server.py --load_model_path xxxx.bin \
--config_path config.json \
--spm_model_path tokenizer.model
```
curl command:
```commandline
curl -H 'Content-Type: application/json' http://127.0.0.1:8888/chat -d '{"question": "xxx"}'
```
<br>
#### Multi-GPU Inference
need to install tensor_parallel
world_size = the number of gpu(gpu id start from 0.)
```commandline
pip install tensor_parallel
python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt \
--load_model_path xxxx.bin \
--config_path config.json \
--spm_model_path tokenizer.model \
--world_size 2
```
<br>
\ No newline at end of file
[**中文**](https://github.com/fengyh3/llama_inference/blob/main/README.md) | [**English**](https://github.com/fengyh3/llama_inference/blob/main/README_en.md)
## 基于TencentPretrain的LLaMa推理
本项目主要支持基于[TencentPretrain](https://github.com/Tencent/TencentPretrain)的LLaMa模型量化推理以及简单的微服务部署。也可以扩展至其他模型,持续更新中。
<br>
### 特性
- __Int8推理__ 支持bitsandbytes库的int8推理,相比tencentpretrain中的LM推理脚本,加入了Batch推理。
- __优化推理逻辑__ 在Multi-head Attention中加入了key和value的cache,每次inference只需要输入新生成的token。
- __大模型多卡推理__ 支持张量并行的多卡推理。
- __微服务部署__ 支持简单的flask部署以及gradio在线可视化部署。
- __LoRA模型推理__ 施工中,计划支持使用LoRA训练的模型。
tips:当前脚本只支持cuda推理,未来计划更多的量化部署推理的功能,敬请期待。
<br>
### 依赖环境
* Python >= 3.7
* torch >= 1.9
* bitsandbytes
* argparse
<br>
### 输入参数参考
* __--load_model_path__ (必填项),预训练好的模型,默认是fp16的(如果需要fp32,修改llama_infer.py的L41为对应的精度)
* __--test_path__ (必填项),输入的prompts,每一行是一个prompts。
* __--prediction_path__ (必填项),输出结果保存的路径。
* __--config_path__ (必填项),模型参数配置文件,可以保存在config文件夹中。
* __--spm_model_path__ (必填项),模型tokenizer存放的路径。
* __--batch_size__ (可选),默认为1。批处理大小,注意按需使用,因为attention cache会根据这个大小来构造tensor并且保存在显存中。
* __--seq_length__ (可选),默认为128。生成句子的总长度,等于prompts + 模型生成的长度。
* __--world_size__ (可选),默认为1。使用多少张卡进行张量并行推理。
* __--use_int8__ (可选),默认为False。是否使用int8推理。
* __--top_k__ (可选),默认为40。句子的生成会针对top_k做采样,影响生成多样性。
* __--top_p__ (可选),默认为0.95。句子的生成会针对累积概率top_p做采样,影响生成多样性。
* __--temperature__ (可选),默认为0.8。对最后的probabilities做一次放缩,影响token采样结果。
* __--repetition_penalty_range__ (可选),默认为1024。重复出现token的惩罚范围。
* __--repetition_penalty_slope__ (可选),默认为0。重复出现token的惩罚slope。
* __--repetition_penalty__ (可选),默认为1.15。重复出现token的惩罚系数。
<br>
### 快速开始
#### FP16/Int8推理
fp16推理:
```commandline
python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt \
--load_model_path xxx.bin \
--config_path ./config/llama_7b_config.json \
--spm_model_path ./tokenizer.model
```
如果要使用int8推理的话,加入--use_int8:
```commandline
python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt \
--load_model_path xxx.bin --use_int8 \
--config_path ./config/llama_7b_config.json \
--spm_model_path ./tokenizer.model
```
<br>
#### 多轮对话
有可选参数keep_length_ratio,表示保留多少比例的上下文。输入clear会进行新的一轮对话,输入exit会退出。
```commandline
python llama_dialogue.py --load_model_path xxxx.bin \
--config_path config.json \
--spm_model_path tokenizer.model \
--world_size 2
```
<br>
#### gradio部署
需要安装gradio
```commandline
pip install gradio
python llama_gradio.py --load_model_path xxxx.bin \
--config_path config.json \
--spm_model_path tokenizer.model
```
然后在网页上打开:http://127.0.0.1:7860/
<br>
#### 微服务部署
需要安装flask
```commandline
pip install flask
python llama_server.py --load_model_path xxxx.bin \
--config_path config.json \
--spm_model_path tokenizer.model
```
查询命令:
```commandline
curl -H 'Content-Type: application/json' http://127.0.0.1:8888/chat -d '{"question": "xxx"}'
```
<br>
#### 多卡张量并行推理
需要安装tensor_parallel
参数world_size为希望使用多少gpu(gpu的id从0开始)
```commandline
pip install tensor_parallel
python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt \
--load_model_path xxxx.bin \
--config_path config.json \
--spm_model_path tokenizer.model \
--world_size 2
```
{
"gradient_accumulation_steps": 1,
"train_micro_batch_size_per_gpu":1,
"steps_per_print": 100,
"optimizer": {
"type": "Adam",
"params": {
"lr": 2e-5,
"weight_decay": 1e-2
}
},
"flops_profiler": {
"enabled": false,
"profile_step": 1,
"module_depth": -1,
"top_modules": 3,
"detailed": true
},
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 2,
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
}
},
"activation_checkpointing": {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false
},
"wall_clock_breakdown": false,
"zero_allow_untested_optimizer": true,
"zero_force_ds_cpu_optimization": false
}
\ No newline at end of file
{
"gradient_accumulation_steps": 1,
"train_micro_batch_size_per_gpu":1,
"steps_per_print": 100,
"optimizer": {
"type": "Adam",
"params": {
"lr": 1e-5,
"weight_decay": 1e-2
}
},
"flops_profiler": {
"enabled": true,
"profile_step": 1,
"module_depth": -1,
"top_modules": 3,
"detailed": true
},
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 3,
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
}
},
"activation_checkpointing": {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false
},
"wall_clock_breakdown": false,
"zero_allow_untested_optimizer": true
}
{
"emb_size": 5120,
"feedforward_size": 13824,
"hidden_size": 5120,
"hidden_act": "silu",
"heads_num": 40,
"layers_num": 40,
"dropout": 0.1,
"data_processor": "lm",
"max_seq_length": 2048,
"embedding": ["word"],
"remove_transformer_bias": true,
"remove_embedding_layernorm": true,
"rotary_position_embedding": true,
"encoder": "transformer",
"feed_forward": "gated",
"mask": "causal",
"layernorm_positioning": "pre",
"layernorm": "rms",
"target": ["lm"]
}
\ No newline at end of file
{
"emb_size": 6656,
"feedforward_size": 17920,
"hidden_size": 6656,
"hidden_act": "silu",
"heads_num": 52,
"layers_num": 60,
"dropout": 0.1,
"data_processor": "lm",
"max_seq_length": 2048,
"embedding": ["word"],
"remove_transformer_bias": true,
"remove_embedding_layernorm": true,
"rotary_position_embedding": true,
"encoder": "transformer",
"feed_forward": "gated",
"mask": "causal",
"layernorm_positioning": "pre",
"layernorm": "rms",
"target": ["lm"]
}
\ No newline at end of file
{
"emb_size": 8192,
"feedforward_size": 22016,
"hidden_size": 8192,
"hidden_act": "silu",
"heads_num": 64,
"layers_num": 80,
"dropout": 0.1,
"data_processor": "lm",
"max_seq_length": 2048,
"embedding": ["word"],
"remove_transformer_bias": true,
"remove_embedding_layernorm": true,
"rotary_position_embedding": true,
"encoder": "transformer",
"feed_forward": "gated",
"mask": "causal",
"layernorm_positioning": "pre",
"layernorm": "rms",
"target": ["lm"]
}
\ No newline at end of file
{
"emb_size": 4096,
"feedforward_size": 11008,
"hidden_size": 4096,
"hidden_act": "silu",
"heads_num": 32,
"layers_num": 32,
"dropout": 0.1,
"data_processor": "lm",
"max_seq_length": 2048,
"embedding": ["word"],
"remove_transformer_bias": true,
"remove_embedding_layernorm": true,
"rotary_position_embedding": true,
"encoder": "transformer",
"feed_forward": "gated",
"mask": "causal",
"layernorm_positioning": "pre",
"layernorm": "rms",
"target": ["lm"]
}
\ No newline at end of file
import torch
import torch.nn.functional as F
def apply_temperature(scores, tempt):
if tempt > 0:
scores = scores / tempt
return scores
def apply_top_p(scores, top_p, filter_value=-float("Inf"), min_tokens_to_keep=1):
if top_p > 0 and top_p < 1:
sorted_logits, sorted_indices = torch.sort(scores, descending=False)
cumulative_probs = sorted_logits.softmax(dim=-1).cumsum(dim=-1)
# Remove tokens with cumulative top_p above the threshold (token with 0 are kept)
sorted_indices_to_remove = cumulative_probs <= (1 - top_p)
if min_tokens_to_keep > 1:
# Keep at least min_tokens_to_keep
sorted_indices_to_remove[..., -min_tokens_to_keep:] = 0
# scatter sorted tensors to original indexing
indices_to_remove = sorted_indices_to_remove.scatter(
1, sorted_indices, sorted_indices_to_remove
)
scores = scores.masked_fill(indices_to_remove, filter_value)
return scores
def apply_top_k(logits, top_k):
top_k = min(top_k, logits.size(-1)) # Safety check
if top_k > 0:
# Remove all tokens with a probability less than the last token of the top-k
indices_to_remove = logits < torch.topk(logits.float(), top_k)[0][..., -1, None]
logits[indices_to_remove] = -float("Inf")
return logits
def apply_advanced_repetition_penalty(
input_ids, scores, penalty_range, penalty_slope, penalty
):
penalty_range = int(penalty_range)
clipped_penalty_range = min(input_ids.shape[-1], penalty_range)
if penalty != 1.0:
if penalty_range > 0:
if clipped_penalty_range < input_ids.shape[1]:
input_ids = input_ids[..., -clipped_penalty_range:]
if penalty_slope != 0:
_penalty = (
torch.arange(
penalty_range, dtype=scores.dtype, device=scores.device
)
/ (penalty_range - 1)
) * 2.0 - 1
_penalty = (penalty_slope * _penalty) / (
1 + torch.abs(_penalty) * (penalty_slope - 1)
)
_penalty = 1 + ((_penalty + 1) / 2).unsqueeze(0) * (penalty - 1)
penalty = _penalty[..., -clipped_penalty_range:]
score = torch.gather(scores, 1, input_ids)
score = torch.where(score <= 0, score * penalty, score / penalty)
scores.scatter_(1, input_ids, score)
return scores
class LmGeneration:
def __init__(self, model, tokenizer):
self.model = model
self.tokenizer = tokenizer
def generate(self, args, prompts, cut_off=None, cut_off_times=1):
if cut_off is not None:
cut_off_times = [cut_off_times for i in range(len(prompts))]
batch = len(prompts)
assert batch <= args.batch_size
prompt_tokens = [args.tokenizer.encode(x, bos=True, eos=False) for x in prompts]
min_prompt_len = min([len(x) for x in prompt_tokens])
# max_prompt_len = max([len(x) for x in prompt_tokens])
total_len = args.seq_length
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokens = torch.full((batch, total_len), self.tokenizer.pad_id).to(device).long()
for idx, t in enumerate(prompt_tokens):
tokens[idx, : len(t)] = torch.tensor(t).long()
mask = tokens != self.tokenizer.pad_id
start_pos = min_prompt_len
prev_pos = 0
continue_exsample = [i for i in range(batch)]
with torch.no_grad():
for cur_pos in range(start_pos, total_len):
logits = self.model.forward(tokens[continue_exsample, prev_pos:cur_pos], prev_pos, continue_exsample).float()
next_token_scores = apply_top_k(logits, top_k=args.top_k)
next_token_scores = apply_top_p(next_token_scores, args.top_p)
next_token_scores = apply_temperature(next_token_scores, args.temperature)
next_token_scores = apply_advanced_repetition_penalty(
tokens[continue_exsample, :cur_pos],
next_token_scores,
args.repetition_penalty_range,
args.repetition_penalty_slope,
args.repetition_penalty
)
scores = F.softmax(next_token_scores, dim=-1)
next_token = torch.multinomial(scores, num_samples=1).squeeze(1)
next_token = next_token.reshape(-1)
next_token = torch.where(
mask[continue_exsample, cur_pos], tokens[continue_exsample, cur_pos], next_token
)
tokens[continue_exsample, cur_pos] = next_token
prev_pos = cur_pos
# remove eos examples.
continue_exsample = []
for i, t in enumerate(tokens.tolist()):
try:
t.index(self.tokenizer.eos_id)
except ValueError:
if cut_off is not None:
if cut_off == self.tokenizer.decode(t[:cur_pos + 1])[-len(cut_off):]:
if cut_off_times[i] == 1:
continue
else:
cut_off_times[i] -= 1
continue_exsample.append(i)
if len(continue_exsample) == 0:
break
decoder = []
for i, t in enumerate(tokens.tolist()):
t = t[: args.seq_length]
try:
t = t[: t.index(self.tokenizer.pad_id)]
t = t[: t.index(self.tokenizer.eos_id)]
except ValueError:
pass
decoder.append(self.tokenizer.decode(t))
return decoder
class LmGeneration_test:
def __init__(self, model, tokenizer):
self.model = model
self.tokenizer = tokenizer
def generate(self, args, prompt_tokens, cut_off=None, cut_off_times=1):
if cut_off is not None:
cut_off_times = [cut_off_times for i in range(len(prompt_tokens))]
batch = len(prompt_tokens)
assert batch <= args.batch_size
# prompt_tokens = [args.tokenizer.encode(x, bos=True, eos=False) for x in prompts]
min_prompt_len = min([len(x) for x in prompt_tokens])
# max_prompt_len = max([len(x) for x in prompt_tokens])
total_len = args.seq_length
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokens = torch.full((batch, total_len), self.tokenizer.pad_id).to(device).long()
for idx, t in enumerate(prompt_tokens):
tokens[idx, : len(t)] = torch.tensor(t).long()
mask = tokens != self.tokenizer.pad_id
start_pos = min_prompt_len
prev_pos = 0
continue_exsample = [i for i in range(batch)]
with torch.no_grad():
for cur_pos in range(start_pos, total_len):
logits = self.model.forward(tokens[continue_exsample, prev_pos:cur_pos], prev_pos, continue_exsample).float()
next_token_scores = apply_top_k(logits, top_k=args.top_k)
next_token_scores = apply_top_p(next_token_scores, args.top_p)
next_token_scores = apply_temperature(next_token_scores, args.temperature)
next_token_scores = apply_advanced_repetition_penalty(
tokens[continue_exsample, :cur_pos],
next_token_scores,
args.repetition_penalty_range,
args.repetition_penalty_slope,
args.repetition_penalty
)
scores = F.softmax(next_token_scores, dim=-1)
next_token = torch.multinomial(scores, num_samples=1).squeeze(1)
next_token = next_token.reshape(-1)
next_token = torch.where(
mask[continue_exsample, cur_pos], tokens[continue_exsample, cur_pos], next_token
)
tokens[continue_exsample, cur_pos] = next_token
prev_pos = cur_pos
# remove eos examples.
continue_exsample = []
for i, t in enumerate(tokens.tolist()):
try:
t.index(self.tokenizer.eos_id)
except ValueError:
if cut_off is not None:
if cut_off == self.tokenizer.decode(t[:cur_pos + 1])[-len(cut_off):]:
if cut_off_times[i] == 1:
continue
else:
cut_off_times[i] -= 1
continue_exsample.append(i)
if len(continue_exsample) == 0:
break
return tokens
import argparse
from utils import load_hyperparam, convert_normal_parameter_to_int8, load_model
from model.tokenize import Tokenizer
from model.llama import *
from generate import LmGeneration
def multi_round_chat(args, lm_generation, keep_length_ratio=0.5):
users = []
answers = []
while True:
user_input = input("User: ")
if user_input == 'clear':
users = []
answers = []
print("开启新的一轮聊天/Start a new round of chat:")
continue
if user_input == 'exit':
break
input_str = ''
for user, ans in zip(users, answers):
input_str += 'User: ' + user + '\nBot: ' + ans + '\n'
input_str += 'User: ' + user_input + '\nBot: '
if len(input_str) >= int(keep_length_ratio * args.seq_length):
input_str = input_str[len(input_str) - int(keep_length_ratio * args.seq_length):]
answer = lm_generation.generate(args, [input_str], cut_off='User:', cut_off_times=1)[0]
answer = answer[len(input_str):]
print("ChatLLaMa: " + answer.replace('User:', ''))
users.append(user_input.rstrip(' ').rstrip('\n'))
answers.append(answer.replace('User:', '').rstrip(' ').rstrip('\n'))
if __name__ == '__main__':
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument("--load_model_path", default=None, type=str,
help="Path of the input model.")
parser.add_argument("--prediction_path", type=str, default=None,
help="Path of the prediction file.")
parser.add_argument("--config_path", type=str, required=True,
help="Path of the config file.")
parser.add_argument("--seq_length", type=int, default=2048,
help="Sequence length.")
parser.add_argument("--world_size", type=int, default=1,
help="the number of gpus.")
parser.add_argument("--keep_length_ratio", type=float, default=0.5)
parser.add_argument("--use_int8", action="store_true")
parser.add_argument("--top_k", type=int, default=10)
parser.add_argument("--top_p", type=float, default=1)
parser.add_argument("--temperature", type=float, default=0.85)
parser.add_argument("--repetition_penalty_range", type=int, default=1024)
parser.add_argument("--repetition_penalty_slope", type=float, default=0)
parser.add_argument("--repetition_penalty", type=float, default=1.15)
parser.add_argument("--spm_model_path", default=None, type=str,
help="Path of the sentence piece model.")
args = parser.parse_args()
args = load_hyperparam(args)
args.batch_size = 1
args.tokenizer = Tokenizer(model_path=args.spm_model_path)
args.vocab_size = args.tokenizer.sp_model.vocab_size()
torch.set_default_tensor_type(torch.HalfTensor)
model = LLaMa(args)
torch.set_default_tensor_type(torch.FloatTensor)
model = load_model(model, args.load_model_path)
model.eval()
# use multi-gpu tensor parallel
if args.world_size > 1:
import tensor_parallel as tp
gpus = ["cuda:" + str(i) for i in range(args.world_size)]
if args.use_int8:
model = tp.tensor_parallel(model, gpus, delay_init=True)
model = convert_normal_parameter_to_int8(model)
else:
model = tp.tensor_parallel(model, gpus)
else:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
lm_generation = LmGeneration(model, args.tokenizer)
multi_round_chat(args, lm_generation, args.keep_length_ratio)
\ No newline at end of file
import gradio as gr
import argparse
from utils import load_hyperparam, load_model, convert_normal_parameter_to_int8
from model.tokenize import Tokenizer
from model.llama import *
from generate import LmGeneration
args = None
lm_generation = None
def init_args():
global args
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument("--load_model_path", default=None, type=str,
help="Path of the input model.")
parser.add_argument("--config_path", type=str, required=True,
help="Path of the config file.")
parser.add_argument("--batch_size", type=int, default=1,
help="Batch size.")
parser.add_argument("--seq_length", type=int, default=128,
help="Sequence length.")
parser.add_argument("--world_size", type=int, default=1,
help="the number of gpus.")
parser.add_argument("--use_int8", action="store_true")
parser.add_argument("--top_k", type=int, default=10)
parser.add_argument("--top_p", type=float, default=1)
parser.add_argument("--temperature", type=float, default=0.85)
parser.add_argument("--repetition_penalty_range", type=int, default=1024)
parser.add_argument("--repetition_penalty_slope", type=float, default=0)
parser.add_argument("--repetition_penalty", type=float, default=1.15)
parser.add_argument("--spm_model_path", default=None, type=str,
help="Path of the sentence piece model.")
args = parser.parse_args()
args = load_hyperparam(args)
args.tokenizer = Tokenizer(model_path=args.spm_model_path)
args.vocab_size = args.tokenizer.sp_model.vocab_size()
def init_model():
global lm_generation
torch.set_default_tensor_type(torch.HalfTensor)
model = LLaMa(args)
torch.set_default_tensor_type(torch.FloatTensor)
model = load_model(model, args.load_model_path)
model.eval()
# use multi-gpu tensor parallel
if args.world_size > 1:
import tensor_parallel as tp
gpus = ["cuda:" + str(i) for i in range(args.world_size)]
if args.use_int8:
model = tp.tensor_parallel(model, gpus, delay_init=True)
model = convert_normal_parameter_to_int8(model)
else:
model = tp.tensor_parallel(model, gpus)
else:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
lm_generation = LmGeneration(model, args.tokenizer)
def chat(prompt, top_k, temperature):
args.top_k = int(top_k)
args.temperature = temperature
response = lm_generation.generate(args, [prompt])
return response[0]
if __name__ == '__main__':
init_args()
init_model()
demo = gr.Interface(
fn=chat,
inputs=["text", gr.Slider(1, 60, value=40, step=1), gr.Slider(0.1, 2.0, value=1.2, step=0.1)],
outputs="text",
)
demo.launch()
import argparse
from utils import load_hyperparam, convert_normal_parameter_to_int8, load_model
from model.tokenize import Tokenizer
from model.llama import *
from generate import LmGeneration
if __name__ == '__main__':
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument("--load_model_path", default=None, type=str,
help="Path of the input model.")
parser.add_argument("--test_path", type=str, required=True,
help="Path of the testset.")
parser.add_argument("--prediction_path", type=str, required=True,
help="Path of the prediction file.")
parser.add_argument("--config_path", type=str, required=True,
help="Path of the config file.")
parser.add_argument("--batch_size", type=int, default=1,
help="Batch size.")
parser.add_argument("--world_size", type=int, default=1,
help="the number of gpus.")
parser.add_argument("--seq_length", type=int, default=128,
help="Sequence length.")
parser.add_argument("--use_int8", action="store_true")
parser.add_argument("--top_k", type=int, default=10)
parser.add_argument("--top_p", type=float, default=1)
parser.add_argument("--temperature", type=float, default=0.85)
parser.add_argument("--repetition_penalty_range", type=int, default=1024)
parser.add_argument("--repetition_penalty_slope", type=float, default=0)
parser.add_argument("--repetition_penalty", type=float, default=1.15)
parser.add_argument("--spm_model_path", default=None, type=str,
help="Path of the sentence piece model.")
args = parser.parse_args()
args = load_hyperparam(args)
args.tokenizer = Tokenizer(model_path=args.spm_model_path)
args.vocab_size = args.tokenizer.sp_model.vocab_size()
torch.set_default_tensor_type(torch.HalfTensor)
model = LLaMa(args)
torch.set_default_tensor_type(torch.FloatTensor)
model = load_model(model, args.load_model_path)
model.eval()
# use multi-gpu tensor parallel
if args.world_size > 1:
import tensor_parallel as tp
gpus = ["cuda:" + str(i) for i in range(args.world_size)]
if args.use_int8:
model = tp.tensor_parallel(model, gpus, delay_init=True)
model = convert_normal_parameter_to_int8(model)
else:
model = tp.tensor_parallel(model, gpus)
else:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
lm_generation = LmGeneration(model, args.tokenizer)
prompts = []
with open(args.test_path, 'r', encoding='utf-8') as f:
for line in f:
prompts.append(line)
with torch.no_grad():
result = lm_generation.generate(args, prompts)
with open(args.prediction_path, 'w', encoding='utf-8') as f:
for res in result:
f.write(res + '\n')
f.write('\n')
\ No newline at end of file
import argparse
import torch
from utils import load_hyperparam, convert_normal_parameter_to_int8, load_model
from model.tokenize import Tokenizer
from model.llama import *
from generate import LmGeneration
from flask import Flask, request
import json
app = Flask(__name__)
args = None
lm_generation = None
def init_model():
global args
global lm_generation
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument("--load_model_path", default=None, type=str,
help="Path of the input model.")
parser.add_argument("--config_path", type=str, required=True,
help="Path of the config file.")
parser.add_argument("--batch_size", type=int, default=1,
help="Batch size.")
parser.add_argument("--seq_length", type=int, default=128,
help="Sequence length.")
parser.add_argument("--world_size", type=int, default=1,
help="the number of gpus.")
parser.add_argument("--use_int8", action="store_true")
parser.add_argument("--top_k", type=int, default=10)
parser.add_argument("--top_p", type=float, default=1)
parser.add_argument("--temperature", type=float, default=0.85)
parser.add_argument("--repetition_penalty_range", type=int, default=1024)
parser.add_argument("--repetition_penalty_slope", type=float, default=0)
parser.add_argument("--repetition_penalty", type=float, default=1.15)
parser.add_argument("--spm_model_path", default=None, type=str,
help="Path of the sentence piece model.")
args = parser.parse_args()
args = load_hyperparam(args)
args.tokenizer = Tokenizer(model_path=args.spm_model_path)
args.vocab_size = args.tokenizer.sp_model.vocab_size()
torch.set_default_tensor_type(torch.HalfTensor)
model = LLaMa(args)
torch.set_default_tensor_type(torch.FloatTensor)
model = load_model(model, args.load_model_path)
model.eval()
# use multi-gpu tensor parallel
if args.world_size > 1:
import tensor_parallel as tp
gpus = ["cuda:" + str(i) for i in range(args.world_size)]
if args.use_int8:
model = tp.tensor_parallel(model, gpus, delay_init=True)
model = convert_normal_parameter_to_int8(model)
else:
model = tp.tensor_parallel(model, gpus)
else:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
lm_generation = LmGeneration(model, args.tokenizer)
@app.route("/chat", methods=['POST'])
def chat():
question = request.json.get("question")
if isinstance(question, str):
question = [question, ]
try:
with torch.no_grad():
answer = lm_generation.generate(args, question)
status = 'success'
except Exception:
answer = ''
status = 'error'
return json.dumps({'answer': answer, 'status': status}, ensure_ascii=False)
if __name__ == '__main__':
init_model()
# first pass on request to initialize int8.
try:
with torch.no_grad():
answer = lm_generation.generate(args, ['hello world!'])
except Exception:
pass
app.run(host='127.0.0.1', port=8888, debug=False)
#模型名称
modelName=LLAMA_pytorch
#模型描述
modelDescription=基于Pytorch框架的tencentpretrain格式llama模型推理
#应用场景
apoScenar10=推理,nlp,文本生成,智能聊天助手
#框架类型
frameType=Pytorch,Transformers,Tensor_parallel
\ No newline at end of file
import torch
import torch.nn as nn
import torch.nn.functional as F
from model.norm import RMSNorm
from model.rope import precompute_freqs_cis, apply_rotary_emb
# import bitsandbytes as bnb
import math
class NormalLinear(nn.Linear):
def reset_parameters(self) -> None:
pass
# class BnbInt8Linear(bnb.nn.Linear8bitLt):
# def __init__(self, *args, **kwargs):
# super().__init__(has_fp16_weights=False, threshold=6.0, *args, **kwargs)
# def reset_parameters(self) -> None:
# pass
def get_linear_layer(use_int8):
if use_int8:
pass
return NormalLinear
class WordEmbedding(nn.Module):
def __init__(self, args):
super(WordEmbedding, self).__init__()
self.embedding = nn.Embedding(args.vocab_size, args.emb_size)
def forward(self, src):
emb = self.embedding(src)
return emb
class MultiHeadedAttention(nn.Module):
def __init__(self, args, hidden_size, heads_num, attention_head_size, has_bias=True, use_int8=True):
super(MultiHeadedAttention, self).__init__()
self.heads_num = heads_num
self.per_head_size = attention_head_size
self.inner_hidden_size = heads_num * attention_head_size
Linear = get_linear_layer(use_int8)
self.linear_layers = nn.ModuleList(
[Linear(hidden_size, self.inner_hidden_size, bias=has_bias) for _ in range(3)]
)
self.final_linear = Linear(self.inner_hidden_size, hidden_size, bias=has_bias)
# add cache to reduce compute source.
self.cache_k = torch.zeros(
(args.batch_size, args.seq_length, self.heads_num, self.per_head_size)
)
self.cache_v = torch.zeros(
(args.batch_size, args.seq_length, self.heads_num, self.per_head_size)
)
def forward(self, key, value, query, start_pos, continue_exsample, mask, freqs_cis):
batch_size, seq_length, _ = query.size()
heads_num = self.heads_num
per_head_size = self.per_head_size
query, key, value = [l(x).view(batch_size, -1, heads_num, per_head_size) \
for l, x in zip(self.linear_layers, (query, key, value))]
query, key = apply_rotary_emb(query, key, freqs_cis=freqs_cis)
if self.cache_k.device != key.device:
self.cache_k = self.cache_k.to(key)
if self.cache_v.device != value.device:
self.cache_v = self.cache_v.to(value)
self.cache_k[continue_exsample, start_pos: start_pos + seq_length] = key
self.cache_v[continue_exsample, start_pos: start_pos + seq_length] = value
key = self.cache_k[continue_exsample, : start_pos + seq_length]
value = self.cache_v[continue_exsample, : start_pos + seq_length]
query, key, value = [x.transpose(1, 2) for x in (query, key, value)]
scores = torch.matmul(query, key.transpose(-2, -1))
scores = scores / math.sqrt(float(per_head_size))
if mask is not None:
scores += mask
# probs = nn.Softmax(dim=-1)(scores)
probs = F.softmax(scores.float(), dim=-1).type_as(query)
output = torch.matmul(probs, value).transpose(1, 2).\
contiguous().view(batch_size, seq_length, -1)
return self.final_linear(output)
class GatedFeedForward(nn.Module):
def __init__(self, hidden_size, feedforward_size, has_bias=True, use_int8=True):
super(GatedFeedForward, self).__init__()
Linear = get_linear_layer(use_int8)
self.linear_gate = Linear(hidden_size, feedforward_size, bias=has_bias)
self.linear_1 = Linear(hidden_size, feedforward_size, bias=has_bias)
self.linear_2 = Linear(feedforward_size, hidden_size, bias=has_bias)
self.act = F.silu
def forward(self, x):
# gate = self.act(self.linear_gate(x))
gate = self.act(self.linear_gate(x)).type_as(x)
inter_linear = self.linear_1(x)
inter = gate * inter_linear
output = self.linear_2(inter)
return output
class TransformerLayer(nn.Module):
def __init__(self, args):
super(TransformerLayer, self).__init__()
if hasattr(args, "attention_head_size"):
attention_head_size = args.attention_head_size
else:
attention_head_size = args.hidden_size // args.heads_num
has_bias = bool(1 - args.remove_transformer_bias)
# Multi-head Attention
self.self_attn = MultiHeadedAttention(
args, args.hidden_size, args.heads_num, attention_head_size, has_bias=has_bias,
use_int8=args.use_int8
)
# FFN
self.feed_forward = GatedFeedForward(
args.hidden_size, args.feedforward_size, has_bias, use_int8=args.use_int8
)
self.layer_norm_1 = RMSNorm(args.hidden_size)
self.layer_norm_2 = RMSNorm(args.hidden_size)
def forward(self, hidden, start_pos, continue_exsample, mask, freqs_cis=None):
inter = self.layer_norm_1(hidden)
inter = self.self_attn(inter, inter, inter, start_pos, continue_exsample, mask, freqs_cis)
hidden = hidden + inter
output = self.layer_norm_2(hidden)
output = self.feed_forward(output) + hidden
return output
class TransformerEncoder(nn.Module):
def __init__(self, args):
super(TransformerEncoder, self).__init__()
self.mask = args.mask
self.layers_num = args.layers_num
self.transformer = nn.ModuleList(
[TransformerLayer(args) for _ in range(self.layers_num)]
)
self.layer_norm = RMSNorm(args.hidden_size)
self.freqs_cis = precompute_freqs_cis(args.hidden_size // args.heads_num, args.max_seq_length * 2)
def forward(self, emb, start_pos, continue_exsample):
batch_size, seq_length, _ = emb.size()
mask = None
if seq_length > 1:
mask = torch.ones(seq_length, seq_length, device=emb.device)
mask = torch.tril(mask)
mask = (1.0 - mask) * -10000
mask = mask.repeat(batch_size, 1, 1, 1)
hidden = emb
freqs_cis = self.freqs_cis[start_pos: start_pos + seq_length].to(hidden.device)
for i in range(self.layers_num):
hidden = self.transformer[i](hidden, start_pos, continue_exsample, mask, freqs_cis=freqs_cis)
return self.layer_norm(hidden)
class LmOutput(nn.Module):
def __init__(self, args):
super(LmOutput, self).__init__()
# update: lm output not use int8
Linear = get_linear_layer(False)
self.lm = Linear(args.hidden_size, args.vocab_size, bias=False)
def forward(self, x):
return self.lm(x[:, -1, :])
class LLaMa(nn.Module):
def __init__(self, args):
super(LLaMa, self).__init__()
self.embedding = WordEmbedding(args)
self.encoder = TransformerEncoder(args)
self.target = LmOutput(args)
#@torch.inference_mode()
def forward(self, src, start_pos, continue_exsample):
emb = self.embedding(src)
output = self.encoder(emb, start_pos, continue_exsample)
output = self.target(output)
return output
from torch import nn
import torch
class RMSNorm(torch.nn.Module):
def __init__(self, hidden_size, eps=1e-6):
super().__init__()
self.eps = eps
self.weight = nn.Parameter(torch.ones(hidden_size))
def _norm(self, x):
return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
def forward(self, x):
output = self._norm(x.float()).type_as(x)
return output * self.weight
import torch
from typing import Tuple
def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
t = torch.arange(end, device=freqs.device) # type: ignore
freqs = torch.outer(t, freqs).float() # type: ignore
freqs_cis = torch.polar(torch.ones_like(freqs), freqs) # complex64
return freqs_cis
def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor):
ndim = x.ndim
assert 0 <= 1 < ndim
assert freqs_cis.shape == (x.shape[1], x.shape[-1])
shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
return freqs_cis.view(*shape)
def apply_rotary_emb(
xq: torch.Tensor,
xk: torch.Tensor,
freqs_cis: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor]:
xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
return xq_out.type_as(xq), xk_out.type_as(xk)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment