added llama_inference_pytorch

b856c4ec · zhaoying1 · b856c4ec · b856c4ec · b856c4ec · b856c4ec
Commit b856c4ec authored Sep 07, 2023 by zhaoying1
20 changed files
--- a/LICENSE
+++ b/LICENSE
--- a/README.md
+++ b/README.md
+# 基于TencentPretrain的LLaMa推理
+
+## 模型介绍
+
+```
+LLaMA，这是一个基础语言模型的集合，参数范围从7B到65B。在数万亿的tokens上训练出的模型，并表明可以专门使用公开可用的数据集来训练最先进的模型，而不依赖于专有的和不可访问的数据集。
+```
+
+## 模型结构
+
+```
+LLAMA网络基于 Transformer 架构。提出了各种改进，并用于不同的模型，例如 PaLM。以下是与原始架构的主要区别：
+预归一化。为了提高训练稳定性，对每个transformer 子层的输入进行归一化，而不是对输出进行归一化。使用 RMSNorm 归一化函数。
+SwiGLU 激活函数 [PaLM]。使用 SwiGLU 激活函数替换 ReLU 非线性以提高性能。使用 2 /3 4d 的维度而不是 PaLM 中的 4d。
+旋转嵌入。移除了绝对位置嵌入，而是添加了旋转位置嵌入 (RoPE)，在网络的每一层。
+```
+
+以下是llama-7B的主要网络参数配置：
+
+```
+  "hidden_act": "silu",
+  "hidden_size": 4096,
+  "initializer_range": 0.02,
+  "intermediate_size": 11008,
+  "max_position_embeddings": 2048,
+  "model_type": "llama",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 32,
+  "rms_norm_eps": 1e-06,
+  "vocab_size": 32000
+```
+
+# LLAMA推理
+
+## 环境配置
+
+推荐使用docker方式运行，提供[光源](https://www.sourcefind.cn/)拉取的docker镜像：
+
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/vscode-pytorch:1.10.0-centos7.6-dtk-22.10-py37-latest
+```
+
+安装docker中没有的依赖
+
+```
+pip install tensor_parallel==1.2.5 --no-dependencies
+pip install transformers==4.28.1 sentencepiece==0.1.99
+```
+
+## 模型下载地址
+
+[llama chat 7B](https://huggingface.co/Linly-AI/ChatFlow-7B)
+
+[llama caht 13b](https://huggingface.co/Linly-AI/ChatFlow-13B)
+
+## 参数说明
+
+```
+--load_model_path （必填项），预训练好的模型，默认是fp16的（如果需要fp32，修改llama_infer.py的L41为对应的精度）
+--test_path （必填项），输入的prompts，每一行是一个prompts。
+--prediction_path （必填项），输出结果保存的路径。
+--config_path （必填项），模型参数配置文件，可以保存在config文件夹中。
+--spm_model_path （必填项），模型tokenizer存放的路径。
+--batch_size （可选），默认为1。批处理大小，注意按需使用，因为attention cache会根据这个大小来构造tensor并且保存在显存中。
+--seq_length （可选），默认为128。生成句子的总长度，等于prompts + 模型生成的长度。
+--world_size （可选），默认为1。使用多少张卡进行张量并行推理。
+--use_int8 （可选），默认为False。是否使用int8推理。
+--top_k （可选），默认为40。句子的生成会针对top_k做采样，影响生成多样性。
+--top_p （可选），默认为0.95。句子的生成会针对累积概率top_p做采样，影响生成多样性。
+--temperature （可选），默认为0.8。对最后的probabilities做一次放缩，影响token采样结果。
+--repetition_penalty_range （可选），默认为1024。重复出现token的惩罚范围。
+--repetition_penalty_slope （可选），默认为0。重复出现token的惩罚slope。
+--repetition_penalty （可选），默认为1.15。重复出现token的惩罚系数。
+```
+
+## 单卡推理
+
+```
+./run.sh
+export HIP_VISIBLE_DEVICES=0   指定使用第0号卡
+LOAD_MODEL  为下载的llama 模型bin路径
+SPM_PATH	为下载的llama 模型tokenizer路径
+--config_path 需要与使用的模型对齐，若使用13b的模型，这里需要改成config/llama_13b_config.json
+```
+
+## 多张卡并行推理
+
+```
+./run-tp.sh
+export HIP_VISIBLE_DEVICES=0,1,2,3   指定使用第0,1,2,3号卡
+LOAD_MODEL  为下载的llama 模型bin路径
+SPM_PATH	为下载的llama 模型tokenizer路径
+--config_path 需要与使用的模型对齐，若使用13b的模型，这里需要改成config/llama_13b_config.json
+```
+
+## 多轮对话
+
+```
+./run-dialogue.sh
+#对话时输入  clear  清空聊天历史  输入 exit  退出程序
+export HIP_VISIBLE_DEVICES=0,1,2,3   指定使用第0,1,2,3号卡
+LOAD_MODEL  为下载的llama 模型bin路径
+SPM_PATH	为下载的llama 模型tokenizer路径
+--config_path 需要与使用的模型对齐，若使用13b的模型，这里需要改成config/llama_13b_config.json
+```
+
+## 多轮对话推理效果
+
+![image-llama](./doc/llama-inf.jpg)
+
+## 源码仓库及问题反馈
+
+https://developer.hpccube.com/codes/hepj/llama_pytorch
+
+## 参考
+
+https://github.com/ProjectD-AI/llama_inference
\ No newline at end of file
--- a/README_en.md
+++ b/README_en.md
+## LLaMa Inference For TencentPretrain 
+
+This project mainly supports LLaMa Inference and Microservice deployment based on [TencentPretrain](https://github.com/Tencent/TencentPretrain).
+
+<br>
+
+### Feature 
+- __Int8 Inference__ Support int8 inference with the bitsandbytes library, and adds batch inference compared to the LM inference script in tencentpretrain.  
+- __Optimized Inference__ Added cache for key and value in Multi-head Attention, requiring only the newly generated token to be input for each inference. 
+- __LLM Multi-Gpu Inference__ Support tensor parallel multi-gpu inference.
+- __Microservices__ Support simple flask microservices and gradio-base online demo.
+- __LoRA model Inference__ To be continued. 
+
+tips: need cuda. 
+
+<br> 
+
+### Requirements 
+* Python >= 3.7 
+* torch >= 1.9 
+* bitsandbytes 
+* argparse 
+
+<br>
+
+### Input Parameters 
+* __--load_model_path__ (Required) pretrained model, default by fp16. 
+* __--test_path__ (Required) input prompts，one prompt each line. 
+* __--prediction_path__ (Required) save path for result. 
+* __--config_path__ (Required) file of model hyper-parameters, can be stored in config file. 
+* __--spm_model_path__ (Required) the path of model tokenizer. 
+* __--batch_size__ (Optional) default by 1. suggestion: consistent with the input. 
+* __--seq_length__ (Optional) default by 128. total length of generated content, equal to the length of input and generated sentence. 
+* __--world_size__ （Optional），default by 1. the number of gpus for tensor parallel inference.
+* __--use_int8__ (Optional) default by False. whether use int8 to inference. 
+* __--top_k__ (Optional) default by 40. 
+* __--top_p__ (Optional) default by 0.95. 
+* __--temperature__ (Optional) default by 0.8. 
+* __--repetition_penalty_range__ (Optional) default by 1024. 
+* __--repetition_penalty_slope__ (Optional) default by 0. 
+* __--repetition_penalty__ (Optional) default by 1.15. 
+
+<br> 
+
+### Quick Start 
+#### FP16/Int8 Inference 
+fp16 inference： 
+```commandline
+python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt  \
+                      --load_model_path xxx.bin \
+                      --config_path ./config/llama_7b_config.json \
+                      --spm_model_path ./tokenizer.model
+``` 
+
+
+int8 inference: 
+```commandline
+python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt  \
+                      --load_model_path xxx.bin --use_int8 \
+                      --config_path ./config/llama_7b_config.json \
+                      --spm_model_path ./tokenizer.model
+``` 
+
+<br>
+
+#### Multi-round chat
+optional parameter: keep_length_ratio. it represents keep the ratio of context.
+enter 'clear' will restart a round of new chat and 'exit' will exit the chat.
+```commandline
+python llama_dialogue.py --load_model_path xxxx.bin \
+                         --config_path config.json \
+                         --spm_model_path tokenizer.model \
+                         --world_size 2
+```
+
+<br>
+
+#### gradio server
+need to install gradio
+```commandline
+pip install gradio
+python llama_gradio.py --load_model_path xxxx.bin \
+                       --config_path config.json \
+                       --spm_model_path tokenizer.model
+```
+website open: http://127.0.0.1:7860/
+
+<br>
+
+#### Microservices deployment 
+need to install flask
+```commandline
+pip install flask 
+python llama_server.py --load_model_path xxxx.bin \
+                       --config_path config.json \
+                       --spm_model_path tokenizer.model
+```
+curl command:
+```commandline
+curl -H 'Content-Type: application/json' http://127.0.0.1:8888/chat -d '{"question": "xxx"}' 
+```
+
+<br>
+
+#### Multi-GPU Inference 
+need to install tensor_parallel
+world_size = the number of gpu（gpu id start from 0.）
+```commandline
+pip install tensor_parallel
+python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt \
+                      --load_model_path xxxx.bin \
+                      --config_path config.json \
+                      --spm_model_path tokenizer.model \
+                      --world_size 2
+```
+<br>
\ No newline at end of file
--- a/README_zh.md
+++ b/README_zh.md
+[**中文**](https://github.com/fengyh3/llama_inference/blob/main/README.md) | [**English**](https://github.com/fengyh3/llama_inference/blob/main/README_en.md) 
+
+## 基于TencentPretrain的LLaMa推理 
+
+本项目主要支持基于[TencentPretrain](https://github.com/Tencent/TencentPretrain)的LLaMa模型量化推理以及简单的微服务部署。也可以扩展至其他模型，持续更新中。 
+
+<br>
+
+### 特性 
+- __Int8推理__ 支持bitsandbytes库的int8推理，相比tencentpretrain中的LM推理脚本，加入了Batch推理。 
+- __优化推理逻辑__ 在Multi-head Attention中加入了key和value的cache，每次inference只需要输入新生成的token。 
+- __大模型多卡推理__ 支持张量并行的多卡推理。
+- __微服务部署__ 支持简单的flask部署以及gradio在线可视化部署。
+- __LoRA模型推理__ 施工中，计划支持使用LoRA训练的模型。 
+
+tips：当前脚本只支持cuda推理，未来计划更多的量化部署推理的功能，敬请期待。 
+
+<br>
+
+### 依赖环境 
+* Python >= 3.7
+* torch >= 1.9
+* bitsandbytes
+* argparse
+
+<br>
+
+### 输入参数参考
+* __--load_model_path__ （必填项），预训练好的模型，默认是fp16的（如果需要fp32，修改llama_infer.py的L41为对应的精度）
+* __--test_path__ （必填项），输入的prompts，每一行是一个prompts。
+* __--prediction_path__ （必填项），输出结果保存的路径。
+* __--config_path__ （必填项），模型参数配置文件，可以保存在config文件夹中。
+* __--spm_model_path__ （必填项），模型tokenizer存放的路径。
+* __--batch_size__ （可选），默认为1。批处理大小，注意按需使用，因为attention cache会根据这个大小来构造tensor并且保存在显存中。
+* __--seq_length__ （可选），默认为128。生成句子的总长度，等于prompts + 模型生成的长度。
+* __--world_size__ （可选），默认为1。使用多少张卡进行张量并行推理。
+* __--use_int8__ （可选），默认为False。是否使用int8推理。
+* __--top_k__ （可选），默认为40。句子的生成会针对top_k做采样，影响生成多样性。
+* __--top_p__ （可选），默认为0.95。句子的生成会针对累积概率top_p做采样，影响生成多样性。
+* __--temperature__ （可选），默认为0.8。对最后的probabilities做一次放缩，影响token采样结果。
+* __--repetition_penalty_range__ （可选），默认为1024。重复出现token的惩罚范围。
+* __--repetition_penalty_slope__ （可选），默认为0。重复出现token的惩罚slope。
+* __--repetition_penalty__ （可选），默认为1.15。重复出现token的惩罚系数。
+
+<br>
+
+### 快速开始 
+#### FP16/Int8推理 
+fp16推理：
+```commandline
+python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt  \
+                      --load_model_path xxx.bin \
+                      --config_path ./config/llama_7b_config.json \
+                      --spm_model_path ./tokenizer.model
+``` 
+
+如果要使用int8推理的话，加入--use_int8: 
+```commandline
+python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt  \
+                      --load_model_path xxx.bin --use_int8 \
+                      --config_path ./config/llama_7b_config.json \
+                      --spm_model_path ./tokenizer.model
+```
+
+<br>
+
+
+#### 多轮对话
+有可选参数keep_length_ratio，表示保留多少比例的上下文。输入clear会进行新的一轮对话，输入exit会退出。
+```commandline
+python llama_dialogue.py --load_model_path xxxx.bin \
+                         --config_path config.json \
+                         --spm_model_path tokenizer.model \
+                         --world_size 2
+```
+
+<br>
+
+
+#### gradio部署 
+需要安装gradio
+```commandline
+pip install gradio
+python llama_gradio.py --load_model_path xxxx.bin \
+                       --config_path config.json \
+                       --spm_model_path tokenizer.model
+```
+然后在网页上打开：http://127.0.0.1:7860/
+
+<br>
+
+#### 微服务部署 
+需要安装flask
+```commandline
+pip install flask
+python llama_server.py --load_model_path xxxx.bin \
+                       --config_path config.json \
+                       --spm_model_path tokenizer.model
+```
+查询命令：
+```commandline
+curl -H 'Content-Type: application/json' http://127.0.0.1:8888/chat -d '{"question": "xxx"}' 
+```
+
+<br>
+
+#### 多卡张量并行推理
+需要安装tensor_parallel
+参数world_size为希望使用多少gpu（gpu的id从0开始）
+```commandline
+pip install tensor_parallel
+python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt \
+                      --load_model_path xxxx.bin \
+                      --config_path config.json \
+                      --spm_model_path tokenizer.model \
+                      --world_size 2
+```
+
--- a/config/deepspeed_config.json
+++ b/config/deepspeed_config.json
+{
+  "gradient_accumulation_steps": 1,
+  "train_micro_batch_size_per_gpu":1,
+  "steps_per_print": 100,
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 2e-5,
+      "weight_decay": 1e-2
+    }
+  },
+  "flops_profiler": {
+    "enabled": false,
+    "profile_step": 1,
+    "module_depth": -1,
+    "top_modules": 3,
+    "detailed": true
+  },
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "zero_optimization": {
+    "stage": 2,
+    "offload_param": {
+      "device": "cpu",
+      "pin_memory": true
+    },
+    "offload_optimizer": {
+      "device": "cpu",
+      "pin_memory": true
+    }
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false,
+    "cpu_checkpointing": false
+  },
+  "wall_clock_breakdown": false,
+  "zero_allow_untested_optimizer": true,
+  "zero_force_ds_cpu_optimization": false
+}
\ No newline at end of file
--- a/config/deepspeed_zero3_config.json
+++ b/config/deepspeed_zero3_config.json
+{
+  "gradient_accumulation_steps": 1,
+  "train_micro_batch_size_per_gpu":1,
+  "steps_per_print": 100,
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-5,
+      "weight_decay": 1e-2
+    }
+  },
+  "flops_profiler": {
+    "enabled": true,
+    "profile_step": 1,
+    "module_depth": -1,
+    "top_modules": 3,
+    "detailed": true
+  },
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "zero_optimization": {
+      "stage": 3,
+      "offload_param": {
+          "device": "cpu",
+          "pin_memory": true
+      },
+      "offload_optimizer": {
+          "device": "cpu",
+          "pin_memory": true
+      }
+  },
+  "activation_checkpointing": {
+      "partition_activations": false,
+      "contiguous_memory_optimization": false,
+      "cpu_checkpointing": false
+  },
+  "wall_clock_breakdown": false,
+  "zero_allow_untested_optimizer": true
+}
--- a/config/llama_13b_config.json
+++ b/config/llama_13b_config.json
+{
+  "emb_size": 5120,
+  "feedforward_size": 13824,
+  "hidden_size": 5120,
+  "hidden_act": "silu",
+  "heads_num": 40,
+  "layers_num": 40,
+  "dropout": 0.1,
+  "data_processor": "lm",
+  "max_seq_length": 2048,
+  "embedding": ["word"],
+  "remove_transformer_bias": true,
+  "remove_embedding_layernorm": true,
+  "rotary_position_embedding": true,
+  "encoder": "transformer",
+  "feed_forward": "gated",
+  "mask": "causal",
+  "layernorm_positioning": "pre",
+  "layernorm": "rms",
+  "target": ["lm"]
+}
\ No newline at end of file
--- a/config/llama_30b_config.json
+++ b/config/llama_30b_config.json
+{
+  "emb_size": 6656,
+  "feedforward_size": 17920,
+  "hidden_size": 6656,
+  "hidden_act": "silu",
+  "heads_num": 52,
+  "layers_num": 60,
+  "dropout": 0.1,
+  "data_processor": "lm",
+  "max_seq_length": 2048,
+  "embedding": ["word"],
+  "remove_transformer_bias": true,
+  "remove_embedding_layernorm": true,
+  "rotary_position_embedding": true,
+  "encoder": "transformer",
+  "feed_forward": "gated",
+  "mask": "causal",
+  "layernorm_positioning": "pre",
+  "layernorm": "rms",
+  "target": ["lm"]
+}
\ No newline at end of file
--- a/config/llama_65b_config.json
+++ b/config/llama_65b_config.json
+{
+  "emb_size": 8192,
+  "feedforward_size": 22016,
+  "hidden_size": 8192,
+  "hidden_act": "silu",
+  "heads_num": 64,
+  "layers_num": 80,
+  "dropout": 0.1,
+  "data_processor": "lm",
+  "max_seq_length": 2048,
+  "embedding": ["word"],
+  "remove_transformer_bias": true,
+  "remove_embedding_layernorm": true,
+  "rotary_position_embedding": true,
+  "encoder": "transformer",
+  "feed_forward": "gated",
+  "mask": "causal",
+  "layernorm_positioning": "pre",
+  "layernorm": "rms",
+  "target": ["lm"]
+}
\ No newline at end of file
--- a/config/llama_7b_config.json
+++ b/config/llama_7b_config.json
+{
+  "emb_size": 4096,
+  "feedforward_size": 11008,
+  "hidden_size": 4096,
+  "hidden_act": "silu",
+  "heads_num": 32,
+  "layers_num": 32,
+  "dropout": 0.1,
+  "data_processor": "lm",
+  "max_seq_length": 2048,
+  "embedding": ["word"],
+  "remove_transformer_bias": true,
+  "remove_embedding_layernorm": true,
+  "rotary_position_embedding": true,
+  "encoder": "transformer",
+  "feed_forward": "gated",
+  "mask": "causal",
+  "layernorm_positioning": "pre",
+  "layernorm": "rms",
+  "target": ["lm"]
+}
\ No newline at end of file
--- a/doc/llama-inf.jpg
+++ b/doc/llama-inf.jpg
--- a/generate.py
+++ b/generate.py
+import torch
+import torch.nn.functional as F
+
+
+def apply_temperature(scores, tempt):
+    if tempt > 0:
+        scores = scores / tempt
+    return scores
+
+
+def apply_top_p(scores, top_p, filter_value=-float("Inf"), min_tokens_to_keep=1):
+    if top_p > 0 and top_p < 1:
+        sorted_logits, sorted_indices = torch.sort(scores, descending=False)
+        cumulative_probs = sorted_logits.softmax(dim=-1).cumsum(dim=-1)
+
+        # Remove tokens with cumulative top_p above the threshold (token with 0 are kept)
+        sorted_indices_to_remove = cumulative_probs <= (1 - top_p)
+        if min_tokens_to_keep > 1:
+            # Keep at least min_tokens_to_keep
+            sorted_indices_to_remove[..., -min_tokens_to_keep:] = 0
+
+        # scatter sorted tensors to original indexing
+        indices_to_remove = sorted_indices_to_remove.scatter(
+            1, sorted_indices, sorted_indices_to_remove
+        )
+        scores = scores.masked_fill(indices_to_remove, filter_value)
+    return scores
+
+
+def apply_top_k(logits, top_k):
+    top_k = min(top_k, logits.size(-1))  # Safety check
+    if top_k > 0:
+        # Remove all tokens with a probability less than the last token of the top-k
+        indices_to_remove = logits < torch.topk(logits.float(), top_k)[0][..., -1, None]
+        logits[indices_to_remove] = -float("Inf")
+
+    return logits
+
+def apply_advanced_repetition_penalty(
+    input_ids, scores, penalty_range, penalty_slope, penalty
+):
+    penalty_range = int(penalty_range)
+    clipped_penalty_range = min(input_ids.shape[-1], penalty_range)
+
+    if penalty != 1.0:
+        if penalty_range > 0:
+            if clipped_penalty_range < input_ids.shape[1]:
+                input_ids = input_ids[..., -clipped_penalty_range:]
+
+            if penalty_slope != 0:
+                _penalty = (
+                    torch.arange(
+                        penalty_range, dtype=scores.dtype, device=scores.device
+                    )
+                    / (penalty_range - 1)
+                ) * 2.0 - 1
+                _penalty = (penalty_slope * _penalty) / (
+                    1 + torch.abs(_penalty) * (penalty_slope - 1)
+                )
+                _penalty = 1 + ((_penalty + 1) / 2).unsqueeze(0) * (penalty - 1)
+                penalty = _penalty[..., -clipped_penalty_range:]
+
+        score = torch.gather(scores, 1, input_ids)
+        score = torch.where(score <= 0, score * penalty, score / penalty)
+        scores.scatter_(1, input_ids, score)
+
+    return scores
+
+
+class LmGeneration:
+    def __init__(self, model, tokenizer):
+        self.model = model
+        self.tokenizer = tokenizer
+
+    def generate(self, args, prompts, cut_off=None, cut_off_times=1):
+        if cut_off is not None:
+            cut_off_times = [cut_off_times for i in range(len(prompts))]
+        batch = len(prompts)
+        assert batch <= args.batch_size
+
+        prompt_tokens = [args.tokenizer.encode(x, bos=True, eos=False) for x in prompts]
+
+        min_prompt_len = min([len(x) for x in prompt_tokens])
+        # max_prompt_len = max([len(x) for x in prompt_tokens])
+
+        total_len = args.seq_length
+
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        tokens = torch.full((batch, total_len), self.tokenizer.pad_id).to(device).long()
+        for idx, t in enumerate(prompt_tokens):
+            tokens[idx, : len(t)] = torch.tensor(t).long()
+        mask = tokens != self.tokenizer.pad_id
+        start_pos = min_prompt_len
+        prev_pos = 0
+        continue_exsample = [i for i in range(batch)]
+        with torch.no_grad():
+            for cur_pos in range(start_pos, total_len):
+                logits = self.model.forward(tokens[continue_exsample, prev_pos:cur_pos], prev_pos, continue_exsample).float()
+                next_token_scores = apply_top_k(logits, top_k=args.top_k)
+                next_token_scores = apply_top_p(next_token_scores, args.top_p)
+                next_token_scores = apply_temperature(next_token_scores, args.temperature)
+                next_token_scores = apply_advanced_repetition_penalty(
+                    tokens[continue_exsample, :cur_pos],
+                    next_token_scores,
+                    args.repetition_penalty_range,
+                    args.repetition_penalty_slope,
+                    args.repetition_penalty
+                )
+                scores = F.softmax(next_token_scores, dim=-1)
+                next_token = torch.multinomial(scores, num_samples=1).squeeze(1)
+                next_token = next_token.reshape(-1)
+                next_token = torch.where(
+                    mask[continue_exsample, cur_pos], tokens[continue_exsample, cur_pos], next_token
+                )
+                tokens[continue_exsample, cur_pos] = next_token
+                prev_pos = cur_pos
+                # remove eos examples.
+                continue_exsample = []
+                for i, t in enumerate(tokens.tolist()):
+                    try:
+                        t.index(self.tokenizer.eos_id)
+                    except ValueError:
+                        if cut_off is not None:
+                            if cut_off == self.tokenizer.decode(t[:cur_pos + 1])[-len(cut_off):]:
+                                if cut_off_times[i] == 1:
+                                    continue
+                                else:
+                                    cut_off_times[i] -= 1
+                        continue_exsample.append(i)
+                if len(continue_exsample) == 0:
+                    break
+
+        decoder = []
+        for i, t in enumerate(tokens.tolist()):
+            t = t[: args.seq_length]
+            try:
+                t = t[: t.index(self.tokenizer.pad_id)]
+                t = t[: t.index(self.tokenizer.eos_id)]
+            except ValueError:
+                pass
+            decoder.append(self.tokenizer.decode(t))
+
+        return decoder
+
+
+class LmGeneration_test:
+    def __init__(self, model, tokenizer):
+        self.model = model
+        self.tokenizer = tokenizer
+
+    def generate(self, args, prompt_tokens, cut_off=None, cut_off_times=1):
+        if cut_off is not None:
+            cut_off_times = [cut_off_times for i in range(len(prompt_tokens))]
+        batch = len(prompt_tokens)
+        assert batch <= args.batch_size
+
+        # prompt_tokens = [args.tokenizer.encode(x, bos=True, eos=False) for x in prompts]
+
+        min_prompt_len = min([len(x) for x in prompt_tokens])
+        # max_prompt_len = max([len(x) for x in prompt_tokens])
+
+        total_len = args.seq_length
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        tokens = torch.full((batch, total_len), self.tokenizer.pad_id).to(device).long()
+        for idx, t in enumerate(prompt_tokens):
+            tokens[idx, : len(t)] = torch.tensor(t).long()
+        mask = tokens != self.tokenizer.pad_id
+        start_pos = min_prompt_len
+        prev_pos = 0
+        continue_exsample = [i for i in range(batch)]
+        with torch.no_grad():
+            for cur_pos in range(start_pos, total_len):
+                logits = self.model.forward(tokens[continue_exsample, prev_pos:cur_pos], prev_pos, continue_exsample).float()
+                next_token_scores = apply_top_k(logits, top_k=args.top_k)
+                next_token_scores = apply_top_p(next_token_scores, args.top_p)
+                next_token_scores = apply_temperature(next_token_scores, args.temperature)
+                next_token_scores = apply_advanced_repetition_penalty(
+                    tokens[continue_exsample, :cur_pos],
+                    next_token_scores,
+                    args.repetition_penalty_range,
+                    args.repetition_penalty_slope,
+                    args.repetition_penalty
+                )
+                scores = F.softmax(next_token_scores, dim=-1)
+                next_token =  torch.multinomial(scores, num_samples=1).squeeze(1)
+                next_token = next_token.reshape(-1)
+                next_token = torch.where(
+                    mask[continue_exsample, cur_pos], tokens[continue_exsample, cur_pos], next_token
+                )
+                tokens[continue_exsample, cur_pos] = next_token
+                prev_pos = cur_pos
+                # remove eos examples.
+                continue_exsample = []
+                for i, t in enumerate(tokens.tolist()):
+                    try:
+                        t.index(self.tokenizer.eos_id)
+                    except ValueError:
+                        if cut_off is not None:
+                            if cut_off == self.tokenizer.decode(t[:cur_pos + 1])[-len(cut_off):]:
+                                if cut_off_times[i] == 1:
+                                    continue
+                                else:
+                                    cut_off_times[i] -= 1
+                        continue_exsample.append(i)
+                if len(continue_exsample) == 0:
+                    break
+        return tokens
--- a/llama_dialogue.py
+++ b/llama_dialogue.py
+import argparse
+from utils import load_hyperparam, convert_normal_parameter_to_int8, load_model
+from model.tokenize import Tokenizer
+from model.llama import *
+from generate import LmGeneration
+
+
+def multi_round_chat(args, lm_generation, keep_length_ratio=0.5):
+    users = []
+    answers = []
+    while True:
+        user_input = input("User: ")
+        if user_input == 'clear':
+            users = []
+            answers = []
+            print("开启新的一轮聊天/Start a new round of chat:")
+            continue
+
+        if user_input == 'exit':
+            break
+
+        input_str = ''
+        for user, ans in zip(users, answers):
+            input_str += 'User: ' + user + '\nBot: ' + ans + '\n'
+        input_str += 'User: ' + user_input + '\nBot: '
+        if len(input_str) >= int(keep_length_ratio * args.seq_length):
+            input_str = input_str[len(input_str) - int(keep_length_ratio * args.seq_length):]
+        answer = lm_generation.generate(args, [input_str], cut_off='User:', cut_off_times=1)[0]
+        answer = answer[len(input_str):]
+        print("ChatLLaMa: " + answer.replace('User:', ''))
+        users.append(user_input.rstrip(' ').rstrip('\n'))
+        answers.append(answer.replace('User:', '').rstrip(' ').rstrip('\n'))
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+
+    parser.add_argument("--load_model_path", default=None, type=str,
+                        help="Path of the input model.")
+    parser.add_argument("--prediction_path", type=str, default=None,
+                        help="Path of the prediction file.")
+    parser.add_argument("--config_path", type=str, required=True,
+                        help="Path of the config file.")
+    parser.add_argument("--seq_length", type=int, default=2048,
+                        help="Sequence length.")
+    parser.add_argument("--world_size", type=int, default=1,
+                        help="the number of gpus.")
+    parser.add_argument("--keep_length_ratio", type=float, default=0.5)
+    parser.add_argument("--use_int8", action="store_true")
+    parser.add_argument("--top_k", type=int, default=10)
+    parser.add_argument("--top_p", type=float, default=1)
+    parser.add_argument("--temperature", type=float, default=0.85)
+    parser.add_argument("--repetition_penalty_range", type=int, default=1024)
+    parser.add_argument("--repetition_penalty_slope", type=float, default=0)
+    parser.add_argument("--repetition_penalty", type=float, default=1.15)
+
+    parser.add_argument("--spm_model_path", default=None, type=str,
+                        help="Path of the sentence piece model.")
+
+    args = parser.parse_args()
+
+    args = load_hyperparam(args)
+    args.batch_size = 1
+
+    args.tokenizer = Tokenizer(model_path=args.spm_model_path)
+    args.vocab_size = args.tokenizer.sp_model.vocab_size()
+
+    torch.set_default_tensor_type(torch.HalfTensor)
+    model = LLaMa(args)
+    torch.set_default_tensor_type(torch.FloatTensor)
+    model = load_model(model, args.load_model_path)
+
+    model.eval()
+    # use multi-gpu tensor parallel
+    if args.world_size > 1:
+        import tensor_parallel as tp
+        gpus = ["cuda:" + str(i) for i in range(args.world_size)]
+        if args.use_int8:
+            model = tp.tensor_parallel(model, gpus, delay_init=True)
+            model = convert_normal_parameter_to_int8(model)
+        else:
+            model = tp.tensor_parallel(model, gpus)
+    else:
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        model.to(device)
+
+    lm_generation = LmGeneration(model, args.tokenizer)
+    multi_round_chat(args, lm_generation, args.keep_length_ratio)
\ No newline at end of file
--- a/llama_gradio.py
+++ b/llama_gradio.py
+import gradio as gr
+import argparse
+from utils import load_hyperparam, load_model, convert_normal_parameter_to_int8
+from model.tokenize import Tokenizer
+from model.llama import *
+from generate import LmGeneration
+
+
+args = None
+lm_generation = None
+
+
+def init_args():
+    global args
+    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+    parser.add_argument("--load_model_path", default=None, type=str,
+                        help="Path of the input model.")
+    parser.add_argument("--config_path", type=str, required=True,
+                        help="Path of the config file.")
+    parser.add_argument("--batch_size", type=int, default=1,
+                        help="Batch size.")
+    parser.add_argument("--seq_length", type=int, default=128,
+                        help="Sequence length.")
+    parser.add_argument("--world_size", type=int, default=1,
+                        help="the number of gpus.")
+    parser.add_argument("--use_int8", action="store_true")
+    parser.add_argument("--top_k", type=int, default=10)
+    parser.add_argument("--top_p", type=float, default=1)
+    parser.add_argument("--temperature", type=float, default=0.85)
+    parser.add_argument("--repetition_penalty_range", type=int, default=1024)
+    parser.add_argument("--repetition_penalty_slope", type=float, default=0)
+    parser.add_argument("--repetition_penalty", type=float, default=1.15)
+
+    parser.add_argument("--spm_model_path", default=None, type=str,
+                        help="Path of the sentence piece model.")
+
+    args = parser.parse_args()
+    args = load_hyperparam(args)
+
+    args.tokenizer = Tokenizer(model_path=args.spm_model_path)
+    args.vocab_size = args.tokenizer.sp_model.vocab_size()
+
+
+def init_model():
+    global lm_generation
+    torch.set_default_tensor_type(torch.HalfTensor)
+    model = LLaMa(args)
+    torch.set_default_tensor_type(torch.FloatTensor)
+    model = load_model(model, args.load_model_path)
+    model.eval()
+
+    # use multi-gpu tensor parallel
+    if args.world_size > 1:
+        import tensor_parallel as tp
+        gpus = ["cuda:" + str(i) for i in range(args.world_size)]
+        if args.use_int8:
+            model = tp.tensor_parallel(model, gpus, delay_init=True)
+            model = convert_normal_parameter_to_int8(model)
+        else:
+            model = tp.tensor_parallel(model, gpus)
+    else:
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        model.to(device)
+
+    lm_generation = LmGeneration(model, args.tokenizer)
+
+
+def chat(prompt, top_k, temperature):
+    args.top_k = int(top_k)
+    args.temperature = temperature
+    response = lm_generation.generate(args, [prompt])
+    return response[0]
+
+
+if __name__ == '__main__':
+    init_args()
+    init_model()
+    demo = gr.Interface(
+        fn=chat,
+        inputs=["text", gr.Slider(1, 60, value=40, step=1), gr.Slider(0.1, 2.0, value=1.2, step=0.1)],
+        outputs="text",
+    )
+    demo.launch()
+
--- a/llama_infer.py
+++ b/llama_infer.py
+import argparse
+from utils import load_hyperparam, convert_normal_parameter_to_int8, load_model
+from model.tokenize import Tokenizer
+from model.llama import *
+from generate import LmGeneration
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+
+    parser.add_argument("--load_model_path", default=None, type=str,
+                        help="Path of the input model.")
+    parser.add_argument("--test_path", type=str, required=True,
+                        help="Path of the testset.")
+    parser.add_argument("--prediction_path", type=str, required=True,
+                        help="Path of the prediction file.")
+    parser.add_argument("--config_path", type=str, required=True,
+                        help="Path of the config file.")
+    parser.add_argument("--batch_size", type=int, default=1,
+                        help="Batch size.")
+    parser.add_argument("--world_size", type=int, default=1,
+                        help="the number of gpus.")
+    parser.add_argument("--seq_length", type=int, default=128,
+                        help="Sequence length.")
+    parser.add_argument("--use_int8", action="store_true")
+    parser.add_argument("--top_k", type=int, default=10)
+    parser.add_argument("--top_p", type=float, default=1)
+    parser.add_argument("--temperature", type=float, default=0.85)
+    parser.add_argument("--repetition_penalty_range", type=int, default=1024)
+    parser.add_argument("--repetition_penalty_slope", type=float, default=0)
+    parser.add_argument("--repetition_penalty", type=float, default=1.15)
+
+    parser.add_argument("--spm_model_path", default=None, type=str,
+                        help="Path of the sentence piece model.")
+
+    args = parser.parse_args()
+
+    args = load_hyperparam(args)
+
+    args.tokenizer = Tokenizer(model_path=args.spm_model_path)
+    args.vocab_size = args.tokenizer.sp_model.vocab_size()
+
+    torch.set_default_tensor_type(torch.HalfTensor)
+    model = LLaMa(args)
+    torch.set_default_tensor_type(torch.FloatTensor)
+    model = load_model(model, args.load_model_path)
+
+    model.eval()
+    # use multi-gpu tensor parallel
+    if args.world_size > 1:
+        import tensor_parallel as tp
+        gpus = ["cuda:" + str(i) for i in range(args.world_size)]
+        if args.use_int8:
+            model = tp.tensor_parallel(model, gpus, delay_init=True)
+            model = convert_normal_parameter_to_int8(model)
+        else:
+            model = tp.tensor_parallel(model, gpus)
+    else:
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        model.to(device)
+
+    lm_generation = LmGeneration(model, args.tokenizer)
+    prompts = []
+    with open(args.test_path, 'r', encoding='utf-8') as f:
+        for line in f:
+            prompts.append(line)
+    with torch.no_grad():
+        result = lm_generation.generate(args, prompts)
+
+    with open(args.prediction_path, 'w', encoding='utf-8') as f:
+        for res in result:
+            f.write(res + '\n')
+            f.write('\n')
\ No newline at end of file
--- a/llama_server.py
+++ b/llama_server.py
+import argparse
+import torch
+from utils import load_hyperparam, convert_normal_parameter_to_int8, load_model
+from model.tokenize import Tokenizer
+from model.llama import *
+from generate import LmGeneration
+from flask import Flask, request
+import json
+
+app = Flask(__name__)
+args = None
+lm_generation = None
+
+
+def init_model():
+    global args
+    global lm_generation
+    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+
+    parser.add_argument("--load_model_path", default=None, type=str,
+                        help="Path of the input model.")
+    parser.add_argument("--config_path", type=str, required=True,
+                        help="Path of the config file.")
+    parser.add_argument("--batch_size", type=int, default=1,
+                        help="Batch size.")
+    parser.add_argument("--seq_length", type=int, default=128,
+                        help="Sequence length.")
+    parser.add_argument("--world_size", type=int, default=1,
+                        help="the number of gpus.")
+    parser.add_argument("--use_int8", action="store_true")
+    parser.add_argument("--top_k", type=int, default=10)
+    parser.add_argument("--top_p", type=float, default=1)
+    parser.add_argument("--temperature", type=float, default=0.85)
+    parser.add_argument("--repetition_penalty_range", type=int, default=1024)
+    parser.add_argument("--repetition_penalty_slope", type=float, default=0)
+    parser.add_argument("--repetition_penalty", type=float, default=1.15)
+
+    parser.add_argument("--spm_model_path", default=None, type=str,
+                        help="Path of the sentence piece model.")
+
+    args = parser.parse_args()
+
+    args = load_hyperparam(args)
+
+    args.tokenizer = Tokenizer(model_path=args.spm_model_path)
+    args.vocab_size = args.tokenizer.sp_model.vocab_size()
+
+    torch.set_default_tensor_type(torch.HalfTensor)
+    model = LLaMa(args)
+    torch.set_default_tensor_type(torch.FloatTensor)
+    model = load_model(model, args.load_model_path)
+    model.eval()
+
+    # use multi-gpu tensor parallel
+    if args.world_size > 1:
+        import tensor_parallel as tp
+        gpus = ["cuda:" + str(i) for i in range(args.world_size)]
+        if args.use_int8:
+            model = tp.tensor_parallel(model, gpus, delay_init=True)
+            model = convert_normal_parameter_to_int8(model)
+        else:
+            model = tp.tensor_parallel(model, gpus)
+    else:
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        model.to(device)
+
+    lm_generation = LmGeneration(model, args.tokenizer)
+
+
+@app.route("/chat", methods=['POST'])
+def chat():
+    question = request.json.get("question")
+    if isinstance(question, str):
+        question = [question, ]
+    try:
+        with torch.no_grad():
+            answer = lm_generation.generate(args, question)
+        status = 'success'
+    except Exception:
+        answer = ''
+        status = 'error'
+    return json.dumps({'answer': answer, 'status': status}, ensure_ascii=False)
+
+
+if __name__ == '__main__':
+    init_model()
+    # first pass on request to initialize int8.
+    try:
+        with torch.no_grad():
+            answer = lm_generation.generate(args, ['hello world!'])
+    except Exception:
+        pass
+    app.run(host='127.0.0.1', port=8888, debug=False)
--- a/model.properties
+++ b/model.properties
+#模型名称
+modelName=LLAMA_pytorch
+#模型描述
+modelDescription=基于Pytorch框架的tencentpretrain格式llama模型推理
+#应用场景
+apoScenar10=推理,nlp,文本生成,智能聊天助手
+#框架类型
+frameType=Pytorch,Transformers,Tensor_parallel
\ No newline at end of file
--- a/model/llama.py
+++ b/model/llama.py
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from model.norm import RMSNorm
+from model.rope import precompute_freqs_cis, apply_rotary_emb
+# import bitsandbytes as bnb
+import math
+
+
+class NormalLinear(nn.Linear):
+    def reset_parameters(self) -> None:
+        pass
+
+
+# class BnbInt8Linear(bnb.nn.Linear8bitLt):
+#     def __init__(self, *args, **kwargs):
+#         super().__init__(has_fp16_weights=False, threshold=6.0, *args, **kwargs)
+
+#     def reset_parameters(self) -> None:
+#         pass
+
+
+def get_linear_layer(use_int8):
+    if use_int8:
+        pass
+    return NormalLinear
+
+
+class WordEmbedding(nn.Module):
+    def __init__(self, args):
+        super(WordEmbedding, self).__init__()
+        self.embedding = nn.Embedding(args.vocab_size, args.emb_size)
+
+    def forward(self, src):
+        emb = self.embedding(src)
+        return emb
+
+
+class MultiHeadedAttention(nn.Module):
+    def __init__(self, args, hidden_size, heads_num, attention_head_size, has_bias=True, use_int8=True):
+        super(MultiHeadedAttention, self).__init__()
+        self.heads_num = heads_num
+
+        self.per_head_size = attention_head_size
+        self.inner_hidden_size = heads_num * attention_head_size
+
+        Linear = get_linear_layer(use_int8)
+        self.linear_layers = nn.ModuleList(
+            [Linear(hidden_size, self.inner_hidden_size, bias=has_bias) for _ in range(3)]
+        )
+
+        self.final_linear = Linear(self.inner_hidden_size, hidden_size, bias=has_bias)
+
+        # add cache to reduce compute source.
+        self.cache_k = torch.zeros(
+            (args.batch_size, args.seq_length, self.heads_num, self.per_head_size)
+        )
+        self.cache_v = torch.zeros(
+            (args.batch_size, args.seq_length, self.heads_num, self.per_head_size)
+        )
+
+    def forward(self, key, value, query, start_pos, continue_exsample, mask, freqs_cis):
+        batch_size, seq_length, _ = query.size()
+        heads_num = self.heads_num
+        per_head_size = self.per_head_size
+        query, key, value = [l(x).view(batch_size, -1, heads_num, per_head_size) \
+                             for l, x in zip(self.linear_layers, (query, key, value))]
+        query, key = apply_rotary_emb(query, key, freqs_cis=freqs_cis)
+        if self.cache_k.device != key.device:
+            self.cache_k = self.cache_k.to(key)
+        if self.cache_v.device != value.device:
+            self.cache_v = self.cache_v.to(value)
+
+        self.cache_k[continue_exsample, start_pos: start_pos + seq_length] = key
+        self.cache_v[continue_exsample, start_pos: start_pos + seq_length] = value
+
+        key = self.cache_k[continue_exsample, : start_pos + seq_length]
+        value = self.cache_v[continue_exsample, : start_pos + seq_length]
+
+        query, key, value = [x.transpose(1, 2) for x in (query, key, value)]
+
+        scores = torch.matmul(query, key.transpose(-2, -1))
+        scores = scores / math.sqrt(float(per_head_size))
+        if mask is not None:
+            scores += mask
+        # probs = nn.Softmax(dim=-1)(scores)
+        probs = F.softmax(scores.float(), dim=-1).type_as(query)
+        output = torch.matmul(probs, value).transpose(1, 2).\
+            contiguous().view(batch_size, seq_length, -1)
+        return self.final_linear(output)
+
+
+class GatedFeedForward(nn.Module):
+    def __init__(self, hidden_size, feedforward_size, has_bias=True, use_int8=True):
+        super(GatedFeedForward, self).__init__()
+        Linear = get_linear_layer(use_int8)
+        self.linear_gate = Linear(hidden_size, feedforward_size, bias=has_bias)
+        self.linear_1 = Linear(hidden_size, feedforward_size, bias=has_bias)
+        self.linear_2 = Linear(feedforward_size, hidden_size, bias=has_bias)
+        self.act = F.silu
+
+    def forward(self, x):
+        # gate = self.act(self.linear_gate(x))
+        gate = self.act(self.linear_gate(x)).type_as(x)
+        inter_linear = self.linear_1(x)
+        inter = gate * inter_linear
+        output = self.linear_2(inter)
+        return output
+
+
+class TransformerLayer(nn.Module):
+    def __init__(self, args):
+        super(TransformerLayer, self).__init__()
+
+        if hasattr(args, "attention_head_size"):
+            attention_head_size = args.attention_head_size
+        else:
+            attention_head_size = args.hidden_size // args.heads_num
+
+        has_bias = bool(1 - args.remove_transformer_bias)
+        # Multi-head Attention
+        self.self_attn = MultiHeadedAttention(
+            args, args.hidden_size, args.heads_num, attention_head_size, has_bias=has_bias,
+            use_int8=args.use_int8
+        )
+
+        # FFN
+        self.feed_forward = GatedFeedForward(
+            args.hidden_size, args.feedforward_size, has_bias, use_int8=args.use_int8
+        )
+
+        self.layer_norm_1 = RMSNorm(args.hidden_size)
+        self.layer_norm_2 = RMSNorm(args.hidden_size)
+
+    def forward(self, hidden, start_pos, continue_exsample, mask, freqs_cis=None):
+        inter = self.layer_norm_1(hidden)
+        inter = self.self_attn(inter, inter, inter, start_pos, continue_exsample, mask, freqs_cis)
+        hidden = hidden + inter
+        output = self.layer_norm_2(hidden)
+        output = self.feed_forward(output) + hidden
+        return output
+
+
+class TransformerEncoder(nn.Module):
+    def __init__(self, args):
+        super(TransformerEncoder, self).__init__()
+        self.mask = args.mask
+        self.layers_num = args.layers_num
+
+        self.transformer = nn.ModuleList(
+            [TransformerLayer(args) for _ in range(self.layers_num)]
+        )
+
+        self.layer_norm = RMSNorm(args.hidden_size)
+        self.freqs_cis = precompute_freqs_cis(args.hidden_size // args.heads_num, args.max_seq_length * 2)
+
+    def forward(self, emb, start_pos, continue_exsample):
+        batch_size, seq_length, _ = emb.size()
+        mask = None
+        if seq_length > 1:
+            mask = torch.ones(seq_length, seq_length, device=emb.device)
+            mask = torch.tril(mask)
+            mask = (1.0 - mask) * -10000
+            mask = mask.repeat(batch_size, 1, 1, 1)
+
+        hidden = emb
+        freqs_cis = self.freqs_cis[start_pos: start_pos + seq_length].to(hidden.device)
+
+        for i in range(self.layers_num):
+            hidden = self.transformer[i](hidden, start_pos, continue_exsample, mask, freqs_cis=freqs_cis)
+        return self.layer_norm(hidden)
+
+
+class LmOutput(nn.Module):
+    def __init__(self, args):
+        super(LmOutput, self).__init__()
+        # update: lm output not use int8
+        Linear = get_linear_layer(False)
+        self.lm = Linear(args.hidden_size, args.vocab_size, bias=False)
+
+    def forward(self, x):
+        return self.lm(x[:, -1, :])
+
+
+class LLaMa(nn.Module):
+    def __init__(self, args):
+        super(LLaMa, self).__init__()
+        self.embedding = WordEmbedding(args)
+        self.encoder = TransformerEncoder(args)
+        self.target = LmOutput(args)
+
+    #@torch.inference_mode()
+    def forward(self, src, start_pos, continue_exsample):
+        emb = self.embedding(src)
+        output = self.encoder(emb, start_pos, continue_exsample)
+        output = self.target(output)
+        return output
--- a/model/norm.py
+++ b/model/norm.py
+from torch import nn
+import torch
+
+
+class RMSNorm(torch.nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        super().__init__()
+        self.eps = eps
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+
+    def _norm(self, x):
+        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
+
+    def forward(self, x):
+        output = self._norm(x.float()).type_as(x)
+        return output * self.weight
--- a/model/rope.py
+++ b/model/rope.py
+import torch
+from typing import Tuple
+
+def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
+    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
+    t = torch.arange(end, device=freqs.device)  # type: ignore
+    freqs = torch.outer(t, freqs).float()  # type: ignore
+    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex64
+    return freqs_cis
+
+
+def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor):
+    ndim = x.ndim
+    assert 0 <= 1 < ndim
+    assert freqs_cis.shape == (x.shape[1], x.shape[-1])
+    shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
+    return freqs_cis.view(*shape)
+
+
+def apply_rotary_emb(
+    xq: torch.Tensor,
+    xk: torch.Tensor,
+    freqs_cis: torch.Tensor,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
+    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
+    freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
+    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
+    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
+    return xq_out.type_as(xq), xk_out.type_as(xk)