README.md

<!--
 * @Author: zhuww
 * @email: zhuww@sugon.com
 * @Date: 2024-04-25 10:38:07
 * @LastEditTime: 2024-04-25 17:47:01
-->
# LLAMA

## 论文
- [https://arxiv.org/pdf/2302.13971.pdf](https://arxiv.org/pdf/2302.13971.pdf)

## 模型结构
LLAMA网络基于 Transformer 架构。提出了各种改进，并用于不同的模型，例如 PaLM。以下是与原始架构的主要区别：
预归一化。为了提高训练稳定性，对每个transformer 子层的输入进行归一化，而不是对输出进行归一化。使用 RMSNorm 归一化函数。
SwiGLU 激活函数 [PaLM]。使用 SwiGLU 激活函数替换 ReLU 非线性以提高性能。使用 2 /3 4d 的维度而不是 PaLM 中的 4d。
旋转嵌入。移除了绝对位置嵌入，而是添加了旋转位置嵌入 (RoPE)，在网络的每一层。

![img](./docs/llama_str.png)

## 算法原理
LLama是一个基础语言模型的集合,参数范围从7B到65B。在数万亿的tokens上训练出的模型，并表明可以专门使用公开可用的数据集来训练最先进的模型，而不依赖于专有的和不可访问的数据集。

![img](./docs/llama_pri.png)

## 环境配置

提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像：
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.3.3-dtk23.10-py38
# <Image ID>用上面拉取docker镜像的ID替换
# <Host Path>主机端路径
# <Container Path>容器映射路径
docker run -it --name llama --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal -v <Host Path>:<Container Path> <Image ID> /bin/bash

# 更新ray版本
pip install ray==2.9.1
```

镜像版本依赖：
* DTK驱动：dtk23.10
* Pytorch: 2.1.0
* vllm: 0.3.3
* xformers: 0.0.23
* flash_attn: 2.0.4
* python: python3.8

## 数据集
无

## 推理

### 模型下载

| 基座模型                                                                        | chat模型                                                                                | GPTQ模型                                                                                          |
| ------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- |
| [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)   | [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)    | [Llama-2-7B-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ/tree/gptq-4bit-128g-actorder_True)   |
| [Llama-2-13b-hf](https://huggingface.co/meta-llama/Llama-2-13b-hf) | [Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) | [Llama-2-13B-GPTQ](https://huggingface.co/TheBloke/Llama-2-13B-GPTQ/tree/gptq-4bit-128g-actorder_True) |
| [Llama-2-70b-hf](https://huggingface.co/meta-llama/Llama-2-70b-hf) | [Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) | [Llama-2-70B-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-70B-Chat-GPTQ/tree/gptq-4bit-128g-actorder_True) |


### 离线批量推理
```bash
python offline_inference.py
```
其中，`prompts`为提示词；`temperature`为控制采样随机性的值，值越小模型生成越确定，值变高模型生成更随机，0表示贪婪采样，默认为1；`max_tokens=16`为生成长度，默认为1；
`model`为模型路径；`tensor_parallel_size=1`为使用卡数，默认为1；`dtype="float16"`为推理数据类型，如果模型权重是bfloat16,需要修改为float16推理,`quantization="gptq"`为使用gptq量化进行推理,需下载以上GPTQ模型。


### 离线批量推理性能测试
1、指定输入输出
```bash
python benchmark_throughput.py -num-prompts 1 --input-len 32 --output-len 128 --model meta-llama/Llama-2-7b-chat-hf -tp 1 --trust-remote-code --enforce-eager --dtype float16
```
其中`--num-prompts`是batch数，`--input-len`是输入seqlen，`--output-len`是输出token长度，`--model`为模型路径，`-tp`为使用卡数，`dtype="float16"`为推理数据类型，如果模型权重是bfloat16,需要修改为float16推理。若指定`--output-len  1`即为首字延迟。

2、使用数据集
下载数据集：
```bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```

```bash
python benchmark_throughput.py -num-prompts 1 --model meta-llama/Llama-2-7b-chat-hf --dataset ShareGPT_V3_unfiltered_cleaned_split.json -tp 1 --trust-remote-code --enforce-eager --dtype float16
```
其中`--num-prompts`是batch数，`--model`为模型路径，`--dataset`为使用的数据集，`-tp`为使用卡数，`dtype="float16"`为推理数据类型，如果模型权重是bfloat16,需要修改为float16推理。


### api服务推理性能测试
1、启动服务端：
```bash
python -m vllm.entrypoints.api_server  --model meta-llama/Llama-2-7b-chat-hf  --dtype float16 --enforce-eager -tp 1 
```

2、启动客户端：
```bash
python vllm/benchmarks/benchmark_serving.py --model meta-llama/Llama-2-7b-chat-hf --dataset ShareGPT_V3_unfiltered_cleaned_split.json  --num-prompts 1 --trust-remote-code
```
参数同使用数据集，离线批量推理性能测试，具体参考[vllm/benchmarks/benchmark_serving.py]


### OpenAI兼容服务
启动服务：
```bash
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-chat-hf --enforce-eager --dtype float16 --trust-remote-code
```
这里`--model`为加载模型路径，`--dtype`为数据类型：float16，默认情况使用tokenizer中的预定义聊天模板，`--chat-template`可以添加新模板覆盖默认模板

列出模型型号：
```bash
curl http://localhost:8000/v1/models
```

### OpenAI Completions API和vllm结合使用
```bash
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-2-7b-chat-hf",
        "prompt": "I believe the meaning of life is",
        "max_tokens": 7,
        "temperature": 0
    }'
```
或者使用[vllm/examples/openai_completion_client.py](https://developer.hpccube.com/codes/OpenDAS/vllm/-/tree/3e147e194e5a3b0fc25a61dd91fdc8a682cbba9d/examples/openai_completion_client.py)


### OpenAI Chat API和vllm结合使用
```bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-2-7b-chat-hf",
        "messages": [
            {"role": "system", "content": "I believe the meaning of life is"},
            {"role": "user", "content": "I believe the meaning of life is"}
        ]
    }'
```
或者使用[vllm/examples/openai_chatcompletion_client.py](https://developer.hpccube.com/codes/OpenDAS/vllm/-/tree/3e147e194e5a3b0fc25a61dd91fdc8a682cbba9d/examples/openai_chatcompletion_client.py)


## result
使用的加速卡:1张 DCU-K100-64G
```
Prompt: 'I believe the meaning of life is', Generated text: ' to find purpose, happiness, and fulfillment. Here are some reasons why:\n\n1. Purpose: Having a sense of purpose gives life meaning and direction. It helps individuals set goals and work towards achieving them, which can lead to a sense of accomplishment and fulfillment.\n2. Happiness: Happiness is a fundamental aspect of life that brings joy and satisfaction.
```

## 精度
无

## 应用场景

### 算法类别
对话问答

### 热点应用行业
金融,科研,教育

## 源码仓库及问题反馈
* [https://developer.hpccube.com/codes/modelzoo/llama_vllm](https://developer.hpccube.com/codes/modelzoo/llama_vllm)

## 参考资料
* [https://github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)