init llama

b8b237bf · zhuwenwen · b9f7ad10 · b8b237bf · b8b237bf · b8b237bf
Commit b8b237bf authored Apr 25, 2024 by zhuwenwen
Showing with 155 additions and 0 deletions

README.md README.md +126 -0

docs/llama_pri.png docs/llama_pri.png +0 -0

docs/llama_str.png docs/llama_str.png +0 -0

model.properties model.properties +10 -0

offline_inference.py offline_inference.py +19 -0

No files found.
--- a/README.md
+++ b/README.md
+<!--
+ * @Author: zhuww
+ * @email: zhuww@sugon.com
+ * @Date: 2024-04-25 10:38:07
+ * @LastEditTime: 2024-04-25 17:47:01
+-->
+# LLAMA
+
+## 论文
+- [https://arxiv.org/pdf/2302.13971.pdf](https://arxiv.org/pdf/2302.13971.pdf)
+
+## 模型结构
+LLAMA网络基于 Transformer 架构。提出了各种改进，并用于不同的模型，例如 PaLM。以下是与原始架构的主要区别：
+预归一化。为了提高训练稳定性，对每个transformer 子层的输入进行归一化，而不是对输出进行归一化。使用 RMSNorm 归一化函数。
+SwiGLU 激活函数 [PaLM]。使用 SwiGLU 激活函数替换 ReLU 非线性以提高性能。使用 2 /3 4d 的维度而不是 PaLM 中的 4d。
+旋转嵌入。移除了绝对位置嵌入，而是添加了旋转位置嵌入 (RoPE)，在网络的每一层。
+
+![img](./docs/images/llama_str.png)
+
+## 算法原理
+LLama是一个基础语言模型的集合,参数范围从7B到65B。在数万亿的tokens上训练出的模型，并表明可以专门使用公开可用的数据集来训练最先进的模型，而不依赖于专有的和不可访问的数据集。
+
+![img](./docs/images/llama_pri.png)
+
+## 环境配置
+
+提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像：
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.3.3-dtk23.10-py38
+# <Image ID>用上面拉取docker镜像的ID替换
+# <Host Path>主机端路径
+# <Container Path>容器映射路径
+docker run -it --name llama --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal -v <Host Path>:<Container Path> <Image ID> /bin/bash
+```
+
+镜像版本依赖：
+* DTK驱动：dtk23.10
+* Pytorch: 2.1.0
+* vllm: 0.3.3
+* xformers: 0.0.23
+* flash_attn: 2.0.4
+* python: python3.8
+
+## 数据集
+无
+
+## 推理
+
+### 模型下载
+
+[LLama2-7B](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
+
+[LLama2-13B](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf)
+
+[LLama2-70B](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf)
+
+### 离线批量推理
+```bash
+python offline_inference.py
+其中，`prompts`为提示词；`temperature`为控制采样随机性的值，值越小模型生成越确定，值变高模型生成更随机，0表示贪婪采样，默认为1；`max_tokens=16`为生成长度，默认为1；
+`model`为模型路径；`tensor_parallel_size=1`为使用卡数，默认为1；`dtype="float16"`为推理数据类型
+```
+
+### OpenAI兼容服务
+启动服务：
+```bash
+python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-chat-hf -enforce-eager
+```
+这里`--model`为加载模型路径，默认情况使用tokenizer中的预定义聊天模板，`--chat-template`可以添加新模板覆盖默认模板
+
+列出模型型号：
+```bash
+curl http://localhost:8000/v1/models
+```
+
+### OpenAI Completions API和vllm结合使用
+```bash
+curl http://localhost:8000/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "meta-llama/Llama-2-7b-chat-hf",
+        "prompt": "I believe the meaning of life is",
+        "max_tokens": 7,
+        "temperature": 0
+    }'
+```
+或者使用[vllm/examples/openai_completion_client.py](vllm/examples/openai_completion_client.py)
+
+
+### OpenAI Chat API和vllm结合使用
+```bash
+curl http://localhost:8000/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "meta-llama/Llama-2-7b-chat-hf",
+        "messages": [
+            {"role": "system", "content": "I believe the meaning of life is"},
+            {"role": "user", "content": "I believe the meaning of life is"}
+        ]
+    }'
+```
+或者使用[vllm/examples/openai_chatcompletion_client.py](vllm/examples/openai_chatcompletion_client.py)
+
+
+## result
+使用的加速卡:1张 DCU-K100-64G
+```
+Prompt: 'I believe the meaning of life is', Generated text: ' to find purpose, happiness, and fulfillment. Here are some reasons why:\n\n1. Purpose: Having a sense of purpose gives life meaning and direction. It helps individuals set goals and work towards achieving them, which can lead to a sense of accomplishment and fulfillment.\n2. Happiness: Happiness is a fundamental aspect of life that brings joy and satisfaction.
+```
+
+## 精度
+无
+
+## 应用场景
+
+### 算法类别
+对话问答
+
+### 热点应用行业
+金融,科研,教育
+
+## 源码仓库及问题反馈
+* [https://developer.hpccube.com/codes/modelzoo/llama_vllm](https://developer.hpccube.com/codes/modelzoo/llama_vllm)
+
+## 参考资料
+* [https://github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)
--- a/docs/llama_pri.png
+++ b/docs/llama_pri.png
--- a/docs/llama_str.png
+++ b/docs/llama_str.png
--- a/model.properties
+++ b/model.properties
+# 模型唯一标识
+modelCode = 601
+# 模型名称
+modelName=llama_vllm
+# 模型描述
+modelDescription=LLama是一个基础语言模型的集合,参数范围从7B到70B。在数万亿的tokens上训练出的模型,并表明可以专门使用公开可用的数据集来训练最先进的模型。
+# 应用场景
+appScenario=推理,对话问答,金融,科研,教育
+# 框架类型
+frameType=llama_vllm
--- a/offline_inference.py
+++ b/offline_inference.py
+from vllm import LLM, SamplingParams
+
+# Sample prompts.
+prompts = [
+    "I believe the meaning of life is",
+]
+# Create a sampling params object.
+sampling_params = SamplingParams(temperature=0, top_p=0.95, max_tokens=16)
+
+# Create an LLM.
+llm = LLM(model="meta-llama/Llama-2-7b-chat-hf", tensor_parallel_size=1, trust_remote_code=True, dtype="float16", enforce_eager=True)
+# Generate texts from the prompts. The output is a list of RequestOutput objects
+# that contain the prompt, generated text, and other information.
+outputs = llm.generate(prompts, sampling_params)
+# Print the outputs.
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")