add llama inference by tgi tutorial

7f4f25e3 · huangwb · 7f4f25e3 · 7f4f25e3 · 7f4f25e3 · 7f4f25e3
Commit 7f4f25e3 authored May 31, 2024 by huangwb
Showing with 128 additions and 0 deletions

.gitmodules .gitmodules +3 -0

README.md README.md +124 -0

docs/llama_pri.png docs/llama_pri.png +0 -0

docs/llama_str.png docs/llama_str.png +0 -0

text-generation-inference text-generation-inference +1 -0

No files found.
--- a/.gitmodules
+++ b/.gitmodules
+[submodule "text-generation-inference"]
+	path = text-generation-inference
+	url = http://developer.hpccube.com/codes/OpenDAS/text-generation-inference.git
--- a/README.md
+++ b/README.md
+# LLAMA
+## 论文
+- [https://arxiv.org/pdf/2302.13971.pdf](https://arxiv.org/pdf/2302.13971.pdf)
+## 模型结构
+LLAMA网络基于 Transformer 架构。提出了各种改进，并用于不同的模型，例如 PaLM。以下是与原始架构的主要区别：
+预归一化。为了提高训练稳定性，对每个transformer 子层的输入进行归一化，而不是对输出进行归一化。使用 RMSNorm 归一化函数。
+SwiGLU 激活函数 [PaLM]。使用 SwiGLU 激活函数替换 ReLU 非线性以提高性能。使用 2 /3 4d 的维度而不是 PaLM 中的 4d。
+旋转嵌入。移除了绝对位置嵌入，而是添加了旋转位置嵌入 (RoPE)，在网络的每一层。
+![img](./docs/llama_str.png)
+## 算法原理
+LLama是一个基础语言模型的集合,参数范围从7B到65B。在数万亿的tokens上训练出的模型，并表明可以专门使用公开可用的数据集来训练最先进的模型，而不依赖于专有的和不可访问的数据集。
+![img](./docs/llama_pri.png)
+## 环境配置
+### Docker（方法一）
+TODO
+### Dockerfile（方法二）
+```
+cd ./text-generation-inference
+docker build -f Dockerfile_dcu -t tgi:latest --ulimit nofile=2048:2048 .
+# <Host Path>主机端路径
+# <Container Path>容器映射路径
+docker run -it --name llama_tgi --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal:ro -v <Host Path>:<Container Path> tgi:latest /bin/bash
+```
+## 数据集
+无
+## 推理
+### 源码编译安装
+参考源码里的[README](./text-generation-inference/README.md)源码编译部分。
+本项目源码编译需要的工具包、深度学习库等均可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
+- [DTK 24.04](https://cancon.hpccube.com:65024/1/main/DTK-24.04)
+- [Pytorch 2.1.0](https://cancon.hpccube.com:65024/4/main/pytorch/DAS1.0)
+- [Flash_attn 2.0.4](https://cancon.hpccube.com:65024/4/main/flash_attn/DAS1.0)
+- [Triton 2.1.0](https://cancon.hpccube.com:65024/4/main/triton/DAS1.0)
+### 模型下载
+| 基座模型                                                                        | chat模型                                                                                | GPTQ模型                                                                                          |
+| ------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- |
+| [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)   | [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)    | [Llama-2-7B-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ/tree/gptq-4bit-128g-actorder_True)   |
+| [Llama-2-13b-hf](https://huggingface.co/meta-llama/Llama-2-13b-hf) | [Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) | [Llama-2-13B-GPTQ](https://huggingface.co/TheBloke/Llama-2-13B-GPTQ/tree/gptq-4bit-128g-actorder_True) |
+| [Llama-2-70b-hf](https://huggingface.co/meta-llama/Llama-2-70b-hf) | [Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) | [Llama-2-70B-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-70B-Chat-GPTQ/tree/gptq-4bit-128g-actorder_True) |
+| [Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) | [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | 
+| [Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B) | [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | 
+### 部署TGI
+1. 启动TGI服务端
+```
+HIP_VISIBLE_DEVICES=3 text-generation-launcher --dtype=float16 --model-id /path/to/Llama-2-7b-chat-hf --port 3001
+```
+更多参数可使用如下方式查看
+```
+text-generation-launcher --help
+```
+2. 验证服务
+curl命令方式:
+```
+curl 127.0.0.1:3001/generate \
+    -X POST \
+    -d '{"inputs":"What is deep learning?","parameters":{"max_new_tokens":100,"temperature":0.7}}' \
+    -H 'Content-Type: application/json'
+```
+python里调用
+```
+import requests
+headers = {
+    "Content-Type": "application/json",
+}
+data = {
+    'inputs': 'What is Deep Learning?',
+    'parameters': {
+        'max_new_tokens': 20,
+    },
+}
+response = requests.post('http://127.0.0.1:3001/generate', headers=headers, json=data)
+print(response.json())
+# {'generated_text': '\n\nDeep Learning is a subset of Machine Learning that is concerned with the development of algorithms that can'}
+```
+更多API查看，请参考 [https://huggingface.github.io/text-generation-inference](https://huggingface.github.io/text-generation-inference)
+### TGI benchmar测试
+example:
+```
+text-generation-benchmark -s 32 -d 128 --runs 10 --tokenizer-name /path/to/Llama-2-7b-chat-hf
+```
+更多参数可使用如下方式查看
+```
+text-generation-benchmark --help
+```
+### 精度
+无
+## 应用场景
+### 算法类别
+对话问答
+### 热点应用行业
+金融,科研,教育
+## 源码仓库及问题反馈
+* [https://developer.hpccube.com/codes/modelzoo/llama_tgi](https://developer.hpccube.com/codes/modelzoo/llama_tgi)
+## 参考资料
+* [https://github.com/huggingface/text-generation-inference](https://github.com/huggingface/text-generation-inference)
--- a/docs/llama_pri.png
+++ b/docs/llama_pri.png
--- a/docs/llama_str.png
+++ b/docs/llama_str.png
--- a/text-generation-inference @ 6e6d3c1a
+++ b/text-generation-inference @ 6e6d3c1a
+Subproject commit 6e6d3c1afe567bf03a33e2ee9653a40322c9f385