Delete README.md

3bc31e0f · wanglch · 20f80172 · 20f80172
Commit 3bc31e0f authored May 22, 2024 by wanglch
Hide whitespace changes
Inline Side-by-side

Showing with 0 additions and 207 deletions

README.md README.md +0 -207

No files found.
--- a/README.md
+++ b/README.md
-# UMT5
-**注：执行下游任务是需要先进行预训练, 训练代码参考train_model.py。**
-<div align="center">
-    <img align="center" src=docs/T5_task.png>
-</div>
-## 论文
- [论文地址] [UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual Pretraining](https://arxiv.org/abs/2304.09151)
-## 模型结构
-umT5：T5 的多语言版本，具备 T5 模型大部分的多功能性，在多语言通用爬虫语料库 mC4 上预训练，覆盖 101 种语言；Encoder-Decoder架构，编码层和解码层都是12层，一共有220M个参数，大概是bert-base 的两倍。
-### MT5 模型结构
-<div align="center">
-    <img align="center" src=docs/T5_structure.png>
-</div>
-## 算法原理
-总的来说，mT5 跟 T5 一脉相承的，整体基本一样，但在模型结构方面，mT5 用的是 T5.1.1方案，在此对它做个基本的介绍。
-它主要的改动来自论文[GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202)，主要是借用了[Language Modeling with Gated Convolutional Networks](https://arxiv.org/abs/1612.08083) 的**GLU**(Gated Linear Unit)来增强 FFN 部分的效果。具体来说，原来 T5 的 FFN 为（T5 没有 Bias）：
-<div align="center">
-    <img align="center" src=docs/equation1.png>
-</div>
-改为：
-<div align="center">
-    <img align="center" src=docs/euqation2.png>
-</div>
-### T5 Transformer
-<div align="center">
-    <img align="center" src=docs/t5transformer.png>
-</div>
-## 环境配置
-### Docker（方法一）
-```
-docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu22.04-dtk23.10.1-py310
-docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro --shm-size=64G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name umt5 <your imageID> bash
-cd /path/your_code_data/umt5
-pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
-```
-### Dockerfile（方法二）
-```
-cd /path/your_code_data/umt5/docker
-docker build --no-cache -t umt5:latest .
-docker run --shm-size=64G --name umt5 -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v /path/your_code_data/:/path/your_code_data/ -it umt5 bash
-```
-### Anaconda（方法三）
-```
-DTK驱动：dtk23.10
-python：python3.10
-torch:2.1.0
-torchvision:0.16.0
-deepspeed:0.12.3
-```
-`Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应`
-关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
-```
-conda create -n umt5 python=3.10
-conda activate umt5 
-cd /path/your_code_data/umt5
-pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple
-```
-## 数据集
-我们选择大规模中文短文本摘要语料库[LCSTS](http://icrc.hitsz.edu.cn/Article/show/139.html) 作为数据集，该语料基于新浪微博短新闻构建，规模超过 200 万。
-### 数据处理代码
-```
-class LCSTS(Dataset):
-    def __init__(self, data_file):
-        self.data = self.load_data(data_file)
-    def load_data(self, data_file):
-        Data = {}
-        with open(data_file, 'rt', encoding='utf-8') as f:
-            for idx, line in enumerate(f):
-                if idx >= max_dataset_size:
-                    break
-                items = line.strip().split('!=!')
-                assert len(items) == 2
-                Data[idx] = {
-                    'title': items[0],
-                    'content': items[1]
-                }
-        return Data
-    def __len__(self):
-        return len(self.data)
-    def __getitem__(self, idx):
-        return self.data[idx]
-def collate_fn(batch_samples):
-    batch_inputs, batch_targets = [], []
-    for sample in batch_samples:
-        batch_inputs.append(sample['content'])
-        batch_targets.append(sample['title'])
-    batch_data = tokenizer(
-        batch_inputs, 
-        padding=True, 
-        max_length=max_input_length,
-        truncation=True, 
-        return_tensors="pt"
-    )
-    with tokenizer.as_target_tokenizer():
-        labels = tokenizer(
-            batch_targets, 
-            padding=True, 
-            max_length=max_target_length,
-            truncation=True, 
-            return_tensors="pt"
-        )["input_ids"]
-        batch_data['decoder_input_ids'] = model.module.prepare_decoder_input_ids_from_labels(labels)
-        end_token_index = torch.where(labels == tokenizer.eos_token_id)[1]
-        for idx, end_idx in enumerate(end_token_index):
-            labels[idx][end_idx+1:] = -100
-        batch_data['labels'] = labels
-    return batch_data
-  train_data = LCSTS('/public/home/wanglch/project/umt5/data/lcsts_tsv/data1.tsv')
-  valid_data = LCSTS('/public/home/wanglch/project/umt5/data/lcsts_tsv/data2.tsv')
-```
-**数据处理以包含在model_train.py和model_test.py中不用单独运行数据处理代码。**
-项目中已提供用于试验训练的迷你[数据集](https://pan.baidu.com/s/10zbcluvILlL8J-KnX56Fgw?pwd=xszb)，训练数据目录结构如下，用于正常训练的完整数据集请按此目录结构进行制备：
-```
- ── dataset
-    │   ├── lcsts_tsv
-    │             ├── data1.tsv
-    │             ├── data2.tsv
-    │             └── data3.tsv
-    │——————————
-```
-## 训练
-### 单机多卡
-```
-python multi_dcu_train.py
-```
-## 推理
-推理前需要进行预训练
-### 单机多卡
-```
-python multi_dcu_test.py
-```
-### 摘要任务
-要进行摘要任务需先进行模型训练，从hf-mirror或者huggingface下载umt5-base模型后，使用**multi_dcu_train.py**进行训练，保存训练权重后，加载权重进行摘要处理。同理，若要处理阅读理解，语言翻译任务时也需要做类似操作。
-```
-python umt5_summary.py
-```
-## result
-### 中文文本摘要任务
-<div align="center">
-    <img align="center" src=docs/result.png>
-</div>
-### 精度
-测试数据：[LCSTS](http://icrc.hitsz.edu.cn/Article/show/139.html)，使用的加速卡:V100S/K100。
-根据测试结果情况填写表格：
-| device | Rougue 1 | Rougue 2 |  Rougue L |
-| :------: | :------: | :------: | :------: |
-| V100s |  26.12 | 14.81 | 23.62 | 
-| K100 | 26.94 | 15.38 | 24.24 | 
-## 应用场景
-### 算法类别
-`文本摘要`
-### 热点应用行业
-金融,教育,政府,科研,制造,能源,广媒
-## 预训练权重
- [hf-mirror预训练模型下载地址](https://hf-mirror.com/google/umt5-base/tree/main)
- [hf-mirror umt5预训练模型下载地址](https://hf-mirror.com/collections/google/mt5-release-65005f1a520f8d7b4d039509)
-## 源码仓库问题反馈
- http://developer.hpccube.com/codes/modelzoo/umt5.git
-## 参考资料
- [google-research/multilingual-t5](https://github.com/google-research/multilingual-t5)
- [UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual Pretraining](https://arxiv.org/abs/2304.09151)