Commit 19a23d09 authored by wangsen's avatar wangsen
Browse files

Initial commit

parents
Pipeline #1247 failed with stages
in 0 seconds
提问时请尽可能提供如下信息:
### 基本信息
- 你使用的**操作系统**:
- 你使用的**Python**版本:
- 你使用的**Pytorch**版本:
- 你使用的**bert4torch**版本:
- 你加载的**预训练模型**:
### 核心代码
```python
# 请在此处贴上你的核心代码
```
### 输出信息
```shell
# 请在此处贴上你的调试输出
```
### 自我尝试
此处请贴上你的自我尝试过程
__pycache__
datasets/
*.pt
*.onnx
*.csv
*.json
*.log
bert4torch_test.ipynb
summary/
.idea
.vscode/launch.json
.pypirc
bert4torch.egg-info/
build/
dist/
.DS_Store
bert4torch_test.py
\ No newline at end of file
MIT License
Copyright (c) 2022 Bo仔很忙
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
# bert4torch
**一款用pytorch来复现bert4keras的简洁训练框架**
[![licence](https://img.shields.io/github/license/Tongjilibo/bert4torch.svg?maxAge=3600)](https://github.com/Tongjilibo/bert4torch/blob/master/LICENSE)
[![GitHub release](https://img.shields.io/github/release/Tongjilibo/bert4torch.svg?maxAge=3600)](https://github.com/Tongjilibo/bert4torch/releases)
[![PyPI](https://img.shields.io/pypi/v/bert4torch?label=pypi%20package)](https://pypi.org/project/bert4torch/)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/bert4torch)](https://pypistats.org/packages/bert4torch)
[![GitHub stars](https://img.shields.io/github/stars/Tongjilibo/bert4torch?style=social)](https://github.com/Tongjilibo/bert4torch)
[![GitHub Issues](https://img.shields.io/github/issues/Tongjilibo/bert4torch.svg)](https://github.com/Tongjilibo/bert4torch/issues)
[![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/Tongjilibo/bert4torch/issues)
## 1. 下载安装
安装稳定版
```shell
pip install bert4torch
```
安装最新版
```shell
pip install git+https://www.github.com/Tongjilibo/bert4torch.git
```
- **注意事项**:pip包的发布慢于git上的开发版本,git clone**注意引用路径**,注意权重是否需要转换
- **测试用例**`git clone https://github.com/Tongjilibo/bert4torch`,修改example中的预训练模型文件路径和数据路径即可启动脚本
- **自行训练**:针对自己的数据,修改相应的数据处理代码块
- **开发环境**:使用`torch==1.10`版本进行开发,如其他版本遇到不适配,欢迎反馈
## 2. 功能
- **核心功能**:加载bert、roberta、albert、xlnet、nezha、bart、RoFormer、RoFormer_V2、ELECTRA、GPT、GPT2、T5、GAU-alpha、ERNIE等预训练权重继续进行finetune、并支持在bert基础上灵活定义自己模型
- **丰富示例**:包含[pretrain](https://github.com/Tongjilibo/bert4torch/blob/master/examples/pretrain)[sentence_classfication](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_classfication)[sentence_embedding](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sequence_embedding)[sequence_labeling](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sequence_labeling)[relation_extraction](https://github.com/Tongjilibo/bert4torch/blob/master/examples/relation_extraction)[seq2seq](https://github.com/Tongjilibo/bert4torch/blob/master/examples/seq2seq)[serving](https://github.com/Tongjilibo/bert4torch/blob/master/examples/serving/)等多种解决方案
- **实验验证**:已在公开数据集[实验验证](https://github.com/Tongjilibo/bert4torch/blob/master/examples/Performance.md), 使用如下[examples数据集](https://github.com/Tongjilibo/bert4torch/blob/master/examples/README.md)
- **易用trick**:集成了常见的[trick](https://github.com/Tongjilibo/bert4torch/blob/master/examples/training_trick), 即插即用
- **其他特性**[加载transformers库模型](https://github.com/Tongjilibo/bert4torch/blob/master/examples/others/task_load_transformers_model.py)一起使用;调用方式简洁高效;有训练进度条动态展示;配合torchinfo打印参数量;默认Logger和Tensorboard简便记录训练过程;自定义fit过程,满足高阶需求
## 3. 快速上手
- [快速上手教程](https://github.com/Tongjilibo/bert4torch/blob/master/examples/tutorials/Tutorials.md), [教程示例](https://github.com/Tongjilibo/bert4torch/blob/master/examples/tutorials), [实战示例](https://github.com/Tongjilibo/bert4torch/blob/master/examples)
- [bert4torch介绍(知乎)](https://zhuanlan.zhihu.com/p/486329434)[bert4torch快速上手(知乎)](https://zhuanlan.zhihu.com/p/508890807), [bert4torch又双叒叕更新啦(知乎)](https://zhuanlan.zhihu.com/p/560885427?)
- 背景:用pytorch复现苏神的[bert4keras](https://github.com/bojone/bert4keras),初版参考了[bert4pytorch](https://github.com/MuQiuJun-AI/bert4pytorch)
## 4. 版本说明
- **v0.2.1**:兼容torch<=1.7.1的torch.div无rounding_mode, 增加自定义metrics,支持断点续训,增加默认Logger和Tensorboard日志
- **v0.2.0**:兼容torch<1.9.0的缺失take_along_dim,修复bart中位置向量514的问题,修复Sptokenizer对符号不转换,打印Epoch开始的时间戳,增加parallel_apply
- **v0.1.9**:增加mixup/manifold_mixup/temporal_ensembling策略, 修复pgd策略param.grad为空的问题,修改tokenizer支持批量
- **v0.1.8**:修复原来CRF训练中loss陡增的问题,修复xlnet的token_type_ids输入显存占用大的问题
- **v0.1.7**:增加EarlyStop,CRF中自带转bool类型
- **v0.1.6**:增加transformer_xl、xlnet、t5_pegasus模型,prompt、预训练等示例,支持增加embedding输入,EMA策略,修复tokenizer和sinusoid的bug
- **v0.1.5**:增加GAU-alpha,混合梯度,梯度裁剪,单机多卡(DP、DDP)
- **v0.1.4**:增加了VAT,修复了linux下apply_embedding返回项有问题的情况
- **v0.1.3**:初始版本
## 5. 更新:
- **2022年9月20更新**:增加TensorRT示例,支持多个schedule(如同时ema+warmup),sanic+onnx部署
- **2022年9月10更新**:增加默认Logger和Tensorboard日志,ONNX推理,增加ERNIE模型, 修复t5的norm_mode问题,允许hidden_size不整除num_attention_heads
- **2022年8月28更新**:增加nl2sql示例, 增加自定义metrics,支持断点续训
- **2022年8月21更新**:增加W2NER和DiffCSE示例,打印Epoch开始的时间戳,增加parallel_apply, 兼容torch<=1.7.1的torch.div无rounding_mode
- **2022年8月14更新**:增加有监督句向量、关系抽取、文本生成实验指标,兼容torch<1.9.0的缺失take_along_dim,修复bart中位置向量514的问题,修复Sptokenizer对符号不转换
- **2022年7月27更新**:增加mixup/manifold_mixup/temporal_ensembling策略, 修复pgd策略param.grad为空的问题,修改tokenizer支持批量,增加uie示例
- **2022年7月16更新**:修复原来CRF训练中loss陡增的问题,修复xlnet的token_type_ids输入显存占用大的问题
- **2022年7月10更新**:增加金融中文FAQ示例,天池新闻分类top1案例,增加EarlyStop,CRF中自带转bool类型
- **2022年6月29更新**:增加ner的实验,测试crf不同初始化的效果,bert-whitening中文实验
- **2022年6月13更新**:增加seq2seq+前缀树,增加SimCSE/ESimCSE/PromptBert等无监督语义相似度的中文实验
- **2022年6月05更新**:增加PromptBert、PET、P-tuning示例,修改tokenizer对special_tokens分词错误的问题,增加t5_pegasus
- **2022年5月29更新**:transformer_xl、xlnet模型, 修改sinusoid位置向量被init_weight的bug, EMA,sohu情感分类示例
- **2022年5月17更新**:增加预训练代码,支持增加embedding输入(如词性,word粒度embedding)
- **2022年5月01更新**:增加了混合梯度,梯度裁剪,单机多卡训练(DP、DDP)
- **2022年4月25更新**:增加了VAT、GAU-alpha等示例,增加了梯度累积,自定义fit()示例
- **2022年4月15更新**:增加了ner_mrc、ner_span、roformer_v2、roformer-sim等示例
- **2022年4月05更新**:增加了GPLinker、TPlinker、SimBERT等示例
- **2022年3月29更新**:增加了CoSENT、R-Drop、UDA等示例
- **2022年3月22更新**:添加GPT、GPT2、T5模型
- **2022年3月12更新**:初版提交
## 6. 预训练权重
- 部分权重是要加载修改的[config.json](https://github.com/Tongjilibo/bert4torch/blob/master/examples/convert_script/PLM_config.md)
| 模型分类 | 权重来源 | 权重链接 | 备注(若有) |
| ---- | ---- | ---- | ---- |
| bert | 谷歌原版bert(即bert-base-chinese) | [tf](https://github.com/google-research/bert), [pytorch](https://huggingface.co/bert-base-chinese) | [tf转pytorch命令](https://huggingface.co/docs/transformers/converting_tensorflow_models), [转换脚本](https://github.com/Tongjilibo/bert4torch/blob/master/examples/convert_script/convert_bert-base-chinese.py)
| bert | 哈工大chinese-bert-wwm-ext | [tf/pytorch](https://github.com/ymcui/Chinese-BERT-wwm), [pytorch](https://huggingface.co/hfl/chinese-bert-wwm-ext) |
| robert | 哈工大chinese-robert-wwm-ext | [tf/pytorch](https://github.com/ymcui/Chinese-BERT-wwm), [pytorch](https://huggingface.co/hfl/chinese-roberta-wwm-ext)
| xlnet | 哈工大xlnet | [tf/pytorch](https://github.com/ymcui/Chinese-XLNet) | [config](https://github.com/Tongjilibo/bert4torch/blob/master/examples/convert_script/PLM_config.md)
| electra | 哈工大electra | [tf](https://github.com/ymcui/Chinese-ELECTRA), [pytorch](https://huggingface.co/hfl/chinese-electra-base-discriminator)
| macbert | 哈工大macbert | [tf](https://github.com/ymcui/MacBERT), [pytorch](https://huggingface.co/hfl/chinese-macbert-base)
| albert | brightmart | [tf](https://github.com/brightmart/albert_zh), [pytorch](https://github.com/lonePatient/albert_pytorch)
| ernie | 百度文心 |[paddle](https://github.com/PaddlePaddle/ERNIE), [pytorch](https://huggingface.co/nghuyong) |
| roformer | 追一科技 | [tf](https://github.com/ZhuiyiTechnology/roformer), [pytorch](https://huggingface.co/junnyu/roformer_chinese_base) |
| roformer_v2 | 追一科技 | [tf](https://github.com/ZhuiyiTechnology/roformer-v2), [pytorch](https://huggingface.co/junnyu/roformer_v2_chinese_char_base) |
| simbert | 追一科技 | [tf](https://github.com/ZhuiyiTechnology/simbert), [pytorch](https://huggingface.co/peterchou/simbert-chinese-base/tree/main) |
| roformer-sim | 追一科技 | [ft](https://github.com/ZhuiyiTechnology/roformer-sim), [pytorch](https://huggingface.co/junnyu/roformer_chinese_sim_char_base) |
| gau-alpha | 追一科技 | [tf](https://github.com/ZhuiyiTechnology/GAU-alpha) | [转换脚本](https://github.com/Tongjilibo/bert4torch/blob/master/examples/convert_script/convert_GAU_alpha.py)
| nezha | 华为 | [tf](https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/NEZHA-TensorFlow), [pytorch](https://github.com/lonePatient/NeZha_Chinese_PyTorch) |
| gpt | CDial-GPT | [pytorch](https://github.com/thu-coai/CDial-GPT) | [转换脚本](https://github.com/Tongjilibo/bert4torch/blob/master/examples/convert_script/convert_gpt__CDial-GPT-LCCC.py)
| gpt2 | 清华26亿 cmp_lm | [pytorch](https://github.com/TsinghuaAI/CPM-1-Generate) | [转换脚本](https://github.com/Tongjilibo/bert4torch/blob/master/examples/convert_script/convert_gpt2__cmp_lm_2.6b.py)
| gpt2 | 中文GPT2_ML模型 | [tf](https://github.com/imcaspar/gpt2-ml), [pytorch](https://github.com/ghosthamlet/gpt2-ml-torch) | [转换脚本](https://github.com/Tongjilibo/bert4torch/blob/master/examples/convert_script/convert_gpt2__gpt2-ml.py)
| t5 | UER | [pytorch](https://huggingface.co/uer/t5-base-chinese-cluecorpussmall) | [config](https://github.com/Tongjilibo/bert4torch/blob/master/examples/convert_script/PLM_config.md)
| mt5 | 谷歌 | [pytorch](https://huggingface.co/google/mt5-base) | [config](https://github.com/Tongjilibo/bert4torch/blob/master/examples/convert_script/PLM_config.md)
| t5_pegasus | 追一科技 | [tf](https://github.com/ZhuiyiTechnology/t5-pegasus) | [转换脚本](https://github.com/Tongjilibo/bert4torch/blob/master/examples/convert_script/convert_t5_pegasus.py)
| bart | 复旦 | [pytorch](https://github.com/fastnlp/CPT) | [转换脚本](https://github.com/Tongjilibo/bert4torch/blob/master/examples/convert_script/convert_bart_fudanNLP.py)
#! -*- coding: utf-8 -*-
__version__ = '0.2.2'
\ No newline at end of file
# 从transformer中移植过来的activation, 原来的bert4keras并没有
import math
import torch
from packaging import version
from torch import nn
def _gelu_python(x):
"""
Original Implementation of the GELU activation function in Google BERT repo when initially created. For
information: OpenAI GPT's GELU is slightly different (and gives slightly different results): 0.5 * x * (1 +
torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))) This is now written in C in nn.functional
Also see the Gaussian Error Linear Units paper: https://arxiv.org/abs/1606.08415
"""
return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
def gelu_new(x):
"""
Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT). Also see
the Gaussian Error Linear Units paper: https://arxiv.org/abs/1606.08415
"""
return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))
if version.parse(torch.__version__) < version.parse("1.4"):
gelu = _gelu_python
else:
gelu = nn.functional.gelu
def gelu_fast(x):
return 0.5 * x * (1.0 + torch.tanh(x * 0.7978845608 * (1.0 + 0.044715 * x * x)))
def quick_gelu(x):
return x * torch.sigmoid(1.702 * x)
def _silu_python(x):
"""
See Gaussian Error Linear Units (Hendrycks et al., https://arxiv.org/abs/1606.08415) where the SiLU (Sigmoid Linear
Unit) was originally introduced and coined, and see Sigmoid-Weighted Linear Units for Neural Network Function
Approximation in Reinforcement Learning (Elfwing et al., https://arxiv.org/abs/1702.03118) and Swish: a Self-Gated
Activation Function (Ramachandran et al., https://arxiv.org/abs/1710.05941v1) where the SiLU was experimented with
later.
"""
return x * torch.sigmoid(x)
if version.parse(torch.__version__) < version.parse("1.7"):
silu = _silu_python
else:
silu = nn.functional.silu
def _mish_python(x):
"""
See Mish: A Self-Regularized Non-Monotonic Activation Function (Misra., https://arxiv.org/abs/1908.08681). Also
visit the official repository for the paper: https://github.com/digantamisra98/Mish
"""
return x * torch.tanh(nn.functional.softplus(x))
if version.parse(torch.__version__) < version.parse("1.9"):
mish = _mish_python
else:
mish = nn.functional.mish
def linear_act(x):
return x
ACT2FN = {
"relu": nn.functional.relu,
"silu": silu,
"swish": silu,
"gelu": gelu,
"tanh": torch.tanh,
"gelu_new": gelu_new,
"gelu_fast": gelu_fast,
"quick_gelu": quick_gelu,
"mish": mish,
"linear": linear_act,
"sigmoid": torch.sigmoid,
"softmax": nn.Softmax(dim=-1)
}
def get_activation(activation_string):
if activation_string in ACT2FN:
return ACT2FN[activation_string]
else:
raise KeyError(f"function {activation_string} not found in ACT2FN mapping {list(ACT2FN.keys())}")
import torch
from torch.functional import Tensor
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import math
from bert4torch.snippets import get_sinusoid_encoding_table, take_along_dim
from bert4torch.activations import get_activation
from typing import List, Optional
import random
import warnings
class LayerNorm(nn.Module):
def __init__(self, hidden_size, eps=1e-12, conditional_size=False, weight=True, bias=True, norm_mode='normal', **kwargs):
"""layernorm 层,这里自行实现,目的是为了兼容 conditianal layernorm,使得可以做条件文本生成、条件分类等任务
条件layernorm来自于苏剑林的想法,详情:https://spaces.ac.cn/archives/7124
"""
super(LayerNorm, self).__init__()
# 兼容roformer_v2不包含weight
if weight:
self.weight = nn.Parameter(torch.ones(hidden_size))
# 兼容t5不包含bias项, 和t5使用的RMSnorm
if bias:
self.bias = nn.Parameter(torch.zeros(hidden_size))
self.norm_mode = norm_mode
self.eps = eps
self.conditional_size = conditional_size
if conditional_size:
# 条件layernorm, 用于条件文本生成,
# 这里采用全零初始化, 目的是在初始状态不干扰原来的预训练权重
self.dense1 = nn.Linear(conditional_size, hidden_size, bias=False)
self.dense1.weight.data.uniform_(0, 0)
self.dense2 = nn.Linear(conditional_size, hidden_size, bias=False)
self.dense2.weight.data.uniform_(0, 0)
def forward(self, x):
inputs = x[0]
if self.norm_mode == 'rmsnorm':
# t5使用的是RMSnorm
variance = inputs.to(torch.float32).pow(2).mean(-1, keepdim=True)
o = inputs * torch.rsqrt(variance + self.eps)
else:
u = inputs.mean(-1, keepdim=True)
s = (inputs - u).pow(2).mean(-1, keepdim=True)
o = (inputs - u) / torch.sqrt(s + self.eps)
if not hasattr(self, 'weight'):
self.weight = 1
if not hasattr(self, 'bias'):
self.bias = 0
if self.conditional_size:
cond = x[1]
for _ in range(len(inputs.shape) - len(cond.shape)):
cond = cond.unsqueeze(dim=1)
return (self.weight + self.dense1(cond)) * o + (self.bias + self.dense2(cond))
else:
return self.weight * o + self.bias
class MultiHeadAttentionLayer(nn.Module):
def __init__(self, hidden_size, num_attention_heads, attention_probs_dropout_prob, attention_scale=True,
return_attention_scores=False, bias=True, **kwargs):
super(MultiHeadAttentionLayer, self).__init__()
self.hidden_size = hidden_size
self.num_attention_heads = num_attention_heads
# assert hidden_size % num_attention_heads == 0 # 旧逻辑,t5_pegasus_small中不可以整除
# 兼容t5_pegasus_small
if kwargs.get('attention_head_size'):
self.attention_head_size = kwargs.get('attention_head_size')
else:
self.attention_head_size = int(hidden_size / num_attention_heads)
self.inner_dim = self.num_attention_heads * self.attention_head_size # 新逻辑
self.attention_scale = attention_scale
self.return_attention_scores = return_attention_scores
self.bias = bias
self.q = nn.Linear(hidden_size, self.inner_dim, bias=bias)
self.k = nn.Linear(hidden_size, self.inner_dim, bias=bias)
self.v = nn.Linear(hidden_size, self.inner_dim, bias=bias)
self.o = nn.Linear(self.inner_dim, hidden_size, bias=bias)
self.dropout = nn.Dropout(attention_probs_dropout_prob)
self.a_bias, self.p_bias = kwargs.get('a_bias'), kwargs.get('p_bias')
if self.p_bias == 'typical_relative': # nezha
self.relative_positions_encoding = RelativePositionsEncoding(qlen=kwargs.get('max_position'),
klen=kwargs.get('max_position'),
embedding_size=self.attention_head_size,
max_relative_position=kwargs.get('max_relative_position'))
elif self.p_bias == 'rotary': # roformer
self.relative_positions_encoding = RoPEPositionEncoding(max_position=kwargs.get('max_position'), embedding_size=self.attention_head_size)
elif self.p_bias == 't5_relative': # t5
self.relative_positions = RelativePositionsEncodingT5(qlen=kwargs.get('max_position'),
klen=kwargs.get('max_position'),
relative_attention_num_buckets=kwargs.get('relative_attention_num_buckets'),
is_decoder=kwargs.get('is_decoder'))
self.relative_positions_encoding = nn.Embedding(kwargs.get('relative_attention_num_buckets'), self.num_attention_heads)
def transpose_for_scores(self, x):
new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
x = x.view(*new_x_shape)
return x.permute(0, 2, 1, 3)
def forward(self, hidden_states, attention_mask=None, encoder_hidden_states=None, encoder_attention_mask=None):
# hidden_states shape: [batch_size, seq_q, hidden_size]
# attention_mask shape: [batch_size, 1, 1, seq_q] 或者 [batch_size, 1, seq_q, seq_q]
# encoder_hidden_states shape: [batch_size, seq_k, hidden_size]
# encoder_attention_mask shape: [batch_size, 1, 1, seq_k]
mixed_query_layer = self.q(hidden_states)
if encoder_hidden_states is not None:
mixed_key_layer = self.k(encoder_hidden_states)
mixed_value_layer = self.v(encoder_hidden_states)
attention_mask = encoder_attention_mask
else:
mixed_key_layer = self.k(hidden_states)
mixed_value_layer = self.v(hidden_states)
# mixed_query_layer shape: [batch_size, query_len, hidden_size]
# mixed_query_layer shape: [batch_size, key_len, hidden_size]
# mixed_query_layer shape: [batch_size, value_len, hidden_size]
query_layer = self.transpose_for_scores(mixed_query_layer)
key_layer = self.transpose_for_scores(mixed_key_layer)
value_layer = self.transpose_for_scores(mixed_value_layer)
# query_layer shape: [batch_size, num_attention_heads, query_len, attention_head_size]
# key_layer shape: [batch_size, num_attention_heads, key_len, attention_head_size]
# value_layer shape: [batch_size, num_attention_heads, value_len, attention_head_size]
if self.p_bias == 'rotary':
query_layer = self.relative_positions_encoding(query_layer)
key_layer = self.relative_positions_encoding(key_layer)
# 交换k的最后两个维度,然后q和k执行点积, 获得attention score
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
# attention_scores shape: [batch_size, num_attention_heads, query_len, key_len]
if (self.p_bias == 'typical_relative') and hasattr(self, 'relative_positions_encoding'):
relations_keys = self.relative_positions_encoding(attention_scores.shape[-1], attention_scores.shape[-1]) # [to_seq_len, to_seq_len, d_hid]
# 旧实现,方便读者理解维度转换
# query_layer_t = query_layer.permute(2, 0, 1, 3)
# query_layer_r = query_layer_t.contiguous().view(from_seq_length, batch_size * num_attention_heads, self.attention_head_size)
# key_position_scores = torch.matmul(query_layer_r, relations_keys.permute(0, 2, 1))
# key_position_scores_r = key_position_scores.view(from_seq_length, batch_size, num_attention_heads, from_seq_length)
# key_position_scores_r_t = key_position_scores_r.permute(1, 2, 0, 3)
# 新实现
key_position_scores_r_t = torch.einsum('bnih,ijh->bnij', query_layer, relations_keys)
attention_scores = attention_scores + key_position_scores_r_t
elif (self.p_bias == 't5_relative') and hasattr(self, 'relative_positions_encoding'):
relations_keys = self.relative_positions(attention_scores.shape[-1], attention_scores.shape[-1])
key_position_scores_r_t = self.relative_positions_encoding(relations_keys).permute([2, 0, 1]).unsqueeze(0)
attention_scores = attention_scores + key_position_scores_r_t
# 是否进行attention scale
if self.attention_scale:
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
# 执行attention mask,对于mask为0部分的attention mask,
# 值为-1e10,经过softmax后,attention_probs几乎为0,所以不会attention到mask为0的部分
if attention_mask is not None:
# attention_scores = attention_scores.masked_fill(attention_mask == 0, -1e10)
attention_mask = (1.0 - attention_mask) * -10000.0 # 所以传入的mask的非padding部分为1, padding部分为0
attention_scores = attention_scores + attention_mask
# 将attention score 归一化到0-1
attention_probs = F.softmax(attention_scores, dim=-1)
attention_probs = self.dropout(attention_probs)
context_layer = torch.matmul(attention_probs, value_layer) # [batch_size, num_attention_heads, query_len, attention_head_size]
if (self.p_bias == 'typical_relative') and hasattr(self, 'relative_positions_encoding'):
relations_values = self.relative_positions_encoding(attention_scores.shape[-1], attention_scores.shape[-1])
# 旧实现,方便读者理解维度转换
# attention_probs_t = attention_probs.permute(2, 0, 1, 3)
# attentions_probs_r = attention_probs_t.contiguous().view(from_seq_length, batch_size * num_attention_heads, to_seq_length)
# value_position_scores = torch.matmul(attentions_probs_r, relations_values)
# value_position_scores_r = value_position_scores.view(from_seq_length, batch_size, num_attention_heads, self.attention_head_size)
# value_position_scores_r_t = value_position_scores_r.permute(1, 2, 0, 3)
# 新实现
value_position_scores_r_t = torch.einsum('bnij,ijh->bnih', attention_probs, relations_values)
context_layer = context_layer + value_position_scores_r_t
# context_layer shape: [batch_size, query_len, num_attention_heads, attention_head_size]
# transpose、permute等维度变换操作后,tensor在内存中不再是连续存储的,而view操作要求tensor的内存连续存储,
# 所以在调用view之前,需要contiguous来返回一个contiguous copy;
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
new_context_layer_shape = context_layer.size()[:-2] + (self.inner_dim,)
context_layer = context_layer.view(*new_context_layer_shape)
# 是否返回attention scores
if self.return_attention_scores:
# 这里返回的attention_scores没有经过softmax, 可在外部进行归一化操作
return self.o(context_layer), attention_scores
else:
return self.o(context_layer)
class PositionWiseFeedForward(nn.Module):
def __init__(self, hidden_size, intermediate_size, dropout_rate=0.5, hidden_act='gelu', is_dropout=False, bias=True, **kwargs):
# 原生的tf版本的bert在激活函数后,没有添加dropout层,但是在google AI的bert-pytorch开源项目中,多了一层dropout;
# 并且在pytorch官方的TransformerEncoderLayer的实现中,也有一层dropout层,就像这样:self.linear2(self.dropout(self.activation(self.linear1(src))));
# 这样不统一做法的原因不得而知,不过有没有这一层,差别可能不会很大;
# 为了适配是否dropout,用is_dropout,dropout_rate两个参数控制;如果是实现原始的transformer,直接使用默认参数即可;如果是实现bert,则is_dropout为False,此时的dropout_rate参数并不会使用.
super(PositionWiseFeedForward, self).__init__()
self.is_dropout = is_dropout
self.intermediate_act_fn = get_activation(hidden_act)
self.intermediateDense = nn.Linear(hidden_size, intermediate_size, bias=bias)
self.outputDense = nn.Linear(intermediate_size, hidden_size, bias=bias)
if self.is_dropout:
self.dropout = nn.Dropout(dropout_rate)
def forward(self, x):
# x shape: (batch size, seq len, hidden_size)
if self.is_dropout:
x = self.dropout(self.intermediate_act_fn(self.intermediateDense(x)))
else:
x = self.intermediate_act_fn(self.intermediateDense(x))
# x shape: (batch size, seq len, intermediate_size)
x = self.outputDense(x)
# x shape: (batch size, seq len, hidden_size)
return x
class GatedAttentionUnit(nn.Module):
'''门控注意力单元,
链接:https://arxiv.org/abs/2202.10447
介绍:https://kexue.fm/archives/8934
说明:没有加入加性相对位置编码
参考pytorch项目:https://github.com/lucidrains/FLASH-pytorch
'''
def __init__(self, hidden_size, attention_key_size, intermediate_size, attention_probs_dropout_prob, hidden_act,
is_dropout=False, attention_scale=True, bias=True, normalization='softmax_plus', **kwargs):
super().__init__()
self.intermediate_size = intermediate_size
self.attention_head_size = attention_key_size
self.attention_scale = attention_scale
self.is_dropout = is_dropout
self.normalization = normalization
self.hidden_fn = get_activation(hidden_act)
self.dropout = nn.Dropout(attention_probs_dropout_prob)
self.i_dense = nn.Linear(hidden_size, self.intermediate_size*2+attention_key_size, bias=bias)
self.offsetscale = self.OffsetScale(attention_key_size, heads=2, bias=bias)
self.o_dense = nn.Linear(self.intermediate_size, hidden_size, bias=bias)
self.a_bias, self.p_bias = kwargs.get('a_bias'), kwargs.get('p_bias')
if self.p_bias == 'rotary': # RoPE
self.relative_positions_encoding = RoPEPositionEncoding(max_position=kwargs.get('max_position'), embedding_size=self.attention_head_size)
def forward(self, hidden_states, attention_mask):
# 投影变换
hidden_states = self.hidden_fn(self.i_dense(hidden_states))
u, v, qk = hidden_states.split([self.intermediate_size, self.intermediate_size, self.attention_head_size], dim=-1)
q, k = self.offsetscale(qk) # 仿射变换
# 加入RoPE
if self.p_bias == 'rotary':
q = self.relative_positions_encoding(q)
k = self.relative_positions_encoding(k)
# Attention
attention_scores = torch.einsum('b i d, b j d -> b i j', q, k) # [btz, seq_len, seq_len]
if self.attention_scale:
# seq_len = hidden_states.shape[1]
# attention_scores = F.relu(attention_scores/seq_len) ** 2
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
if attention_mask is not None:
attention_mask = (1.0 - attention_mask) * -1e12
attention_scores = attention_scores + attention_mask.squeeze(1)
# 归一化
attention_scores = self.attention_normalize(attention_scores, -1, self.normalization)
if self.is_dropout:
attention_scores = self.dropout(attention_scores)
# 计算输出
out = self.o_dense(u * torch.einsum('b i j, b j d -> b i d', attention_scores, v))
return out
def attention_normalize(self, a, dim=-1, method='softmax'):
"""不同的注意力归一化方案
softmax:常规/标准的指数归一化;
squared_relu:来自 https://arxiv.org/abs/2202.10447 ;
softmax_plus:来自 https://kexue.fm/archives/8823 。
"""
if method == 'softmax':
return F.softmax(a, dim=dim)
else:
mask = (a > -1e11).float()
l = torch.maximum(torch.sum(mask, dim=dim, keepdims=True), torch.tensor(1).to(mask))
if method == 'squared_relu':
return F.relu(a)**2 / l
elif method == 'softmax_plus':
return F.softmax(a * torch.log(l) / torch.log(torch.tensor(512)).to(mask), dim=dim)
return a
class OffsetScale(nn.Module):
'''仿射变换
'''
def __init__(self, head_size, heads=1, bias=True):
super().__init__()
self.gamma = nn.Parameter(torch.ones(heads, head_size))
self.bias = bias
if bias:
self.beta = nn.Parameter(torch.zeros(heads, head_size))
nn.init.normal_(self.gamma, std = 0.02)
def forward(self, x):
out = torch.einsum('... d, h d -> ... h d', x, self.gamma)
if self.bias:
out = out + self.beta
return out.unbind(dim = -2)
class BertEmbeddings(nn.Module):
"""
embeddings层
构造word, position and token_type embeddings.
"""
def __init__(self, vocab_size, embedding_size, hidden_size, max_position, segment_vocab_size, shared_segment_embeddings, drop_rate, conditional_size=False, **kwargs):
super(BertEmbeddings, self).__init__()
self.shared_segment_embeddings = shared_segment_embeddings
self.word_embeddings = nn.Embedding(vocab_size, embedding_size, padding_idx=0)
# 位置编码
if kwargs.get('p_bias') == 'sinusoid':
self.position_embeddings = SinusoidalPositionEncoding(max_position, embedding_size)
elif kwargs.get('p_bias') in {'rotary', 'typical_relative', 't5_relative', 'other_relative'}:
# 如果使用相对位置编码,则不声明PositionEmbeddings
pass
elif max_position > 0:
self.position_embeddings = nn.Embedding(max_position, embedding_size)
# segement编码
if (segment_vocab_size > 0) and (not shared_segment_embeddings):
self.segment_embeddings = nn.Embedding(segment_vocab_size, embedding_size)
# emb_scale
self.emb_scale = kwargs.get('emb_scale', 1) # transform_xl, xlnet特有
# LayerNorm
self.layerNorm = LayerNorm(embedding_size, eps=1e-12, conditional_size=conditional_size, **kwargs)
self.dropout = nn.Dropout(drop_rate)
# 如果embedding_size != hidden_size,则再有一个linear(适用于albert矩阵分解)
if embedding_size != hidden_size:
self.embedding_hidden_mapping_in = nn.Linear(embedding_size, hidden_size)
def forward(self, token_ids, segment_ids=None, conditional_emb=None, additional_embs=None):
if (not token_ids.requires_grad) and (token_ids.dtype in {torch.long, torch.int}):
words_embeddings = self.word_embeddings(token_ids)
else:
words_embeddings = token_ids # 自定义word_embedding,目前仅有VAT中使用
if hasattr(self, 'segment_embeddings'):
segment_ids = torch.zeros_like(token_ids) if segment_ids is None else segment_ids
segment_embeddings = self.segment_embeddings(segment_ids)
embeddings = words_embeddings + segment_embeddings
elif self.shared_segment_embeddings: # segment和word_embedding共享权重
segment_ids = torch.zeros_like(token_ids) if segment_ids is None else segment_ids
segment_embeddings = self.word_embeddings(segment_ids)
embeddings = words_embeddings + segment_embeddings
else:
embeddings = words_embeddings
# 额外的embedding,如词性等
if additional_embs is not None:
for emb in additional_embs:
embeddings += emb
if hasattr(self, 'position_embeddings'):
seq_length = token_ids.size(1)
position_ids = torch.arange(seq_length, dtype=torch.long, device=token_ids.device)
position_ids = position_ids.unsqueeze(0).repeat(token_ids.shape[0], 1)
position_embeddings = self.position_embeddings(position_ids)
embeddings += position_embeddings
if self.emb_scale != 1:
embeddings = embeddings * self.emb_scale # transform_xl, xlnet特有
if hasattr(self, 'layerNorm'):
embeddings = self.layerNorm((embeddings, conditional_emb))
embeddings = self.dropout(embeddings)
if hasattr(self, 'embedding_hidden_mapping_in'):
embeddings = self.embedding_hidden_mapping_in(embeddings)
return embeddings
class BertLayer(nn.Module):
"""
Transformer层:
顺序为: Attention --> Add --> LayerNorm --> Feed Forward --> Add --> LayerNorm
注意: 1、以上都不计dropout层,并不代表没有dropout,每一层的dropout使用略有不同,注意区分
2、原始的Transformer的encoder中的Feed Forward层一共有两层linear,
config.intermediate_size的大小不仅是第一层linear的输出尺寸,也是第二层linear的输入尺寸
"""
def __init__(self, hidden_size, num_attention_heads, dropout_rate, attention_probs_dropout_prob, intermediate_size, hidden_act,
is_dropout=False, conditional_size=False, **kwargs):
super(BertLayer, self).__init__()
self.multiHeadAttention = MultiHeadAttentionLayer(hidden_size, num_attention_heads, attention_probs_dropout_prob, **kwargs)
self.dropout1 = nn.Dropout(dropout_rate)
self.layerNorm1 = LayerNorm(hidden_size, eps=1e-12, conditional_size=conditional_size, **kwargs)
self.feedForward = PositionWiseFeedForward(hidden_size, intermediate_size, dropout_rate, hidden_act, is_dropout=is_dropout, **kwargs)
self.dropout2 = nn.Dropout(dropout_rate)
self.layerNorm2 = LayerNorm(hidden_size, eps=1e-12, conditional_size=conditional_size, **kwargs)
self.is_decoder = kwargs.get('is_decoder')
if self.is_decoder:
self.crossAttention = MultiHeadAttentionLayer(hidden_size, num_attention_heads, attention_probs_dropout_prob, **kwargs)
self.dropout3 = nn.Dropout(dropout_rate)
self.layerNorm3 = LayerNorm(hidden_size, eps=1e-12, conditional_size=conditional_size, **kwargs)
def forward(self, hidden_states, attention_mask, conditional_emb=None, encoder_hidden_states=None, encoder_attention_mask=None):
self_attn_output = self.multiHeadAttention(hidden_states, attention_mask) # self.decoder为true时候,这里的attention_mask是三角的
hidden_states = hidden_states + self.dropout1(self_attn_output)
hidden_states = self.layerNorm1((hidden_states, conditional_emb))
# cross attention
if self.is_decoder and encoder_hidden_states is not None:
cross_attn_output = self.crossAttention(hidden_states, None, encoder_hidden_states, encoder_attention_mask)
hidden_states = hidden_states + self.dropout3(cross_attn_output)
hidden_states = self.layerNorm3((hidden_states, conditional_emb))
self_attn_output2 = self.feedForward(hidden_states)
hidden_states = hidden_states + self.dropout2(self_attn_output2)
hidden_states = self.layerNorm2((hidden_states, conditional_emb))
return hidden_states
class T5Layer(BertLayer):
"""T5的Encoder的主体是基于Self-Attention的模块
顺序:LN --> Att --> Add --> LN --> FFN --> Add
"""
def __init__(self, *args, version='t5.1.0', **kwargs):
super().__init__(*args, **kwargs)
# 如果是t5.1.1结构,则FFN层需要变更
if version.endswith('t5.1.1'):
kwargs['dropout_rate'] = args[2]
kwargs['hidden_act'] = args[5]
self.feedForward = self.T5PositionWiseFeedForward(hidden_size=args[0], intermediate_size=args[4], **kwargs)
# decoder中间有crossAttention
if self.is_decoder and hasattr(self.crossAttention, 'relative_positions_encoding'):
del self.crossAttention.relative_positions_encoding
del self.crossAttention.relative_positions
def forward(self, hidden_states, attention_mask, conditional_emb=None, encoder_hidden_states=None, encoder_attention_mask=None):
# bert的layernorm是在attn/ffc之后,Openai-gpt2是在之前
x = self.layerNorm1((hidden_states, conditional_emb))
self_attn_output = self.multiHeadAttention(x, attention_mask)
hidden_states = hidden_states + self.dropout1(self_attn_output)
# cross attention
if self.is_decoder and encoder_hidden_states is not None:
x = self.layerNorm3((hidden_states, conditional_emb))
cross_attn_output = self.crossAttention(x, None, encoder_hidden_states, encoder_attention_mask)
hidden_states = hidden_states + self.dropout3(cross_attn_output)
x = self.layerNorm2((hidden_states, conditional_emb))
ffn_output = self.feedForward(x)
hidden_states = hidden_states + self.dropout2(ffn_output)
return hidden_states
class T5PositionWiseFeedForward(PositionWiseFeedForward):
'''参考transformer包: https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py
'''
def __init__(self, hidden_size, intermediate_size, **kwargs):
super().__init__(hidden_size, intermediate_size, **kwargs)
self.intermediateDense = nn.Linear(hidden_size, intermediate_size, bias=False)
self.intermediateDense1 = nn.Linear(hidden_size, intermediate_size, bias=False)
self.outputDense = nn.Linear(intermediate_size, hidden_size, bias=False)
def forward(self, x):
# x shape: (batch size, seq len, hidden_size)
x_gelu = self.intermediate_act_fn(self.intermediateDense(x))
x_linear = self.intermediateDense1(x)
x = x_gelu * x_linear
if self.is_dropout:
x = self.dropout(x)
# x shape: (batch size, seq len, intermediate_size)
x = self.outputDense(x)
# x shape: (batch size, seq len, hidden_size)
return x
class XlnetLayer(BertLayer):
'''Transformer_XL层
顺序为: Attention --> Add --> LayerNorm --> Feed Forward --> Add --> LayerNorm
'''
def __init__(self, hidden_size, num_attention_heads, dropout_rate, attention_probs_dropout_prob, intermediate_size, hidden_act, **kwargs):
super().__init__(hidden_size, num_attention_heads, dropout_rate, attention_probs_dropout_prob, intermediate_size, hidden_act, **kwargs)
self.pre_lnorm = kwargs.get('pre_lnorm')
# multiattn层无bias
self.multiHeadAttention = self.RelPartialLearnableMultiHeadAttn(hidden_size, num_attention_heads, attention_probs_dropout_prob, bias=False, **kwargs)
def forward(self, hidden_states, segment_ids, pos_emb, attention_mask, mems_i, conditional_emb=None):
# 拼接mems和query,mems_i: [btz, m_len, hdsz], w: [btz, q_len, hdsz] = [btz, k_len, hdsz]
hidden_states_cat = torch.cat([mems_i, hidden_states], 1) if mems_i is not None else hidden_states
# Attn
if self.pre_lnorm:
hidden_states_cat = self.layerNorm1((hidden_states_cat, conditional_emb))
self_attn_output = self.multiHeadAttention(hidden_states, hidden_states_cat, pos_emb, attention_mask, segment_ids)
hidden_states = hidden_states + self.dropout1(self_attn_output)
if not self.pre_lnorm: # post_lnorm
hidden_states = self.layerNorm1((hidden_states, conditional_emb))
# FFN
x = self.layerNorm2((hidden_states, conditional_emb)) if self.pre_lnorm else hidden_states
self_attn_output2 = self.feedForward(x)
hidden_states = hidden_states + self.dropout2(self_attn_output2)
if not self.pre_lnorm: # post_lnorm
hidden_states = self.layerNorm2((hidden_states, conditional_emb))
return hidden_states
class RelPartialLearnableMultiHeadAttn(MultiHeadAttentionLayer):
'''Transformer_XL式相对位置编码, 这里修改成了MultiHeadAttentionLayer的batch_first代码格式
'''
def __init__(self, *args, r_w_bias=None, r_r_bias=None, r_s_bias=None, **kwargs):
super().__init__(*args, **kwargs)
segment_vocab_size = kwargs.get('segment_vocab_size')
if r_r_bias is None or r_w_bias is None: # Biases are not shared
self.r_r_bias = nn.Parameter(torch.FloatTensor(self.num_attention_heads, self.attention_head_size)) # 全局内容偏置
self.r_w_bias = nn.Parameter(torch.FloatTensor(self.num_attention_heads, self.attention_head_size)) # 全局位置偏置
if segment_vocab_size > 0:
self.r_s_bias = nn.Parameter(torch.FloatTensor(self.num_attention_heads, self.attention_head_size)) # 全局segment偏置
else: # 所有层公用一个
self.r_r_bias = r_r_bias
self.r_w_bias = r_w_bias
self.r_s_bias = r_s_bias
if segment_vocab_size > 0:
# self.seg_embed = nn.Embedding(segment_vocab_size, self.hidden_size)
self.seg_embed = nn.Parameter(torch.FloatTensor(segment_vocab_size, self.num_attention_heads, self.attention_head_size))
self.r = nn.Linear(self.hidden_size, self.hidden_size, bias=self.bias)
self.rel_shift_opt = kwargs.get('rel_shift_opt')
@staticmethod
def rel_shift(x, zero_triu=False):
'''transformer_xl使用, 向左shift让右上角都是0, 对角线是同一个值, x: [btz, n_head, q_len, k_len]
'''
q_len, k_len = x.size(2), x.size(-1)
zero_pad = torch.zeros((*x.size()[:2], q_len, 1), device=x.device, dtype=x.dtype)
x_padded = torch.cat([zero_pad, x], dim=-1)
x_padded = x_padded.view(*x.size()[:2], k_len + 1, q_len)
x = x_padded[:,:,1:,:].view_as(x)
if zero_triu:
ones = torch.ones((q_len, k_len), device=x.device)
x = x * torch.tril(ones, k_len - q_len)[None,None,:,:]
return x
@staticmethod
def rel_shift_bnij(x, klen=-1):
''' xlnet使用
'''
x_size = x.shape
x = x.reshape(x_size[0], x_size[1], x_size[3], x_size[2])
x = x[:, :, 1:, :]
x = x.reshape(x_size[0], x_size[1], x_size[2], x_size[3] - 1)
x = torch.index_select(x, 3, torch.arange(klen, device=x.device, dtype=torch.long))
# x = x[:, :, :, :klen]
return x
def forward(self, w, cat, r, attention_mask=None, seg_mat=None):
# w: 词向量[btz, q_len, hdsz], cat: w和mem_i拼接后向量[btz, k_len, hdsz], r:相对位置向量[r_len, hdsz]
qlen, rlen, bsz = w.size(1), r.size(0), w.size(0)
mixed_query_layer = self.q(cat)[:, -qlen:, :] # 仅取用query部分,不适用mem部分
mixed_key_layer = self.k(cat)
mixed_value_layer = self.v(cat)
w_head_q = self.transpose_for_scores(mixed_query_layer) # [btz, n_head, q_len, d_head]
w_head_k = self.transpose_for_scores(mixed_key_layer) # [btz, n_head, k_len, d_head]
w_head_v = self.transpose_for_scores(mixed_value_layer) # [btz, n_head, k_len, d_head]
r_head_k = self.r(r) # [hdsz, nhead*headsize] = [r_len, 1, nhead*headsize]
r_head_k = r_head_k.view(rlen, self.num_attention_heads, self.attention_head_size) # rlen x n_head x d_head
#### compute attention score
rw_head_q = w_head_q + self.r_w_bias.unsqueeze(1) # [btz, n_head, q_len, d_head]
AC = torch.einsum('bnid,bnjd->bnij', (rw_head_q, w_head_k)) # [btz, n_head, q_len, k_len]
rr_head_q = w_head_q + self.r_r_bias.unsqueeze(1) # [btz, n_head, q_len, d_head]
BD = torch.einsum('bnid,jnd->bnij', (rr_head_q, r_head_k)) # [btz, n_head, q_len, k_len]
BD = self.rel_shift_bnij(BD, klen=AC.shape[3]) if self.rel_shift_opt == 'xlnet' else self.rel_shift(BD)
if hasattr(self, 'seg_embed') and (self.r_r_bias is not None):
# # 之前的方式,需要配合Embedding,以及load_variable和variable_mapping,显存容易爆炸
# w_head_s = self.seg_embed(seg_mat) # [btz, q_len, klen, hdsz]
# w_head_s = w_head_s.reshape(*w_head_s.shape[:3], self.num_attention_heads, self.attention_head_size)
# rs_head_q = w_head_q + self.r_s_bias.unsqueeze(1)
# EF = torch.einsum('bnid,bijnd->bnij', (rs_head_q, w_head_s)) # [btz, n_head, q_len, k_len]
seg_mat = F.one_hot(seg_mat, 2).float()
EF = torch.einsum("bnid,snd->ibns", w_head_q + self.r_s_bias.unsqueeze(1), self.seg_embed)
EF = torch.einsum("bijs,ibns->bnij", seg_mat, EF)
else:
EF = 0
# # [btz, n_head, q_len, k_len]
attention_scores = AC + BD + EF
if self.attention_scale:
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
#### compute attention probability
if attention_mask is not None and attention_mask.any().item():
# attention_mask = (1.0 - attention_mask) * -10000.0
# attention_scores = attention_scores + attention_mask # 这里修改了下,原有的-10000不够接近-inf
attention_mask = (1.0 - attention_mask)
attention_scores = attention_scores.float().masked_fill(attention_mask.bool(), -1e30).type_as(attention_mask)
# [btz, n_head, q_len, k_len]
attention_probs = F.softmax(attention_scores, dim=-1)
attention_probs = self.dropout(attention_probs)
context_layer = torch.matmul(attention_probs, w_head_v) # [batch_size, num_attention_heads, query_len, attention_head_size]
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size,)
context_layer = context_layer.view(*new_context_layer_shape)
# 是否返回attention scores
if self.return_attention_scores:
# 这里返回的attention_scores没有经过softmax, 可在外部进行归一化操作
return self.o(context_layer), attention_scores
else:
return self.o(context_layer)
class AdaptiveEmbedding(nn.Module):
'''Transformer_XL的自适应embedding, 实现不同区间使用不同的维度
可以实现如高频词用比如1024或512维,低频词用256或64维, 再用Linear层project到相同的维数
'''
def __init__(self, vocab_size, embedding_size, hidden_size, cutoffs, div_val=1, sample_softmax=False, **kwargs):
super().__init__()
self.vocab_size = vocab_size
self.embedding_size = embedding_size
self.cutoffs = cutoffs + [vocab_size]
self.div_val = div_val
self.hidden_size = hidden_size
self.emb_scale = hidden_size ** 0.5
self.cutoff_ends = [0] + self.cutoffs
self.emb_layers = nn.ModuleList()
self.emb_projs = nn.ParameterList()
if div_val == 1:
self.emb_layers.append(nn.Embedding(vocab_size, embedding_size, sparse=sample_softmax > 0))
if hidden_size != embedding_size:
self.emb_projs.append(nn.Parameter(torch.FloatTensor(hidden_size, embedding_size)))
else:
for i in range(len(self.cutoffs)):
l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]
d_emb_i = embedding_size // (div_val ** i)
self.emb_layers.append(nn.Embedding(r_idx - l_idx, d_emb_i))
self.emb_projs.append(nn.Parameter(torch.FloatTensor(hidden_size, d_emb_i)))
def forward(self, token_ids):
if self.div_val == 1: # 仅有一个embedding
embed = self.emb_layers[0](token_ids) # [btz, seq_len, embedding_size]
if self.hidden_size != self.embedding_size:
embed = nn.functional.linear(embed, self.emb_projs[0])
else:
param = next(self.parameters())
inp_flat = token_ids.view(-1)
emb_flat = torch.zeros([inp_flat.size(0), self.hidden_size], dtype=param.dtype, device=param.device)
for i in range(len(self.cutoffs)):
l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]
mask_i = (inp_flat >= l_idx) & (inp_flat < r_idx)
indices_i = mask_i.nonzero().squeeze()
if indices_i.numel() == 0:
continue
inp_i = inp_flat.index_select(0, indices_i) - l_idx
emb_i = self.emb_layers[i](inp_i)
emb_i = nn.functional.linear(emb_i, self.emb_projs[i])
emb_flat.index_copy_(0, indices_i, emb_i)
embed_shape = token_ids.size() + (self.hidden_size,)
embed = emb_flat.view(embed_shape)
embed.mul_(self.emb_scale)
return embed
class Identity(nn.Module):
def __init__(self, *args, **kwargs):
super(Identity, self).__init__()
def forward(self, *args):
return args[0]
class XlnetPositionsEncoding(nn.Module):
'''Xlnet, transformer_xl使用的相对位置编码
和SinusoidalPositionEncoding区别是一个是间隔排列, 一个是前后排列
'''
def __init__(self, embedding_size):
super().__init__()
self.demb = embedding_size
inv_freq = 1 / (10000 ** (torch.arange(0.0, embedding_size, 2.0) / embedding_size))
self.register_buffer("inv_freq", inv_freq)
def forward(self, pos_seq):
sinusoid_inp = torch.ger(pos_seq, self.inv_freq)
pos_emb = torch.cat([sinusoid_inp.sin(), sinusoid_inp.cos()], dim=-1)
return pos_emb
class RelativePositionsEncoding(nn.Module):
"""nezha用的google相对位置编码
来自论文:https://arxiv.org/abs/1803.02155
"""
def __init__(self, qlen, klen, embedding_size, max_relative_position=127):
super(RelativePositionsEncoding, self).__init__()
# 生成相对位置矩阵
vocab_size = max_relative_position * 2 + 1
distance_mat = torch.arange(klen)[None, :] - torch.arange(qlen)[:, None] # 列数-行数, [query_len, key_len]
distance_mat_clipped = torch.clamp(distance_mat, -max_relative_position, max_relative_position)
final_mat = distance_mat_clipped + max_relative_position
# sinusoid_encoding编码的位置矩阵
embeddings_table = get_sinusoid_encoding_table(vocab_size, embedding_size)
# 实现方式1
# flat_relative_positions_matrix = final_mat.view(-1)
# one_hot_relative_positions_matrix = torch.nn.functional.one_hot(flat_relative_positions_matrix, num_classes=vocab_size).float()
# position_embeddings = torch.matmul(one_hot_relative_positions_matrix, embeddings_table)
# my_shape = list(final_mat.size())
# my_shape.append(embedding_size)
# position_embeddings = position_embeddings.view(my_shape)
# 实现方式2
# position_embeddings = take_along_dim(embeddings_table, final_mat.flatten().unsqueeze(1), dim=0)
# position_embeddings = position_embeddings.reshape(*final_mat.shape, embeddings_table.shape[-1]) # [seq_len, seq_len, hdsz]
# self.register_buffer('position_embeddings', position_embeddings)
# 实现方式3
position_embeddings = nn.Embedding.from_pretrained(embeddings_table, freeze=True)(final_mat)
self.register_buffer('position_embeddings', position_embeddings)
def forward(self, qlen, klen):
return self.position_embeddings[:qlen, :klen, :]
class RelativePositionsEncodingT5(nn.Module):
"""Google T5的相对位置编码
来自论文:https://arxiv.org/abs/1910.10683
"""
def __init__(self, qlen, klen, relative_attention_num_buckets, is_decoder=False):
super(RelativePositionsEncodingT5, self).__init__()
# 生成相对位置矩阵
context_position = torch.arange(qlen, dtype=torch.long)[:, None]
memory_position = torch.arange(klen, dtype=torch.long)[None, :]
relative_position = memory_position - context_position # shape (qlen, klen)
relative_position = self._relative_position_bucket(
relative_position, # shape (qlen, klen)
bidirectional=not is_decoder,
num_buckets=relative_attention_num_buckets,
)
self.register_buffer('relative_position', relative_position)
def forward(self, qlen, klen):
return self.relative_position[:qlen, :klen]
@staticmethod
def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):
'''直接来源于transformer
'''
ret = 0
n = -relative_position
if bidirectional:
num_buckets //= 2
ret += (n < 0).to(torch.long) * num_buckets # mtf.to_int32(mtf.less(n, 0)) * num_buckets
n = torch.abs(n)
else:
n = torch.max(n, torch.zeros_like(n))
# now n is in the range [0, inf)
# half of the buckets are for exact increments in positions
max_exact = num_buckets // 2
is_small = n < max_exact
# The other half of the buckets are for logarithmically bigger bins in positions up to max_distance
val_if_large = max_exact + (
torch.log(n.float() / max_exact) / math.log(max_distance / max_exact) * (num_buckets - max_exact)
).to(torch.long)
val_if_large = torch.min(val_if_large, torch.full_like(val_if_large, num_buckets - 1))
ret += torch.where(is_small, n, val_if_large)
return ret
class SinusoidalPositionEncoding(nn.Module):
"""定义Sin-Cos位置Embedding
"""
def __init__(self, max_position, embedding_size):
super(SinusoidalPositionEncoding, self).__init__()
self.position_embeddings = nn.Embedding.from_pretrained(get_sinusoid_encoding_table(max_position, embedding_size), freeze=True)
def forward(self, position_ids):
return self.position_embeddings(position_ids)
class RoPEPositionEncoding(nn.Module):
"""旋转式位置编码: https://kexue.fm/archives/8265
"""
def __init__(self, max_position, embedding_size):
super(RoPEPositionEncoding, self).__init__()
position_embeddings = get_sinusoid_encoding_table(max_position, embedding_size) # [seq_len, hdsz]
cos_position = position_embeddings[:, 1::2].repeat_interleave(2, dim=-1)
sin_position = position_embeddings[:, ::2].repeat_interleave(2, dim=-1)
# register_buffer是为了最外层model.to(device),不用内部指定device
self.register_buffer('cos_position', cos_position)
self.register_buffer('sin_position', sin_position)
def forward(self, qw, seq_dim=-2):
# 默认最后两个维度为[seq_len, hdsz]
seq_len = qw.shape[seq_dim]
qw2 = torch.stack([-qw[..., 1::2], qw[..., ::2]], dim=-1).reshape_as(qw)
return qw * self.cos_position[:seq_len] + qw2 * self.sin_position[:seq_len]
class CRF(nn.Module):
'''Conditional random field: https://github.com/lonePatient/BERT-NER-Pytorch/blob/master/models/layers/crf.py
'''
def __init__(self, num_tags: int, init_transitions: Optional[List[np.ndarray]] = None, freeze=False) -> None:
if num_tags <= 0:
raise ValueError(f'invalid number of tags: {num_tags}')
super().__init__()
self.num_tags = num_tags
if (init_transitions is None) and (not freeze):
self.start_transitions = nn.Parameter(torch.empty(num_tags))
self.end_transitions = nn.Parameter(torch.empty(num_tags))
self.transitions = nn.Parameter(torch.empty(num_tags, num_tags))
nn.init.uniform_(self.start_transitions, -0.1, 0.1)
nn.init.uniform_(self.end_transitions, -0.1, 0.1)
nn.init.uniform_(self.transitions, -0.1, 0.1)
elif init_transitions is not None:
transitions = torch.tensor(init_transitions[0], dtype=torch.float)
start_transitions = torch.tensor(init_transitions[1], dtype=torch.float)
end_transitions = torch.tensor(init_transitions[2], dtype=torch.float)
if not freeze:
self.transitions = nn.Parameter(transitions)
self.start_transitions = nn.Parameter(start_transitions)
self.end_transitions = nn.Parameter(end_transitions)
else:
self.register_buffer('transitions', transitions)
self.register_buffer('start_transitions', start_transitions)
self.register_buffer('end_transitions', end_transitions)
def __repr__(self) -> str:
return f'{self.__class__.__name__}(num_tags={self.num_tags})'
def forward(self, emissions: torch.Tensor, mask: torch.ByteTensor,
tags: torch.LongTensor, reduction: str = 'mean') -> torch.Tensor:
"""Compute the conditional log likelihood of a sequence of tags given emission scores.
emissions: [btz, seq_len, num_tags]
mask: [btz, seq_len]
tags: [btz, seq_len]
"""
if reduction not in ('none', 'sum', 'mean', 'token_mean'):
raise ValueError(f'invalid reduction: {reduction}')
if mask.dtype != torch.uint8:
mask = mask.byte()
self._validate(emissions, tags=tags, mask=mask)
# shape: (batch_size,)
numerator = self._compute_score(emissions, tags, mask)
# shape: (batch_size,)
denominator = self._compute_normalizer(emissions, mask)
# shape: (batch_size,)
llh = denominator - numerator
if reduction == 'none':
return llh
if reduction == 'sum':
return llh.sum()
if reduction == 'mean':
return llh.mean()
return llh.sum() / mask.float().sum()
def decode(self, emissions: torch.Tensor, mask: Optional[torch.ByteTensor] = None,
nbest: Optional[int] = None, pad_tag: Optional[int] = None) -> List[List[List[int]]]:
"""Find the most likely tag sequence using Viterbi algorithm.
"""
if nbest is None:
nbest = 1
if mask is None:
mask = torch.ones(emissions.shape[:2], dtype=torch.uint8, device=emissions.device)
if mask.dtype != torch.uint8:
mask = mask.byte()
self._validate(emissions, mask=mask)
best_path = self._viterbi_decode_nbest(emissions, mask, nbest, pad_tag)
return best_path[0] if nbest == 1 else best_path
def _validate(self, emissions: torch.Tensor, tags: Optional[torch.LongTensor] = None,
mask: Optional[torch.ByteTensor] = None) -> None:
if emissions.dim() != 3:
raise ValueError(f'emissions must have dimension of 3, got {emissions.dim()}')
if emissions.size(2) != self.num_tags:
raise ValueError(f'expected last dimension of emissions is {self.num_tags}, '
f'got {emissions.size(2)}')
if tags is not None:
if emissions.shape[:2] != tags.shape:
raise ValueError('the first two dimensions of emissions and tags must match, '
f'got {tuple(emissions.shape[:2])} and {tuple(tags.shape)}')
if mask is not None:
if emissions.shape[:2] != mask.shape:
raise ValueError('the first two dimensions of emissions and mask must match, '
f'got {tuple(emissions.shape[:2])} and {tuple(mask.shape)}')
no_empty_seq_bf = mask[:, 0].all()
if not no_empty_seq_bf:
raise ValueError('mask of the first timestep must all be on')
def _compute_score(self, emissions: torch.Tensor, tags: torch.LongTensor, mask: torch.ByteTensor) -> torch.Tensor:
# emissions: (batch_size, seq_length, num_tags)
# tags: (batch_size, seq_length)
# mask: (batch_size, seq_length)
batch_size, seq_length = tags.shape
mask = mask.float()
# Start transition score and first emission
# shape: (batch_size,)
score = self.start_transitions[tags[:, 0]]
score += emissions[torch.arange(batch_size), 0, tags[:, 0]]
for i in range(1, seq_length):
# Transition score to next tag, only added if next timestep is valid (mask == 1)
# shape: (batch_size,)
score += self.transitions[tags[:, i - 1], tags[:, i]] * mask[:, i]
# Emission score for next tag, only added if next timestep is valid (mask == 1)
# shape: (batch_size,)
score += emissions[torch.arange(batch_size), i, tags[:, i]] * mask[:, i]
# End transition score
# shape: (batch_size,)
seq_ends = mask.long().sum(dim=1) - 1
# shape: (batch_size,)
last_tags = tags[torch.arange(batch_size), seq_ends]
# shape: (batch_size,)
score += self.end_transitions[last_tags]
return score
def _compute_normalizer(self, emissions: torch.Tensor, mask: torch.ByteTensor) -> torch.Tensor:
# emissions: (batch_size, seq_length, num_tags)
# mask: (batch_size, seq_length)
seq_length = emissions.size(1)
# Start transition score and first emission; score has size of
# (batch_size, num_tags) where for each batch, the j-th column stores
# the score that the first timestep has tag j
# shape: (batch_size, num_tags)
score = self.start_transitions + emissions[:, 0]
for i in range(1, seq_length):
# Broadcast score for every possible next tag
# shape: (batch_size, num_tags, 1)
broadcast_score = score.unsqueeze(2)
# Broadcast emission score for every possible current tag
# shape: (batch_size, 1, num_tags)
broadcast_emissions = emissions[:, i].unsqueeze(1)
# Compute the score tensor of size (batch_size, num_tags, num_tags) where
# for each sample, entry at row i and column j stores the sum of scores of all
# possible tag sequences so far that end with transitioning from tag i to tag j
# and emitting
# shape: (batch_size, num_tags, num_tags)
next_score = broadcast_score + self.transitions + broadcast_emissions
# Sum over all possible current tags, but we're in score space, so a sum
# becomes a log-sum-exp: for each sample, entry i stores the sum of scores of
# all possible tag sequences so far, that end in tag i
# shape: (batch_size, num_tags)
next_score = torch.logsumexp(next_score, dim=1)
# Set score to the next score if this timestep is valid (mask == 1)
# shape: (batch_size, num_tags)
score = torch.where(mask[:, i].unsqueeze(1).bool(), next_score, score)
# End transition score
# shape: (batch_size, num_tags)
score += self.end_transitions
# Sum (log-sum-exp) over all possible tags
# shape: (batch_size,)
return torch.logsumexp(score, dim=1)
def _viterbi_decode_nbest(self, emissions: torch.FloatTensor, mask: torch.ByteTensor,
nbest: int, pad_tag: Optional[int] = None) -> List[List[List[int]]]:
# emissions: (batch_size, seq_length, num_tags)
# mask: (batch_size, seq_length)
# return: (nbest, batch_size, seq_length)
if pad_tag is None:
pad_tag = 0
device = emissions.device
batch_size, seq_length = mask.shape
# Start transition and first emission
# shape: (batch_size, num_tags)
score = self.start_transitions + emissions[:, 0]
history_idx = torch.zeros((batch_size, seq_length, self.num_tags, nbest), dtype=torch.long, device=device)
oor_idx = torch.zeros((batch_size, self.num_tags, nbest), dtype=torch.long, device=device)
oor_tag = torch.full((batch_size, seq_length, nbest), pad_tag, dtype=torch.long, device=device)
# - score is a tensor of size (batch_size, num_tags) where for every batch,
# value at column j stores the score of the best tag sequence so far that ends
# with tag j
# - history_idx saves where the best tags candidate transitioned from; this is used
# when we trace back the best tag sequence
# - oor_idx saves the best tags candidate transitioned from at the positions
# where mask is 0, i.e. out of range (oor)
# Viterbi algorithm recursive case: we compute the score of the best tag sequence
# for every possible next tag
for i in range(1, seq_length):
if i == 1:
broadcast_score = score.unsqueeze(-1)
broadcast_emission = emissions[:, i].unsqueeze(1)
# shape: (batch_size, num_tags, num_tags)
next_score = broadcast_score + self.transitions + broadcast_emission
else:
broadcast_score = score.unsqueeze(-1)
broadcast_emission = emissions[:, i].unsqueeze(1).unsqueeze(2)
# shape: (batch_size, num_tags, nbest, num_tags)
next_score = broadcast_score + self.transitions.unsqueeze(1) + broadcast_emission
# Find the top `nbest` maximum score over all possible current tag
# shape: (batch_size, nbest, num_tags)
next_score, indices = next_score.view(batch_size, -1, self.num_tags).topk(nbest, dim=1)
if i == 1:
score = score.unsqueeze(-1).expand(-1, -1, nbest)
indices = indices * nbest
# convert to shape: (batch_size, num_tags, nbest)
next_score = next_score.transpose(2, 1)
indices = indices.transpose(2, 1)
# Set score to the next score if this timestep is valid (mask == 1)
# and save the index that produces the next score
# shape: (batch_size, num_tags, nbest)
score = torch.where(mask[:, i].unsqueeze(-1).unsqueeze(-1).bool(), next_score, score)
indices = torch.where(mask[:, i].unsqueeze(-1).unsqueeze(-1).bool(), indices, oor_idx)
history_idx[:, i - 1] = indices
# End transition score shape: (batch_size, num_tags, nbest)
end_score = score + self.end_transitions.unsqueeze(-1)
_, end_tag = end_score.view(batch_size, -1).topk(nbest, dim=1)
# shape: (batch_size,)
seq_ends = mask.long().sum(dim=1) - 1
# insert the best tag at each sequence end (last position with mask == 1)
history_idx.scatter_(1, seq_ends.view(-1, 1, 1, 1).expand(-1, 1, self.num_tags, nbest),
end_tag.view(-1, 1, 1, nbest).expand(-1, 1, self.num_tags, nbest))
# The most probable path for each sequence
best_tags_arr = torch.zeros((batch_size, seq_length, nbest), dtype=torch.long, device=device)
best_tags = torch.arange(nbest, dtype=torch.long, device=device).view(1, -1).expand(batch_size, -1)
for idx in range(seq_length - 1, -1, -1):
best_tags = torch.gather(history_idx[:, idx].view(batch_size, -1), 1, best_tags)
best_tags_arr[:, idx] = torch.div(best_tags.data.view(batch_size, -1), nbest, rounding_mode='floor')
return torch.where(mask.unsqueeze(-1).bool(), best_tags_arr, oor_tag).permute(2, 0, 1)
class BERT_WHITENING():
def __init__(self):
self.kernel = None
self.bias = None
def compute_kernel_bias(self, sentence_vec):
'''bert-whitening的torch实现
'''
vecs = torch.cat(sentence_vec, dim=0)
self.bias = -vecs.mean(dim=0, keepdims=True)
cov = torch.cov(vecs.T) # 协方差
u, s, vh = torch.linalg.svd(cov)
W = torch.matmul(u, torch.diag(s**0.5))
self.kernel = torch.linalg.inv(W.T)
def save_whiten(self, path):
whiten = {'kernel': self.kernel, 'bias': self.bias}
torch.save(path, whiten)
def load_whiten(self, path):
whiten = torch.load(path)
self.kernel = whiten['kernel']
self.bias = whiten['bias']
def transform_and_normalize(self, vecs):
"""应用变换,然后标准化
"""
if not (self.kernel is None or self.bias is None):
vecs = (vecs + self.bias).mm(self.kernel)
return vecs / (vecs**2).sum(axis=1, keepdims=True)**0.5
class GlobalPointer(nn.Module):
"""全局指针模块
将序列的每个(start, end)作为整体来进行判断
参考:https://kexue.fm/archives/8373
"""
def __init__(self, hidden_size, heads, head_size, RoPE=True, max_len=512, use_bias=True, tril_mask=True):
super().__init__()
self.heads = heads
self.head_size = head_size
self.RoPE = RoPE
self.tril_mask = tril_mask
self.RoPE = RoPE
self.dense = nn.Linear(hidden_size, heads * head_size * 2, bias=use_bias)
if self.RoPE:
self.position_embedding = RoPEPositionEncoding(max_len, head_size)
def forward(self, inputs, mask=None):
''' inputs: [..., hdsz]
mask: [bez, seq_len], padding部分为0
'''
sequence_output = self.dense(inputs) # [..., heads*head_size*2]
sequence_output = torch.stack(torch.chunk(sequence_output, self.heads, dim=-1), dim=-2) # [..., heads, head_size*2]
qw, kw = sequence_output[..., :self.head_size], sequence_output[..., self.head_size:] # [..., heads, head_size]
# ROPE编码
if self.RoPE:
qw = self.position_embedding(qw)
kw = self.position_embedding(kw)
# 计算内积
logits = torch.einsum('bmhd,bnhd->bhmn', qw, kw) # [btz, heads, seq_len, seq_len]
# 排除padding
if mask is not None:
attention_mask1 = 1 - mask.unsqueeze(1).unsqueeze(3) # [btz, 1, seq_len, 1]
attention_mask2 = 1 - mask.unsqueeze(1).unsqueeze(2) # [btz, 1, 1, seq_len]
logits = logits.masked_fill(attention_mask1.bool(), value=-float('inf'))
logits = logits.masked_fill(attention_mask2.bool(), value=-float('inf'))
# 排除下三角
if self.tril_mask:
logits = logits - torch.tril(torch.ones_like(logits), -1) * 1e12
# scale返回
return logits / self.head_size**0.5
class EfficientGlobalPointer(nn.Module):
"""更加参数高效的GlobalPointer
参考:https://kexue.fm/archives/8877
"""
def __init__(self, hidden_size, heads, head_size, RoPE=True, max_len=512, use_bias=True, tril_mask=True):
super().__init__()
self.heads = heads
self.head_size = head_size
self.RoPE = RoPE
self.tril_mask = tril_mask
self.RoPE = RoPE
self.p_dense = nn.Linear(hidden_size, head_size * 2, bias=use_bias)
self.q_dense = nn.Linear(head_size * 2, heads * 2, bias=use_bias)
if self.RoPE:
self.position_embedding = RoPEPositionEncoding(max_len, head_size)
def forward(self, inputs, mask=None):
''' inputs: [..., hdsz]
mask: [bez, seq_len], padding部分为0
'''
sequence_output = self.p_dense(inputs) # [..., head_size*2]
qw, kw = sequence_output[..., :self.head_size], sequence_output[..., self.head_size:] # [..., head_size]
# ROPE编码
if self.RoPE:
qw = self.position_embedding(qw)
kw = self.position_embedding(kw)
# 计算内积
logits = torch.einsum('bmd,bnd->bmn', qw, kw) / self.head_size**0.5 # [btz, seq_len, seq_len], 是否是实体的打分
bias_input = self.q_dense(sequence_output) # [..., heads*2]
bias = torch.stack(torch.chunk(bias_input, self.heads, dim=-1), dim=-2).transpose(1,2) # [btz, heads, seq_len, 2]
logits = logits.unsqueeze(1) + bias[..., :1] + bias[..., 1:].transpose(2, 3) # [btz, heads, seq_len, seq_len]
# 排除padding
if mask is not None:
attention_mask1 = 1 - mask.unsqueeze(1).unsqueeze(3) # [btz, 1, seq_len, 1]
attention_mask2 = 1 - mask.unsqueeze(1).unsqueeze(2) # [btz, 1, 1, seq_len]
logits = logits.masked_fill(attention_mask1.bool(), value=-float('inf'))
logits = logits.masked_fill(attention_mask2.bool(), value=-float('inf'))
# 排除下三角
if self.tril_mask:
logits = logits - torch.tril(torch.ones_like(logits), -1) * 1e12
return logits
class TplinkerHandshakingKernel(nn.Module):
'''Tplinker的HandshakingKernel实现
'''
def __init__(self, hidden_size, shaking_type, inner_enc_type=''):
super().__init__()
self.shaking_type = shaking_type
if shaking_type == "cat":
self.combine_fc = nn.Linear(hidden_size * 2, hidden_size)
elif shaking_type == "cat_plus":
self.combine_fc = nn.Linear(hidden_size * 3, hidden_size)
elif shaking_type == "cln":
self.tp_cln = LayerNorm(hidden_size, conditional_size=hidden_size)
elif shaking_type == "cln_plus":
self.tp_cln = LayerNorm(hidden_size, conditional_size=hidden_size)
self.inner_context_cln = LayerNorm(hidden_size, conditional_size=hidden_size)
self.inner_enc_type = inner_enc_type
if inner_enc_type == "mix_pooling":
self.lamtha = nn.Parameter(torch.rand(hidden_size))
elif inner_enc_type == "lstm":
self.inner_context_lstm = nn.LSTM(hidden_size, hidden_size, num_layers=1, bidirectional=False, batch_first=True)
# 自行实现的用torch.gather方式来做,避免循环,目前只实现了cat方式
# tag_ids = [(i, j) for i in range(maxlen) for j in range(maxlen) if j >= i]
# gather_idx = torch.tensor(tag_ids, dtype=torch.long).flatten()[None, :, None]
# self.register_buffer('gather_idx', gather_idx)
def enc_inner_hiddens(self, seq_hiddens, inner_enc_type="lstm"):
# seq_hiddens: (batch_size, seq_len, hidden_size)
def pool(seqence, pooling_type):
if pooling_type == "mean_pooling":
pooling = torch.mean(seqence, dim = -2)
elif pooling_type == "max_pooling":
pooling, _ = torch.max(seqence, dim = -2)
elif pooling_type == "mix_pooling":
pooling = self.lamtha * torch.mean(seqence, dim = -2) + (1 - self.lamtha) * torch.max(seqence, dim = -2)[0]
return pooling
if "pooling" in inner_enc_type:
inner_context = torch.stack([pool(seq_hiddens[:, :i+1, :], inner_enc_type) for i in range(seq_hiddens.size()[1])], dim = 1)
elif inner_enc_type == "lstm":
inner_context, _ = self.inner_context_lstm(seq_hiddens)
return inner_context
def forward(self, seq_hiddens):
'''
seq_hiddens: (batch_size, seq_len, hidden_size)
return:
shaking_hiddenss: (batch_size, (1 + seq_len) * seq_len / 2, hidden_size) (32, 5+4+3+2+1, 5)
'''
seq_len = seq_hiddens.size()[-2]
shaking_hiddens_list = []
for ind in range(seq_len):
hidden_each_step = seq_hiddens[:, ind, :]
visible_hiddens = seq_hiddens[:, ind:, :] # ind: only look back
repeat_hiddens = hidden_each_step[:, None, :].repeat(1, seq_len - ind, 1)
if self.shaking_type == "cat":
shaking_hiddens = torch.cat([repeat_hiddens, visible_hiddens], dim = -1)
shaking_hiddens = torch.tanh(self.combine_fc(shaking_hiddens))
elif self.shaking_type == "cat_plus":
inner_context = self.enc_inner_hiddens(visible_hiddens, self.inner_enc_type)
shaking_hiddens = torch.cat([repeat_hiddens, visible_hiddens, inner_context], dim = -1)
shaking_hiddens = torch.tanh(self.combine_fc(shaking_hiddens))
elif self.shaking_type == "cln":
shaking_hiddens = self.tp_cln([visible_hiddens, repeat_hiddens])
elif self.shaking_type == "cln_plus":
inner_context = self.enc_inner_hiddens(visible_hiddens, self.inner_enc_type)
shaking_hiddens = self.tp_cln([visible_hiddens, repeat_hiddens])
shaking_hiddens = self.inner_context_cln([shaking_hiddens, inner_context])
shaking_hiddens_list.append(shaking_hiddens)
long_shaking_hiddens = torch.cat(shaking_hiddens_list, dim = 1)
return long_shaking_hiddens
# def handshaking_kernel(self, last_hidden_state):
# '''获取(0,0),(0,1),...,(99,99))对应的序列id
# '''
# btz, _, hdsz = last_hidden_state.shape
# gather_idx = self.gather_idx.repeat(btz, 1, hdsz)
# concat_hidden_states = torch.gather(last_hidden_state, dim=1, index=gather_idx) # [btz, pair_len*2, hdsz]
# concat_hidden_states = concat_hidden_states.reshape(btz, -1, 2, hdsz) # concat方式 [btz, pair_len, 2, hdsz]
# shaking_hiddens = torch.cat(torch.chunk(concat_hidden_states, chunks=2, dim=-2), dim=-1).squeeze(-2) # [btz, pair_len, hdsz*2]
# return shaking_hiddens
class MixUp(nn.Module):
'''mixup方法实现
method: embed, encoder分别表示在embedding和encoder层面做mixup, None表示mix后续处理, hidden表示对隐含层做mixup
'''
def __init__(self, method='encoder', alpha=1.0, layer_mix=None):
super().__init__()
assert method in {'embed', 'encoder', 'hidden', None}
self.method = method
self.alpha = alpha
self.perm_index = None
self.lam = 0
self.layer_mix = layer_mix # 需要mix的隐含层index
def get_perm(self, inputs):
if isinstance(inputs, torch.Tensor):
return inputs[self.perm_index]
elif isinstance(inputs, (list, tuple)):
return [inp[self.perm_index] if isinstance(inp, torch.Tensor) else inp for inp in inputs]
def mix_up(self, output, output1):
if isinstance(output, torch.Tensor):
return self.lam * output + (1.0-self.lam) * output1
elif isinstance(output, (list, tuple)):
output_final = []
for i in range(len(output)):
if output[i] is None: # conditional_emb=None
output_final.append(output[i])
elif (not output[i].requires_grad) and (output[i].dtype in {torch.long, torch.int}):
# 不是embedding形式的
output_final.append(torch.max(output[i], output1[i]))
else:
output_final.append(self.lam * output[i] + (1.0-self.lam) * output1[i])
return output_final
else:
raise ValueError('Illegal model output')
def encode(self, model, inputs):
batch_size = inputs[0].shape[0]
device = inputs[0].device
self.lam = np.random.beta(self.alpha, self.alpha)
self.perm_index = torch.randperm(batch_size).to(device)
if self.method is None:
output = model(inputs)
output1 = self.get_perm(output)
return [output, output1]
elif self.method == 'encoder':
output = model(inputs)
output1 = self.get_perm(output)
output_final = self.mix_up(output, output1)
elif self.method == 'embed':
output = model.apply_embeddings(inputs)
output1 = self.get_perm(output)
output_final = self.mix_up(output, output1)
# Main
output_final = model.apply_main_layers(output_final)
# Final
output_final = model.apply_final_layers(output_final)
elif self.method == 'hidden':
if self.layer_mix is None:
# 这里暂时只考虑encoderLayer, 不考虑decoderLayer和seq2seq模型结构
try:
layer_mix = random.randint(0, len(model.encoderLayer))
except:
warnings.warn('LayerMix random failded')
layer_mix = 0
else:
layer_mix = self.layer_mix
def apply_on_layer_end(l_i, output):
if l_i == layer_mix:
output1 = self.get_perm(output)
return self.mix_up(output, output1)
else:
return output
model.apply_on_layer_end = apply_on_layer_end
output_final = model(inputs)
return output_final
def forward(self, criterion, y_pred, y_true):
'''计算loss
'''
y_true1 = y_true[self.perm_index]
return self.lam * criterion(y_pred, y_true) + (1 - self.lam) * criterion(y_pred, y_true1)
\ No newline at end of file
from ast import arg
from tracemalloc import start
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import numpy as np
class FocalLoss(nn.Module):
'''Multi-class Focal loss implementation'''
def __init__(self, gamma=2, weight=None,ignore_index=-100):
super(FocalLoss, self).__init__()
self.gamma = gamma
self.weight = weight
self.ignore_index=ignore_index
def forward(self, input, target):
"""
input: [N, C]
target: [N, ]
"""
logpt = F.log_softmax(input, dim=1)
pt = torch.exp(logpt)
logpt = (1-pt)**self.gamma * logpt
loss = F.nll_loss(logpt, target, self.weight,ignore_index=self.ignore_index)
return loss
class LabelSmoothingCrossEntropy(nn.Module):
def __init__(self, eps=0.1, reduction='mean',ignore_index=-100):
super(LabelSmoothingCrossEntropy, self).__init__()
self.eps = eps
self.reduction = reduction
self.ignore_index = ignore_index
def forward(self, output, target):
c = output.size()[-1]
log_preds = F.log_softmax(output, dim=-1)
if self.reduction=='sum':
loss = -log_preds.sum()
else:
loss = -log_preds.sum(dim=-1)
if self.reduction=='mean':
loss = loss.mean()
return loss*self.eps/c + (1-self.eps) * F.nll_loss(log_preds, target, reduction=self.reduction,
ignore_index=self.ignore_index)
class MultilabelCategoricalCrossentropy(nn.Module):
"""多标签分类的交叉熵
说明:y_true和y_pred的shape一致,y_true的元素非0即1, 1表示对应的类为目标类,0表示对应的类为非目标类。
警告:请保证y_pred的值域是全体实数,换言之一般情况下y_pred不用加激活函数,尤其是不能加sigmoid或者softmax!预测
阶段则输出y_pred大于0的类。如有疑问,请仔细阅读并理解本文。
参考:https://kexue.fm/archives/7359
"""
def __init__(self, **kwargs):
super().__init__(**kwargs)
def forward(self, y_pred, y_true):
""" y_true ([Tensor]): [..., num_classes]
y_pred ([Tensor]): [..., num_classes]
"""
y_pred = (1-2*y_true) * y_pred
y_pred_pos = y_pred - (1-y_true) * 1e12
y_pred_neg = y_pred - y_true * 1e12
y_pred_pos = torch.cat([y_pred_pos, torch.zeros_like(y_pred_pos[..., :1])], dim=-1)
y_pred_neg = torch.cat([y_pred_neg, torch.zeros_like(y_pred_neg[..., :1])], dim=-1)
pos_loss = torch.logsumexp(y_pred_pos, dim=-1)
neg_loss = torch.logsumexp(y_pred_neg, dim=-1)
return (pos_loss + neg_loss).mean()
class SparseMultilabelCategoricalCrossentropy(nn.Module):
"""稀疏版多标签分类的交叉熵
说明:
1. y_true.shape=[..., num_positive],
y_pred.shape=[..., num_classes];
2. 请保证y_pred的值域是全体实数,换言之一般情况下y_pred不用加激活函数,尤其是不能加sigmoid或者softmax;
3. 预测阶段则输出y_pred大于0的类;
4. 详情请看:https://kexue.fm/archives/7359 。
"""
def __init__(self, mask_zero=False, epsilon=1e-7, **kwargs):
super().__init__(**kwargs)
self.mask_zero = mask_zero
self.epsilon = epsilon
def forward(self, y_pred, y_true):
zeros = torch.zeros_like(y_pred[..., :1])
y_pred = torch.cat([y_pred, zeros], dim=-1)
if self.mask_zero:
infs = zeros + float('inf')
y_pred = torch.cat([infs, y_pred[..., 1:]], dim=-1)
y_pos_2 = torch.gather(y_pred, dim=-1, index=y_true)
y_pos_1 = torch.cat([y_pos_2, zeros], dim=-1)
if self.mask_zero:
y_pred = torch.cat([-infs, y_pred[..., 1:]], dim=-1)
y_pos_2 = torch.gather(y_pred, dim=-1, index=y_true)
pos_loss = torch.logsumexp(-y_pos_1, dim=-1)
all_loss = torch.logsumexp(y_pred, dim=-1) # a
aux_loss = torch.logsumexp(y_pos_2, dim=-1) - all_loss # b-a
aux_loss = torch.clamp(1 - torch.exp(aux_loss), self.epsilon, 1) # 1-exp(b-a)
neg_loss = all_loss + torch.log(aux_loss) # a + log[1-exp(b-a)]
return pos_loss + neg_loss
class ContrastiveLoss(nn.Module):
"""对比损失:减小正例之间的距离,增大正例和反例之间的距离
公式:labels * distance_matrix.pow(2) + (1-labels)*F.relu(margin-distance_matrix).pow(2)
https://www.sbert.net/docs/package_reference/losses.html
"""
def __init__(self, margin=0.5, size_average=True, online=False):
super(ContrastiveLoss, self).__init__()
self.margin = margin
self.size_average = size_average
self.online = online
def forward(self, distances, labels, pos_id=1, neg_id=0):
if not self.online:
losses = 0.5 * (labels.float() * distances.pow(2) + (1 - labels).float() * F.relu(self.margin - distances).pow(2))
return losses.mean() if self.size_average else losses.sum()
else:
negs = distances[labels == neg_id]
poss = distances[labels == pos_id]
# select hard positive and hard negative pairs
negative_pairs = negs[negs < (poss.max() if len(poss) > 1 else negs.mean())]
positive_pairs = poss[poss > (negs.min() if len(negs) > 1 else poss.mean())]
positive_loss = positive_pairs.pow(2).sum()
negative_loss = F.relu(self.margin - negative_pairs).pow(2).sum()
return positive_loss + negative_loss
class RDropLoss(nn.Module):
'''R-Drop的Loss实现,官方项目:https://github.com/dropreg/R-Drop
'''
def __init__(self, alpha=4, rank='adjacent'):
super().__init__()
self.alpha = alpha
# 支持两种方式,一种是奇偶相邻排列,一种是上下排列
assert rank in {'adjacent', 'updown'}, "rank kwarg only support 'adjacent' and 'updown' "
self.rank = rank
self.loss_sup = nn.CrossEntropyLoss()
self.loss_rdrop = nn.KLDivLoss(reduction='none')
def forward(self, *args):
'''支持两种方式: 一种是y_pred, y_true, 另一种是y_pred1, y_pred2, y_true
'''
assert len(args) in {2, 3}, 'RDropLoss only support 2 or 3 input args'
# y_pred是1个Tensor
if len(args) == 2:
y_pred, y_true = args
loss_sup = self.loss_sup(y_pred, y_true) # 两个都算
if self.rank == 'adjacent':
y_pred1 = y_pred[1::2]
y_pred2 = y_pred[::2]
elif self.rank == 'updown':
half_btz = y_true.shape[0] // 2
y_pred1 = y_pred[:half_btz]
y_pred2 = y_pred[half_btz:]
# y_pred是两个tensor
else:
y_pred1, y_pred2, y_true = args
loss_sup = self.loss_sup(y_pred1, y_true)
loss_rdrop1 = self.loss_rdrop(F.log_softmax(y_pred1, dim=-1), F.softmax(y_pred2, dim=-1))
loss_rdrop2 = self.loss_rdrop(F.log_softmax(y_pred2, dim=-1), F.softmax(y_pred1, dim=-1))
return loss_sup + torch.mean(loss_rdrop1 + loss_rdrop2) / 4 * self.alpha
class UDALoss(nn.Module):
'''UDALoss,使用时候需要继承一下,因为forward需要使用到global_step和total_steps
https://arxiv.org/abs/1904.12848
'''
def __init__(self, tsa_schedule=None, total_steps=None, start_p=0, end_p=1, return_all_loss=True):
super().__init__()
self.loss_sup = nn.CrossEntropyLoss()
self.loss_unsup = nn.KLDivLoss(reduction='batchmean')
self.tsa_schedule = tsa_schedule
self.start = start_p
self.end = end_p
if self.tsa_schedule:
assert self.tsa_schedule in {'linear_schedule', 'exp_schedule', 'log_schedule'}, 'tsa_schedule config illegal'
self.return_all_loss = return_all_loss
def forward(self, y_pred, y_true_sup, global_step, total_steps):
sup_size = y_true_sup.size(0)
unsup_size = (y_pred.size(0) - sup_size) // 2
# 有监督部分, 用交叉熵损失
y_pred_sup = y_pred[:sup_size]
if self.tsa_schedule is None:
loss_sup = self.loss_sup(y_pred_sup, y_true_sup)
else: # 使用tsa来去掉预测概率较高的有监督样本
threshold = self.get_tsa_threshold(self.tsa_schedule, global_step, total_steps, self.start, self.end)
true_prob = torch.gather(F.softmax(y_pred_sup, dim=-1), dim=1, index=y_true_sup[:, None])
sel_rows = true_prob.lt(threshold).sum(dim=-1).gt(0) # 仅保留小于阈值的样本
loss_sup = self.loss_sup(y_pred_sup[sel_rows], y_true_sup[sel_rows]) if sel_rows.sum() > 0 else 0
# 无监督部分,这里用KL散度,也可以用交叉熵
y_true_unsup = y_pred[sup_size:sup_size+unsup_size]
y_true_unsup = F.softmax(y_true_unsup.detach(), dim=-1)
y_pred_unsup = F.log_softmax(y_pred[sup_size+unsup_size:], dim=-1)
loss_unsup = self.loss_unsup(y_pred_unsup, y_true_unsup)
if self.return_all_loss:
return loss_sup + loss_unsup, loss_sup, loss_unsup
else:
return loss_sup + loss_unsup
@ staticmethod
def get_tsa_threshold(schedule, global_step, num_train_steps, start, end):
training_progress = global_step / num_train_steps
if schedule == "linear_schedule":
threshold = training_progress
elif schedule == "exp_schedule":
scale = 5
threshold = math.exp((training_progress - 1) * scale)
elif schedule == "log_schedule":
scale = 5
threshold = 1 - math.exp((-training_progress) * scale)
return threshold * (end - start) + start
class TemporalEnsemblingLoss(nn.Module):
'''TemporalEnsembling的实现,思路是在监督loss的基础上,增加一个mse的一致性损失loss
官方项目:https://github.com/s-laine/tempens
pytorch第三方实现:https://github.com/ferretj/temporal-ensembling
使用的时候,train_dataloader的shffle必须未False
'''
def __init__(self, epochs, max_val=10.0, ramp_up_mult=-5.0, alpha=0.5, max_batch_num=100, hist_device='cpu'):
super().__init__()
self.loss_sup = nn.CrossEntropyLoss()
self.max_epochs = epochs
self.max_val = max_val
self.ramp_up_mult = ramp_up_mult
self.alpha = alpha
self.max_batch_num = max_batch_num # 设置未None表示记录全部数据历史,数据量大时耗资源
self.hist_unsup = [] # 历史无监督logit
self.hist_sup = [] # 历史监督信息logit
self.hist_device = hist_device
self.hist_input_y = [] # 历史监督标签y
assert (self.alpha >= 0) & (self.alpha < 1) # 等于1的时候upata写分母为0
def forward(self, y_pred_sup, y_pred_unsup, y_true_sup, epoch, bti):
self.same_batch_check(y_pred_sup, y_pred_unsup, y_true_sup, bti)
if (self.max_batch_num is None) or (bti < self.max_batch_num):
self.init_hist(bti, y_pred_sup, y_pred_unsup) # 初始化历史
sup_ratio = float(len(y_pred_sup)) / (len(y_pred_sup) + len(y_pred_unsup)) # 监督样本的比例
w = self.weight_schedule(epoch, sup_ratio)
sup_loss, unsup_loss = self.temporal_loss(y_pred_sup, y_pred_unsup, y_true_sup, bti)
# 更新
self.hist_unsup[bti] = self.update(self.hist_unsup[bti], y_pred_unsup.detach(), epoch)
self.hist_sup[bti] = self.update(self.hist_sup[bti], y_pred_sup.detach(), epoch)
# if bti == 0: $ 用于检查每个epoch数据顺序是否一致
# print(w, sup_loss.item(), w * unsup_loss.item())
# print(y_true_sup)
return sup_loss + w * unsup_loss, sup_loss, w * unsup_loss
else:
return self.loss_sup(y_pred_sup, y_true_sup)
def same_batch_check(self, y_pred_sup, y_pred_unsup, y_true_sup, bti):
'''检测数据的前几个batch必须是一致的, 这里写死是10个
'''
if bti >= 10:
return
if bti >= len(self.hist_input_y):
self.hist_input_y.append(y_true_sup.to(self.hist_device))
else: # 检测
err_msg = 'TemporalEnsemblingLoss requests the same sort dataloader, you may need to set train_dataloader shuffle=False'
assert self.hist_input_y[bti].equal(y_true_sup.to(self.hist_device)), err_msg
def update(self, hist, y_pred, epoch):
'''更新历史logit,利用alpha门控来控制比例
'''
Z = self.alpha * hist.to(y_pred) + (1. -self.alpha) * y_pred
output = Z * (1. / (1. - self.alpha ** (epoch + 1)))
return output.to(self.hist_device)
def weight_schedule(self, epoch, sup_ratio):
max_val = self.max_val * sup_ratio
if epoch == 0:
return 0.
elif epoch >= self.max_epochs:
return max_val
return max_val * np.exp(self.ramp_up_mult * (1. - float(epoch) / self.max_epochs) ** 2)
def temporal_loss(self, y_pred_sup, y_pred_unsup, y_true_sup, bti):
# MSE between current and temporal outputs
def mse_loss(out1, out2):
quad_diff = torch.sum((F.softmax(out1, dim=1) - F.softmax(out2, dim=1)) ** 2)
return quad_diff / out1.data.nelement()
sup_loss = self.loss_sup(y_pred_sup, y_true_sup)
# 原来实现是sup和unsup作为一个tensor,整体计算的,这里由于是拆分成两个tensor,因此分开算
unsup_loss = mse_loss(y_pred_unsup, self.hist_unsup[bti].to(y_pred_unsup))
unsup_loss += mse_loss(y_pred_sup, self.hist_sup[bti].to(y_pred_sup))
return sup_loss, unsup_loss
def init_hist(self, bti, y_pred_sup, y_pred_unsup):
if bti >= len(self.hist_sup):
self.hist_sup.append(torch.zeros_like(y_pred_sup).to(self.hist_device))
self.hist_unsup.append(torch.zeros_like(y_pred_unsup).to(self.hist_device))
\ No newline at end of file
import os
import time
from inspect import isfunction
import torch
import torch.nn as nn
import copy
import json
import re
from bert4torch.layers import LayerNorm, BertEmbeddings, BertLayer, Identity, T5Layer, GatedAttentionUnit, XlnetLayer
from bert4torch.layers import AdaptiveEmbedding, XlnetPositionsEncoding
from bert4torch.snippets import metric_mapping, search_layer, insert_arguments, delete_arguments, get_kw
from bert4torch.snippets import ProgbarLogger, EarlyStopping, FGM, PGD, VAT, IterDataset, take_along_dim
from bert4torch.activations import get_activation
from collections import OrderedDict
import warnings
from apex import amp
from apex.amp import _amp_state
from torch.cuda.amp import GradScaler
class BaseModel(nn.Module):
def __init__(self):
super(BaseModel, self).__init__()
# 这里主要是为了外面调用用到
self.global_step, self.local_step, self.total_steps, self.epoch, self.steps_per_epoch, self.train_dataloader = 0, 0, 0, 0, None, None
self.resume_step, self.resume_epoch = 0, 0
self.callbacks = []
def save_steps_params(self, save_path):
'''保存训练过程参数
'''
step_params = {'resume_step': (self.local_step+1) % self.steps_per_epoch,
'resume_epoch': self.epoch + (self.local_step+1) // self.steps_per_epoch}
torch.save(step_params, save_path)
def load_steps_params(self, save_path):
'''导入训练过程参数
'''
step_params = torch.load(save_path)
self.resume_step = step_params['resume_step']
self.resume_epoch = step_params['resume_epoch']
return step_params
def compile(self, loss, optimizer, scheduler=None, clip_grad_norm=None, use_amp=False, use_apex=False, metrics=None, adversarial_train={'name': ''}):
'''定义loss, optimizer, metrics, 是否在计算loss前reshape
loss: loss
optimizer: 优化器
scheduler: scheduler
clip_grad_norm: 是否使用梯度裁剪, 默认不启用
use_amp: 是否使用混合精度,默认不启用
use_apex: 是否使用apex中的混合精度,默认不启用
metrics: 训练过程中需要打印的指标, loss相关指标默认会打印, 目前支持accuracy, 也支持自定义metric,形式为{key: func}
'''
self.criterion = loss
self.optimizer = optimizer
self.scheduler = scheduler
self.clip_grad_norm = clip_grad_norm
self.use_amp = use_amp
self.use_apex = use_apex
if use_amp:
assert adversarial_train['name'] not in {'vat', 'gradient_penalty'}, 'Amp and adversarial_train both run is not supported in current version'
from torch.cuda.amp import autocast
self.autocast = autocast
self.scaler = torch.cuda.amp.GradScaler()
# 训练过程观测的指标
self.metrics = OrderedDict({'loss': None})
if metrics is None:
metrics = []
elif isinstance(metrics, (str, dict)) or isfunction(metrics):
metrics = [metrics]
for metric in metrics:
# 字符类型,目前仅支持accuracy
if isinstance(metric, str) and metric != 'loss':
self.metrics[metric] = None
# 字典形式 {metric: func}
elif isinstance(metric, dict):
self.metrics.update(metric)
# 函数形式,key和value都赋值metric
elif isfunction(metric):
self.metrics.update({metric: metric})
else:
raise ValueError('Args metrics only support "String, Dict, Callback, List[String, Dict, Callback]" format')
# 对抗训练
self.adversarial = adversarial_train
self.adversarial_initialize()
def adversarial_initialize(self):
'''对抗训练初始化
'''
assert self.adversarial['name'] in {'', 'fgm', 'pgd', 'vat', 'gradient_penalty'}, 'adversarial_train support fgm, pgd, vat and gradient_penalty mode'
self.adversarial['epsilon'] = self.adversarial.get('epsilon', 1.0)
self.adversarial['emb_name'] = self.adversarial.get('emb_name', 'word_embeddings')
if self.adversarial['name'] == 'fgm':
self.ad_train = FGM(self)
elif self.adversarial['name'] == 'pgd':
self.adversarial['K'] = self.adversarial.get('K', 3) # 步数
self.adversarial['alpha'] = self.adversarial.get('alpha', 0.3) # 学习率
self.ad_train = PGD(self)
elif self.adversarial['name'] == 'gradient_penalty':
pass
elif self.adversarial['name'] == 'vat':
self.adversarial['K'] = self.adversarial.get('K', 3)
self.adversarial['noise_var'] = self.adversarial.get('noise_var', 1e-5) # 噪声的方差
self.adversarial['noise_gamma'] = self.adversarial.get('noise_gamma', 1e-6) # eps
self.adversarial['adv_step_size'] = self.adversarial.get('adv_step_size', 1e-3) # 学习率
self.adversarial['adv_alpha'] = self.adversarial.get('adv_alpha', 1) # 对抗loss的权重
self.adversarial['norm_type'] = self.adversarial.get('norm_type', 'l2') # 归一化方式
self.ad_train = VAT(self, **self.adversarial)
def adversarial_training(self, train_X, train_y, output, loss, loss_detail, grad_accumulation_steps):
'''对抗训练
'''
if self.adversarial['name'] == 'fgm':
self.ad_train.attack(**self.adversarial) # embedding被修改了
output, loss, loss_detail = self.train_step(train_X, train_y, grad_accumulation_steps)
loss.backward() # 反向传播,在正常的grad基础上,累加对抗训练的梯度
# 恢复Embedding的参数, 因为要在正常的embedding上更新参数,而不是增加了对抗扰动后的embedding上更新参数~
self.ad_train.restore(**self.adversarial)
elif self.adversarial['name'] == 'pgd':
self.ad_train.backup_grad() # 备份梯度
for t in range(self.adversarial['K']):
# 在embedding上添加对抗扰动, first attack时备份param.data
self.ad_train.attack(**self.adversarial, is_first_attack=(t==0))
if t != self.adversarial['K']-1:
self.optimizer.zero_grad() # 为了累积扰动而不是梯度
else:
self.ad_train.restore_grad() # 恢复正常的grad
output, loss, loss_detail = self.train_step(train_X, train_y, grad_accumulation_steps)
loss.backward() # 反向传播,在正常的grad基础上,累加对抗训练的梯度
self.ad_train.restore(**self.adversarial) # 恢复embedding参数
# 梯度惩罚
elif self.adversarial['name'] == 'gradient_penalty':
para = search_layer(self, self.adversarial['emb_name'], retrun_first=True)
gp = (para.grad ** 2).sum()
loss += 0.5 * gp * self.adversarial['epsilon']
loss.backward()
# 虚拟对抗训练
elif self.adversarial['name'] == 'vat':
logit = output[0] if isinstance(output, (list, tuple)) else output
adv_loss = self.ad_train.virtual_adversarial_training(train_X, logit)
loss_detail.update({'loss_sup': loss.item(), 'loss_unsup': adv_loss})
loss += (adv_loss if adv_loss else 0)
loss.backward()
return loss, loss_detail
def train_step(self, train_X, train_y, grad_accumulation_steps):
'''forward并返回loss
'''
def args_segmentate(train_X):
'''参数是否展开
'''
if isinstance(train_X, torch.Tensor): # tensor不展开
pass
elif isinstance(self, (BaseModelDP, BaseModelDDP)):
if self.module.forward.__code__.co_argcount >= 3:
return True
elif self.forward.__code__.co_argcount >= 3:
return True
return False
if self.use_amp:
with self.autocast():
output = self.forward(*train_X) if args_segmentate(train_X) else self.forward(train_X)
loss_detail = self.criterion(output, train_y)
else:
output = self.forward(*train_X) if args_segmentate(train_X) else self.forward(train_X)
loss_detail = self.criterion(output, train_y)
if isinstance(loss_detail, torch.Tensor):
loss = loss_detail
loss_detail = {}
elif isinstance(loss_detail, dict):
loss = loss_detail['loss'] # 还存在其他loss,仅用于打印
del loss_detail['loss']
elif isinstance(loss_detail, (tuple, list)):
loss = loss_detail[0]
loss_detail = {f'loss{i}':v for i, v in enumerate(loss_detail[1:], start=1)}
else:
raise ValueError('Return loss only support Tensor/dict/tuple/list format')
# 梯度累积
loss = loss / grad_accumulation_steps if grad_accumulation_steps > 1 else loss
return output, loss, loss_detail
def callback_fun(self, mode, logs={}):
'''统一调用callback, 方便一些判断条件的触发
'''
# 如果是分布式DDP训练,则仅masker_rank可以callback
if isinstance(self, BaseModelDDP) and self.master_rank!=torch.distributed.get_rank():
return
if mode == 'train_begin':
for callback in self.callbacks:
callback.on_train_begin()
elif mode == 'epoch_begin':
for callback in self.callbacks:
callback.on_epoch_begin(self.global_step, self.epoch, logs)
elif mode == 'batch_begin':
for callback in self.callbacks:
callback.on_batch_begin(self.global_step, self.local_step, logs)
elif mode == 'batch_end':
for callback in self.callbacks:
callback.on_batch_end(self.global_step, self.local_step, logs)
elif mode == 'epoch_end':
for callback in self.callbacks:
callback.on_epoch_end(self.global_step, self.epoch, logs)
elif mode == 'train_end':
for callback in self.callbacks:
callback.on_train_end()
elif mode == 'dataloader_end':
for callback in self.callbacks:
callback.on_dataloader_end()
def fit(self, train_dataloader, steps_per_epoch=None, epochs=1, grad_accumulation_steps=1, callbacks=None):
if not hasattr(train_dataloader, '__len__'):
assert steps_per_epoch is not None, 'Either train_dataloader has attr "__len__" or steps_per_epoch is not None'
self.steps_per_epoch = len(train_dataloader) if steps_per_epoch is None else steps_per_epoch
self.total_steps = self.steps_per_epoch * epochs
self.train_dataloader = train_dataloader # 设置为成员变量,可由外部的callbacks进行修改
train_dataloader_iter = iter(self.train_dataloader) # 循环epoch时不重生成
callbacks = [] if callbacks is None else callbacks
callbacks = callbacks if isinstance(callbacks, (list, tuple)) else [callbacks]
self.callbacks = [ProgbarLogger(epochs, self.steps_per_epoch, [i for i in self.metrics.keys() if isinstance(i, str)])] + callbacks
self.callback_fun('train_begin')
# epoch:当前epoch
# global_step:当前全局训练步数
# local_step: 当前epoch内的训练步数,不同epoch中相同local_step对应的batch数据不一定相同,在steps_per_epoch=None时相同
# bti:在dataloader中的index,不同epoch中相同的bti对应的batch数据一般相同,除非重新生成dataloader
self.bti = 0
import time
# 声明fps日志打印路径
fps_dir = 'log/'
os.makedirs(fps_dir, exist_ok=True)
fw = open(os.path.join(fps_dir, f'fps.txt'), 'a', encoding='utf-8')
# 训练总时长
total_training_time = 0
# 训练总样本数量
total_samples = 0
# 评估总时长
total_eval_time = 0
# 总时长
total_time = 0
for epoch in range(self.resume_epoch, epochs):
self.epoch = epoch
# one epoch样本数量
epoch_samples = 0
# one epoch开始时间
epoch_start = time.time()
# resume_step:判断local_step的起点,以及进度条的起始位置
resume_step = self.resume_step if epoch==self.resume_epoch else 0
self.callback_fun('epoch_begin')
self.callbacks[0].seen = resume_step
for local_step in range(resume_step, self.steps_per_epoch):
self.local_step = local_step
self.global_step = self.epoch * self.steps_per_epoch + self.local_step
# 循环dataloader, 不要试用itertools的cycle,遇到过变量不释放的问题
try:
batch = next(train_dataloader_iter)
except StopIteration:
self.callback_fun('dataloader_end') # 适用于数据量较大时,动态读取文件并重新生成dataloader的情况,如预训练
train_dataloader_iter = iter(self.train_dataloader) # shuffle=True时候,其实顺序也重新生成了
self.bti = 0
batch = next(train_dataloader_iter)
train_X, train_y = batch
# 计算训练样本数
total_samples += train_X.shape[0]
epoch_samples += train_X.shape[0]
# 取btz,最多允许嵌套两层,即((token_ids1, mask1), (token_ids2, mask2))
# if isinstance(train_X, (list, tuple)):
# if isinstance(train_X[0], (list, tuple)):
# btz = train_X[0][0].size(0)
# else:
# btz = train_X[0].size(0)
# elif isinstance(train_X, torch.Tensor):
# btz = train_X.size(0)
# else:
# raise ValueError('Input only support [list, tuple, tensor]')
# logs = {'batch': self.local_step, 'size': btz}
logs = OrderedDict()
self.callback_fun('batch_begin', logs)
self.train() # 设置为train模式
# 入参个数判断,如果入参>=3表示是多个入参,如果=2则表示是一个入参
output, loss, loss_detail = self.train_step(train_X, train_y, grad_accumulation_steps)
retain_graph = True if self.adversarial['name'] in {'gradient_penalty', 'vat'} else False
if self.use_amp: # 混合精度
scale_before_step = self.scaler.get_scale()
self.scaler.scale(loss).backward(retain_graph=retain_graph)
elif self.use_apex:
with amp.scale_loss(loss,self.optimizer) as scaled_loss:
scaled_loss.backward(retain_graph=retain_graph)
else:
loss.backward(retain_graph=retain_graph)
# 对抗训练
loss, loss_detail = self.adversarial_training(train_X, train_y, output, loss, loss_detail, grad_accumulation_steps)
# 参数更新, 真实的参数更新次数要除以grad_accumulation_steps,注意调整总的训练步数
if (self.global_step+1) % grad_accumulation_steps == 0:
skip_scheduler = False
# 混合精度
if self.use_amp:
self.scaler.unscale_(self.optimizer)
if self.clip_grad_norm is not None: # 梯度裁剪
torch.nn.utils.clip_grad_norm_(self.parameters(), self.clip_grad_norm)
self.scaler.step(self.optimizer)
self.scaler.update()
skip_scheduler = self.scaler.get_scale() != scale_before_step
else:
if self.clip_grad_norm is not None: # 梯度裁剪
torch.nn.utils.clip_grad_norm_(self.parameters(), self.clip_grad_norm)
self.optimizer.step()
self.optimizer.zero_grad() # 清梯度
if (self.scheduler is not None) and not skip_scheduler:
if isinstance(self.scheduler, (tuple, list)):
for scheduler in self.scheduler:
scheduler.step()
else:
self.scheduler.step()
# 添加loss至log打印
logs.update({'loss': loss.item()})
logs_loss_detail = {k: v.item() if isinstance(v, torch.Tensor) else v for k, v in loss_detail.items()}
logs.update(logs_loss_detail)
if self.global_step == resume_step:
self.callbacks[0].add_metrics(list(logs_loss_detail.keys()), add_position=1)
# 添加metrics至log打印
for metric, func in self.metrics.items():
perf = metric_mapping(metric, func, output, train_y) # 内置的一些accuracy指标
if perf is not None:
if isfunction(metric): # 直接传入回调函数(无key)
if self.global_step == resume_step:
self.callbacks[0].add_metrics(list(perf.keys()))
logs.update(perf)
elif isinstance(metric, str): # 直接传入回调函数(有key)
logs[metric] = perf
self.callback_fun('batch_end', logs)
self.bti += 1
# one epoch训练结束
epoch_train_end = time.time()
# 计算一个epoch训练时长
cur_epoch_train_time = epoch_train_end - epoch_start
# 计算总共训练时长
total_training_time += cur_epoch_train_time
logs['f1'] = 0
# one epoch评估开始
eval_start = time.time()
self.callback_fun('epoch_end', logs)
# one epoch结束时间
epoch_end = time.time()
# 计算一个epoch评估时长
cur_epoch_eval_time = time.time() - eval_start
# 计算总共评估时长
total_eval_time += cur_epoch_eval_time
# 计算总时长
total_time += (epoch_end - epoch_start)
# 写入日志'log/fps.log'
if fw is not None:
fw.write('Epoch: {}\tCurrent Start time: {:.6f}\n'.format(epoch, epoch_start))
fw.write('Epoch: {}\tCurrent Train Samples: {}\n'.format(epoch, epoch_samples))
fw.write('Epoch: {}\tCurrent Train Time: {}\n'.format(epoch, cur_epoch_train_time))
fw.write('Epoch: {}\tCurrent Eval Time: {}\n'.format(epoch, cur_epoch_eval_time))
fw.write('Epoch: {}\tCurrent Epoch FPS: {:.3f}\n'.format(epoch, epoch_samples/cur_epoch_train_time))
fw.write('Epoch: {}\tCurrent End time: {:.6f}\n'.format(epoch, epoch_end))
fw.write('Epoch: {}\tCurrent F1 score: {:.5f}\n'.format(epoch, logs['f1']))
fw.write('Epoch: {}\tAll Train Samples: {}\n'.format(epoch, total_samples))
fw.write('Epoch: {}\tAll Train Time: {}\n'.format(epoch, total_training_time))
fw.write('Epoch: {}\tAll Eval Time: {}\n'.format(epoch, total_eval_time))
fw.write('Epoch: {}\tAll Time: {}\n'.format(epoch, total_time))
fw.flush()
# earlystop策略
callback_tmp = [callback_tmp for callback_tmp in self.callbacks if isinstance(callback_tmp, EarlyStopping)]
if callback_tmp and callback_tmp[0].stopped_epoch > 0:
break
self.callback_fun('train_end', logs)
@torch.no_grad()
def predict(self, input_tensor_list, return_all=None):
self.eval()
if self.forward.__code__.co_argcount >= 3:
output = self.forward(*input_tensor_list)
else:
output = self.forward(input_tensor_list)
if return_all is None:
return output
elif isinstance(output, (tuple, list)) and isinstance(return_all, int) and return_all < len(output):
return output[return_all]
else:
raise ValueError('Return format error')
def load_weights(self, load_path, strict=True, prefix=None):
'''加载模型权重
save_path: 权重加载路径
prefix: None表示按照当前的key加载, 传入string表示按照variable_mapping()中原始的key加载
'''
state_dict = torch.load(load_path, map_location='cpu')
if prefix is None:
self.load_state_dict(state_dict, strict=strict)
else:
# 按照variable_mapping()中原始的key加载
eval_str = 'self.variable_mapping()' if prefix == '' else f'self.{prefix}.variable_mapping()'
mapping = {v:k for k, v in eval(eval_str).items()}
mapping = mapping if prefix == '' else {k:f'{prefix}.{v}' for k,v in mapping.items()}
state_dict_raw = {}
for k, v in state_dict.items():
k = mapping.get(k, k)
state_dict_raw[k] = v
self.load_state_dict(state_dict_raw, strict=strict)
def save_weights(self, save_path, prefix=None):
'''保存模型权重
save_path: 权重保存路径
prefix: None表示按照当前的key加载, 传入string表示按照variable_mapping()中原始的key保存
'''
if prefix is None:
torch.save(self.state_dict(), save_path)
else:
# 按照variable_mapping()中原始的key保存,方便其他官方代码加载模型
eval_str = 'self.variable_mapping()' if prefix == '' else f'self.{prefix}.variable_mapping()'
mapping = eval(eval_str)
mapping = mapping if prefix == '' else {f'{prefix}.{k}':v for k,v in mapping.items()}
state_dict_raw = {}
for k, v in self.state_dict().items():
k = mapping.get(k, k)
state_dict_raw[k] = v
torch.save(state_dict_raw, save_path)
class BaseModelDP(nn.DataParallel, BaseModel):
'''DataParallel模式使用多gpu的方法, 父类顺序颠倒也会出问题
'''
def __init__(self, *args, **kwargs):
BaseModel.__init__(self)
nn.DataParallel.__init__(self, *args, **kwargs)
class BaseModelDDP(nn.parallel.DistributedDataParallel, BaseModel):
'''DistributedDataParallel模式使用多gpu的方法, 父类顺序颠倒也会出问题
'''
def __init__(self, *args, master_rank=0, **kwargs):
self.master_rank = master_rank # 用于记录打印条的rank
BaseModel.__init__(self)
nn.parallel.DistributedDataParallel.__init__(self, *args, **kwargs)
class BERT_BASE(BaseModel):
"""模型基类
"""
def __init__(
self,
vocab_size, # 词表大小
hidden_size, # 编码维度
num_hidden_layers, # Transformer总层数
num_attention_heads, # Attention的头数
intermediate_size, # FeedForward的隐层维度
hidden_act, # FeedForward隐层的激活函数
dropout_rate=None, # Dropout比例
attention_probs_dropout_prob=None, # Attention矩阵的Dropout比例
embedding_size=None, # 指定embedding_size, 不指定则使用config文件的参数
attention_head_size=None, # Attention中V的head_size
attention_key_size=None, # Attention中Q,K的head_size
initializer_range=0.02, # 权重初始化方差
sequence_length=None, # 是否固定序列长度
keep_tokens=None, # 要保留的词ID列表
compound_tokens=None, # 扩展Embedding
residual_attention_scores=False, # Attention矩阵加残差
ignore_invalid_weights=False, # 允许跳过不存在的权重
keep_hidden_layers=None, # 保留的hidden_layer层的id
hierarchical_position=None, # 是否层次分解位置编码
**kwargs
):
super(BERT_BASE, self).__init__()
if keep_tokens is not None:
vocab_size = len(keep_tokens)
if compound_tokens is not None:
vocab_size += len(compound_tokens)
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.attention_head_size = attention_head_size or self.hidden_size // self.num_attention_heads
self.attention_key_size = attention_key_size or self.attention_head_size
self.intermediate_size = intermediate_size
self.dropout_rate = dropout_rate or 0
self.attention_probs_dropout_prob = attention_probs_dropout_prob or 0
self.hidden_act = hidden_act
self.embedding_size = embedding_size or hidden_size
self.initializer_range = initializer_range
self.sequence_length = sequence_length
self.keep_tokens = keep_tokens
self.compound_tokens = compound_tokens
self.attention_bias = None
self.position_bias = None
self.attention_scores = None
self.residual_attention_scores = residual_attention_scores
self.ignore_invalid_weights = ignore_invalid_weights
self.keep_hidden_layers = set(range(num_hidden_layers)) if keep_hidden_layers is None else set(keep_hidden_layers)
self.hierarchical_position = hierarchical_position
def build(
self,
attention_caches=None,
layer_norm_cond=None,
layer_norm_cond_hidden_size=None,
layer_norm_cond_hidden_act=None,
additional_input_layers=None,
**kwargs
):
"""模型构建函数
attention_caches: 为Attention的K,V的缓存序列字典,格式为{Attention层名: [K缓存, V缓存]};
layer_norm_*系列参数: 实现Conditional Layer Normalization时使用,用来实现以“固定长度向量”为条件的条件Bert。
"""
# additional_input
# if additional_input_layers is not None:
# if not isinstance(additional_input_layers, list):
# self.additional_input_layers = [additional_input_layers]
# else:
# self.additional_input_layers = additional_input_layers
# Other
self.attention_caches = attention_caches or {}
# self.layer_norm_conds = [
# layer_norm_cond,
# layer_norm_cond_hidden_size,
# layer_norm_cond_hidden_act or 'linear',
# ]
self.output_all_encoded_layers = kwargs.get('output_all_encoded_layers', False)
def forward(self, inputs):
"""定义模型的执行流程
"""
# Embedding
outputs = self.apply_embeddings(inputs)
# Main
outputs = self.apply_main_layers(outputs)
# Final
outputs = self.apply_final_layers(outputs)
return outputs
def init_model_weights(self, module):
""" 初始化权重
"""
if isinstance(module, (nn.Linear, nn.Embedding)) and (module.weight.requires_grad):
# bert参数初始化, tf版本在linear和Embedding层使用的是截断正太分布, pytorch没有实现该函数,
# 此种初始化对于加载预训练模型后进行finetune没有任何影响,
# cf https://github.com/pytorch/pytorch/pull/5617
# 固定的相对位置编码如Sinusoidal无需初始化
module.weight.data.normal_(mean=0.0, std=self.initializer_range)
elif isinstance(module, LayerNorm):
if hasattr(module, 'bias') and module.bias.requires_grad: # T5等模型使用的是rmsnorm
module.bias.data.zero_()
if hasattr(module, 'weight') and module.weight.requires_grad:
module.weight.data.fill_(1.0)
if isinstance(module, nn.Linear) and (module.bias is not None) and (module.bias.requires_grad):
module.bias.data.zero_()
def variable_mapping(self):
"""构建pytorch层与checkpoint的变量名之间的映射表
"""
return {}
def load_variable(self):
raise NotImplementedError
def load_embeddings(self, embeddings):
"""根据keep_tokens和compound_tokens对embedding进行修改
"""
if self.keep_tokens is not None:
embeddings = embeddings[self.keep_tokens]
if self.compound_tokens is not None:
ext_embeddings = []
for item in self.compound_tokens:
try:
ext_embeddings.append(torch.mean(embeddings[item], 0) * torch.ones_like(embeddings[item]))
except IndexError:
ext_embeddings.append(torch.mean(embeddings, 0, keepdim=True))
warnings.warn(f'Initialize ext_embeddings from compound_tokens not in embedding index')
embeddings = torch.cat([embeddings] + ext_embeddings, 0)
return embeddings
def load_pos_embeddings(self, embeddings):
"""根据hierarchical_position对pos_embedding进行修改
"""
if self.hierarchical_position is not None:
alpha = 0.4 if self.hierarchical_position is True else self.hierarchical_position
embeddings = embeddings - alpha * embeddings[:1]
embeddings = embeddings / (1 - alpha)
position_index = torch.arange(self.max_position)[:, None]
# 为兼容低版本pytorch没有take_along_dim
embeddings_x = take_along_dim(embeddings, torch.div(position_index, embeddings.size(0), rounding_mode='trunc'), dim=0)
embeddings_y = take_along_dim(embeddings, position_index % embeddings.size(0), dim=0)
embeddings = alpha * embeddings_x + (1 - alpha) * embeddings_y
return embeddings
def load_weights_from_pytorch_checkpoint(self, checkpoint, mapping=None):
"""根据mapping从checkpoint加载权重
"""
file_state_dict = torch.load(checkpoint, map_location='cpu') # 加载模型文件
mapping = mapping or self.variable_mapping()
parameters_set = set([i[0] for i in self.named_parameters()]) # 可更新的变量
# 如果模型文件和模型结构中同时存在,且不在预设的mapping中,则更新mapping
# 主要是如为了在外部继承BERT后有其他layer,也能自动从checkpoint中加载进来
for layer_name in parameters_set:
if (layer_name in file_state_dict) and (layer_name not in mapping):
mapping.update({layer_name: layer_name})
state_dict_new ={}
for new_key, old_key in mapping.items():
if new_key not in self.state_dict():
continue
elif old_key in file_state_dict: # mapping中包含,且模型结构中有
state_dict_new[new_key] = self.load_variable(file_state_dict, old_key)
elif (old_key not in file_state_dict) and (not self.ignore_invalid_weights):
# mapping中包含,但模型文件中没有
print(f'[WARNIMG] {old_key} not found in pretrain models')
if new_key in parameters_set:
parameters_set.remove(new_key)
# 未能加载预训练权重的Parameter
if not self.ignore_invalid_weights:
for key in parameters_set:
print(f'[WARNIMG] Parameter {key} not loaded from pretrain models')
del file_state_dict
# 将ckpt的权重load到模型结构中
self.load_state_dict(state_dict_new, strict=False)
# def get_inputs(self):
# pass
# def set_inputs(self, inputs, additional_input_layers=None):
# """设置input和inputs属性
# """
# pass
def apply_embeddings(self, inputs):
raise NotImplementedError
def apply_main_layers(self, inputs):
raise NotImplementedError
def apply_final_layers(self, inputs):
raise NotImplementedError
def apply_on_layer_begin(self, l_i, inputs):
'''新增对layer block输入进行操作的函数
'''
return inputs
def apply_on_layer_end(self, l_i, inputs):
'''新增对layer block输出进行操作的函数
'''
return inputs
def compute_attention_bias(self, inputs=None):
"""定义每一层的Attention Bias
"""
return self.attention_bias
def compute_position_bias(self, inputs=None):
"""定义每一层的Position Bias(一般相对位置编码用)
"""
return self.position_bias
def set_outputs(self, outputs):
"""设置output和oututs属性
"""
if not isinstance(outputs, list):
outputs = [outputs]
outputs = outputs[:]
self.outputs = outputs
if len(outputs) > 1:
self.output = outputs
else:
self.output = outputs[0]
class LM_Mask(object):
"""定义下三角Attention Mask(语言模型用)
"""
def compute_attention_bias(self, inputs=None):
"""通过idxs序列的比较来得到对应的mask
"""
seq_len = inputs[0].shape[1]
attention_bias = torch.tril(torch.ones(seq_len, seq_len, dtype=torch.long, device=inputs[0].device), diagonal=0)
self.attention_bias = attention_bias.unsqueeze(0).unsqueeze(1)
return self.attention_bias
def extend_with_language_model(InputModel):
"""添加下三角的Attention Mask(语言模型用)
"""
class LanguageModel(LM_Mask, InputModel):
"""带下三角Attention Mask的派生模型
"""
def __init__(self, *args, **kwargs):
kwargs['with_mlm'] = kwargs.get('with_mlm') or True
super(LanguageModel, self).__init__(*args, **kwargs)
return LanguageModel
class UniLM_Mask(object):
"""定义UniLM的Attention Mask(Seq2Seq模型用)
其中source和target的分区,由segment_ids来表示。
UniLM: https://arxiv.org/abs/1905.03197
"""
def compute_attention_bias(self, inputs=None):
"""通过idxs序列的比较来得到对应的mask
"""
segment_ids = inputs[1]
attention_bias = torch.cumsum(segment_ids, dim=1)
attention_bias = (attention_bias.unsqueeze(1)) <= (attention_bias.unsqueeze(2))
self.attention_bias = attention_bias.unsqueeze(1).long()
return self.attention_bias
def extend_with_unified_language_model(InputModel):
"""添加UniLM的Attention Mask(Seq2Seq模型用)
"""
class UnifiedLanguageModel(UniLM_Mask, InputModel):
"""带UniLM的Attention Mask的派生模型
UniLM: https://arxiv.org/abs/1905.03197
"""
def __init__(self, *args, **kwargs):
kwargs['with_mlm'] = kwargs.get('with_mlm') or True
super(UnifiedLanguageModel, self).__init__(*args, **kwargs)
return UnifiedLanguageModel
class BERT(BERT_BASE):
"""构建BERT模型
"""
def __init__(
self,
max_position, # 序列最大长度
segment_vocab_size=2, # segment总数目
with_pool=False, # 是否包含Pool部分
with_nsp=False, # 是否包含NSP部分
with_mlm=False, # 是否包含MLM部分
custom_position_ids=False, # 是否自行传入位置id
custom_attention_mask=False, # 是否自行传入attention_mask
shared_segment_embeddings=False, # 若True,则segment跟token共用embedding
layer_norm_cond=None, # conditional layer_norm
layer_add_embs=None, # addtional_embeddng, 比如加入词性,音调,word粒度的自定义embedding
is_dropout=False,
token_pad_ids=0, # 默认0是padding ids, 但是注意google的mt5padding不是0
**kwargs # 其余参数
):
super(BERT, self).__init__(**kwargs)
self.max_position = max_position
self.segment_vocab_size = segment_vocab_size
self.with_pool = with_pool
self.with_nsp = with_nsp
self.with_mlm = with_mlm
self.custom_position_ids = custom_position_ids
self.custom_attention_mask = custom_attention_mask
self.shared_segment_embeddings = shared_segment_embeddings
self.is_dropout = is_dropout
self.token_pad_ids = token_pad_ids
if self.with_nsp and not self.with_pool:
self.with_pool = True
self.layer_norm_conds = layer_norm_cond
self.layer_add_embs = layer_add_embs
self.conditional_size = layer_norm_cond.weight.size(1) if layer_norm_cond is not None else None
self.embeddings = BertEmbeddings(self.vocab_size, self.embedding_size, self.hidden_size, self.max_position, self.segment_vocab_size, self.shared_segment_embeddings,
self.dropout_rate, self.conditional_size, **get_kw(BertEmbeddings, kwargs))
kwargs['max_position'] = self.max_position # 相对位置编码需要使用
layer = BertLayer(self.hidden_size, self.num_attention_heads, self.dropout_rate, self.attention_probs_dropout_prob, self.intermediate_size, self.hidden_act,
is_dropout=self.is_dropout, conditional_size=self.conditional_size, **get_kw(BertLayer, kwargs))
self.encoderLayer = nn.ModuleList([copy.deepcopy(layer) if layer_id in self.keep_hidden_layers else Identity() for layer_id in range(self.num_hidden_layers)])
if self.with_pool:
# Pooler部分(提取CLS向量)
self.pooler = nn.Linear(self.hidden_size, self.hidden_size)
self.pooler_activation = nn.Tanh() if self.with_pool is True else get_activation(self.with_pool)
if self.with_nsp:
# Next Sentence Prediction部分
# nsp的输入为pooled_output, 所以with_pool为True是使用nsp的前提条件
self.nsp = nn.Linear(self.hidden_size, 2)
else:
self.pooler = None
self.pooler_activation = None
if self.with_mlm:
self.mlmDense = nn.Linear(self.hidden_size, self.hidden_size)
self.transform_act_fn = get_activation(self.hidden_act)
self.mlmLayerNorm = LayerNorm(self.hidden_size, eps=1e-12, conditional_size=self.conditional_size)
self.mlmDecoder = nn.Linear(self.hidden_size, self.vocab_size, bias=False)
if kwargs.get('tie_emb_prj_weight') is True:
self.mlmDecoder.weight = self.embeddings.word_embeddings.weight
self.mlmBias = nn.Parameter(torch.zeros(self.vocab_size))
self.mlmDecoder.bias = self.mlmBias
# 下述继承于BERT的有声明新的参数,在这里初始化不能统一初始化到
def apply_embeddings(self, inputs):
"""BERT的embedding是token、position、segment三者embedding之和
默认顺序是token_ids, segment_ids(若有), position_ids(若有), custom_attention_mask(若有), conditional_input(若有)
"""
assert isinstance(inputs, (list, tuple)), f'Inputs only support list,tuple format but passed {type(inputs)}'
token_ids = inputs[0]
index_ = 1
if self.segment_vocab_size > 0:
segment_ids = inputs[index_]
index_ += 1
else:
segment_ids = None
if self.custom_position_ids: # 暂未使用到,暂保留
position_ids = inputs[index_]
index_ += 1
else:
position_ids = None
# 根据token_ids创建一个3D的attention mask矩阵,尺寸为[batch_size, 1, 1, to_seq_length],
# 目的是为了适配多头注意力机制,从而能广播到[batch_size, num_heads, from_seq_length, to_seq_length]尺寸
if self.custom_attention_mask:
attention_mask = inputs[index_].long().unsqueeze(1).unsqueeze(2)
index_ += 1
elif (not token_ids.requires_grad) and (token_ids.dtype in {torch.long, torch.int}): # 正常的token_ids
attention_mask = (token_ids != self.token_pad_ids).long().unsqueeze(1).unsqueeze(2) # 默认0为mask_value
if self.token_pad_ids < 0:
token_ids = token_ids * attention_mask[:,0,0,:]
else: # 自定义word_embedding,目前仅有VAT中使用
attention_mask = self.attention_mask_cache
self.attention_mask_cache = attention_mask # 缓存上次用的attention_mask
self.compute_attention_bias([token_ids, segment_ids]) # 根据lm或者unilm需要对mask做调整
if self.attention_bias is not None:
attention_mask = attention_mask * self.attention_bias # 不可访问padding
# attention_mask = self.attention_bias # 可以访问padding
# pytorch >= 1.5时候会导致StopIteration错误
# https://github.com/huggingface/transformers/issues/3936
# https://github.com/huggingface/transformers/issues/4189
# https://github.com/huggingface/transformers/issues/3936
try:
attention_mask = attention_mask.to(dtype=next(self.parameters()).dtype) # 兼容fp16
except StopIteration:
attention_mask = attention_mask.to(dtype=torch.float32)
# 对mask矩阵中,数值为0的转换成很大的负数,使得不需要attention的位置经过softmax后,分数趋近于0
# attention_mask = (1.0 - attention_mask) * -10000.0
# conditional layer_norm
if self.layer_norm_conds is None:
conditional_emb = None
else:
conditional_emb = self.layer_norm_conds(inputs[index_])
index_ += 1
# addtional_embeddng, 比如加入词性,音调,word粒度的自定义embedding
if isinstance(self.layer_add_embs, nn.Module): # 单个
additional_embs = [self.layer_add_embs(inputs[index_])]
index_ += 1
elif isinstance(self.layer_add_embs, (tuple, list)): # 多个
additional_embs = []
for layer in self.layer_add_embs:
assert isinstance(layer, nn.Module), 'Layer_add_embs element should be nn.Module'
additional_embs.append(layer(inputs[index_]))
index_ += 1
else:
additional_embs = None
# 进入embedding层
hidden_states = self.embeddings(token_ids, segment_ids, conditional_emb, additional_embs)
return [hidden_states, attention_mask, conditional_emb] + inputs[index_:]
def apply_main_layers(self, inputs):
"""BERT的主体是基于Self-Attention的模块
顺序:Att --> Add --> LN --> FFN --> Add --> LN
默认第一个是hidden_states, 第二个是attention_mask, 第三个是conditional_emb
"""
hidden_states, attention_mask, conditional_emb = inputs[:3]
if len(inputs[3:]) >= 2:
encoder_hidden_state, encoder_attention_mask = inputs[3], inputs[4]
else:
encoder_hidden_state, encoder_attention_mask = None, None
encoded_layers = [hidden_states] # 添加embedding的输出
layer_inputs = [hidden_states, attention_mask, conditional_emb, encoder_hidden_state, encoder_attention_mask]
for l_i, layer_module in enumerate(self.encoderLayer):
layer_inputs = self.apply_on_layer_begin(l_i, layer_inputs)
hidden_states = layer_module(*layer_inputs)
layer_inputs[0] = hidden_states
layer_inputs = self.apply_on_layer_end(l_i, layer_inputs)
if self.output_all_encoded_layers:
encoded_layers.append(hidden_states)
if not self.output_all_encoded_layers:
encoded_layers.append(hidden_states)
return [encoded_layers, conditional_emb]
def apply_final_layers(self, inputs):
"""根据剩余参数决定输出
"""
# 获取最后一层隐藏层的输出
encoded_layers, conditional_emb = inputs
sequence_output = encoded_layers[-1]
# 是否取最后一层输出
if not self.output_all_encoded_layers:
encoded_layers = encoded_layers[-1]
# 是否添加pool层
if self.with_pool:
pooled_output = self.pooler_activation(self.pooler(sequence_output[:, 0]))
else:
pooled_output = None
# 是否添加nsp
if self.with_pool and self.with_nsp:
nsp_scores = self.nsp(pooled_output)
else:
nsp_scores = None
# 是否添加mlm
if self.with_mlm:
mlm_hidden_state = self.mlmDense(sequence_output)
mlm_hidden_state = self.transform_act_fn(mlm_hidden_state)
mlm_hidden_state = self.mlmLayerNorm((mlm_hidden_state, conditional_emb))
mlm_scores = self.mlmDecoder(mlm_hidden_state)
mlm_activation = get_activation('linear' if self.with_mlm is True else self.with_mlm)
mlm_scores = mlm_activation(mlm_scores)
else:
mlm_scores = None
outputs = [value for value in [encoded_layers, pooled_output, mlm_scores, nsp_scores] if value is not None]
return outputs if len(outputs) > 1 else outputs[0]
def load_variable(self, state_dict, name, prefix='bert'):
"""加载单个变量的函数
"""
variable = state_dict[name]
if name in {
f'{prefix}.embeddings.word_embeddings.weight',
'cls.predictions.bias',
'cls.predictions.decoder.weight',
'cls.predictions.decoder.bias'
}:
return self.load_embeddings(variable)
elif name == f'{prefix}.embeddings.position_embeddings.weight':
return self.load_pos_embeddings(variable)
elif name == 'cls.seq_relationship.weight':
return variable.T
else:
return variable
def variable_mapping(self, prefix='bert'):
mapping = {
'embeddings.word_embeddings.weight': f'{prefix}.embeddings.word_embeddings.weight',
'embeddings.position_embeddings.weight': f'{prefix}.embeddings.position_embeddings.weight',
'embeddings.segment_embeddings.weight': f'{prefix}.embeddings.token_type_embeddings.weight',
'embeddings.layerNorm.weight': f'{prefix}.embeddings.LayerNorm.weight',
'embeddings.layerNorm.bias': f'{prefix}.embeddings.LayerNorm.bias',
'pooler.weight': f'{prefix}.pooler.dense.weight',
'pooler.bias': f'{prefix}.pooler.dense.bias',
'nsp.weight': 'cls.seq_relationship.weight',
'nsp.bias': 'cls.seq_relationship.bias',
'mlmDense.weight': 'cls.predictions.transform.dense.weight',
'mlmDense.bias': 'cls.predictions.transform.dense.bias',
'mlmLayerNorm.weight': 'cls.predictions.transform.LayerNorm.weight',
'mlmLayerNorm.bias': 'cls.predictions.transform.LayerNorm.bias',
'mlmBias': 'cls.predictions.bias',
'mlmDecoder.weight': 'cls.predictions.decoder.weight',
'mlmDecoder.bias': 'cls.predictions.decoder.bias'
}
for i in range(self.num_hidden_layers):
prefix_i = f'{prefix}.encoder.layer.%d.' % i
mapping.update({f'encoderLayer.{i}.multiHeadAttention.q.weight': prefix_i + 'attention.self.query.weight',
f'encoderLayer.{i}.multiHeadAttention.q.bias': prefix_i + 'attention.self.query.bias',
f'encoderLayer.{i}.multiHeadAttention.k.weight': prefix_i + 'attention.self.key.weight',
f'encoderLayer.{i}.multiHeadAttention.k.bias': prefix_i + 'attention.self.key.bias',
f'encoderLayer.{i}.multiHeadAttention.v.weight': prefix_i + 'attention.self.value.weight',
f'encoderLayer.{i}.multiHeadAttention.v.bias': prefix_i + 'attention.self.value.bias',
f'encoderLayer.{i}.multiHeadAttention.o.weight': prefix_i + 'attention.output.dense.weight',
f'encoderLayer.{i}.multiHeadAttention.o.bias': prefix_i + 'attention.output.dense.bias',
f'encoderLayer.{i}.layerNorm1.weight': prefix_i + 'attention.output.LayerNorm.weight',
f'encoderLayer.{i}.layerNorm1.bias': prefix_i + 'attention.output.LayerNorm.bias',
f'encoderLayer.{i}.feedForward.intermediateDense.weight': prefix_i + 'intermediate.dense.weight',
f'encoderLayer.{i}.feedForward.intermediateDense.bias': prefix_i + 'intermediate.dense.bias',
f'encoderLayer.{i}.feedForward.outputDense.weight': prefix_i + 'output.dense.weight',
f'encoderLayer.{i}.feedForward.outputDense.bias': prefix_i + 'output.dense.bias',
f'encoderLayer.{i}.layerNorm2.weight': prefix_i + 'output.LayerNorm.weight',
f'encoderLayer.{i}.layerNorm2.bias': prefix_i + 'output.LayerNorm.bias'
})
return mapping
class ALBERT(BERT):
def __init__(self, *args, **kwargs):
super(ALBERT, self).__init__(*args, **kwargs)
self.encoderLayer = nn.ModuleList([self.encoderLayer[0]]) # 取上述的第一行
def apply_main_layers(self, inputs):
"""BERT的主体是基于Self-Attention的模块
顺序:Att --> Add --> LN --> FFN --> Add --> LN
"""
hidden_states, attention_mask, conditional_emb = inputs[:3]
if len(inputs[3:]) >= 2:
encoder_hidden_state, encoder_attention_mask = inputs[3], inputs[4]
else:
encoder_hidden_state, encoder_attention_mask = None, None
encoded_layers = [hidden_states] # 添加embedding的输出
layer_inputs = [hidden_states, attention_mask, conditional_emb, encoder_hidden_state, encoder_attention_mask]
for l_i in range(self.num_hidden_layers):
layer_inputs = self.apply_on_layer_begin(l_i, layer_inputs)
hidden_states = self.encoderLayer[0](*layer_inputs)
layer_inputs[0] = hidden_states
layer_inputs = self.apply_on_layer_end(l_i, layer_inputs)
if self.output_all_encoded_layers:
encoded_layers.append(hidden_states)
if not self.output_all_encoded_layers:
encoded_layers.append(hidden_states)
return [encoded_layers, conditional_emb]
def variable_mapping(self, prefix='albert'):
mapping = {
'embeddings.word_embeddings.weight': f'{prefix}.embeddings.word_embeddings.weight',
'embeddings.position_embeddings.weight': f'{prefix}.embeddings.position_embeddings.weight',
'embeddings.segment_embeddings.weight': f'{prefix}.embeddings.token_type_embeddings.weight',
'embeddings.layerNorm.weight': f'{prefix}.embeddings.LayerNorm.weight',
'embeddings.layerNorm.bias': f'{prefix}.embeddings.LayerNorm.bias',
'embeddings.embedding_hidden_mapping_in.weight': f'{prefix}.encoder.embedding_hidden_mapping_in.weight',
'embeddings.embedding_hidden_mapping_in.bias': f'{prefix}.encoder.embedding_hidden_mapping_in.bias',
'pooler.weight': f'{prefix}.pooler.weight',
'pooler.bias': f'{prefix}.pooler.bias',
'nsp.weight': 'sop_classifier.classifier.weight', # 用名字nsp来替换sop
'nsp.bias': 'sop_classifier.classifier.bias',
'mlmDense.weight': 'predictions.dense.weight',
'mlmDense.bias': 'predictions.dense.bias',
'mlmLayerNorm.weight': 'predictions.LayerNorm.weight',
'mlmLayerNorm.bias': 'predictions.LayerNorm.bias',
'mlmBias': 'predictions.bias',
'mlmDecoder.weight': 'predictions.decoder.weight',
'mlmDecoder.bias': 'predictions.decoder.bias'
}
i = 0
prefix_i = f'{prefix}.encoder.albert_layer_groups.{i}.albert_layers.{i}.'
mapping.update({f'encoderLayer.{i}.multiHeadAttention.q.weight': prefix_i + 'attention.query.weight',
f'encoderLayer.{i}.multiHeadAttention.q.bias': prefix_i + 'attention.query.bias',
f'encoderLayer.{i}.multiHeadAttention.k.weight': prefix_i + 'attention.key.weight',
f'encoderLayer.{i}.multiHeadAttention.k.bias': prefix_i + 'attention.key.bias',
f'encoderLayer.{i}.multiHeadAttention.v.weight': prefix_i + 'attention.value.weight',
f'encoderLayer.{i}.multiHeadAttention.v.bias': prefix_i + 'attention.value.bias',
f'encoderLayer.{i}.multiHeadAttention.o.weight': prefix_i + 'attention.dense.weight',
f'encoderLayer.{i}.multiHeadAttention.o.bias': prefix_i + 'attention.dense.bias',
f'encoderLayer.{i}.layerNorm1.weight': prefix_i + 'attention.LayerNorm.weight',
f'encoderLayer.{i}.layerNorm1.bias': prefix_i + 'attention.LayerNorm.bias',
f'encoderLayer.{i}.feedForward.intermediateDense.weight': prefix_i + 'ffn.weight',
f'encoderLayer.{i}.feedForward.intermediateDense.bias': prefix_i + 'ffn.bias',
f'encoderLayer.{i}.feedForward.outputDense.weight': prefix_i + 'ffn_output.weight',
f'encoderLayer.{i}.feedForward.outputDense.bias': prefix_i + 'ffn_output.bias',
f'encoderLayer.{i}.layerNorm2.weight': prefix_i + 'full_layer_layer_norm.weight',
f'encoderLayer.{i}.layerNorm2.bias': prefix_i + 'full_layer_layer_norm.bias'
})
return mapping
def load_variable(self, state_dict, name):
"""加载单个变量的函数
"""
variable = state_dict[name]
if name in {
'albert.embeddings.word_embeddings.weight',
'predictions.bias',
'predictions.decoder.weight',
'predictions.decoder.bias'
}:
return self.load_embeddings(variable)
elif name == 'albert.embeddings.position_embeddings.weight':
return self.load_pos_embeddings(variable)
elif name == 'sop_classifier.classifier.weight':
return variable.T
else:
return variable
class ALBERT_Unshared(ALBERT):
def __init__(self, *args, **kwargs):
super(ALBERT_Unshared).__init__(*args, **kwargs)
self.encoderLayer = nn.ModuleList([copy.deepcopy(self.encoderLayer[0]) for _ in range(self.num_hidden_layers)])
def apply_main_layers(self, inputs):
"""BERT的主体是基于Self-Attention的模块
顺序:Att --> Add --> LN --> FFN --> Add --> LN
"""
hidden_states, attention_mask, conditional_emb = inputs
if len(inputs[3:]) >= 2:
encoder_hidden_state, encoder_attention_mask = inputs[3], inputs[4]
else:
encoder_hidden_state, encoder_attention_mask = None, None
encoded_layers = [hidden_states] # 添加embedding的输出
layer_inputs = [hidden_states, attention_mask, conditional_emb, encoder_hidden_state, encoder_attention_mask]
for i in range(self.num_hidden_layers):
layer_inputs = self.apply_on_layer_begin(i, layer_inputs)
hidden_states = self.encoderLayer[i](*layer_inputs)
layer_inputs[0] = hidden_states
layer_inputs = self.apply_on_layer_end(i, layer_inputs)
if self.output_all_encoded_layers:
encoded_layers.append(hidden_states)
if not self.output_all_encoded_layers:
encoded_layers.append(hidden_states)
return [encoded_layers, conditional_emb]
class NEZHA(BERT):
"""华为推出的NAZHA模型
链接:https://arxiv.org/abs/1909.00204
"""
def __init__(self, *args, **kwargs):
# p_bias来控制embedding阶段无pos_embedding, max_relative_position默认取64
kwargs.update({'p_bias': 'typical_relative', 'max_relative_position': kwargs.get('max_relative_position', 64)})
super(NEZHA, self).__init__(*args, **kwargs)
class RoFormer(BERT):
"""旋转式位置编码的BERT模型
链接:https://kexue.fm/archives/8265
"""
def __init__(self, *args, **kwargs):
kwargs.update({'p_bias': 'rotary'})
super(RoFormer, self).__init__(*args, **kwargs)
def load_variable(self, state_dict, name, prefix='roformer'):
return super().load_variable(state_dict, name, prefix)
def variable_mapping(self, prefix='roformer'):
mapping = super().variable_mapping(prefix)
del mapping['embeddings.position_embeddings.weight'] # 没有位置编码
return mapping
class RoFormerV2(RoFormer):
"""RoFormerV2
改动:去掉bias,简化Norm,优化初始化等。目前初始化暂时还用的bert的初始化,finetune不受影响
"""
@delete_arguments('with_pool', 'with_nsp')
def __init__(self, *args, **kwargs):
kwargs.update({'p_bias': 'rotary', 'weight': False, 'bias': False, 'norm_mode': 'rmsnorm'})
super(RoFormerV2, self).__init__(*args, **kwargs)
if self.with_mlm:
del self.mlmLayerNorm
del self.mlmBias
del self.mlmDense
self.mlmDecoder.register_parameter('bias', None)
def variable_mapping(self, prefix='roformer'):
mapping = super().variable_mapping(prefix)
mapping_new = {}
for k, v in mapping.items():
if (not re.search('bias|layernorm', k.lower())) and (not re.search('bias|layernorm', v.lower())):
mapping_new[k] = v
return mapping_new
def apply_final_layers(self, inputs):
"""根据剩余参数决定输出
"""
# 获取最后一层隐藏层的输出
encoded_layers, conditional_emb = inputs
sequence_output = encoded_layers[-1]
# 是否取最后一层输出
if not self.output_all_encoded_layers:
encoded_layers = encoded_layers[-1]
# 是否添加mlm
if self.with_mlm:
mlm_scores = self.mlmDecoder(sequence_output)
else:
mlm_scores = None
outputs = [value for value in [encoded_layers, mlm_scores] if value is not None]
return outputs if len(outputs) > 1 else outputs[0]
class GAU_alpha(RoFormerV2):
def __init__(self, *args, **kwargs):
kwargs.update({'p_bias': 'rotary', 'weight': False, 'bias': False, 'norm_mode': 'rmsnorm', 'normalization': 'softmax_plus'})
super().__init__(*args, **kwargs)
layer = self.GAU_Layer(**kwargs)
self.encoderLayer = nn.ModuleList([copy.deepcopy(layer) if layer_id in self.keep_hidden_layers else Identity() for layer_id in range(self.num_hidden_layers)])
def load_variable(self, state_dict, name, prefix=''):
variable = state_dict[name]
return self.load_embeddings(variable) if name in {'embeddings.word_embeddings.weight', 'mlmDecoder.weight'} else variable
def variable_mapping(self, prefix=''):
'''在convert脚本里已经把key转成bert4torch可用的
'''
return {k: k for k, _ in self.named_parameters()}
class GAU_Layer(nn.Module):
def __init__(self, *args, **kwargs):
super().__init__()
self.gau = GatedAttentionUnit(**kwargs)
self.dropout1 = nn.Dropout(kwargs.get('dropout_rate'))
self.layerNorm1 = LayerNorm(**kwargs)
def forward(self, hidden_states, attention_mask, conditional_emb=None, encoder_hidden_states=None, encoder_attention_mask=None):
gau_hidden_states = self.gau(hidden_states, attention_mask)
hidden_states = hidden_states + self.dropout1(gau_hidden_states)
hidden_states = self.layerNorm1((hidden_states, conditional_emb))
return hidden_states
class ELECTRA(BERT):
"""Google推出的ELECTRA模型
链接:https://arxiv.org/abs/2003.10555
"""
@insert_arguments(with_discriminator=False)
@delete_arguments('with_pool', 'with_mlm', 'with_nsp')
def __init__(self, max_position, **kwargs):
super(ELECTRA, self).__init__(max_position, **kwargs)
if self.with_discriminator:
self.dense = nn.Linear(self.hidden_size, self.hidden_size)
self.dense_act = get_activation(self.hidden_act)
self.dense_prediction = nn.Linear(self.hidden_size, 1)
self.dense_prediction_act = get_activation('sigmoid') if self.with_discriminator is True else get_activation(self.with_discriminator)
def apply_final_layers(self, inputs):
hidden_states = super().apply_final_layers(inputs) # 仅有hidden_state一项输出
if self.with_discriminator:
logit = self.dense_act(self.dense(hidden_states))
return [hidden_states, self.dense_prediction_act(self.dense_prediction(logit))]
else:
return hidden_states
def load_variable(self, state_dict, name):
"""加载单个变量的函数
"""
return super().load_variable(state_dict, name, prefix='electra')
def variable_mapping(self):
mapping = super(ELECTRA, self).variable_mapping(prefix='electra')
mapping.update({'dense.weight': 'discriminator_predictions.dense.weight',
'dense.bias': 'discriminator_predictions.dense.bias',
'dense_prediction.weight': 'discriminator_predictions.dense_prediction.weight',
'dense_prediction.bias': 'discriminator_predictions.dense_prediction.bias'}
)
for del_key in ['pooler.weight', 'pooler.bias', 'nsp.weight', 'nsp.bias', 'mlmDense.weight', 'mlmDense.bias',
'mlmLayerNorm.weight', 'mlmLayerNorm.bias', 'mlmBias', 'mlmDecoder.weight', 'mlmDecoder.bias']:
del mapping[del_key]
return mapping
class ERNIE(BERT):
"""百度文心 https://github.com/PaddlePaddle/ERNIE
"""
def __init__(self, *args, **kwargs):
super(ERNIE, self).__init__(*args, **kwargs)
def variable_mapping(self):
mapping = super(ERNIE, self).variable_mapping(prefix='ernie')
mapping.update({'mlmDecoder.weight': 'ernie.embeddings.word_embeddings.weight',
'mlmDecoder.bias': 'cls.predictions.bias'})
for k, v in mapping.items():
if ('LayerNorm.weight' in v) or ('LayerNorm.bias' in v):
v1 = v.replace('.weight', '.gamma').replace('.bias', '.beta')
mapping[k] = v1
for del_key in ['nsp.weight', 'nsp.bias']:
del mapping[del_key]
return mapping
def load_variable(self, state_dict, name, prefix='ernie'):
return super().load_variable(state_dict, name, prefix=prefix)
class Encoder(BERT):
def __init__(self, *args, **kwargs):
kwargs['vocab_size'] = kwargs.get('src_vocab_size', kwargs['vocab_size'])
super().__init__(*args, **kwargs)
# encoder需要返回encoder_attention_mask
self.encoder_attention_mask = None
def forward(self, inputs):
"""因为encoder需要返回encoder_attention_mask,因此这里从新定义一下,多返回一个参数
"""
# Embedding
outputs = self.apply_embeddings(inputs)
encoder_attention_mask = [outputs[1]]
# Main
outputs = self.apply_main_layers(outputs)
# Final
outputs = self.apply_final_layers(outputs)
return ([outputs] if isinstance(outputs, torch.Tensor) else outputs) + encoder_attention_mask
class Decoder(LM_Mask, BERT):
@delete_arguments('with_pool', 'with_mlm', 'with_nsp')
def __init__(self, *args, with_lm=True, tie_emb_prj_weight=False, logit_scale=True, **kwargs):
kwargs['vocab_size'] = kwargs.get('tgt_vocab_size', kwargs['vocab_size'])
kwargs['is_decoder'] = True # 标记是decoder
super().__init__(*args, **kwargs)
self.decoderLayer = self.encoderLayer
del self.encoderLayer
self.with_lm = with_lm
# 从hidden_states映射到logit
if self.with_lm:
self.final_dense = nn.Linear(self.hidden_size, self.vocab_size, bias=False)
# decoder底层的embedding和顶层的全连接共享
# [True]: fudan_bart和uer_t5的t5, [False]: mt5和t5_pegasus
if tie_emb_prj_weight:
self.final_dense.weight = self.embeddings.word_embeddings.weight
if logit_scale: # T5默认会有logit_scale, bart默认没有,所以bart要传入false
self.x_logit_scale = (self.hidden_size ** -0.5)
else:
self.x_logit_scale = 1.
def apply_main_layers(self, inputs):
"""Dencoder主体是基于Self-Attention、Cross-Attention的模块
顺序:Att1 --> Add --> LN --> Att2 --> Add --> LN --> FFN --> Add --> LN
"""
hidden_states, attention_mask, conditional_emb, encoder_hidden_state, encoder_attention_mask = inputs[:5]
decoded_layers = [hidden_states] # 添加embedding的输出
layer_inputs = [hidden_states, attention_mask, conditional_emb, encoder_hidden_state, encoder_attention_mask]
for i, layer_module in enumerate(self.decoderLayer):
layer_inputs = self.apply_on_layer_begin(i, layer_inputs)
hidden_states = layer_module(*layer_inputs)
layer_inputs[0] = hidden_states
layer_inputs = self.apply_on_layer_end(i, layer_inputs)
if self.output_all_encoded_layers:
decoded_layers.append(hidden_states)
if not self.output_all_encoded_layers:
decoded_layers.append(hidden_states)
return [decoded_layers, conditional_emb]
def apply_final_layers(self, inputs):
outputs = []
hidden_states = super().apply_final_layers(inputs) # outputs为decoder顶层的hidden_states [btz, seq_len, hdsz]
outputs.append(hidden_states)
if self.with_lm:
logits = self.final_dense(hidden_states) * self.x_logit_scale # outputs为[btz, seq_len, vocab_size]的logits
activation = get_activation('linear' if self.with_lm is True else self.with_lm) # 添加激活,一般是线性激活或softmax
logits = activation(logits)
outputs.append(logits)
return outputs
def variable_mapping(self, prefix='bert'):
raw_mapping = super().variable_mapping(prefix)
mapping = {}
for k, v in raw_mapping.items():
mapping[k.replace('encoderLayer', 'decoderLayer')] = v
# for i in range(self.num_hidden_layers):
# prefix_i = f'{prefix}.encoder.layer.%d.' % i
# mapping.update({
# f'decoderLayer.{i}.crossAttention.q.weight': prefix_i + 'crossattention.self.query.weight',
# f'decoderLayer.{i}.crossAttention.q.bias': prefix_i + 'crossattention.self.query.bias',
# f'decoderLayer.{i}.crossAttention.k.weight': prefix_i + 'crossattention.self.key.weight',
# f'decoderLayer.{i}.crossAttention.k.bias': prefix_i + 'crossattention.self.key.bias',
# f'decoderLayer.{i}.crossAttention.v.weight': prefix_i + 'crossattention.self.value.weight',
# f'decoderLayer.{i}.crossAttention.v.bias': prefix_i + 'crossattention.self.value.bias',
# f'decoderLayer.{i}.crossAttention.o.weight': prefix_i + 'crossattention.output.dense.weight',
# f'decoderLayer.{i}.crossAttention.o.bias': prefix_i + 'crossattention.output.dense.bias',
# f'decoderLayer.{i}.layerNorm3.weight': prefix_i + 'crossattention.output.LayerNorm.weight',
# f'decoderLayer.{i}.layerNorm3.bias': prefix_i + 'crossattention.output.LayerNorm.bias'
# })
return mapping
class Transformer(BERT_BASE):
'''encoder-decoder结构
'''
@delete_arguments('with_pool', 'with_mlm', 'with_nsp')
def __init__(self, *args, tie_emb_src_tgt_weight=False, **kwargs):
super(Transformer, self).__init__(*args, **kwargs)
# encoder
self.encoder = Encoder(*args, **kwargs)
self.encoder.build(**kwargs)
# decoder
self.decoder = Decoder(*args, **kwargs)
self.decoder.build(**kwargs)
if tie_emb_src_tgt_weight:
# encoder和decoder的embedding权重共享
assert self.encoder.vocab_size == self.decoder.vocab_size, "To share word embedding, the vocab size of src/tgt shall be the same."
self.encoder.embeddings.word_embeddings.weight = self.decoder.embeddings.word_embeddings.weight
def forward(self, inputs):
"""定义模型的执行流程
"""
encoder_input, decoder_input = inputs[:2]
# encoder
# encoder_emb = self.encoder.apply_embeddings(encoder_input)
# encode_outputs = self.encoder.apply_main_layers(encoder_emb)
# encoder_hidden_state = self.encoder.apply_final_layers(encode_outputs)
# encoder_attention_mask = encoder_emb[1]
encoder_hidden_state, encoder_attention_mask = self.encoder(encoder_input)
# decoder
# decoder_emb = self.decoder.apply_embeddings(decoder_input)
# decoder_outputs = self.decoder.apply_main_layers([*decoder_emb, encoder_hidden_state, encoder_attention_mask])
# decoder_outputs = self.decoder.apply_final_layers(decoder_outputs) # [hidden_states, logits]
decoder_outputs = self.decoder(decoder_input + [encoder_hidden_state, encoder_attention_mask])
return [encoder_hidden_state] + decoder_outputs # 输出encoder_hidden_state和decoder_hidden_state,以应对一些多任务情况
class BART(Transformer):
'''encoder-decoder结构
'''
def __init__(self, *args, tie_emb_src_tgt_weight=True, **kwargs):
kwargs['logit_scale'] = kwargs.get('logit_scale', False)
kwargs['tie_emb_prj_weight'] = kwargs.get('tie_emb_prj_weight', True)
super(BART, self).__init__(*args, tie_emb_src_tgt_weight=tie_emb_src_tgt_weight, **kwargs)
self.tie_emb_src_tgt_weight = tie_emb_src_tgt_weight
def load_variable(self, state_dict, name, prefix=''):
"""加载单个变量的函数
"""
variable = state_dict[name]
if name in {
'shared.weight',
'encoder.embed_tokens.weight',
'decoder.embed_tokens.weight',
}:
return self.load_embeddings(variable)
elif name in {'encoder.embed_positions.weight', 'decoder.embed_positions.weight'}:
return self.load_pos_embeddings(variable)
else:
return variable
def variable_mapping(self, prefix=''):
# 查看check_point发现'shared.weight'
mapping = {
'encoder.embeddings.word_embeddings.weight': 'shared.weight' if self.tie_emb_src_tgt_weight else 'encoder.embed_tokens.weight',
'encoder.embeddings.position_embeddings.weight': 'encoder.embed_positions.weight',
'encoder.embeddings.layerNorm.weight': 'encoder.layernorm_embedding.weight',
'encoder.embeddings.layerNorm.bias': 'encoder.layernorm_embedding.bias',
'decoder.embeddings.word_embeddings.weight': 'shared.weight' if self.tie_emb_src_tgt_weight else 'decoder.embed_tokens.weight',
'decoder.embeddings.position_embeddings.weight': 'decoder.embed_positions.weight',
'decoder.embeddings.layerNorm.weight': 'decoder.layernorm_embedding.weight',
'decoder.embeddings.layerNorm.bias': 'decoder.layernorm_embedding.bias',
}
for i in range(self.num_hidden_layers):
mapping.update(
{
f'encoder.encoderLayer.{i}.multiHeadAttention.q.weight': f'encoder.layers.{i}.self_attn.q_proj.weight',
f'encoder.encoderLayer.{i}.multiHeadAttention.q.bias': f'encoder.layers.{i}.self_attn.q_proj.bias',
f'encoder.encoderLayer.{i}.multiHeadAttention.k.weight': f'encoder.layers.{i}.self_attn.k_proj.weight',
f'encoder.encoderLayer.{i}.multiHeadAttention.k.bias': f'encoder.layers.{i}.self_attn.k_proj.bias',
f'encoder.encoderLayer.{i}.multiHeadAttention.v.weight': f'encoder.layers.{i}.self_attn.v_proj.weight',
f'encoder.encoderLayer.{i}.multiHeadAttention.v.bias': f'encoder.layers.{i}.self_attn.v_proj.bias',
f'encoder.encoderLayer.{i}.multiHeadAttention.o.weight': f'encoder.layers.{i}.self_attn.out_proj.weight',
f'encoder.encoderLayer.{i}.multiHeadAttention.o.bias': f'encoder.layers.{i}.self_attn.out_proj.bias',
f'encoder.encoderLayer.{i}.layerNorm1.weight': f'encoder.layers.{i}.self_attn_layer_norm.weight',
f'encoder.encoderLayer.{i}.layerNorm1.bias': f'encoder.layers.{i}.self_attn_layer_norm.bias',
f'encoder.encoderLayer.{i}.feedForward.intermediateDense.weight': f'encoder.layers.{i}.fc1.weight',
f'encoder.encoderLayer.{i}.feedForward.intermediateDense.bias': f'encoder.layers.{i}.fc1.bias',
f'encoder.encoderLayer.{i}.feedForward.outputDense.weight': f'encoder.layers.{i}.fc2.weight',
f'encoder.encoderLayer.{i}.feedForward.outputDense.bias': f'encoder.layers.{i}.fc2.bias',
f'encoder.encoderLayer.{i}.layerNorm2.weight': f'encoder.layers.{i}.final_layer_norm.weight',
f'encoder.encoderLayer.{i}.layerNorm2.bias': f'encoder.layers.{i}.final_layer_norm.bias',
f'decoder.decoderLayer.{i}.multiHeadAttention.q.weight': f'decoder.layers.{i}.self_attn.q_proj.weight',
f'decoder.decoderLayer.{i}.multiHeadAttention.q.bias': f'decoder.layers.{i}.self_attn.q_proj.bias',
f'decoder.decoderLayer.{i}.multiHeadAttention.k.weight': f'decoder.layers.{i}.self_attn.k_proj.weight',
f'decoder.decoderLayer.{i}.multiHeadAttention.k.bias': f'decoder.layers.{i}.self_attn.k_proj.bias',
f'decoder.decoderLayer.{i}.multiHeadAttention.v.weight': f'decoder.layers.{i}.self_attn.v_proj.weight',
f'decoder.decoderLayer.{i}.multiHeadAttention.v.bias': f'decoder.layers.{i}.self_attn.v_proj.bias',
f'decoder.decoderLayer.{i}.multiHeadAttention.o.weight': f'decoder.layers.{i}.self_attn.out_proj.weight',
f'decoder.decoderLayer.{i}.multiHeadAttention.o.bias': f'decoder.layers.{i}.self_attn.out_proj.bias',
f'decoder.decoderLayer.{i}.layerNorm1.weight': f'decoder.layers.{i}.self_attn_layer_norm.weight',
f'decoder.decoderLayer.{i}.layerNorm1.bias': f'decoder.layers.{i}.self_attn_layer_norm.bias',
f'decoder.decoderLayer.{i}.crossAttention.q.weight': f'decoder.layers.{i}.encoder_attn.q_proj.weight',
f'decoder.decoderLayer.{i}.crossAttention.q.bias': f'decoder.layers.{i}.encoder_attn.q_proj.bias',
f'decoder.decoderLayer.{i}.crossAttention.k.weight': f'decoder.layers.{i}.encoder_attn.k_proj.weight',
f'decoder.decoderLayer.{i}.crossAttention.k.bias': f'decoder.layers.{i}.encoder_attn.k_proj.bias',
f'decoder.decoderLayer.{i}.crossAttention.v.weight': f'decoder.layers.{i}.encoder_attn.v_proj.weight',
f'decoder.decoderLayer.{i}.crossAttention.v.bias': f'decoder.layers.{i}.encoder_attn.v_proj.bias',
f'decoder.decoderLayer.{i}.crossAttention.o.weight': f'decoder.layers.{i}.encoder_attn.out_proj.weight',
f'decoder.decoderLayer.{i}.crossAttention.o.bias': f'decoder.layers.{i}.encoder_attn.out_proj.bias',
f'decoder.decoderLayer.{i}.layerNorm3.weight': f'decoder.layers.{i}.encoder_attn_layer_norm.weight',
f'decoder.decoderLayer.{i}.layerNorm3.bias': f'decoder.layers.{i}.encoder_attn_layer_norm.bias',
f'decoder.decoderLayer.{i}.feedForward.intermediateDense.weight': f'decoder.layers.{i}.fc1.weight',
f'decoder.decoderLayer.{i}.feedForward.intermediateDense.bias': f'decoder.layers.{i}.fc1.bias',
f'decoder.decoderLayer.{i}.feedForward.outputDense.weight': f'decoder.layers.{i}.fc2.weight',
f'decoder.decoderLayer.{i}.feedForward.outputDense.bias': f'decoder.layers.{i}.fc2.bias',
f'decoder.decoderLayer.{i}.layerNorm2.weight': f'decoder.layers.{i}.final_layer_norm.weight',
f'decoder.decoderLayer.{i}.layerNorm2.bias': f'decoder.layers.{i}.final_layer_norm.bias'
})
return mapping
class T5_Encoder(Encoder):
@insert_arguments(version='t5.1.0')
def __init__(self, *args, **kwargs):
kwargs.update({'p_bias': 't5_relative', 'relative_attention_num_buckets': kwargs.get('relative_attention_num_buckets'), 'version': self.version,
'bias': False, 'norm_mode': 'rmsnorm'}) # p_bias来控制embedding阶段无pos_embedding,t5不使用bias,并且使用rmsnorm
super().__init__(*args, **kwargs)
del self.embeddings.layerNorm
# t5的layernorm都在前面,因此重新定义了下
layer = T5Layer(self.hidden_size, self.num_attention_heads, self.dropout_rate, self.attention_probs_dropout_prob, self.intermediate_size, self.hidden_act, is_dropout=self.is_dropout,
conditional_size=self.conditional_size, **get_kw(BertLayer, kwargs))
self.encoderLayer = nn.ModuleList([copy.deepcopy(layer) for _ in range(self.num_hidden_layers)])
# 把第二层后的相对位置编码的权重绑定到第一层上,变相实现仅由第一层计算
for i in range(1, self.num_hidden_layers):
self.encoderLayer[i].multiHeadAttention.relative_positions_encoding.weight = self.encoderLayer[0].multiHeadAttention.relative_positions_encoding.weight
self.final_layer_norm = LayerNorm(self.hidden_size, eps=1e-12, conditional_size=self.conditional_size, bias=False, norm_mode='rmsnorm')
self.dropout = nn.Dropout(self.dropout_rate)
def apply_final_layers(self, inputs):
hidden_states = super().apply_final_layers(inputs)
return self.dropout(self.final_layer_norm([hidden_states]))
def load_variable(self, state_dict, name, prefix=''):
"""加载单个变量的函数
"""
variable = state_dict[name]
if name in {'encoder.embed_tokens.weight', 'shared.weight'}:
return self.load_embeddings(variable)
else:
return variable
def variable_mapping(self, prefix=''):
# 查看check_point发现'shared.weight'
mapping = {f'{prefix}embeddings.word_embeddings.weight': 'encoder.embed_tokens.weight',
f'{prefix}encoderLayer.0.multiHeadAttention.relative_positions_encoding.weight': 'encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight',
f'{prefix}final_layer_norm.weight': 'encoder.final_layer_norm.weight'}
for i in range(self.num_hidden_layers):
mapping.update(
{
f'{prefix}encoderLayer.{i}.multiHeadAttention.q.weight': f'encoder.block.{i}.layer.0.SelfAttention.q.weight',
f'{prefix}encoderLayer.{i}.multiHeadAttention.k.weight': f'encoder.block.{i}.layer.0.SelfAttention.k.weight',
f'{prefix}encoderLayer.{i}.multiHeadAttention.v.weight': f'encoder.block.{i}.layer.0.SelfAttention.v.weight',
f'{prefix}encoderLayer.{i}.multiHeadAttention.o.weight': f'encoder.block.{i}.layer.0.SelfAttention.o.weight',
f'{prefix}encoderLayer.{i}.layerNorm1.weight': f'encoder.block.{i}.layer.0.layer_norm.weight',
f'{prefix}encoderLayer.{i}.feedForward.outputDense.weight': f'encoder.block.{i}.layer.1.DenseReluDense.wo.weight',
f'{prefix}encoderLayer.{i}.layerNorm2.weight': f'encoder.block.{i}.layer.1.layer_norm.weight',
})
if self.version.endswith('t5.1.0'):
mapping.update({f'{prefix}encoderLayer.{i}.feedForward.intermediateDense.weight': f'encoder.block.{i}.layer.1.DenseReluDense.wi.weight'})
elif self.version.endswith('t5.1.1'):
mapping.update({f'{prefix}encoderLayer.{i}.feedForward.intermediateDense.weight': f'encoder.block.{i}.layer.1.DenseReluDense.wi_0.weight',
f'{prefix}encoderLayer.{i}.feedForward.intermediateDense1.weight': f'encoder.block.{i}.layer.1.DenseReluDense.wi_1.weight'})
return mapping
class T5_Decoder(Decoder):
@insert_arguments(version='t5.1.0')
def __init__(self, *args, **kwargs):
kwargs.update({'p_bias': 't5_relative', 'relative_attention_num_buckets': kwargs.get('relative_attention_num_buckets'), 'version': self.version,
'bias': False, 'norm_mode': 'rmsnorm'}) # p_bias来控制embedding阶段无pos_embedding,t5不使用bias,并且使用rmsnorm
super().__init__(*args, **kwargs)
del self.embeddings.layerNorm
# t5的layernorm都在前面,因此重新定义了下
layer = T5Layer(self.hidden_size, self.num_attention_heads, self.dropout_rate, self.attention_probs_dropout_prob, self.intermediate_size, self.hidden_act, is_dropout=self.is_dropout,
conditional_size=self.conditional_size, is_decoder=True, **get_kw(BertLayer, kwargs))
self.decoderLayer = nn.ModuleList([copy.deepcopy(layer) for _ in range(self.num_hidden_layers)])
# 把第二层后的相对位置编码的权重绑定到第一层上,变相实现仅由第一层计算
for i in range(1, self.num_hidden_layers):
self.decoderLayer[i].multiHeadAttention.relative_positions_encoding.weight = self.decoderLayer[0].multiHeadAttention.relative_positions_encoding.weight
self.final_layer_norm = LayerNorm(self.hidden_size, eps=1e-12, conditional_size=self.conditional_size, bias=False, norm_mode='rmsnorm')
self.dropout = nn.Dropout(self.dropout_rate)
def apply_final_layers(self, inputs):
inputs[0][1] = self.dropout(self.final_layer_norm([inputs[0][1]])) # 在转logit前把最后一层的hidden_states加layernorm
return super().apply_final_layers(inputs)
def load_variable(self, state_dict, name, prefix=''):
"""加载单个变量的函数
"""
variable = state_dict[name]
if name in {f'decoder.embed_tokens.weight', 'lm_head.weight', 'shared.weight'}:
return self.load_embeddings(variable)
else:
return variable
def variable_mapping(self, prefix=''):
# 查看check_point发现'shared.weight'
mapping = {f'{prefix}embeddings.word_embeddings.weight': 'decoder.embed_tokens.weight',
f'{prefix}decoderLayer.0.multiHeadAttention.relative_positions_encoding.weight': 'decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight',
f'{prefix}final_layer_norm.weight': 'decoder.final_layer_norm.weight',
f'{prefix}final_dense.weight': 'lm_head.weight'}
for i in range(self.num_hidden_layers):
mapping.update(
{
f'{prefix}decoderLayer.{i}.multiHeadAttention.q.weight': f'decoder.block.{i}.layer.0.SelfAttention.q.weight',
f'{prefix}decoderLayer.{i}.multiHeadAttention.k.weight': f'decoder.block.{i}.layer.0.SelfAttention.k.weight',
f'{prefix}decoderLayer.{i}.multiHeadAttention.v.weight': f'decoder.block.{i}.layer.0.SelfAttention.v.weight',
f'{prefix}decoderLayer.{i}.multiHeadAttention.o.weight': f'decoder.block.{i}.layer.0.SelfAttention.o.weight',
f'{prefix}decoderLayer.{i}.layerNorm1.weight': f'decoder.block.{i}.layer.0.layer_norm.weight',
f'{prefix}decoderLayer.{i}.crossAttention.q.weight': f'decoder.block.{i}.layer.1.EncDecAttention.q.weight',
f'{prefix}decoderLayer.{i}.crossAttention.k.weight': f'decoder.block.{i}.layer.1.EncDecAttention.k.weight',
f'{prefix}decoderLayer.{i}.crossAttention.v.weight': f'decoder.block.{i}.layer.1.EncDecAttention.v.weight',
f'{prefix}decoderLayer.{i}.crossAttention.o.weight': f'decoder.block.{i}.layer.1.EncDecAttention.o.weight',
f'{prefix}decoderLayer.{i}.layerNorm3.weight': f'decoder.block.{i}.layer.1.layer_norm.weight',
f'{prefix}decoderLayer.{i}.feedForward.outputDense.weight': f'decoder.block.{i}.layer.2.DenseReluDense.wo.weight',
f'{prefix}decoderLayer.{i}.layerNorm2.weight': f'decoder.block.{i}.layer.2.layer_norm.weight',
})
if self.version.endswith('t5.1.0'):
mapping.update({f'{prefix}decoderLayer.{i}.feedForward.intermediateDense.weight': f'decoder.block.{i}.layer.2.DenseReluDense.wi.weight'})
elif self.version.endswith('t5.1.1'):
mapping.update({f'{prefix}decoderLayer.{i}.feedForward.intermediateDense.weight': f'decoder.block.{i}.layer.2.DenseReluDense.wi_0.weight',
f'{prefix}decoderLayer.{i}.feedForward.intermediateDense1.weight': f'decoder.block.{i}.layer.2.DenseReluDense.wi_1.weight'})
return mapping
class T5(Transformer):
"""Google的T5模型(Encoder-Decoder)
"""
@delete_arguments('with_pool', 'with_mlm', 'with_nsp')
def __init__(self, *args, tie_emb_src_tgt_weight=True, **kwargs):
super(T5, self).__init__(*args, **kwargs)
self.tie_emb_src_tgt_weight = tie_emb_src_tgt_weight
# encoder
self.encoder = T5_Encoder(*args, **kwargs)
self.encoder.build(**kwargs)
# decoder
self.decoder = T5_Decoder(*args, **kwargs)
self.decoder.build(**kwargs)
def load_variable(self, state_dict, name, prefix=''):
"""加载单个变量的函数
"""
variable = state_dict[name]
if name in {'shared.weight', 'encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'}:
return self.load_embeddings(variable)
else:
return variable
def variable_mapping(self, prefix=''):
mapping = self.encoder.variable_mapping(prefix='encoder.')
mapping.update(self.decoder.variable_mapping(prefix='decoder.'))
if self.tie_emb_src_tgt_weight:
mapping.update({'encoder.embeddings.word_embeddings.weight': 'shared.weight',
'decoder.embeddings.word_embeddings.weight': 'shared.weight'})
return mapping
class GPT(LM_Mask, BERT):
"""构建GPT模型
链接:https://github.com/openai/finetune-transformer-lm
"""
@insert_arguments(final_activation='softmax')
@delete_arguments('with_pool', 'with_mlm', 'with_nsp')
def __init__(self, max_position, **kwargs):
"""GPT的embedding是token、position、segment三者embedding之和,跟BERT的主要区别是三者相加之后没有加LayerNormalization层。
使用LM_Mask实现预训练ckpt中的bias参数,最后的全连接层由于和embedding层权重一致,因此直接从word_embedding取
"""
super(GPT, self).__init__(max_position, **kwargs)
del self.embeddings.layerNorm
self.dense = nn.Linear(self.hidden_size, self.vocab_size, bias=False)
self.dense.weight = self.embeddings.word_embeddings.weight
self.final_activation = get_activation(self.final_activation)
def apply_final_layers(self, inputs):
hidden_state = super().apply_final_layers(inputs)
logit = self.dense(hidden_state)
return self.final_activation(logit)
def load_variable(self, state_dict, name):
return super(GPT, self).load_variable(state_dict, name, prefix='gpt')
def variable_mapping(self):
"""映射到GPT权重格式
"""
mapping = super(GPT, self).variable_mapping(prefix='gpt')
return mapping
class GPT2(LM_Mask, BERT):
"""构建GPT模型
链接:https://github.com/openai/finetune-transformer-lm
"""
@insert_arguments(final_activation='softmax')
@delete_arguments('with_pool', 'with_mlm', 'with_nsp')
def __init__(self, max_position, **kwargs):
"""GPT2的embedding是token、position两者embedding之和
1、跟BERT的主要区别是三者相加之后没有加LayerNormalization层。
2、bert的layernorm是在attn/ffc之后,OpenAi-gpt2是在之前。
使用LM_Mask实现预训练ckpt中的bias参数,最后的全连接层由于和embedding层权重一致,因此直接从word_embedding取
"""
super(GPT2, self).__init__(max_position, **kwargs)
del self.embeddings.layerNorm
layer = self.Gpt2Layer(self.hidden_size, self.num_attention_heads, self.dropout_rate, self.attention_probs_dropout_prob, self.intermediate_size, self.hidden_act, is_dropout=self.is_dropout, conditional_size=self.conditional_size)
self.encoderLayer = nn.ModuleList([copy.deepcopy(layer) if layer_id in self.keep_hidden_layers else Identity() for layer_id in range(self.num_hidden_layers)])
self.LayerNormFinal = LayerNorm(self.hidden_size, eps=1e-12, conditional_size=self.conditional_size)
self.dense = nn.Linear(self.hidden_size, self.vocab_size, bias=False)
self.dense.weight = self.embeddings.word_embeddings.weight
self.final_activation = get_activation(self.final_activation)
def apply_final_layers(self, inputs):
hidden_state = super().apply_final_layers(inputs)
logit = self.dense(self.LayerNormFinal([hidden_state]))
return self.final_activation(logit)
def load_variable(self, state_dict, name):
return super(GPT2, self).load_variable(state_dict, name, prefix='gpt2')
def variable_mapping(self):
"""映射到GPT权重格式
"""
mapping = super(GPT2, self).variable_mapping(prefix='gpt2')
mapping.update({'LayerNormFinal.weight': 'gpt2.LayerNormFinal.weight',
'LayerNormFinal.bias': 'gpt2.LayerNormFinal.bias'})
return mapping
class Gpt2Layer(BertLayer):
'''未定义在layer.py中是因为该层针对gpt2_mlm模型,不可复用
顺序:LN --> Att --> Add --> LN --> FFN --> Add
'''
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def forward(self, hidden_states, attention_mask, conditional_emb=None, encoder_hidden_states=None, encoder_attention_mask=None):
# bert的layernorm是在attn/ffc之后,Openai-gpt2是在之前
x = self.layerNorm1((hidden_states, conditional_emb))
self_attn_output = self.multiHeadAttention(x, attention_mask)
hidden_states = hidden_states + self.dropout1(self_attn_output)
x = self.layerNorm2((hidden_states, conditional_emb))
ffn_output = self.feedForward(x)
hidden_states = hidden_states + self.dropout2(ffn_output)
return hidden_states
class GPT2_ML(LM_Mask, BERT):
"""构建GPT2_ML模型
链接: https://github.com/imcaspar/gpt2-ml
注意:GPT2_ML虽然号称GPT2,但是它的结构其实更接近GPT,它自称GPT2的原因大概是因为它开源的版本参数量达到了GPT2的15亿参数。
看完ckpt中的key,和GPT的区别是embedding后也有layernorm,和bert的区别是第一个跳跃链接是在layernorm前,bert是在之后
"""
@insert_arguments(final_activation='softmax')
@delete_arguments('with_pool', 'with_mlm', 'with_nsp')
def __init__(self, max_position, **kwargs):
super().__init__(max_position, **kwargs)
layer = self.Gpt2MlLayer(self.hidden_size, self.num_attention_heads, self.dropout_rate, self.attention_probs_dropout_prob, self.intermediate_size, self.hidden_act, is_dropout=self.is_dropout, conditional_size=self.conditional_size)
self.encoderLayer = nn.ModuleList([copy.deepcopy(layer) if layer_id in self.keep_hidden_layers else Identity() for layer_id in range(self.num_hidden_layers)])
self.dense = nn.Linear(self.hidden_size, self.vocab_size, bias=False)
self.dense.weight = self.embeddings.word_embeddings.weight
self.final_activation = get_activation(self.final_activation)
def apply_final_layers(self, inputs):
hidden_state = super().apply_final_layers(inputs)
logit = self.dense(hidden_state)
return self.final_activation(logit)
def load_variable(self, state_dict, name):
return super(GPT2_ML, self).load_variable(state_dict, name, prefix='gpt2_ml')
def variable_mapping(self):
"""映射到GPT2权重格式
"""
mapping = super(GPT2_ML, self).variable_mapping(prefix='gpt2_ml')
return mapping
class Gpt2MlLayer(BertLayer):
'''未定义在layer.py中是因为该层针对gpt2_mlm模型,不可复用
顺序:Att --> Add --> LN --> FFN --> Add --> LN
'''
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def forward(self, hidden_states, attention_mask, conditional_emb=None, encoder_hidden_states=None, encoder_attention_mask=None):
self_attn_output = self.multiHeadAttention(hidden_states, attention_mask)
hidden_states = hidden_states + self.dropout1(self_attn_output)
x = self.layerNorm1((hidden_states, conditional_emb))
# bert的跳跃连接是在layerNorm之后,gpt2_ml是在layerNorm之前
ffn_output = self.feedForward(x)
hidden_states = hidden_states + self.dropout2(ffn_output)
hidden_states = self.layerNorm2((hidden_states, conditional_emb))
return hidden_states
class Transformer_XL(BERT):
'''构建transformer-xl模型, 已加载
项目: https://github.com/kimiyoung/transformer-xl
不同点:
1) 简化了原有的AdaptiveEmbedding(可选)和未使用ProjectedAdaptiveLogSoftmax, 直接输出last_hidden_state
2) mems修改了transformer中初始化为zero_tensor, 改为包含最后一层, 原项目初始化为empty_tensor
3) SinusoidalPositionEncoding一般是sincos间隔排列, 这里是先sin后cos
4) attention_mask在multi_attn中使用中使用1e30来替代原来的1000
'''
@delete_arguments('with_pool', 'with_nsp', 'with_mlm')
@insert_arguments(with_lm=False)
def __init__(self, *args, mem_len=0, same_length=False, clamp_len=-1, **kwargs):
# p_bias来控制embedding阶段无pos_embedding
kwargs.update({'p_bias': 'other_relative'})
super().__init__(*args, **kwargs)
self.mem_len, self.same_length, self.clamp_len = mem_len, same_length, clamp_len
self.attn_type = kwargs.get('attn_type', 0)
# embedding
if kwargs.get('adaptive_embedding'):
cutoffs, div_val, sample_softmax = kwargs.get('cutoffs', []), kwargs.get('div_val', 1), kwargs.get('sample_softmax', False)
self.embeddings = AdaptiveEmbedding(self.vocab_size, self.embedding_size, self.hidden_size, cutoffs, div_val, sample_softmax, **get_kw(AdaptiveEmbedding, kwargs))
else:
self.embeddings = nn.Embedding(self.vocab_size, self.embedding_size)
self.pos_embeddings = XlnetPositionsEncoding(self.embedding_size)
self.dropout = nn.Dropout(self.dropout_rate)
# 每层自己的r_w_bias和r_r_bias,还是公用
if not kwargs.get('untie_r'):
self.r_w_bias = nn.Parameter(torch.FloatTensor(self.num_attention_heads, self.attention_head_size)) # 全局内容偏置
self.r_r_bias = nn.Parameter(torch.FloatTensor(self.num_attention_heads, self.attention_head_size)) # 全局位置偏置
if self.segment_vocab_size > 0:
self.r_s_bias = nn.Parameter(torch.FloatTensor(self.num_attention_heads, self.attention_head_size)) # 全局segment偏置
else:
self.r_w_bias, self.r_r_bias = None, None
self.r_s_bias = None
# transformer block
layer = XlnetLayer(self.hidden_size, self.num_attention_heads, self.dropout_rate, self.attention_probs_dropout_prob, self.intermediate_size,
self.hidden_act, is_dropout=self.is_dropout, conditional_size=self.conditional_size, r_w_bias=self.r_w_bias, r_r_bias=self.r_r_bias,
r_s_bias=None, **get_kw(BertLayer, kwargs))
self.encoderLayer = nn.ModuleList([copy.deepcopy(layer) if layer_id in self.keep_hidden_layers else Identity() for layer_id in range(self.num_hidden_layers)])
# 映射
if self.with_lm:
self.dense = nn.Linear(self.hidden_size, self.vocab_size, bias=True)
def init_mems(self, bsz):
'''初始化mems, 用于记忆mlen的各层隐含层状态
'''
if isinstance(self.mem_len, (int, float)) and (self.mem_len > 0):
mems = []
param = next(self.parameters())
for _ in range(self.num_hidden_layers+1):
empty = torch.zeros(bsz, self.mem_len, self.hidden_size, dtype=param.dtype, device=param.device)
mems.append(empty)
return mems
else:
return None
def _update_mems(self, hids, mlen, qlen):
'''更新mems
'''
# does not deal with None
if self.mems is None:
return None
# mems is not None
assert len(hids) == len(self.mems), "len(hids) != len(mems)"
# There are `mlen + qlen` steps that can be cached into mems
with torch.no_grad():
new_mems = []
end_idx = mlen + max(0, qlen)
beg_idx = max(0, end_idx - self.mem_len)
for i in range(len(hids)):
cat = torch.cat([self.mems[i], hids[i]], dim=1)
new_mems.append(cat[:, beg_idx:end_idx].detach())
self.mems = new_mems
def relative_positional_encoding(self, qlen, klen, device):
# 生成pos_emb, 这里使用sincos的位置编码,为了和xlnet入参一致
pos_seq = torch.arange(klen-1, -1, -1.0, device=device, dtype=torch.long)
if self.clamp_len > 0:
pos_seq.clamp_(max=self.clamp_len)
pos_emb = self.dropout(self.pos_embeddings(pos_seq)) # 用word_emb的dropout
return pos_emb
def create_mask(self, word_emb, qlen, klen, mlen):
# 修改attention_mask, mlen可以全部访问,q_len只能访问<=t时刻的, mask和Unilm类似,但是Unilm是靠segement_ids来控制
if self.same_length: # 只能访问前面固定长度
all_ones = word_emb.new_ones(qlen, klen)
mask_len = klen - self.mem_len
mask_shift_len = qlen - mask_len if mask_len > 0 else qlen
attention_mask = 1-(torch.triu(all_ones, 1+mlen) + torch.tril(all_ones, -mask_shift_len)).byte() # -1
else:
attention_mask = torch.tril(word_emb.new_ones(qlen, klen), diagonal=mlen).byte() # [q_len, k_len], 下三角为1矩阵
attention_mask = attention_mask[None, None, :, :]
return attention_mask
def apply_embeddings(self, inputs):
'''接受的inputs输入: [token_ids, segment_ids], 暂不支持条件LayerNorm输入
'''
assert isinstance(inputs, (list, tuple)), f'Inputs only support list,tuple format but passed {type(inputs)}'
self.mems = self.init_mems(inputs[0].size(0)) # 生成mems
# 精简后embeddings中只计算word_emdedding
word_emb = self.dropout(self.embeddings(inputs[0]))
index_ = 1
btz, qlen = inputs[0].shape[:2] # query长度
mlen = self.mems[0].size(1) if self.mems is not None else 0
klen = mlen + qlen
# 相对位置编码
pos_emb = self.relative_positional_encoding(qlen, klen, word_emb.device)
# segment embedding
if self.segment_vocab_size > 0:
segment_ids = inputs[index_]
if mlen > 0:
mem_pad = torch.zeros([btz, mlen], dtype=torch.long, device=word_emb.device)
cat_ids = torch.cat([mem_pad, segment_ids], dim=1)
else:
cat_ids = segment_ids
# `1` indicates not in the same segment [qlen x klen x bsz]
segment_ids = (segment_ids[:, :, None] != cat_ids[:, None]).long()
index_ += 1
else:
segment_ids = None
if self.attn_type in {'uni', 0}: # 兼容transformer_xl的设置: 0
attention_mask = self.create_mask(word_emb, qlen, klen, mlen)
elif self.attn_type == 'bi':
attention_mask = (inputs[0] != self.token_pad_ids).long().unsqueeze(1).unsqueeze(2)
non_tgt_mask = torch.eye(qlen).to(attention_mask)[None, None, :, :]
non_tgt_mask = ((1 - attention_mask - non_tgt_mask) <= 0).long()
return [word_emb, segment_ids, pos_emb, non_tgt_mask, None]
def apply_main_layers(self, inputs):
hidden_states, segment_ids, pos_emb, attention_mask, conditional_emb = inputs[:5]
encoded_layers = [hidden_states] # 添加embedding的输出
layer_inputs = [hidden_states, segment_ids, pos_emb, attention_mask, None, conditional_emb]
for i, layer_module in enumerate(self.encoderLayer):
mems_i = None if self.mems is None else self.mems[i]
layer_inputs[-2] = mems_i
layer_inputs = self.apply_on_layer_begin(i, layer_inputs)
hidden_states = layer_module(*layer_inputs)
layer_inputs[0] = hidden_states
layer_inputs = self.apply_on_layer_end(i, layer_inputs)
encoded_layers.append(hidden_states)
# 原实现中word_emb, pos_emb和core_out(hidden_states)使用同一个dropout
hidden_states = self.dropout(hidden_states)
qlen = inputs[0].size(1) # query长度
mlen = self.mems[0].size(0) if self.mems is not None else 0
self._update_mems(encoded_layers, mlen, qlen)
if not self.output_all_encoded_layers:
# 不返回所有层,即返回顶层
encoded_layers = encoded_layers[:1] + [hidden_states]
return [encoded_layers, conditional_emb]
def load_variable(self, state_dict, name, prefix=''):
# 这里由于预训练模型使用了AdapterEmbedding,因此暂不支持
if (self.keep_tokens is not None) or (self.compound_tokens is not None):
raise ValueError('Custom keep_tokens and compound_tokens is not yet supported in Transformer_XL')
return state_dict[name]
def variable_mapping(self, prefix=''):
return {k:k for k, v in self.named_parameters()}
class XLNET(Transformer_XL):
'''构建xlnet模型, 这里做了简化, 只用来finetune, 即没有perm_mask, target_mapping这些输入
接受的inputs输入: [token_ids, segment_ids]
'''
def __init__(self, *args, bi_data=False, **kwargs):
self.attn_type = kwargs.get('attn_type', 'bi')
self.bi_data = bi_data
kwargs['rel_shift_opt'] = 'xlnet'
super().__init__(*args, **kwargs)
def relative_positional_encoding(self, qlen, klen, device):
# 生成pos_emb, 这里使用sincos的位置编码, transformer_xl里面有-1
if self.attn_type == 'bi':
beg, end = klen, -qlen
elif self.attn_type == "uni":
beg, end = klen, -1
else:
raise ValueError(f"Unknown `attn_type` {self.attn_type}.")
# 前向的emb
pos_seq = torch.arange(beg, end, -1.0, device=device, dtype=torch.long)
if self.clamp_len > 0:
pos_seq.clamp_(max=self.clamp_len)
fwd_pos_emb = self.pos_embeddings(pos_seq)
# 双向数据
if self.bi_data:
pos_seq = torch.arange(-beg, -end, -1.0, device=device, dtype=torch.long)
if self.clamp_len > 0:
pos_seq.clamp_(max=self.clamp_len)
bwd_pos_emb = self.pos_embeddings(pos_seq)
pos_emb = torch.cat([fwd_pos_emb, bwd_pos_emb], dim=0)
else:
pos_emb = fwd_pos_emb
pos_emb = self.dropout(pos_emb) # 用word_emb的dropout
return pos_emb
def apply_final_layers(self, inputs):
hidden_state = super().apply_final_layers(inputs)
if self.with_lm:
return [hidden_state, self.dense(hidden_state)]
else:
return hidden_state
def load_variable(self, state_dict, name, prefix='transformer'):
"""加载单个变量的函数
"""
variable = state_dict[name]
if name in {f'{prefix}.word_embedding.weight', 'lm_loss.weight', 'lm_loss.bias'}:
return self.load_embeddings(variable)
elif re.search('rel_attn\.(q|k|v|r)$', name):
return variable.reshape(variable.shape[0], -1).T
# elif re.search('rel_attn\.(o|seg_embed)$', name):
elif re.search('rel_attn\.(o)$', name):
return variable.reshape(variable.shape[0], -1)
else:
return variable
def variable_mapping(self, prefix='transformer'):
mapping = {
'embeddings.weight': f'{prefix}.word_embedding.weight',
'dense.weight': 'lm_loss.weight',
'dense.bias': 'lm_loss.bias',
}
for i in range(self.num_hidden_layers):
prefix_i = f'{prefix}.layer.%d.' % i
mapping.update({f'encoderLayer.{i}.multiHeadAttention.q.weight': prefix_i + 'rel_attn.q',
f'encoderLayer.{i}.multiHeadAttention.k.weight': prefix_i + 'rel_attn.k',
f'encoderLayer.{i}.multiHeadAttention.v.weight': prefix_i + 'rel_attn.v',
f'encoderLayer.{i}.multiHeadAttention.o.weight': prefix_i + 'rel_attn.o',
f'encoderLayer.{i}.multiHeadAttention.r.weight': prefix_i + 'rel_attn.r',
f'encoderLayer.{i}.multiHeadAttention.r_r_bias': prefix_i + 'rel_attn.r_r_bias',
f'encoderLayer.{i}.multiHeadAttention.r_s_bias': prefix_i + 'rel_attn.r_s_bias',
f'encoderLayer.{i}.multiHeadAttention.r_w_bias': prefix_i + 'rel_attn.r_w_bias',
# f'encoderLayer.{i}.multiHeadAttention.seg_embed.weight': prefix_i + 'rel_attn.seg_embed',
f'encoderLayer.{i}.multiHeadAttention.seg_embed': prefix_i + 'rel_attn.seg_embed',
f'encoderLayer.{i}.layerNorm1.weight': prefix_i + 'rel_attn.layer_norm.weight',
f'encoderLayer.{i}.layerNorm1.bias': prefix_i + 'rel_attn.layer_norm.bias',
f'encoderLayer.{i}.feedForward.intermediateDense.weight': prefix_i + 'ff.layer_1.weight',
f'encoderLayer.{i}.feedForward.intermediateDense.bias': prefix_i + 'ff.layer_1.bias',
f'encoderLayer.{i}.feedForward.outputDense.weight': prefix_i + 'ff.layer_2.weight',
f'encoderLayer.{i}.feedForward.outputDense.bias': prefix_i + 'ff.layer_2.bias',
f'encoderLayer.{i}.layerNorm2.weight': prefix_i + 'ff.layer_norm.weight',
f'encoderLayer.{i}.layerNorm2.bias': prefix_i + 'ff.layer_norm.bias'
})
return mapping
def build_transformer_model(
config_path=None,
checkpoint_path=None,
model='bert',
application='encoder',
**kwargs
):
"""根据配置文件构建模型,可选加载checkpoint权重
"""
configs = {}
if config_path is not None:
configs.update(json.load(open(config_path)))
configs.update(kwargs)
if 'max_position' not in configs:
configs['max_position'] = configs.get('max_position_embeddings', 512)
if 'dropout_rate' not in configs:
configs['dropout_rate'] = configs.get('hidden_dropout_prob')
if 'segment_vocab_size' not in configs:
configs['segment_vocab_size'] = configs.get('type_vocab_size', 2)
models = {
'bert': BERT,
'roberta': BERT,
'albert': ALBERT,
'albert_unshared': ALBERT_Unshared,
'nezha': NEZHA,
'roformer': RoFormer,
'roformer_v2': RoFormerV2,
'gau_alpha': GAU_alpha,
'electra': ELECTRA,
'ernie': ERNIE,
'encoder': Encoder,
'decoder': Decoder,
'transformer': Transformer,
'bart': BART,
'gpt': GPT,
'gpt2': GPT2,
'gpt2_ml': GPT2_ML,
't5': T5,
't5_encoder': T5_Encoder,
't5_decoder': T5_Decoder,
't5.1.0': T5,
't5.1.0_encoder': T5_Encoder,
't5.1.0_decoder': T5_Decoder,
't5.1.1': T5,
't5.1.1_encoder': T5_Encoder,
't5.1.1_decoder': T5_Decoder,
'mt5.1.1': T5,
'mt5.1.1_encoder': T5_Encoder,
'mt5.1.1_decoder': T5_Decoder,
'transformer_xl': Transformer_XL,
'xlnet': XLNET,
}
if isinstance(model, str): # string表示使用自带的模型
MODEL = models[model.lower()]
if model.endswith('t5.1.1'):
configs['version'] = model
elif isinstance(model, type) and issubclass(model, BERT_BASE): # nn.Module表示使用自定义的模型:
MODEL = model
else:
raise ValueError('"model" args type should be string or nn.Module')
application = application.lower()
if application in ['lm', 'unilm'] and model in ['electra', 't5', ]:
raise ValueError(f'"{model}" model can not be used as "{application}" application.\n')
if application == 'lm':
MODEL = extend_with_language_model(MODEL)
elif application == 'unilm':
MODEL = extend_with_unified_language_model(MODEL)
transformer = MODEL(**configs)
transformer.build(**configs)
transformer.apply(transformer.init_model_weights) # 初始化权重
if checkpoint_path is not None:
transformer.load_weights_from_pytorch_checkpoint(checkpoint_path)
transformer.configs = configs
return transformer
import os
import time
from inspect import isfunction
import torch
import torch.nn as nn
import copy
import json
import re
from bert4torch.layers import LayerNorm, BertEmbeddings, BertLayer, Identity, T5Layer, GatedAttentionUnit, XlnetLayer
from bert4torch.layers import AdaptiveEmbedding, XlnetPositionsEncoding
from bert4torch.snippets import metric_mapping, search_layer, insert_arguments, delete_arguments, get_kw
from bert4torch.snippets import ProgbarLogger, EarlyStopping, FGM, PGD, VAT, IterDataset, take_along_dim
from bert4torch.activations import get_activation
from collections import OrderedDict
import warnings
class BaseModel(nn.Module):
def __init__(self):
super(BaseModel, self).__init__()
# 这里主要是为了外面调用用到
self.global_step, self.local_step, self.total_steps, self.epoch, self.steps_per_epoch, self.train_dataloader = 0, 0, 0, 0, None, None
self.resume_step, self.resume_epoch = 0, 0
self.callbacks = []
def save_steps_params(self, save_path):
'''保存训练过程参数
'''
step_params = {'resume_step': (self.local_step+1) % self.steps_per_epoch,
'resume_epoch': self.epoch + (self.local_step+1) // self.steps_per_epoch}
torch.save(step_params, save_path)
def load_steps_params(self, save_path):
'''导入训练过程参数
'''
step_params = torch.load(save_path)
self.resume_step = step_params['resume_step']
self.resume_epoch = step_params['resume_epoch']
return step_params
def compile(self, loss, optimizer, scheduler=None, clip_grad_norm=None, use_amp=False, metrics=None, adversarial_train={'name': ''}):
'''定义loss, optimizer, metrics, 是否在计算loss前reshape
loss: loss
optimizer: 优化器
scheduler: scheduler
clip_grad_norm: 是否使用梯度裁剪, 默认不启用
use_amp: 是否使用混合精度,默认不启用
metrics: 训练过程中需要打印的指标, loss相关指标默认会打印, 目前支持accuracy, 也支持自定义metric,形式为{key: func}
'''
self.criterion = loss
self.optimizer = optimizer
self.scheduler = scheduler
self.clip_grad_norm = clip_grad_norm
self.use_amp = use_amp
if use_amp:
assert adversarial_train['name'] not in {'vat', 'gradient_penalty'}, 'Amp and adversarial_train both run is not supported in current version'
from torch.cuda.amp import autocast
self.autocast = autocast
self.scaler = torch.cuda.amp.GradScaler()
# 训练过程观测的指标
self.metrics = OrderedDict({'loss': None})
if metrics is None:
metrics = []
elif isinstance(metrics, (str, dict)) or isfunction(metrics):
metrics = [metrics]
for metric in metrics:
# 字符类型,目前仅支持accuracy
if isinstance(metric, str) and metric != 'loss':
self.metrics[metric] = None
# 字典形式 {metric: func}
elif isinstance(metric, dict):
self.metrics.update(metric)
# 函数形式,key和value都赋值metric
elif isfunction(metric):
self.metrics.update({metric: metric})
else:
raise ValueError('Args metrics only support "String, Dict, Callback, List[String, Dict, Callback]" format')
# 对抗训练
self.adversarial = adversarial_train
self.adversarial_initialize()
def adversarial_initialize(self):
'''对抗训练初始化
'''
assert self.adversarial['name'] in {'', 'fgm', 'pgd', 'vat', 'gradient_penalty'}, 'adversarial_train support fgm, pgd, vat and gradient_penalty mode'
self.adversarial['epsilon'] = self.adversarial.get('epsilon', 1.0)
self.adversarial['emb_name'] = self.adversarial.get('emb_name', 'word_embeddings')
if self.adversarial['name'] == 'fgm':
self.ad_train = FGM(self)
elif self.adversarial['name'] == 'pgd':
self.adversarial['K'] = self.adversarial.get('K', 3) # 步数
self.adversarial['alpha'] = self.adversarial.get('alpha', 0.3) # 学习率
self.ad_train = PGD(self)
elif self.adversarial['name'] == 'gradient_penalty':
pass
elif self.adversarial['name'] == 'vat':
self.adversarial['K'] = self.adversarial.get('K', 3)
self.adversarial['noise_var'] = self.adversarial.get('noise_var', 1e-5) # 噪声的方差
self.adversarial['noise_gamma'] = self.adversarial.get('noise_gamma', 1e-6) # eps
self.adversarial['adv_step_size'] = self.adversarial.get('adv_step_size', 1e-3) # 学习率
self.adversarial['adv_alpha'] = self.adversarial.get('adv_alpha', 1) # 对抗loss的权重
self.adversarial['norm_type'] = self.adversarial.get('norm_type', 'l2') # 归一化方式
self.ad_train = VAT(self, **self.adversarial)
def adversarial_training(self, train_X, train_y, output, loss, loss_detail, grad_accumulation_steps):
'''对抗训练
'''
if self.adversarial['name'] == 'fgm':
self.ad_train.attack(**self.adversarial) # embedding被修改了
output, loss, loss_detail = self.train_step(train_X, train_y, grad_accumulation_steps)
loss.backward() # 反向传播,在正常的grad基础上,累加对抗训练的梯度
# 恢复Embedding的参数, 因为要在正常的embedding上更新参数,而不是增加了对抗扰动后的embedding上更新参数~
self.ad_train.restore(**self.adversarial)
elif self.adversarial['name'] == 'pgd':
self.ad_train.backup_grad() # 备份梯度
for t in range(self.adversarial['K']):
# 在embedding上添加对抗扰动, first attack时备份param.data
self.ad_train.attack(**self.adversarial, is_first_attack=(t==0))
if t != self.adversarial['K']-1:
self.optimizer.zero_grad() # 为了累积扰动而不是梯度
else:
self.ad_train.restore_grad() # 恢复正常的grad
output, loss, loss_detail = self.train_step(train_X, train_y, grad_accumulation_steps)
loss.backward() # 反向传播,在正常的grad基础上,累加对抗训练的梯度
self.ad_train.restore(**self.adversarial) # 恢复embedding参数
# 梯度惩罚
elif self.adversarial['name'] == 'gradient_penalty':
para = search_layer(self, self.adversarial['emb_name'], retrun_first=True)
gp = (para.grad ** 2).sum()
loss += 0.5 * gp * self.adversarial['epsilon']
loss.backward()
# 虚拟对抗训练
elif self.adversarial['name'] == 'vat':
logit = output[0] if isinstance(output, (list, tuple)) else output
adv_loss = self.ad_train.virtual_adversarial_training(train_X, logit)
loss_detail.update({'loss_sup': loss.item(), 'loss_unsup': adv_loss})
loss += (adv_loss if adv_loss else 0)
loss.backward()
return loss, loss_detail
def train_step(self, train_X, train_y, grad_accumulation_steps):
'''forward并返回loss
'''
def args_segmentate(train_X):
'''参数是否展开
'''
if isinstance(train_X, torch.Tensor): # tensor不展开
pass
elif isinstance(self, (BaseModelDP, BaseModelDDP)):
if self.module.forward.__code__.co_argcount >= 3:
return True
elif self.forward.__code__.co_argcount >= 3:
return True
return False
if self.use_amp:
with self.autocast():
output = self.forward(*train_X) if args_segmentate(train_X) else self.forward(train_X)
loss_detail = self.criterion(output, train_y)
else:
output = self.forward(*train_X) if args_segmentate(train_X) else self.forward(train_X)
loss_detail = self.criterion(output, train_y)
if isinstance(loss_detail, torch.Tensor):
loss = loss_detail
loss_detail = {}
elif isinstance(loss_detail, dict):
loss = loss_detail['loss'] # 还存在其他loss,仅用于打印
del loss_detail['loss']
elif isinstance(loss_detail, (tuple, list)):
loss = loss_detail[0]
loss_detail = {f'loss{i}':v for i, v in enumerate(loss_detail[1:], start=1)}
else:
raise ValueError('Return loss only support Tensor/dict/tuple/list format')
# 梯度累积
loss = loss / grad_accumulation_steps if grad_accumulation_steps > 1 else loss
return output, loss, loss_detail
def callback_fun(self, mode, logs={}):
'''统一调用callback, 方便一些判断条件的触发
'''
# 如果是分布式DDP训练,则仅masker_rank可以callback
if isinstance(self, BaseModelDDP) and self.master_rank!=torch.distributed.get_rank():
return
if mode == 'train_begin':
for callback in self.callbacks:
callback.on_train_begin()
elif mode == 'epoch_begin':
for callback in self.callbacks:
callback.on_epoch_begin(self.global_step, self.epoch, logs)
elif mode == 'batch_begin':
for callback in self.callbacks:
callback.on_batch_begin(self.global_step, self.local_step, logs)
elif mode == 'batch_end':
for callback in self.callbacks:
callback.on_batch_end(self.global_step, self.local_step, logs)
elif mode == 'epoch_end':
for callback in self.callbacks:
callback.on_epoch_end(self.global_step, self.epoch, logs)
elif mode == 'train_end':
for callback in self.callbacks:
callback.on_train_end()
elif mode == 'dataloader_end':
for callback in self.callbacks:
callback.on_dataloader_end()
def fit(self, train_dataloader, steps_per_epoch=None, epochs=1, grad_accumulation_steps=1, callbacks=None):
if not hasattr(train_dataloader, '__len__'):
assert steps_per_epoch is not None, 'Either train_dataloader has attr "__len__" or steps_per_epoch is not None'
self.steps_per_epoch = len(train_dataloader) if steps_per_epoch is None else steps_per_epoch
self.total_steps = self.steps_per_epoch * epochs
self.train_dataloader = train_dataloader # 设置为成员变量,可由外部的callbacks进行修改
train_dataloader_iter = iter(self.train_dataloader) # 循环epoch时不重生成
callbacks = [] if callbacks is None else callbacks
callbacks = callbacks if isinstance(callbacks, (list, tuple)) else [callbacks]
self.callbacks = [ProgbarLogger(epochs, self.steps_per_epoch, [i for i in self.metrics.keys() if isinstance(i, str)])] + callbacks
self.callback_fun('train_begin')
# epoch:当前epoch
# global_step:当前全局训练步数
# local_step: 当前epoch内的训练步数,不同epoch中相同local_step对应的batch数据不一定相同,在steps_per_epoch=None时相同
# bti:在dataloader中的index,不同epoch中相同的bti对应的batch数据一般相同,除非重新生成dataloader
self.bti = 0
import time
# 声明fps日志打印路径
fps_dir = 'log/'
os.makedirs(fps_dir, exist_ok=True)
fw = open(os.path.join(fps_dir, f'fps.txt'), 'a', encoding='utf-8')
# 训练总时长
total_training_time = 0
# 训练总样本数量
total_samples = 0
# 评估总时长
total_eval_time = 0
# 总时长
total_time = 0
for epoch in range(self.resume_epoch, epochs):
self.epoch = epoch
# one epoch样本数量
epoch_samples = 0
# one epoch开始时间
epoch_start = time.time()
# resume_step:判断local_step的起点,以及进度条的起始位置
resume_step = self.resume_step if epoch==self.resume_epoch else 0
self.callback_fun('epoch_begin')
self.callbacks[0].seen = resume_step
for local_step in range(resume_step, self.steps_per_epoch):
self.local_step = local_step
self.global_step = self.epoch * self.steps_per_epoch + self.local_step
# 循环dataloader, 不要试用itertools的cycle,遇到过变量不释放的问题
try:
batch = next(train_dataloader_iter)
except StopIteration:
self.callback_fun('dataloader_end') # 适用于数据量较大时,动态读取文件并重新生成dataloader的情况,如预训练
train_dataloader_iter = iter(self.train_dataloader) # shuffle=True时候,其实顺序也重新生成了
self.bti = 0
batch = next(train_dataloader_iter)
train_X, train_y = batch
# 计算训练样本数
total_samples += train_X.shape[0]
epoch_samples += train_X.shape[0]
# 取btz,最多允许嵌套两层,即((token_ids1, mask1), (token_ids2, mask2))
# if isinstance(train_X, (list, tuple)):
# if isinstance(train_X[0], (list, tuple)):
# btz = train_X[0][0].size(0)
# else:
# btz = train_X[0].size(0)
# elif isinstance(train_X, torch.Tensor):
# btz = train_X.size(0)
# else:
# raise ValueError('Input only support [list, tuple, tensor]')
# logs = {'batch': self.local_step, 'size': btz}
logs = OrderedDict()
self.callback_fun('batch_begin', logs)
self.train() # 设置为train模式
# 入参个数判断,如果入参>=3表示是多个入参,如果=2则表示是一个入参
output, loss, loss_detail = self.train_step(train_X, train_y, grad_accumulation_steps)
retain_graph = True if self.adversarial['name'] in {'gradient_penalty', 'vat'} else False
if self.use_amp: # 混合精度
scale_before_step = self.scaler.get_scale()
self.scaler.scale(loss).backward(retain_graph=retain_graph)
else:
loss.backward(retain_graph=retain_graph)
# 对抗训练
loss, loss_detail = self.adversarial_training(train_X, train_y, output, loss, loss_detail, grad_accumulation_steps)
# 参数更新, 真实的参数更新次数要除以grad_accumulation_steps,注意调整总的训练步数
if (self.global_step+1) % grad_accumulation_steps == 0:
skip_scheduler = False
# 混合精度
if self.use_amp:
self.scaler.unscale_(self.optimizer)
if self.clip_grad_norm is not None: # 梯度裁剪
torch.nn.utils.clip_grad_norm_(self.parameters(), self.clip_grad_norm)
self.scaler.step(self.optimizer)
self.scaler.update()
skip_scheduler = self.scaler.get_scale() != scale_before_step
else:
if self.clip_grad_norm is not None: # 梯度裁剪
torch.nn.utils.clip_grad_norm_(self.parameters(), self.clip_grad_norm)
self.optimizer.step()
self.optimizer.zero_grad() # 清梯度
if (self.scheduler is not None) and not skip_scheduler:
if isinstance(self.scheduler, (tuple, list)):
for scheduler in self.scheduler:
scheduler.step()
else:
self.scheduler.step()
# 添加loss至log打印
logs.update({'loss': loss.item()})
logs_loss_detail = {k: v.item() if isinstance(v, torch.Tensor) else v for k, v in loss_detail.items()}
logs.update(logs_loss_detail)
if self.global_step == resume_step:
self.callbacks[0].add_metrics(list(logs_loss_detail.keys()), add_position=1)
# 添加metrics至log打印
for metric, func in self.metrics.items():
perf = metric_mapping(metric, func, output, train_y) # 内置的一些accuracy指标
if perf is not None:
if isfunction(metric): # 直接传入回调函数(无key)
if self.global_step == resume_step:
self.callbacks[0].add_metrics(list(perf.keys()))
logs.update(perf)
elif isinstance(metric, str): # 直接传入回调函数(有key)
logs[metric] = perf
self.callback_fun('batch_end', logs)
self.bti += 1
# one epoch训练结束
epoch_train_end = time.time()
# 计算一个epoch训练时长
cur_epoch_train_time = epoch_train_end - epoch_start
# 计算总共训练时长
total_training_time += cur_epoch_train_time
# one epoch评估开始
eval_start = time.time()
self.callback_fun('epoch_end', logs)
# one epoch结束时间
epoch_end = time.time()
# 计算一个epoch评估时长
cur_epoch_eval_time = time.time() - eval_start
# 计算总共评估时长
total_eval_time += cur_epoch_eval_time
# 计算总时长
total_time += (epoch_end - epoch_start)
# 写入日志'log/fps.log'
if fw is not None:
fw.write('Epoch: {}\tCurrent Start time: {:.6f}\n'.format(epoch, epoch_start))
fw.write('Epoch: {}\tCurrent Train Samples: {}\n'.format(epoch, epoch_samples))
fw.write('Epoch: {}\tCurrent Train Time: {}\n'.format(epoch, cur_epoch_train_time))
fw.write('Epoch: {}\tCurrent Eval Time: {}\n'.format(epoch, cur_epoch_eval_time))
fw.write('Epoch: {}\tCurrent Epoch FPS: {:.3f}\n'.format(epoch, epoch_samples/cur_epoch_train_time))
fw.write('Epoch: {}\tCurrent End time: {:.6f}\n'.format(epoch, epoch_end))
fw.write('Epoch: {}\tCurrent F1 score: {:.5f}\n'.format(epoch, logs['f1']))
fw.write('Epoch: {}\tAll Train Samples: {}\n'.format(epoch, total_samples))
fw.write('Epoch: {}\tAll Train Time: {}\n'.format(epoch, total_training_time))
fw.write('Epoch: {}\tAll Eval Time: {}\n'.format(epoch, total_eval_time))
fw.write('Epoch: {}\tAll Time: {}\n'.format(epoch, total_time))
fw.flush()
# earlystop策略
callback_tmp = [callback_tmp for callback_tmp in self.callbacks if isinstance(callback_tmp, EarlyStopping)]
if callback_tmp and callback_tmp[0].stopped_epoch > 0:
break
self.callback_fun('train_end', logs)
@torch.no_grad()
def predict(self, input_tensor_list, return_all=None):
self.eval()
if self.forward.__code__.co_argcount >= 3:
output = self.forward(*input_tensor_list)
else:
output = self.forward(input_tensor_list)
if return_all is None:
return output
elif isinstance(output, (tuple, list)) and isinstance(return_all, int) and return_all < len(output):
return output[return_all]
else:
raise ValueError('Return format error')
def load_weights(self, load_path, strict=True, prefix=None):
'''加载模型权重
save_path: 权重加载路径
prefix: None表示按照当前的key加载, 传入string表示按照variable_mapping()中原始的key加载
'''
state_dict = torch.load(load_path, map_location='cpu')
if prefix is None:
self.load_state_dict(state_dict, strict=strict)
else:
# 按照variable_mapping()中原始的key加载
eval_str = 'self.variable_mapping()' if prefix == '' else f'self.{prefix}.variable_mapping()'
mapping = {v:k for k, v in eval(eval_str).items()}
mapping = mapping if prefix == '' else {k:f'{prefix}.{v}' for k,v in mapping.items()}
state_dict_raw = {}
for k, v in state_dict.items():
k = mapping.get(k, k)
state_dict_raw[k] = v
self.load_state_dict(state_dict_raw, strict=strict)
def save_weights(self, save_path, prefix=None):
'''保存模型权重
save_path: 权重保存路径
prefix: None表示按照当前的key加载, 传入string表示按照variable_mapping()中原始的key保存
'''
if prefix is None:
torch.save(self.state_dict(), save_path)
else:
# 按照variable_mapping()中原始的key保存,方便其他官方代码加载模型
eval_str = 'self.variable_mapping()' if prefix == '' else f'self.{prefix}.variable_mapping()'
mapping = eval(eval_str)
mapping = mapping if prefix == '' else {f'{prefix}.{k}':v for k,v in mapping.items()}
state_dict_raw = {}
for k, v in self.state_dict().items():
k = mapping.get(k, k)
state_dict_raw[k] = v
torch.save(state_dict_raw, save_path)
class BaseModelDP(BaseModel, nn.DataParallel):
'''DataParallel模式使用多gpu的方法
'''
def __init__(self, *args, **kwargs):
nn.DataParallel.__init__(self, *args, **kwargs)
class BaseModelDDP(BaseModel, nn.parallel.DistributedDataParallel):
'''DistributedDataParallel模式使用多gpu的方法
'''
def __init__(self, *args, master_rank=0, **kwargs):
self.master_rank = master_rank # 用于记录打印条的rank
nn.parallel.DistributedDataParallel.__init__(self, *args, **kwargs)
class BERT_BASE(BaseModel):
"""模型基类
"""
def __init__(
self,
vocab_size, # 词表大小
hidden_size, # 编码维度
num_hidden_layers, # Transformer总层数
num_attention_heads, # Attention的头数
intermediate_size, # FeedForward的隐层维度
hidden_act, # FeedForward隐层的激活函数
dropout_rate=None, # Dropout比例
attention_probs_dropout_prob=None, # Attention矩阵的Dropout比例
embedding_size=None, # 指定embedding_size, 不指定则使用config文件的参数
attention_head_size=None, # Attention中V的head_size
attention_key_size=None, # Attention中Q,K的head_size
initializer_range=0.02, # 权重初始化方差
sequence_length=None, # 是否固定序列长度
keep_tokens=None, # 要保留的词ID列表
compound_tokens=None, # 扩展Embedding
residual_attention_scores=False, # Attention矩阵加残差
ignore_invalid_weights=False, # 允许跳过不存在的权重
keep_hidden_layers=None, # 保留的hidden_layer层的id
hierarchical_position=None, # 是否层次分解位置编码
**kwargs
):
super(BERT_BASE, self).__init__()
if keep_tokens is not None:
vocab_size = len(keep_tokens)
if compound_tokens is not None:
vocab_size += len(compound_tokens)
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.attention_head_size = attention_head_size or self.hidden_size // self.num_attention_heads
self.attention_key_size = attention_key_size or self.attention_head_size
self.intermediate_size = intermediate_size
self.dropout_rate = dropout_rate or 0
self.attention_probs_dropout_prob = attention_probs_dropout_prob or 0
self.hidden_act = hidden_act
self.embedding_size = embedding_size or hidden_size
self.initializer_range = initializer_range
self.sequence_length = sequence_length
self.keep_tokens = keep_tokens
self.compound_tokens = compound_tokens
self.attention_bias = None
self.position_bias = None
self.attention_scores = None
self.residual_attention_scores = residual_attention_scores
self.ignore_invalid_weights = ignore_invalid_weights
self.keep_hidden_layers = set(range(num_hidden_layers)) if keep_hidden_layers is None else set(keep_hidden_layers)
self.hierarchical_position = hierarchical_position
def build(
self,
attention_caches=None,
layer_norm_cond=None,
layer_norm_cond_hidden_size=None,
layer_norm_cond_hidden_act=None,
additional_input_layers=None,
**kwargs
):
"""模型构建函数
attention_caches: 为Attention的K,V的缓存序列字典,格式为{Attention层名: [K缓存, V缓存]};
layer_norm_*系列参数: 实现Conditional Layer Normalization时使用,用来实现以“固定长度向量”为条件的条件Bert。
"""
# additional_input
# if additional_input_layers is not None:
# if not isinstance(additional_input_layers, list):
# self.additional_input_layers = [additional_input_layers]
# else:
# self.additional_input_layers = additional_input_layers
# Other
self.attention_caches = attention_caches or {}
# self.layer_norm_conds = [
# layer_norm_cond,
# layer_norm_cond_hidden_size,
# layer_norm_cond_hidden_act or 'linear',
# ]
self.output_all_encoded_layers = kwargs.get('output_all_encoded_layers', False)
def forward(self, inputs):
"""定义模型的执行流程
"""
# Embedding
outputs = self.apply_embeddings(inputs)
# Main
outputs = self.apply_main_layers(outputs)
# Final
outputs = self.apply_final_layers(outputs)
return outputs
def init_model_weights(self, module):
""" 初始化权重
"""
if isinstance(module, (nn.Linear, nn.Embedding)) and (module.weight.requires_grad):
# bert参数初始化, tf版本在linear和Embedding层使用的是截断正太分布, pytorch没有实现该函数,
# 此种初始化对于加载预训练模型后进行finetune没有任何影响,
# cf https://github.com/pytorch/pytorch/pull/5617
# 固定的相对位置编码如Sinusoidal无需初始化
module.weight.data.normal_(mean=0.0, std=self.initializer_range)
elif isinstance(module, LayerNorm):
if hasattr(module, 'bias') and module.bias.requires_grad: # T5等模型使用的是rmsnorm
module.bias.data.zero_()
if hasattr(module, 'weight') and module.weight.requires_grad:
module.weight.data.fill_(1.0)
if isinstance(module, nn.Linear) and (module.bias is not None) and (module.bias.requires_grad):
module.bias.data.zero_()
def variable_mapping(self):
"""构建pytorch层与checkpoint的变量名之间的映射表
"""
return {}
def load_variable(self):
raise NotImplementedError
def load_embeddings(self, embeddings):
"""根据keep_tokens和compound_tokens对embedding进行修改
"""
if self.keep_tokens is not None:
embeddings = embeddings[self.keep_tokens]
if self.compound_tokens is not None:
ext_embeddings = []
for item in self.compound_tokens:
try:
ext_embeddings.append(torch.mean(embeddings[item], 0) * torch.ones_like(embeddings[item]))
except IndexError:
ext_embeddings.append(torch.mean(embeddings, 0, keepdim=True))
warnings.warn(f'Initialize ext_embeddings from compound_tokens not in embedding index')
embeddings = torch.cat([embeddings] + ext_embeddings, 0)
return embeddings
def load_pos_embeddings(self, embeddings):
"""根据hierarchical_position对pos_embedding进行修改
"""
if self.hierarchical_position is not None:
alpha = 0.4 if self.hierarchical_position is True else self.hierarchical_position
embeddings = embeddings - alpha * embeddings[:1]
embeddings = embeddings / (1 - alpha)
position_index = torch.arange(self.max_position)[:, None]
# 为兼容低版本pytorch没有take_along_dim
embeddings_x = take_along_dim(embeddings, torch.div(position_index, embeddings.size(0), rounding_mode='trunc'), dim=0)
embeddings_y = take_along_dim(embeddings, position_index % embeddings.size(0), dim=0)
embeddings = alpha * embeddings_x + (1 - alpha) * embeddings_y
return embeddings
def load_weights_from_pytorch_checkpoint(self, checkpoint, mapping=None):
"""根据mapping从checkpoint加载权重
"""
file_state_dict = torch.load(checkpoint, map_location='cpu') # 加载模型文件
mapping = mapping or self.variable_mapping()
parameters_set = set([i[0] for i in self.named_parameters()]) # 可更新的变量
# 如果模型文件和模型结构中同时存在,且不在预设的mapping中,则更新mapping
# 主要是如为了在外部继承BERT后有其他layer,也能自动从checkpoint中加载进来
for layer_name in parameters_set:
if (layer_name in file_state_dict) and (layer_name not in mapping):
mapping.update({layer_name: layer_name})
state_dict_new ={}
for new_key, old_key in mapping.items():
if new_key not in self.state_dict():
continue
elif old_key in file_state_dict: # mapping中包含,且模型结构中有
state_dict_new[new_key] = self.load_variable(file_state_dict, old_key)
elif (old_key not in file_state_dict) and (not self.ignore_invalid_weights):
# mapping中包含,但模型文件中没有
print(f'[WARNIMG] {old_key} not found in pretrain models')
if new_key in parameters_set:
parameters_set.remove(new_key)
# 未能加载预训练权重的Parameter
if not self.ignore_invalid_weights:
for key in parameters_set:
print(f'[WARNIMG] Parameter {key} not loaded from pretrain models')
del file_state_dict
# 将ckpt的权重load到模型结构中
self.load_state_dict(state_dict_new, strict=False)
# def get_inputs(self):
# pass
# def set_inputs(self, inputs, additional_input_layers=None):
# """设置input和inputs属性
# """
# pass
def apply_embeddings(self, inputs):
raise NotImplementedError
def apply_main_layers(self, inputs):
raise NotImplementedError
def apply_final_layers(self, inputs):
raise NotImplementedError
def apply_on_layer_begin(self, l_i, inputs):
'''新增对layer block输入进行操作的函数
'''
return inputs
def apply_on_layer_end(self, l_i, inputs):
'''新增对layer block输出进行操作的函数
'''
return inputs
def compute_attention_bias(self, inputs=None):
"""定义每一层的Attention Bias
"""
return self.attention_bias
def compute_position_bias(self, inputs=None):
"""定义每一层的Position Bias(一般相对位置编码用)
"""
return self.position_bias
def set_outputs(self, outputs):
"""设置output和oututs属性
"""
if not isinstance(outputs, list):
outputs = [outputs]
outputs = outputs[:]
self.outputs = outputs
if len(outputs) > 1:
self.output = outputs
else:
self.output = outputs[0]
class LM_Mask(object):
"""定义下三角Attention Mask(语言模型用)
"""
def compute_attention_bias(self, inputs=None):
"""通过idxs序列的比较来得到对应的mask
"""
seq_len = inputs[0].shape[1]
attention_bias = torch.tril(torch.ones(seq_len, seq_len, dtype=torch.long, device=inputs[0].device), diagonal=0)
self.attention_bias = attention_bias.unsqueeze(0).unsqueeze(1)
return self.attention_bias
def extend_with_language_model(InputModel):
"""添加下三角的Attention Mask(语言模型用)
"""
class LanguageModel(LM_Mask, InputModel):
"""带下三角Attention Mask的派生模型
"""
def __init__(self, *args, **kwargs):
kwargs['with_mlm'] = kwargs.get('with_mlm') or True
super(LanguageModel, self).__init__(*args, **kwargs)
return LanguageModel
class UniLM_Mask(object):
"""定义UniLM的Attention Mask(Seq2Seq模型用)
其中source和target的分区,由segment_ids来表示。
UniLM: https://arxiv.org/abs/1905.03197
"""
def compute_attention_bias(self, inputs=None):
"""通过idxs序列的比较来得到对应的mask
"""
segment_ids = inputs[1]
attention_bias = torch.cumsum(segment_ids, dim=1)
attention_bias = (attention_bias.unsqueeze(1)) <= (attention_bias.unsqueeze(2))
self.attention_bias = attention_bias.unsqueeze(1).long()
return self.attention_bias
def extend_with_unified_language_model(InputModel):
"""添加UniLM的Attention Mask(Seq2Seq模型用)
"""
class UnifiedLanguageModel(UniLM_Mask, InputModel):
"""带UniLM的Attention Mask的派生模型
UniLM: https://arxiv.org/abs/1905.03197
"""
def __init__(self, *args, **kwargs):
kwargs['with_mlm'] = kwargs.get('with_mlm') or True
super(UnifiedLanguageModel, self).__init__(*args, **kwargs)
return UnifiedLanguageModel
class BERT(BERT_BASE):
"""构建BERT模型
"""
def __init__(
self,
max_position, # 序列最大长度
segment_vocab_size=2, # segment总数目
with_pool=False, # 是否包含Pool部分
with_nsp=False, # 是否包含NSP部分
with_mlm=False, # 是否包含MLM部分
custom_position_ids=False, # 是否自行传入位置id
custom_attention_mask=False, # 是否自行传入attention_mask
shared_segment_embeddings=False, # 若True,则segment跟token共用embedding
layer_norm_cond=None, # conditional layer_norm
layer_add_embs=None, # addtional_embeddng, 比如加入词性,音调,word粒度的自定义embedding
is_dropout=False,
token_pad_ids=0, # 默认0是padding ids, 但是注意google的mt5padding不是0
**kwargs # 其余参数
):
super(BERT, self).__init__(**kwargs)
self.max_position = max_position
self.segment_vocab_size = segment_vocab_size
self.with_pool = with_pool
self.with_nsp = with_nsp
self.with_mlm = with_mlm
self.custom_position_ids = custom_position_ids
self.custom_attention_mask = custom_attention_mask
self.shared_segment_embeddings = shared_segment_embeddings
self.is_dropout = is_dropout
self.token_pad_ids = token_pad_ids
if self.with_nsp and not self.with_pool:
self.with_pool = True
self.layer_norm_conds = layer_norm_cond
self.layer_add_embs = layer_add_embs
self.conditional_size = layer_norm_cond.weight.size(1) if layer_norm_cond is not None else None
self.embeddings = BertEmbeddings(self.vocab_size, self.embedding_size, self.hidden_size, self.max_position, self.segment_vocab_size, self.shared_segment_embeddings,
self.dropout_rate, self.conditional_size, **get_kw(BertEmbeddings, kwargs))
kwargs['max_position'] = self.max_position # 相对位置编码需要使用
layer = BertLayer(self.hidden_size, self.num_attention_heads, self.dropout_rate, self.attention_probs_dropout_prob, self.intermediate_size, self.hidden_act,
is_dropout=self.is_dropout, conditional_size=self.conditional_size, **get_kw(BertLayer, kwargs))
self.encoderLayer = nn.ModuleList([copy.deepcopy(layer) if layer_id in self.keep_hidden_layers else Identity() for layer_id in range(self.num_hidden_layers)])
if self.with_pool:
# Pooler部分(提取CLS向量)
self.pooler = nn.Linear(self.hidden_size, self.hidden_size)
self.pooler_activation = nn.Tanh() if self.with_pool is True else get_activation(self.with_pool)
if self.with_nsp:
# Next Sentence Prediction部分
# nsp的输入为pooled_output, 所以with_pool为True是使用nsp的前提条件
self.nsp = nn.Linear(self.hidden_size, 2)
else:
self.pooler = None
self.pooler_activation = None
if self.with_mlm:
self.mlmDense = nn.Linear(self.hidden_size, self.hidden_size)
self.transform_act_fn = get_activation(self.hidden_act)
self.mlmLayerNorm = LayerNorm(self.hidden_size, eps=1e-12, conditional_size=self.conditional_size)
self.mlmDecoder = nn.Linear(self.hidden_size, self.vocab_size, bias=False)
if kwargs.get('tie_emb_prj_weight') is True:
self.mlmDecoder.weight = self.embeddings.word_embeddings.weight
self.mlmBias = nn.Parameter(torch.zeros(self.vocab_size))
self.mlmDecoder.bias = self.mlmBias
# 下述继承于BERT的有声明新的参数,在这里初始化不能统一初始化到
def apply_embeddings(self, inputs):
"""BERT的embedding是token、position、segment三者embedding之和
默认顺序是token_ids, segment_ids(若有), position_ids(若有), custom_attention_mask(若有), conditional_input(若有)
"""
assert isinstance(inputs, (list, tuple)), f'Inputs only support list,tuple format but passed {type(inputs)}'
token_ids = inputs[0]
index_ = 1
if self.segment_vocab_size > 0:
segment_ids = inputs[index_]
index_ += 1
else:
segment_ids = None
if self.custom_position_ids: # 暂未使用到,暂保留
position_ids = inputs[index_]
index_ += 1
else:
position_ids = None
# 根据token_ids创建一个3D的attention mask矩阵,尺寸为[batch_size, 1, 1, to_seq_length],
# 目的是为了适配多头注意力机制,从而能广播到[batch_size, num_heads, from_seq_length, to_seq_length]尺寸
if self.custom_attention_mask:
attention_mask = inputs[index_].long().unsqueeze(1).unsqueeze(2)
index_ += 1
elif (not token_ids.requires_grad) and (token_ids.dtype in {torch.long, torch.int}): # 正常的token_ids
attention_mask = (token_ids != self.token_pad_ids).long().unsqueeze(1).unsqueeze(2) # 默认0为mask_value
if self.token_pad_ids < 0:
token_ids = token_ids * attention_mask[:,0,0,:]
else: # 自定义word_embedding,目前仅有VAT中使用
attention_mask = self.attention_mask_cache
self.attention_mask_cache = attention_mask # 缓存上次用的attention_mask
self.compute_attention_bias([token_ids, segment_ids]) # 根据lm或者unilm需要对mask做调整
if self.attention_bias is not None:
attention_mask = attention_mask * self.attention_bias # 不可访问padding
# attention_mask = self.attention_bias # 可以访问padding
# pytorch >= 1.5时候会导致StopIteration错误
# https://github.com/huggingface/transformers/issues/3936
# https://github.com/huggingface/transformers/issues/4189
# https://github.com/huggingface/transformers/issues/3936
try:
attention_mask = attention_mask.to(dtype=next(self.parameters()).dtype) # 兼容fp16
except StopIteration:
attention_mask = attention_mask.to(dtype=torch.float32)
# 对mask矩阵中,数值为0的转换成很大的负数,使得不需要attention的位置经过softmax后,分数趋近于0
# attention_mask = (1.0 - attention_mask) * -10000.0
# conditional layer_norm
if self.layer_norm_conds is None:
conditional_emb = None
else:
conditional_emb = self.layer_norm_conds(inputs[index_])
index_ += 1
# addtional_embeddng, 比如加入词性,音调,word粒度的自定义embedding
if isinstance(self.layer_add_embs, nn.Module): # 单个
additional_embs = [self.layer_add_embs(inputs[index_])]
index_ += 1
elif isinstance(self.layer_add_embs, (tuple, list)): # 多个
additional_embs = []
for layer in self.layer_add_embs:
assert isinstance(layer, nn.Module), 'Layer_add_embs element should be nn.Module'
additional_embs.append(layer(inputs[index_]))
index_ += 1
else:
additional_embs = None
# 进入embedding层
hidden_states = self.embeddings(token_ids, segment_ids, conditional_emb, additional_embs)
return [hidden_states, attention_mask, conditional_emb] + inputs[index_:]
def apply_main_layers(self, inputs):
"""BERT的主体是基于Self-Attention的模块
顺序:Att --> Add --> LN --> FFN --> Add --> LN
默认第一个是hidden_states, 第二个是attention_mask, 第三个是conditional_emb
"""
hidden_states, attention_mask, conditional_emb = inputs[:3]
if len(inputs[3:]) >= 2:
encoder_hidden_state, encoder_attention_mask = inputs[3], inputs[4]
else:
encoder_hidden_state, encoder_attention_mask = None, None
encoded_layers = [hidden_states] # 添加embedding的输出
layer_inputs = [hidden_states, attention_mask, conditional_emb, encoder_hidden_state, encoder_attention_mask]
for l_i, layer_module in enumerate(self.encoderLayer):
layer_inputs = self.apply_on_layer_begin(l_i, layer_inputs)
hidden_states = layer_module(*layer_inputs)
layer_inputs[0] = hidden_states
layer_inputs = self.apply_on_layer_end(l_i, layer_inputs)
if self.output_all_encoded_layers:
encoded_layers.append(hidden_states)
if not self.output_all_encoded_layers:
encoded_layers.append(hidden_states)
return [encoded_layers, conditional_emb]
def apply_final_layers(self, inputs):
"""根据剩余参数决定输出
"""
# 获取最后一层隐藏层的输出
encoded_layers, conditional_emb = inputs
sequence_output = encoded_layers[-1]
# 是否取最后一层输出
if not self.output_all_encoded_layers:
encoded_layers = encoded_layers[-1]
# 是否添加pool层
if self.with_pool:
pooled_output = self.pooler_activation(self.pooler(sequence_output[:, 0]))
else:
pooled_output = None
# 是否添加nsp
if self.with_pool and self.with_nsp:
nsp_scores = self.nsp(pooled_output)
else:
nsp_scores = None
# 是否添加mlm
if self.with_mlm:
mlm_hidden_state = self.mlmDense(sequence_output)
mlm_hidden_state = self.transform_act_fn(mlm_hidden_state)
mlm_hidden_state = self.mlmLayerNorm((mlm_hidden_state, conditional_emb))
mlm_scores = self.mlmDecoder(mlm_hidden_state)
mlm_activation = get_activation('linear' if self.with_mlm is True else self.with_mlm)
mlm_scores = mlm_activation(mlm_scores)
else:
mlm_scores = None
outputs = [value for value in [encoded_layers, pooled_output, mlm_scores, nsp_scores] if value is not None]
return outputs if len(outputs) > 1 else outputs[0]
def load_variable(self, state_dict, name, prefix='bert'):
"""加载单个变量的函数
"""
variable = state_dict[name]
if name in {
f'{prefix}.embeddings.word_embeddings.weight',
'cls.predictions.bias',
'cls.predictions.decoder.weight',
'cls.predictions.decoder.bias'
}:
return self.load_embeddings(variable)
elif name == f'{prefix}.embeddings.position_embeddings.weight':
return self.load_pos_embeddings(variable)
elif name == 'cls.seq_relationship.weight':
return variable.T
else:
return variable
def variable_mapping(self, prefix='bert'):
mapping = {
'embeddings.word_embeddings.weight': f'{prefix}.embeddings.word_embeddings.weight',
'embeddings.position_embeddings.weight': f'{prefix}.embeddings.position_embeddings.weight',
'embeddings.segment_embeddings.weight': f'{prefix}.embeddings.token_type_embeddings.weight',
'embeddings.layerNorm.weight': f'{prefix}.embeddings.LayerNorm.weight',
'embeddings.layerNorm.bias': f'{prefix}.embeddings.LayerNorm.bias',
'pooler.weight': f'{prefix}.pooler.dense.weight',
'pooler.bias': f'{prefix}.pooler.dense.bias',
'nsp.weight': 'cls.seq_relationship.weight',
'nsp.bias': 'cls.seq_relationship.bias',
'mlmDense.weight': 'cls.predictions.transform.dense.weight',
'mlmDense.bias': 'cls.predictions.transform.dense.bias',
'mlmLayerNorm.weight': 'cls.predictions.transform.LayerNorm.weight',
'mlmLayerNorm.bias': 'cls.predictions.transform.LayerNorm.bias',
'mlmBias': 'cls.predictions.bias',
'mlmDecoder.weight': 'cls.predictions.decoder.weight',
'mlmDecoder.bias': 'cls.predictions.decoder.bias'
}
for i in range(self.num_hidden_layers):
prefix_i = f'{prefix}.encoder.layer.%d.' % i
mapping.update({f'encoderLayer.{i}.multiHeadAttention.q.weight': prefix_i + 'attention.self.query.weight',
f'encoderLayer.{i}.multiHeadAttention.q.bias': prefix_i + 'attention.self.query.bias',
f'encoderLayer.{i}.multiHeadAttention.k.weight': prefix_i + 'attention.self.key.weight',
f'encoderLayer.{i}.multiHeadAttention.k.bias': prefix_i + 'attention.self.key.bias',
f'encoderLayer.{i}.multiHeadAttention.v.weight': prefix_i + 'attention.self.value.weight',
f'encoderLayer.{i}.multiHeadAttention.v.bias': prefix_i + 'attention.self.value.bias',
f'encoderLayer.{i}.multiHeadAttention.o.weight': prefix_i + 'attention.output.dense.weight',
f'encoderLayer.{i}.multiHeadAttention.o.bias': prefix_i + 'attention.output.dense.bias',
f'encoderLayer.{i}.layerNorm1.weight': prefix_i + 'attention.output.LayerNorm.weight',
f'encoderLayer.{i}.layerNorm1.bias': prefix_i + 'attention.output.LayerNorm.bias',
f'encoderLayer.{i}.feedForward.intermediateDense.weight': prefix_i + 'intermediate.dense.weight',
f'encoderLayer.{i}.feedForward.intermediateDense.bias': prefix_i + 'intermediate.dense.bias',
f'encoderLayer.{i}.feedForward.outputDense.weight': prefix_i + 'output.dense.weight',
f'encoderLayer.{i}.feedForward.outputDense.bias': prefix_i + 'output.dense.bias',
f'encoderLayer.{i}.layerNorm2.weight': prefix_i + 'output.LayerNorm.weight',
f'encoderLayer.{i}.layerNorm2.bias': prefix_i + 'output.LayerNorm.bias'
})
return mapping
class ALBERT(BERT):
def __init__(self, *args, **kwargs):
super(ALBERT, self).__init__(*args, **kwargs)
self.encoderLayer = nn.ModuleList([self.encoderLayer[0]]) # 取上述的第一行
def apply_main_layers(self, inputs):
"""BERT的主体是基于Self-Attention的模块
顺序:Att --> Add --> LN --> FFN --> Add --> LN
"""
hidden_states, attention_mask, conditional_emb = inputs[:3]
if len(inputs[3:]) >= 2:
encoder_hidden_state, encoder_attention_mask = inputs[3], inputs[4]
else:
encoder_hidden_state, encoder_attention_mask = None, None
encoded_layers = [hidden_states] # 添加embedding的输出
layer_inputs = [hidden_states, attention_mask, conditional_emb, encoder_hidden_state, encoder_attention_mask]
for l_i in range(self.num_hidden_layers):
layer_inputs = self.apply_on_layer_begin(l_i, layer_inputs)
hidden_states = self.encoderLayer[0](*layer_inputs)
layer_inputs[0] = hidden_states
layer_inputs = self.apply_on_layer_end(l_i, layer_inputs)
if self.output_all_encoded_layers:
encoded_layers.append(hidden_states)
if not self.output_all_encoded_layers:
encoded_layers.append(hidden_states)
return [encoded_layers, conditional_emb]
def variable_mapping(self, prefix='albert'):
mapping = {
'embeddings.word_embeddings.weight': f'{prefix}.embeddings.word_embeddings.weight',
'embeddings.position_embeddings.weight': f'{prefix}.embeddings.position_embeddings.weight',
'embeddings.segment_embeddings.weight': f'{prefix}.embeddings.token_type_embeddings.weight',
'embeddings.layerNorm.weight': f'{prefix}.embeddings.LayerNorm.weight',
'embeddings.layerNorm.bias': f'{prefix}.embeddings.LayerNorm.bias',
'embeddings.embedding_hidden_mapping_in.weight': f'{prefix}.encoder.embedding_hidden_mapping_in.weight',
'embeddings.embedding_hidden_mapping_in.bias': f'{prefix}.encoder.embedding_hidden_mapping_in.bias',
'pooler.weight': f'{prefix}.pooler.weight',
'pooler.bias': f'{prefix}.pooler.bias',
'nsp.weight': 'sop_classifier.classifier.weight', # 用名字nsp来替换sop
'nsp.bias': 'sop_classifier.classifier.bias',
'mlmDense.weight': 'predictions.dense.weight',
'mlmDense.bias': 'predictions.dense.bias',
'mlmLayerNorm.weight': 'predictions.LayerNorm.weight',
'mlmLayerNorm.bias': 'predictions.LayerNorm.bias',
'mlmBias': 'predictions.bias',
'mlmDecoder.weight': 'predictions.decoder.weight',
'mlmDecoder.bias': 'predictions.decoder.bias'
}
i = 0
prefix_i = f'{prefix}.encoder.albert_layer_groups.{i}.albert_layers.{i}.'
mapping.update({f'encoderLayer.{i}.multiHeadAttention.q.weight': prefix_i + 'attention.query.weight',
f'encoderLayer.{i}.multiHeadAttention.q.bias': prefix_i + 'attention.query.bias',
f'encoderLayer.{i}.multiHeadAttention.k.weight': prefix_i + 'attention.key.weight',
f'encoderLayer.{i}.multiHeadAttention.k.bias': prefix_i + 'attention.key.bias',
f'encoderLayer.{i}.multiHeadAttention.v.weight': prefix_i + 'attention.value.weight',
f'encoderLayer.{i}.multiHeadAttention.v.bias': prefix_i + 'attention.value.bias',
f'encoderLayer.{i}.multiHeadAttention.o.weight': prefix_i + 'attention.dense.weight',
f'encoderLayer.{i}.multiHeadAttention.o.bias': prefix_i + 'attention.dense.bias',
f'encoderLayer.{i}.layerNorm1.weight': prefix_i + 'attention.LayerNorm.weight',
f'encoderLayer.{i}.layerNorm1.bias': prefix_i + 'attention.LayerNorm.bias',
f'encoderLayer.{i}.feedForward.intermediateDense.weight': prefix_i + 'ffn.weight',
f'encoderLayer.{i}.feedForward.intermediateDense.bias': prefix_i + 'ffn.bias',
f'encoderLayer.{i}.feedForward.outputDense.weight': prefix_i + 'ffn_output.weight',
f'encoderLayer.{i}.feedForward.outputDense.bias': prefix_i + 'ffn_output.bias',
f'encoderLayer.{i}.layerNorm2.weight': prefix_i + 'full_layer_layer_norm.weight',
f'encoderLayer.{i}.layerNorm2.bias': prefix_i + 'full_layer_layer_norm.bias'
})
return mapping
def load_variable(self, state_dict, name):
"""加载单个变量的函数
"""
variable = state_dict[name]
if name in {
'albert.embeddings.word_embeddings.weight',
'predictions.bias',
'predictions.decoder.weight',
'predictions.decoder.bias'
}:
return self.load_embeddings(variable)
elif name == 'albert.embeddings.position_embeddings.weight':
return self.load_pos_embeddings(variable)
elif name == 'sop_classifier.classifier.weight':
return variable.T
else:
return variable
class ALBERT_Unshared(ALBERT):
def __init__(self, *args, **kwargs):
super(ALBERT_Unshared).__init__(*args, **kwargs)
self.encoderLayer = nn.ModuleList([copy.deepcopy(self.encoderLayer[0]) for _ in range(self.num_hidden_layers)])
def apply_main_layers(self, inputs):
"""BERT的主体是基于Self-Attention的模块
顺序:Att --> Add --> LN --> FFN --> Add --> LN
"""
hidden_states, attention_mask, conditional_emb = inputs
if len(inputs[3:]) >= 2:
encoder_hidden_state, encoder_attention_mask = inputs[3], inputs[4]
else:
encoder_hidden_state, encoder_attention_mask = None, None
encoded_layers = [hidden_states] # 添加embedding的输出
layer_inputs = [hidden_states, attention_mask, conditional_emb, encoder_hidden_state, encoder_attention_mask]
for i in range(self.num_hidden_layers):
layer_inputs = self.apply_on_layer_begin(i, layer_inputs)
hidden_states = self.encoderLayer[i](*layer_inputs)
layer_inputs[0] = hidden_states
layer_inputs = self.apply_on_layer_end(i, layer_inputs)
if self.output_all_encoded_layers:
encoded_layers.append(hidden_states)
if not self.output_all_encoded_layers:
encoded_layers.append(hidden_states)
return [encoded_layers, conditional_emb]
class NEZHA(BERT):
"""华为推出的NAZHA模型
链接:https://arxiv.org/abs/1909.00204
"""
def __init__(self, *args, **kwargs):
# p_bias来控制embedding阶段无pos_embedding, max_relative_position默认取64
kwargs.update({'p_bias': 'typical_relative', 'max_relative_position': kwargs.get('max_relative_position', 64)})
super(NEZHA, self).__init__(*args, **kwargs)
class RoFormer(BERT):
"""旋转式位置编码的BERT模型
链接:https://kexue.fm/archives/8265
"""
def __init__(self, *args, **kwargs):
kwargs.update({'p_bias': 'rotary'})
super(RoFormer, self).__init__(*args, **kwargs)
def load_variable(self, state_dict, name, prefix='roformer'):
return super().load_variable(state_dict, name, prefix)
def variable_mapping(self, prefix='roformer'):
mapping = super().variable_mapping(prefix)
del mapping['embeddings.position_embeddings.weight'] # 没有位置编码
return mapping
class RoFormerV2(RoFormer):
"""RoFormerV2
改动:去掉bias,简化Norm,优化初始化等。目前初始化暂时还用的bert的初始化,finetune不受影响
"""
@delete_arguments('with_pool', 'with_nsp')
def __init__(self, *args, **kwargs):
kwargs.update({'p_bias': 'rotary', 'weight': False, 'bias': False, 'norm_mode': 'rmsnorm'})
super(RoFormerV2, self).__init__(*args, **kwargs)
if self.with_mlm:
del self.mlmLayerNorm
del self.mlmBias
del self.mlmDense
self.mlmDecoder.register_parameter('bias', None)
def variable_mapping(self, prefix='roformer'):
mapping = super().variable_mapping(prefix)
mapping_new = {}
for k, v in mapping.items():
if (not re.search('bias|layernorm', k.lower())) and (not re.search('bias|layernorm', v.lower())):
mapping_new[k] = v
return mapping_new
def apply_final_layers(self, inputs):
"""根据剩余参数决定输出
"""
# 获取最后一层隐藏层的输出
encoded_layers, conditional_emb = inputs
sequence_output = encoded_layers[-1]
# 是否取最后一层输出
if not self.output_all_encoded_layers:
encoded_layers = encoded_layers[-1]
# 是否添加mlm
if self.with_mlm:
mlm_scores = self.mlmDecoder(sequence_output)
else:
mlm_scores = None
outputs = [value for value in [encoded_layers, mlm_scores] if value is not None]
return outputs if len(outputs) > 1 else outputs[0]
class GAU_alpha(RoFormerV2):
def __init__(self, *args, **kwargs):
kwargs.update({'p_bias': 'rotary', 'weight': False, 'bias': False, 'norm_mode': 'rmsnorm', 'normalization': 'softmax_plus'})
super().__init__(*args, **kwargs)
layer = self.GAU_Layer(**kwargs)
self.encoderLayer = nn.ModuleList([copy.deepcopy(layer) if layer_id in self.keep_hidden_layers else Identity() for layer_id in range(self.num_hidden_layers)])
def load_variable(self, state_dict, name, prefix=''):
variable = state_dict[name]
return self.load_embeddings(variable) if name in {'embeddings.word_embeddings.weight', 'mlmDecoder.weight'} else variable
def variable_mapping(self, prefix=''):
'''在convert脚本里已经把key转成bert4torch可用的
'''
return {k: k for k, _ in self.named_parameters()}
class GAU_Layer(nn.Module):
def __init__(self, *args, **kwargs):
super().__init__()
self.gau = GatedAttentionUnit(**kwargs)
self.dropout1 = nn.Dropout(kwargs.get('dropout_rate'))
self.layerNorm1 = LayerNorm(**kwargs)
def forward(self, hidden_states, attention_mask, conditional_emb=None, encoder_hidden_states=None, encoder_attention_mask=None):
gau_hidden_states = self.gau(hidden_states, attention_mask)
hidden_states = hidden_states + self.dropout1(gau_hidden_states)
hidden_states = self.layerNorm1((hidden_states, conditional_emb))
return hidden_states
class ELECTRA(BERT):
"""Google推出的ELECTRA模型
链接:https://arxiv.org/abs/2003.10555
"""
@insert_arguments(with_discriminator=False)
@delete_arguments('with_pool', 'with_mlm', 'with_nsp')
def __init__(self, max_position, **kwargs):
super(ELECTRA, self).__init__(max_position, **kwargs)
if self.with_discriminator:
self.dense = nn.Linear(self.hidden_size, self.hidden_size)
self.dense_act = get_activation(self.hidden_act)
self.dense_prediction = nn.Linear(self.hidden_size, 1)
self.dense_prediction_act = get_activation('sigmoid') if self.with_discriminator is True else get_activation(self.with_discriminator)
def apply_final_layers(self, inputs):
hidden_states = super().apply_final_layers(inputs) # 仅有hidden_state一项输出
if self.with_discriminator:
logit = self.dense_act(self.dense(hidden_states))
return [hidden_states, self.dense_prediction_act(self.dense_prediction(logit))]
else:
return hidden_states
def load_variable(self, state_dict, name):
"""加载单个变量的函数
"""
return super().load_variable(state_dict, name, prefix='electra')
def variable_mapping(self):
mapping = super(ELECTRA, self).variable_mapping(prefix='electra')
mapping.update({'dense.weight': 'discriminator_predictions.dense.weight',
'dense.bias': 'discriminator_predictions.dense.bias',
'dense_prediction.weight': 'discriminator_predictions.dense_prediction.weight',
'dense_prediction.bias': 'discriminator_predictions.dense_prediction.bias'}
)
for del_key in ['pooler.weight', 'pooler.bias', 'nsp.weight', 'nsp.bias', 'mlmDense.weight', 'mlmDense.bias',
'mlmLayerNorm.weight', 'mlmLayerNorm.bias', 'mlmBias', 'mlmDecoder.weight', 'mlmDecoder.bias']:
del mapping[del_key]
return mapping
class ERNIE(BERT):
"""百度文心 https://github.com/PaddlePaddle/ERNIE
"""
def __init__(self, *args, **kwargs):
super(ERNIE, self).__init__(*args, **kwargs)
def variable_mapping(self):
mapping = super(ERNIE, self).variable_mapping(prefix='ernie')
mapping.update({'mlmDecoder.weight': 'ernie.embeddings.word_embeddings.weight',
'mlmDecoder.bias': 'cls.predictions.bias'})
for k, v in mapping.items():
if ('LayerNorm.weight' in v) or ('LayerNorm.bias' in v):
v1 = v.replace('.weight', '.gamma').replace('.bias', '.beta')
mapping[k] = v1
for del_key in ['nsp.weight', 'nsp.bias']:
del mapping[del_key]
return mapping
def load_variable(self, state_dict, name, prefix='ernie'):
return super().load_variable(state_dict, name, prefix=prefix)
class Encoder(BERT):
def __init__(self, *args, **kwargs):
kwargs['vocab_size'] = kwargs.get('src_vocab_size', kwargs['vocab_size'])
super().__init__(*args, **kwargs)
# encoder需要返回encoder_attention_mask
self.encoder_attention_mask = None
def forward(self, inputs):
"""因为encoder需要返回encoder_attention_mask,因此这里从新定义一下,多返回一个参数
"""
# Embedding
outputs = self.apply_embeddings(inputs)
encoder_attention_mask = [outputs[1]]
# Main
outputs = self.apply_main_layers(outputs)
# Final
outputs = self.apply_final_layers(outputs)
return ([outputs] if isinstance(outputs, torch.Tensor) else outputs) + encoder_attention_mask
class Decoder(LM_Mask, BERT):
@delete_arguments('with_pool', 'with_mlm', 'with_nsp')
def __init__(self, *args, with_lm=True, tie_emb_prj_weight=False, logit_scale=True, **kwargs):
kwargs['vocab_size'] = kwargs.get('tgt_vocab_size', kwargs['vocab_size'])
kwargs['is_decoder'] = True # 标记是decoder
super().__init__(*args, **kwargs)
self.decoderLayer = self.encoderLayer
del self.encoderLayer
self.with_lm = with_lm
# 从hidden_states映射到logit
if self.with_lm:
self.final_dense = nn.Linear(self.hidden_size, self.vocab_size, bias=False)
# decoder底层的embedding和顶层的全连接共享
# [True]: fudan_bart和uer_t5的t5, [False]: mt5和t5_pegasus
if tie_emb_prj_weight:
self.final_dense.weight = self.embeddings.word_embeddings.weight
if logit_scale: # T5默认会有logit_scale, bart默认没有,所以bart要传入false
self.x_logit_scale = (self.hidden_size ** -0.5)
else:
self.x_logit_scale = 1.
def apply_main_layers(self, inputs):
"""Dencoder主体是基于Self-Attention、Cross-Attention的模块
顺序:Att1 --> Add --> LN --> Att2 --> Add --> LN --> FFN --> Add --> LN
"""
hidden_states, attention_mask, conditional_emb, encoder_hidden_state, encoder_attention_mask = inputs[:5]
decoded_layers = [hidden_states] # 添加embedding的输出
layer_inputs = [hidden_states, attention_mask, conditional_emb, encoder_hidden_state, encoder_attention_mask]
for i, layer_module in enumerate(self.decoderLayer):
layer_inputs = self.apply_on_layer_begin(i, layer_inputs)
hidden_states = layer_module(*layer_inputs)
layer_inputs[0] = hidden_states
layer_inputs = self.apply_on_layer_end(i, layer_inputs)
if self.output_all_encoded_layers:
decoded_layers.append(hidden_states)
if not self.output_all_encoded_layers:
decoded_layers.append(hidden_states)
return [decoded_layers, conditional_emb]
def apply_final_layers(self, inputs):
outputs = []
hidden_states = super().apply_final_layers(inputs) # outputs为decoder顶层的hidden_states [btz, seq_len, hdsz]
outputs.append(hidden_states)
if self.with_lm:
logits = self.final_dense(hidden_states) * self.x_logit_scale # outputs为[btz, seq_len, vocab_size]的logits
activation = get_activation('linear' if self.with_lm is True else self.with_lm) # 添加激活,一般是线性激活或softmax
logits = activation(logits)
outputs.append(logits)
return outputs
def variable_mapping(self, prefix='bert'):
raw_mapping = super().variable_mapping(prefix)
mapping = {}
for k, v in raw_mapping.items():
mapping[k.replace('encoderLayer', 'decoderLayer')] = v
# for i in range(self.num_hidden_layers):
# prefix_i = f'{prefix}.encoder.layer.%d.' % i
# mapping.update({
# f'decoderLayer.{i}.crossAttention.q.weight': prefix_i + 'crossattention.self.query.weight',
# f'decoderLayer.{i}.crossAttention.q.bias': prefix_i + 'crossattention.self.query.bias',
# f'decoderLayer.{i}.crossAttention.k.weight': prefix_i + 'crossattention.self.key.weight',
# f'decoderLayer.{i}.crossAttention.k.bias': prefix_i + 'crossattention.self.key.bias',
# f'decoderLayer.{i}.crossAttention.v.weight': prefix_i + 'crossattention.self.value.weight',
# f'decoderLayer.{i}.crossAttention.v.bias': prefix_i + 'crossattention.self.value.bias',
# f'decoderLayer.{i}.crossAttention.o.weight': prefix_i + 'crossattention.output.dense.weight',
# f'decoderLayer.{i}.crossAttention.o.bias': prefix_i + 'crossattention.output.dense.bias',
# f'decoderLayer.{i}.layerNorm3.weight': prefix_i + 'crossattention.output.LayerNorm.weight',
# f'decoderLayer.{i}.layerNorm3.bias': prefix_i + 'crossattention.output.LayerNorm.bias'
# })
return mapping
class Transformer(BERT_BASE):
'''encoder-decoder结构
'''
@delete_arguments('with_pool', 'with_mlm', 'with_nsp')
def __init__(self, *args, tie_emb_src_tgt_weight=False, **kwargs):
super(Transformer, self).__init__(*args, **kwargs)
# encoder
self.encoder = Encoder(*args, **kwargs)
self.encoder.build(**kwargs)
# decoder
self.decoder = Decoder(*args, **kwargs)
self.decoder.build(**kwargs)
if tie_emb_src_tgt_weight:
# encoder和decoder的embedding权重共享
assert self.encoder.vocab_size == self.decoder.vocab_size, "To share word embedding, the vocab size of src/tgt shall be the same."
self.encoder.embeddings.word_embeddings.weight = self.decoder.embeddings.word_embeddings.weight
def forward(self, inputs):
"""定义模型的执行流程
"""
encoder_input, decoder_input = inputs[:2]
# encoder
# encoder_emb = self.encoder.apply_embeddings(encoder_input)
# encode_outputs = self.encoder.apply_main_layers(encoder_emb)
# encoder_hidden_state = self.encoder.apply_final_layers(encode_outputs)
# encoder_attention_mask = encoder_emb[1]
encoder_hidden_state, encoder_attention_mask = self.encoder(encoder_input)
# decoder
# decoder_emb = self.decoder.apply_embeddings(decoder_input)
# decoder_outputs = self.decoder.apply_main_layers([*decoder_emb, encoder_hidden_state, encoder_attention_mask])
# decoder_outputs = self.decoder.apply_final_layers(decoder_outputs) # [hidden_states, logits]
decoder_outputs = self.decoder(decoder_input + [encoder_hidden_state, encoder_attention_mask])
return [encoder_hidden_state] + decoder_outputs # 输出encoder_hidden_state和decoder_hidden_state,以应对一些多任务情况
class BART(Transformer):
'''encoder-decoder结构
'''
def __init__(self, *args, tie_emb_src_tgt_weight=True, **kwargs):
kwargs['logit_scale'] = kwargs.get('logit_scale', False)
kwargs['tie_emb_prj_weight'] = kwargs.get('tie_emb_prj_weight', True)
super(BART, self).__init__(*args, tie_emb_src_tgt_weight=tie_emb_src_tgt_weight, **kwargs)
self.tie_emb_src_tgt_weight = tie_emb_src_tgt_weight
def load_variable(self, state_dict, name, prefix=''):
"""加载单个变量的函数
"""
variable = state_dict[name]
if name in {
'shared.weight',
'encoder.embed_tokens.weight',
'decoder.embed_tokens.weight',
}:
return self.load_embeddings(variable)
elif name in {'encoder.embed_positions.weight', 'decoder.embed_positions.weight'}:
return self.load_pos_embeddings(variable)
else:
return variable
def variable_mapping(self, prefix=''):
# 查看check_point发现'shared.weight'
mapping = {
'encoder.embeddings.word_embeddings.weight': 'shared.weight' if self.tie_emb_src_tgt_weight else 'encoder.embed_tokens.weight',
'encoder.embeddings.position_embeddings.weight': 'encoder.embed_positions.weight',
'encoder.embeddings.layerNorm.weight': 'encoder.layernorm_embedding.weight',
'encoder.embeddings.layerNorm.bias': 'encoder.layernorm_embedding.bias',
'decoder.embeddings.word_embeddings.weight': 'shared.weight' if self.tie_emb_src_tgt_weight else 'decoder.embed_tokens.weight',
'decoder.embeddings.position_embeddings.weight': 'decoder.embed_positions.weight',
'decoder.embeddings.layerNorm.weight': 'decoder.layernorm_embedding.weight',
'decoder.embeddings.layerNorm.bias': 'decoder.layernorm_embedding.bias',
}
for i in range(self.num_hidden_layers):
mapping.update(
{
f'encoder.encoderLayer.{i}.multiHeadAttention.q.weight': f'encoder.layers.{i}.self_attn.q_proj.weight',
f'encoder.encoderLayer.{i}.multiHeadAttention.q.bias': f'encoder.layers.{i}.self_attn.q_proj.bias',
f'encoder.encoderLayer.{i}.multiHeadAttention.k.weight': f'encoder.layers.{i}.self_attn.k_proj.weight',
f'encoder.encoderLayer.{i}.multiHeadAttention.k.bias': f'encoder.layers.{i}.self_attn.k_proj.bias',
f'encoder.encoderLayer.{i}.multiHeadAttention.v.weight': f'encoder.layers.{i}.self_attn.v_proj.weight',
f'encoder.encoderLayer.{i}.multiHeadAttention.v.bias': f'encoder.layers.{i}.self_attn.v_proj.bias',
f'encoder.encoderLayer.{i}.multiHeadAttention.o.weight': f'encoder.layers.{i}.self_attn.out_proj.weight',
f'encoder.encoderLayer.{i}.multiHeadAttention.o.bias': f'encoder.layers.{i}.self_attn.out_proj.bias',
f'encoder.encoderLayer.{i}.layerNorm1.weight': f'encoder.layers.{i}.self_attn_layer_norm.weight',
f'encoder.encoderLayer.{i}.layerNorm1.bias': f'encoder.layers.{i}.self_attn_layer_norm.bias',
f'encoder.encoderLayer.{i}.feedForward.intermediateDense.weight': f'encoder.layers.{i}.fc1.weight',
f'encoder.encoderLayer.{i}.feedForward.intermediateDense.bias': f'encoder.layers.{i}.fc1.bias',
f'encoder.encoderLayer.{i}.feedForward.outputDense.weight': f'encoder.layers.{i}.fc2.weight',
f'encoder.encoderLayer.{i}.feedForward.outputDense.bias': f'encoder.layers.{i}.fc2.bias',
f'encoder.encoderLayer.{i}.layerNorm2.weight': f'encoder.layers.{i}.final_layer_norm.weight',
f'encoder.encoderLayer.{i}.layerNorm2.bias': f'encoder.layers.{i}.final_layer_norm.bias',
f'decoder.decoderLayer.{i}.multiHeadAttention.q.weight': f'decoder.layers.{i}.self_attn.q_proj.weight',
f'decoder.decoderLayer.{i}.multiHeadAttention.q.bias': f'decoder.layers.{i}.self_attn.q_proj.bias',
f'decoder.decoderLayer.{i}.multiHeadAttention.k.weight': f'decoder.layers.{i}.self_attn.k_proj.weight',
f'decoder.decoderLayer.{i}.multiHeadAttention.k.bias': f'decoder.layers.{i}.self_attn.k_proj.bias',
f'decoder.decoderLayer.{i}.multiHeadAttention.v.weight': f'decoder.layers.{i}.self_attn.v_proj.weight',
f'decoder.decoderLayer.{i}.multiHeadAttention.v.bias': f'decoder.layers.{i}.self_attn.v_proj.bias',
f'decoder.decoderLayer.{i}.multiHeadAttention.o.weight': f'decoder.layers.{i}.self_attn.out_proj.weight',
f'decoder.decoderLayer.{i}.multiHeadAttention.o.bias': f'decoder.layers.{i}.self_attn.out_proj.bias',
f'decoder.decoderLayer.{i}.layerNorm1.weight': f'decoder.layers.{i}.self_attn_layer_norm.weight',
f'decoder.decoderLayer.{i}.layerNorm1.bias': f'decoder.layers.{i}.self_attn_layer_norm.bias',
f'decoder.decoderLayer.{i}.crossAttention.q.weight': f'decoder.layers.{i}.encoder_attn.q_proj.weight',
f'decoder.decoderLayer.{i}.crossAttention.q.bias': f'decoder.layers.{i}.encoder_attn.q_proj.bias',
f'decoder.decoderLayer.{i}.crossAttention.k.weight': f'decoder.layers.{i}.encoder_attn.k_proj.weight',
f'decoder.decoderLayer.{i}.crossAttention.k.bias': f'decoder.layers.{i}.encoder_attn.k_proj.bias',
f'decoder.decoderLayer.{i}.crossAttention.v.weight': f'decoder.layers.{i}.encoder_attn.v_proj.weight',
f'decoder.decoderLayer.{i}.crossAttention.v.bias': f'decoder.layers.{i}.encoder_attn.v_proj.bias',
f'decoder.decoderLayer.{i}.crossAttention.o.weight': f'decoder.layers.{i}.encoder_attn.out_proj.weight',
f'decoder.decoderLayer.{i}.crossAttention.o.bias': f'decoder.layers.{i}.encoder_attn.out_proj.bias',
f'decoder.decoderLayer.{i}.layerNorm3.weight': f'decoder.layers.{i}.encoder_attn_layer_norm.weight',
f'decoder.decoderLayer.{i}.layerNorm3.bias': f'decoder.layers.{i}.encoder_attn_layer_norm.bias',
f'decoder.decoderLayer.{i}.feedForward.intermediateDense.weight': f'decoder.layers.{i}.fc1.weight',
f'decoder.decoderLayer.{i}.feedForward.intermediateDense.bias': f'decoder.layers.{i}.fc1.bias',
f'decoder.decoderLayer.{i}.feedForward.outputDense.weight': f'decoder.layers.{i}.fc2.weight',
f'decoder.decoderLayer.{i}.feedForward.outputDense.bias': f'decoder.layers.{i}.fc2.bias',
f'decoder.decoderLayer.{i}.layerNorm2.weight': f'decoder.layers.{i}.final_layer_norm.weight',
f'decoder.decoderLayer.{i}.layerNorm2.bias': f'decoder.layers.{i}.final_layer_norm.bias'
})
return mapping
class T5_Encoder(Encoder):
@insert_arguments(version='t5.1.0')
def __init__(self, *args, **kwargs):
kwargs.update({'p_bias': 't5_relative', 'relative_attention_num_buckets': kwargs.get('relative_attention_num_buckets'), 'version': self.version,
'bias': False, 'norm_mode': 'rmsnorm'}) # p_bias来控制embedding阶段无pos_embedding,t5不使用bias,并且使用rmsnorm
super().__init__(*args, **kwargs)
del self.embeddings.layerNorm
# t5的layernorm都在前面,因此重新定义了下
layer = T5Layer(self.hidden_size, self.num_attention_heads, self.dropout_rate, self.attention_probs_dropout_prob, self.intermediate_size, self.hidden_act, is_dropout=self.is_dropout,
conditional_size=self.conditional_size, **get_kw(BertLayer, kwargs))
self.encoderLayer = nn.ModuleList([copy.deepcopy(layer) for _ in range(self.num_hidden_layers)])
# 把第二层后的相对位置编码的权重绑定到第一层上,变相实现仅由第一层计算
for i in range(1, self.num_hidden_layers):
self.encoderLayer[i].multiHeadAttention.relative_positions_encoding.weight = self.encoderLayer[0].multiHeadAttention.relative_positions_encoding.weight
self.final_layer_norm = LayerNorm(self.hidden_size, eps=1e-12, conditional_size=self.conditional_size, bias=False, norm_mode='rmsnorm')
self.dropout = nn.Dropout(self.dropout_rate)
def apply_final_layers(self, inputs):
hidden_states = super().apply_final_layers(inputs)
return self.dropout(self.final_layer_norm([hidden_states]))
def load_variable(self, state_dict, name, prefix=''):
"""加载单个变量的函数
"""
variable = state_dict[name]
if name in {'encoder.embed_tokens.weight', 'shared.weight'}:
return self.load_embeddings(variable)
else:
return variable
def variable_mapping(self, prefix=''):
# 查看check_point发现'shared.weight'
mapping = {f'{prefix}embeddings.word_embeddings.weight': 'encoder.embed_tokens.weight',
f'{prefix}encoderLayer.0.multiHeadAttention.relative_positions_encoding.weight': 'encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight',
f'{prefix}final_layer_norm.weight': 'encoder.final_layer_norm.weight'}
for i in range(self.num_hidden_layers):
mapping.update(
{
f'{prefix}encoderLayer.{i}.multiHeadAttention.q.weight': f'encoder.block.{i}.layer.0.SelfAttention.q.weight',
f'{prefix}encoderLayer.{i}.multiHeadAttention.k.weight': f'encoder.block.{i}.layer.0.SelfAttention.k.weight',
f'{prefix}encoderLayer.{i}.multiHeadAttention.v.weight': f'encoder.block.{i}.layer.0.SelfAttention.v.weight',
f'{prefix}encoderLayer.{i}.multiHeadAttention.o.weight': f'encoder.block.{i}.layer.0.SelfAttention.o.weight',
f'{prefix}encoderLayer.{i}.layerNorm1.weight': f'encoder.block.{i}.layer.0.layer_norm.weight',
f'{prefix}encoderLayer.{i}.feedForward.outputDense.weight': f'encoder.block.{i}.layer.1.DenseReluDense.wo.weight',
f'{prefix}encoderLayer.{i}.layerNorm2.weight': f'encoder.block.{i}.layer.1.layer_norm.weight',
})
if self.version.endswith('t5.1.0'):
mapping.update({f'{prefix}encoderLayer.{i}.feedForward.intermediateDense.weight': f'encoder.block.{i}.layer.1.DenseReluDense.wi.weight'})
elif self.version.endswith('t5.1.1'):
mapping.update({f'{prefix}encoderLayer.{i}.feedForward.intermediateDense.weight': f'encoder.block.{i}.layer.1.DenseReluDense.wi_0.weight',
f'{prefix}encoderLayer.{i}.feedForward.intermediateDense1.weight': f'encoder.block.{i}.layer.1.DenseReluDense.wi_1.weight'})
return mapping
class T5_Decoder(Decoder):
@insert_arguments(version='t5.1.0')
def __init__(self, *args, **kwargs):
kwargs.update({'p_bias': 't5_relative', 'relative_attention_num_buckets': kwargs.get('relative_attention_num_buckets'), 'version': self.version,
'bias': False, 'norm_mode': 'rmsnorm'}) # p_bias来控制embedding阶段无pos_embedding,t5不使用bias,并且使用rmsnorm
super().__init__(*args, **kwargs)
del self.embeddings.layerNorm
# t5的layernorm都在前面,因此重新定义了下
layer = T5Layer(self.hidden_size, self.num_attention_heads, self.dropout_rate, self.attention_probs_dropout_prob, self.intermediate_size, self.hidden_act, is_dropout=self.is_dropout,
conditional_size=self.conditional_size, is_decoder=True, **get_kw(BertLayer, kwargs))
self.decoderLayer = nn.ModuleList([copy.deepcopy(layer) for _ in range(self.num_hidden_layers)])
# 把第二层后的相对位置编码的权重绑定到第一层上,变相实现仅由第一层计算
for i in range(1, self.num_hidden_layers):
self.decoderLayer[i].multiHeadAttention.relative_positions_encoding.weight = self.decoderLayer[0].multiHeadAttention.relative_positions_encoding.weight
self.final_layer_norm = LayerNorm(self.hidden_size, eps=1e-12, conditional_size=self.conditional_size, bias=False, norm_mode='rmsnorm')
self.dropout = nn.Dropout(self.dropout_rate)
def apply_final_layers(self, inputs):
inputs[0][1] = self.dropout(self.final_layer_norm([inputs[0][1]])) # 在转logit前把最后一层的hidden_states加layernorm
return super().apply_final_layers(inputs)
def load_variable(self, state_dict, name, prefix=''):
"""加载单个变量的函数
"""
variable = state_dict[name]
if name in {f'decoder.embed_tokens.weight', 'lm_head.weight', 'shared.weight'}:
return self.load_embeddings(variable)
else:
return variable
def variable_mapping(self, prefix=''):
# 查看check_point发现'shared.weight'
mapping = {f'{prefix}embeddings.word_embeddings.weight': 'decoder.embed_tokens.weight',
f'{prefix}decoderLayer.0.multiHeadAttention.relative_positions_encoding.weight': 'decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight',
f'{prefix}final_layer_norm.weight': 'decoder.final_layer_norm.weight',
f'{prefix}final_dense.weight': 'lm_head.weight'}
for i in range(self.num_hidden_layers):
mapping.update(
{
f'{prefix}decoderLayer.{i}.multiHeadAttention.q.weight': f'decoder.block.{i}.layer.0.SelfAttention.q.weight',
f'{prefix}decoderLayer.{i}.multiHeadAttention.k.weight': f'decoder.block.{i}.layer.0.SelfAttention.k.weight',
f'{prefix}decoderLayer.{i}.multiHeadAttention.v.weight': f'decoder.block.{i}.layer.0.SelfAttention.v.weight',
f'{prefix}decoderLayer.{i}.multiHeadAttention.o.weight': f'decoder.block.{i}.layer.0.SelfAttention.o.weight',
f'{prefix}decoderLayer.{i}.layerNorm1.weight': f'decoder.block.{i}.layer.0.layer_norm.weight',
f'{prefix}decoderLayer.{i}.crossAttention.q.weight': f'decoder.block.{i}.layer.1.EncDecAttention.q.weight',
f'{prefix}decoderLayer.{i}.crossAttention.k.weight': f'decoder.block.{i}.layer.1.EncDecAttention.k.weight',
f'{prefix}decoderLayer.{i}.crossAttention.v.weight': f'decoder.block.{i}.layer.1.EncDecAttention.v.weight',
f'{prefix}decoderLayer.{i}.crossAttention.o.weight': f'decoder.block.{i}.layer.1.EncDecAttention.o.weight',
f'{prefix}decoderLayer.{i}.layerNorm3.weight': f'decoder.block.{i}.layer.1.layer_norm.weight',
f'{prefix}decoderLayer.{i}.feedForward.outputDense.weight': f'decoder.block.{i}.layer.2.DenseReluDense.wo.weight',
f'{prefix}decoderLayer.{i}.layerNorm2.weight': f'decoder.block.{i}.layer.2.layer_norm.weight',
})
if self.version.endswith('t5.1.0'):
mapping.update({f'{prefix}decoderLayer.{i}.feedForward.intermediateDense.weight': f'decoder.block.{i}.layer.2.DenseReluDense.wi.weight'})
elif self.version.endswith('t5.1.1'):
mapping.update({f'{prefix}decoderLayer.{i}.feedForward.intermediateDense.weight': f'decoder.block.{i}.layer.2.DenseReluDense.wi_0.weight',
f'{prefix}decoderLayer.{i}.feedForward.intermediateDense1.weight': f'decoder.block.{i}.layer.2.DenseReluDense.wi_1.weight'})
return mapping
class T5(Transformer):
"""Google的T5模型(Encoder-Decoder)
"""
@delete_arguments('with_pool', 'with_mlm', 'with_nsp')
def __init__(self, *args, tie_emb_src_tgt_weight=True, **kwargs):
super(T5, self).__init__(*args, **kwargs)
self.tie_emb_src_tgt_weight = tie_emb_src_tgt_weight
# encoder
self.encoder = T5_Encoder(*args, **kwargs)
self.encoder.build(**kwargs)
# decoder
self.decoder = T5_Decoder(*args, **kwargs)
self.decoder.build(**kwargs)
def load_variable(self, state_dict, name, prefix=''):
"""加载单个变量的函数
"""
variable = state_dict[name]
if name in {'shared.weight', 'encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'}:
return self.load_embeddings(variable)
else:
return variable
def variable_mapping(self, prefix=''):
mapping = self.encoder.variable_mapping(prefix='encoder.')
mapping.update(self.decoder.variable_mapping(prefix='decoder.'))
if self.tie_emb_src_tgt_weight:
mapping.update({'encoder.embeddings.word_embeddings.weight': 'shared.weight',
'decoder.embeddings.word_embeddings.weight': 'shared.weight'})
return mapping
class GPT(LM_Mask, BERT):
"""构建GPT模型
链接:https://github.com/openai/finetune-transformer-lm
"""
@insert_arguments(final_activation='softmax')
@delete_arguments('with_pool', 'with_mlm', 'with_nsp')
def __init__(self, max_position, **kwargs):
"""GPT的embedding是token、position、segment三者embedding之和,跟BERT的主要区别是三者相加之后没有加LayerNormalization层。
使用LM_Mask实现预训练ckpt中的bias参数,最后的全连接层由于和embedding层权重一致,因此直接从word_embedding取
"""
super(GPT, self).__init__(max_position, **kwargs)
del self.embeddings.layerNorm
self.dense = nn.Linear(self.hidden_size, self.vocab_size, bias=False)
self.dense.weight = self.embeddings.word_embeddings.weight
self.final_activation = get_activation(self.final_activation)
def apply_final_layers(self, inputs):
hidden_state = super().apply_final_layers(inputs)
logit = self.dense(hidden_state)
return self.final_activation(logit)
def load_variable(self, state_dict, name):
return super(GPT, self).load_variable(state_dict, name, prefix='gpt')
def variable_mapping(self):
"""映射到GPT权重格式
"""
mapping = super(GPT, self).variable_mapping(prefix='gpt')
return mapping
class GPT2(LM_Mask, BERT):
"""构建GPT模型
链接:https://github.com/openai/finetune-transformer-lm
"""
@insert_arguments(final_activation='softmax')
@delete_arguments('with_pool', 'with_mlm', 'with_nsp')
def __init__(self, max_position, **kwargs):
"""GPT2的embedding是token、position两者embedding之和
1、跟BERT的主要区别是三者相加之后没有加LayerNormalization层。
2、bert的layernorm是在attn/ffc之后,OpenAi-gpt2是在之前。
使用LM_Mask实现预训练ckpt中的bias参数,最后的全连接层由于和embedding层权重一致,因此直接从word_embedding取
"""
super(GPT2, self).__init__(max_position, **kwargs)
del self.embeddings.layerNorm
layer = self.Gpt2Layer(self.hidden_size, self.num_attention_heads, self.dropout_rate, self.attention_probs_dropout_prob, self.intermediate_size, self.hidden_act, is_dropout=self.is_dropout, conditional_size=self.conditional_size)
self.encoderLayer = nn.ModuleList([copy.deepcopy(layer) if layer_id in self.keep_hidden_layers else Identity() for layer_id in range(self.num_hidden_layers)])
self.LayerNormFinal = LayerNorm(self.hidden_size, eps=1e-12, conditional_size=self.conditional_size)
self.dense = nn.Linear(self.hidden_size, self.vocab_size, bias=False)
self.dense.weight = self.embeddings.word_embeddings.weight
self.final_activation = get_activation(self.final_activation)
def apply_final_layers(self, inputs):
hidden_state = super().apply_final_layers(inputs)
logit = self.dense(self.LayerNormFinal([hidden_state]))
return self.final_activation(logit)
def load_variable(self, state_dict, name):
return super(GPT2, self).load_variable(state_dict, name, prefix='gpt2')
def variable_mapping(self):
"""映射到GPT权重格式
"""
mapping = super(GPT2, self).variable_mapping(prefix='gpt2')
mapping.update({'LayerNormFinal.weight': 'gpt2.LayerNormFinal.weight',
'LayerNormFinal.bias': 'gpt2.LayerNormFinal.bias'})
return mapping
class Gpt2Layer(BertLayer):
'''未定义在layer.py中是因为该层针对gpt2_mlm模型,不可复用
顺序:LN --> Att --> Add --> LN --> FFN --> Add
'''
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def forward(self, hidden_states, attention_mask, conditional_emb=None, encoder_hidden_states=None, encoder_attention_mask=None):
# bert的layernorm是在attn/ffc之后,Openai-gpt2是在之前
x = self.layerNorm1((hidden_states, conditional_emb))
self_attn_output = self.multiHeadAttention(x, attention_mask)
hidden_states = hidden_states + self.dropout1(self_attn_output)
x = self.layerNorm2((hidden_states, conditional_emb))
ffn_output = self.feedForward(x)
hidden_states = hidden_states + self.dropout2(ffn_output)
return hidden_states
class GPT2_ML(LM_Mask, BERT):
"""构建GPT2_ML模型
链接: https://github.com/imcaspar/gpt2-ml
注意:GPT2_ML虽然号称GPT2,但是它的结构其实更接近GPT,它自称GPT2的原因大概是因为它开源的版本参数量达到了GPT2的15亿参数。
看完ckpt中的key,和GPT的区别是embedding后也有layernorm,和bert的区别是第一个跳跃链接是在layernorm前,bert是在之后
"""
@insert_arguments(final_activation='softmax')
@delete_arguments('with_pool', 'with_mlm', 'with_nsp')
def __init__(self, max_position, **kwargs):
super().__init__(max_position, **kwargs)
layer = self.Gpt2MlLayer(self.hidden_size, self.num_attention_heads, self.dropout_rate, self.attention_probs_dropout_prob, self.intermediate_size, self.hidden_act, is_dropout=self.is_dropout, conditional_size=self.conditional_size)
self.encoderLayer = nn.ModuleList([copy.deepcopy(layer) if layer_id in self.keep_hidden_layers else Identity() for layer_id in range(self.num_hidden_layers)])
self.dense = nn.Linear(self.hidden_size, self.vocab_size, bias=False)
self.dense.weight = self.embeddings.word_embeddings.weight
self.final_activation = get_activation(self.final_activation)
def apply_final_layers(self, inputs):
hidden_state = super().apply_final_layers(inputs)
logit = self.dense(hidden_state)
return self.final_activation(logit)
def load_variable(self, state_dict, name):
return super(GPT2_ML, self).load_variable(state_dict, name, prefix='gpt2_ml')
def variable_mapping(self):
"""映射到GPT2权重格式
"""
mapping = super(GPT2_ML, self).variable_mapping(prefix='gpt2_ml')
return mapping
class Gpt2MlLayer(BertLayer):
'''未定义在layer.py中是因为该层针对gpt2_mlm模型,不可复用
顺序:Att --> Add --> LN --> FFN --> Add --> LN
'''
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def forward(self, hidden_states, attention_mask, conditional_emb=None, encoder_hidden_states=None, encoder_attention_mask=None):
self_attn_output = self.multiHeadAttention(hidden_states, attention_mask)
hidden_states = hidden_states + self.dropout1(self_attn_output)
x = self.layerNorm1((hidden_states, conditional_emb))
# bert的跳跃连接是在layerNorm之后,gpt2_ml是在layerNorm之前
ffn_output = self.feedForward(x)
hidden_states = hidden_states + self.dropout2(ffn_output)
hidden_states = self.layerNorm2((hidden_states, conditional_emb))
return hidden_states
class Transformer_XL(BERT):
'''构建transformer-xl模型, 已加载
项目: https://github.com/kimiyoung/transformer-xl
不同点:
1) 简化了原有的AdaptiveEmbedding(可选)和未使用ProjectedAdaptiveLogSoftmax, 直接输出last_hidden_state
2) mems修改了transformer中初始化为zero_tensor, 改为包含最后一层, 原项目初始化为empty_tensor
3) SinusoidalPositionEncoding一般是sincos间隔排列, 这里是先sin后cos
4) attention_mask在multi_attn中使用中使用1e30来替代原来的1000
'''
@delete_arguments('with_pool', 'with_nsp', 'with_mlm')
@insert_arguments(with_lm=False)
def __init__(self, *args, mem_len=0, same_length=False, clamp_len=-1, **kwargs):
# p_bias来控制embedding阶段无pos_embedding
kwargs.update({'p_bias': 'other_relative'})
super().__init__(*args, **kwargs)
self.mem_len, self.same_length, self.clamp_len = mem_len, same_length, clamp_len
self.attn_type = kwargs.get('attn_type', 0)
# embedding
if kwargs.get('adaptive_embedding'):
cutoffs, div_val, sample_softmax = kwargs.get('cutoffs', []), kwargs.get('div_val', 1), kwargs.get('sample_softmax', False)
self.embeddings = AdaptiveEmbedding(self.vocab_size, self.embedding_size, self.hidden_size, cutoffs, div_val, sample_softmax, **get_kw(AdaptiveEmbedding, kwargs))
else:
self.embeddings = nn.Embedding(self.vocab_size, self.embedding_size)
self.pos_embeddings = XlnetPositionsEncoding(self.embedding_size)
self.dropout = nn.Dropout(self.dropout_rate)
# 每层自己的r_w_bias和r_r_bias,还是公用
if not kwargs.get('untie_r'):
self.r_w_bias = nn.Parameter(torch.FloatTensor(self.num_attention_heads, self.attention_head_size)) # 全局内容偏置
self.r_r_bias = nn.Parameter(torch.FloatTensor(self.num_attention_heads, self.attention_head_size)) # 全局位置偏置
if self.segment_vocab_size > 0:
self.r_s_bias = nn.Parameter(torch.FloatTensor(self.num_attention_heads, self.attention_head_size)) # 全局segment偏置
else:
self.r_w_bias, self.r_r_bias = None, None
self.r_s_bias = None
# transformer block
layer = XlnetLayer(self.hidden_size, self.num_attention_heads, self.dropout_rate, self.attention_probs_dropout_prob, self.intermediate_size,
self.hidden_act, is_dropout=self.is_dropout, conditional_size=self.conditional_size, r_w_bias=self.r_w_bias, r_r_bias=self.r_r_bias,
r_s_bias=None, **get_kw(BertLayer, kwargs))
self.encoderLayer = nn.ModuleList([copy.deepcopy(layer) if layer_id in self.keep_hidden_layers else Identity() for layer_id in range(self.num_hidden_layers)])
# 映射
if self.with_lm:
self.dense = nn.Linear(self.hidden_size, self.vocab_size, bias=True)
def init_mems(self, bsz):
'''初始化mems, 用于记忆mlen的各层隐含层状态
'''
if isinstance(self.mem_len, (int, float)) and (self.mem_len > 0):
mems = []
param = next(self.parameters())
for _ in range(self.num_hidden_layers+1):
empty = torch.zeros(bsz, self.mem_len, self.hidden_size, dtype=param.dtype, device=param.device)
mems.append(empty)
return mems
else:
return None
def _update_mems(self, hids, mlen, qlen):
'''更新mems
'''
# does not deal with None
if self.mems is None:
return None
# mems is not None
assert len(hids) == len(self.mems), "len(hids) != len(mems)"
# There are `mlen + qlen` steps that can be cached into mems
with torch.no_grad():
new_mems = []
end_idx = mlen + max(0, qlen)
beg_idx = max(0, end_idx - self.mem_len)
for i in range(len(hids)):
cat = torch.cat([self.mems[i], hids[i]], dim=1)
new_mems.append(cat[:, beg_idx:end_idx].detach())
self.mems = new_mems
def relative_positional_encoding(self, qlen, klen, device):
# 生成pos_emb, 这里使用sincos的位置编码,为了和xlnet入参一致
pos_seq = torch.arange(klen-1, -1, -1.0, device=device, dtype=torch.long)
if self.clamp_len > 0:
pos_seq.clamp_(max=self.clamp_len)
pos_emb = self.dropout(self.pos_embeddings(pos_seq)) # 用word_emb的dropout
return pos_emb
def create_mask(self, word_emb, qlen, klen, mlen):
# 修改attention_mask, mlen可以全部访问,q_len只能访问<=t时刻的, mask和Unilm类似,但是Unilm是靠segement_ids来控制
if self.same_length: # 只能访问前面固定长度
all_ones = word_emb.new_ones(qlen, klen)
mask_len = klen - self.mem_len
mask_shift_len = qlen - mask_len if mask_len > 0 else qlen
attention_mask = 1-(torch.triu(all_ones, 1+mlen) + torch.tril(all_ones, -mask_shift_len)).byte() # -1
else:
attention_mask = torch.tril(word_emb.new_ones(qlen, klen), diagonal=mlen).byte() # [q_len, k_len], 下三角为1矩阵
attention_mask = attention_mask[None, None, :, :]
return attention_mask
def apply_embeddings(self, inputs):
'''接受的inputs输入: [token_ids, segment_ids], 暂不支持条件LayerNorm输入
'''
assert isinstance(inputs, (list, tuple)), f'Inputs only support list,tuple format but passed {type(inputs)}'
self.mems = self.init_mems(inputs[0].size(0)) # 生成mems
# 精简后embeddings中只计算word_emdedding
word_emb = self.dropout(self.embeddings(inputs[0]))
index_ = 1
btz, qlen = inputs[0].shape[:2] # query长度
mlen = self.mems[0].size(1) if self.mems is not None else 0
klen = mlen + qlen
# 相对位置编码
pos_emb = self.relative_positional_encoding(qlen, klen, word_emb.device)
# segment embedding
if self.segment_vocab_size > 0:
segment_ids = inputs[index_]
if mlen > 0:
mem_pad = torch.zeros([btz, mlen], dtype=torch.long, device=word_emb.device)
cat_ids = torch.cat([mem_pad, segment_ids], dim=1)
else:
cat_ids = segment_ids
# `1` indicates not in the same segment [qlen x klen x bsz]
segment_ids = (segment_ids[:, :, None] != cat_ids[:, None]).long()
index_ += 1
else:
segment_ids = None
if self.attn_type in {'uni', 0}: # 兼容transformer_xl的设置: 0
attention_mask = self.create_mask(word_emb, qlen, klen, mlen)
elif self.attn_type == 'bi':
attention_mask = (inputs[0] != self.token_pad_ids).long().unsqueeze(1).unsqueeze(2)
non_tgt_mask = torch.eye(qlen).to(attention_mask)[None, None, :, :]
non_tgt_mask = ((1 - attention_mask - non_tgt_mask) <= 0).long()
return [word_emb, segment_ids, pos_emb, non_tgt_mask, None]
def apply_main_layers(self, inputs):
hidden_states, segment_ids, pos_emb, attention_mask, conditional_emb = inputs[:5]
encoded_layers = [hidden_states] # 添加embedding的输出
layer_inputs = [hidden_states, segment_ids, pos_emb, attention_mask, None, conditional_emb]
for i, layer_module in enumerate(self.encoderLayer):
mems_i = None if self.mems is None else self.mems[i]
layer_inputs[-2] = mems_i
layer_inputs = self.apply_on_layer_begin(i, layer_inputs)
hidden_states = layer_module(*layer_inputs)
layer_inputs[0] = hidden_states
layer_inputs = self.apply_on_layer_end(i, layer_inputs)
encoded_layers.append(hidden_states)
# 原实现中word_emb, pos_emb和core_out(hidden_states)使用同一个dropout
hidden_states = self.dropout(hidden_states)
qlen = inputs[0].size(1) # query长度
mlen = self.mems[0].size(0) if self.mems is not None else 0
self._update_mems(encoded_layers, mlen, qlen)
if not self.output_all_encoded_layers:
# 不返回所有层,即返回顶层
encoded_layers = encoded_layers[:1] + [hidden_states]
return [encoded_layers, conditional_emb]
def load_variable(self, state_dict, name, prefix=''):
# 这里由于预训练模型使用了AdapterEmbedding,因此暂不支持
if (self.keep_tokens is not None) or (self.compound_tokens is not None):
raise ValueError('Custom keep_tokens and compound_tokens is not yet supported in Transformer_XL')
return state_dict[name]
def variable_mapping(self, prefix=''):
return {k:k for k, v in self.named_parameters()}
class XLNET(Transformer_XL):
'''构建xlnet模型, 这里做了简化, 只用来finetune, 即没有perm_mask, target_mapping这些输入
接受的inputs输入: [token_ids, segment_ids]
'''
def __init__(self, *args, bi_data=False, **kwargs):
self.attn_type = kwargs.get('attn_type', 'bi')
self.bi_data = bi_data
kwargs['rel_shift_opt'] = 'xlnet'
super().__init__(*args, **kwargs)
def relative_positional_encoding(self, qlen, klen, device):
# 生成pos_emb, 这里使用sincos的位置编码, transformer_xl里面有-1
if self.attn_type == 'bi':
beg, end = klen, -qlen
elif self.attn_type == "uni":
beg, end = klen, -1
else:
raise ValueError(f"Unknown `attn_type` {self.attn_type}.")
# 前向的emb
pos_seq = torch.arange(beg, end, -1.0, device=device, dtype=torch.long)
if self.clamp_len > 0:
pos_seq.clamp_(max=self.clamp_len)
fwd_pos_emb = self.pos_embeddings(pos_seq)
# 双向数据
if self.bi_data:
pos_seq = torch.arange(-beg, -end, -1.0, device=device, dtype=torch.long)
if self.clamp_len > 0:
pos_seq.clamp_(max=self.clamp_len)
bwd_pos_emb = self.pos_embeddings(pos_seq)
pos_emb = torch.cat([fwd_pos_emb, bwd_pos_emb], dim=0)
else:
pos_emb = fwd_pos_emb
pos_emb = self.dropout(pos_emb) # 用word_emb的dropout
return pos_emb
def apply_final_layers(self, inputs):
hidden_state = super().apply_final_layers(inputs)
if self.with_lm:
return [hidden_state, self.dense(hidden_state)]
else:
return hidden_state
def load_variable(self, state_dict, name, prefix='transformer'):
"""加载单个变量的函数
"""
variable = state_dict[name]
if name in {f'{prefix}.word_embedding.weight', 'lm_loss.weight', 'lm_loss.bias'}:
return self.load_embeddings(variable)
elif re.search('rel_attn\.(q|k|v|r)$', name):
return variable.reshape(variable.shape[0], -1).T
# elif re.search('rel_attn\.(o|seg_embed)$', name):
elif re.search('rel_attn\.(o)$', name):
return variable.reshape(variable.shape[0], -1)
else:
return variable
def variable_mapping(self, prefix='transformer'):
mapping = {
'embeddings.weight': f'{prefix}.word_embedding.weight',
'dense.weight': 'lm_loss.weight',
'dense.bias': 'lm_loss.bias',
}
for i in range(self.num_hidden_layers):
prefix_i = f'{prefix}.layer.%d.' % i
mapping.update({f'encoderLayer.{i}.multiHeadAttention.q.weight': prefix_i + 'rel_attn.q',
f'encoderLayer.{i}.multiHeadAttention.k.weight': prefix_i + 'rel_attn.k',
f'encoderLayer.{i}.multiHeadAttention.v.weight': prefix_i + 'rel_attn.v',
f'encoderLayer.{i}.multiHeadAttention.o.weight': prefix_i + 'rel_attn.o',
f'encoderLayer.{i}.multiHeadAttention.r.weight': prefix_i + 'rel_attn.r',
f'encoderLayer.{i}.multiHeadAttention.r_r_bias': prefix_i + 'rel_attn.r_r_bias',
f'encoderLayer.{i}.multiHeadAttention.r_s_bias': prefix_i + 'rel_attn.r_s_bias',
f'encoderLayer.{i}.multiHeadAttention.r_w_bias': prefix_i + 'rel_attn.r_w_bias',
# f'encoderLayer.{i}.multiHeadAttention.seg_embed.weight': prefix_i + 'rel_attn.seg_embed',
f'encoderLayer.{i}.multiHeadAttention.seg_embed': prefix_i + 'rel_attn.seg_embed',
f'encoderLayer.{i}.layerNorm1.weight': prefix_i + 'rel_attn.layer_norm.weight',
f'encoderLayer.{i}.layerNorm1.bias': prefix_i + 'rel_attn.layer_norm.bias',
f'encoderLayer.{i}.feedForward.intermediateDense.weight': prefix_i + 'ff.layer_1.weight',
f'encoderLayer.{i}.feedForward.intermediateDense.bias': prefix_i + 'ff.layer_1.bias',
f'encoderLayer.{i}.feedForward.outputDense.weight': prefix_i + 'ff.layer_2.weight',
f'encoderLayer.{i}.feedForward.outputDense.bias': prefix_i + 'ff.layer_2.bias',
f'encoderLayer.{i}.layerNorm2.weight': prefix_i + 'ff.layer_norm.weight',
f'encoderLayer.{i}.layerNorm2.bias': prefix_i + 'ff.layer_norm.bias'
})
return mapping
def build_transformer_model(
config_path=None,
checkpoint_path=None,
model='bert',
application='encoder',
**kwargs
):
"""根据配置文件构建模型,可选加载checkpoint权重
"""
configs = {}
if config_path is not None:
configs.update(json.load(open(config_path)))
configs.update(kwargs)
if 'max_position' not in configs:
configs['max_position'] = configs.get('max_position_embeddings', 512)
if 'dropout_rate' not in configs:
configs['dropout_rate'] = configs.get('hidden_dropout_prob')
if 'segment_vocab_size' not in configs:
configs['segment_vocab_size'] = configs.get('type_vocab_size', 2)
models = {
'bert': BERT,
'roberta': BERT,
'albert': ALBERT,
'albert_unshared': ALBERT_Unshared,
'nezha': NEZHA,
'roformer': RoFormer,
'roformer_v2': RoFormerV2,
'gau_alpha': GAU_alpha,
'electra': ELECTRA,
'ernie': ERNIE,
'encoder': Encoder,
'decoder': Decoder,
'transformer': Transformer,
'bart': BART,
'gpt': GPT,
'gpt2': GPT2,
'gpt2_ml': GPT2_ML,
't5': T5,
't5_encoder': T5_Encoder,
't5_decoder': T5_Decoder,
't5.1.0': T5,
't5.1.0_encoder': T5_Encoder,
't5.1.0_decoder': T5_Decoder,
't5.1.1': T5,
't5.1.1_encoder': T5_Encoder,
't5.1.1_decoder': T5_Decoder,
'mt5.1.1': T5,
'mt5.1.1_encoder': T5_Encoder,
'mt5.1.1_decoder': T5_Decoder,
'transformer_xl': Transformer_XL,
'xlnet': XLNET,
}
if isinstance(model, str): # string表示使用自带的模型
MODEL = models[model.lower()]
if model.endswith('t5.1.1'):
configs['version'] = model
elif isinstance(model, type) and issubclass(model, BERT_BASE): # nn.Module表示使用自定义的模型:
MODEL = model
else:
raise ValueError('"model" args type should be string or nn.Module')
application = application.lower()
if application in ['lm', 'unilm'] and model in ['electra', 't5', ]:
raise ValueError(f'"{model}" model can not be used as "{application}" application.\n')
if application == 'lm':
MODEL = extend_with_language_model(MODEL)
elif application == 'unilm':
MODEL = extend_with_unified_language_model(MODEL)
transformer = MODEL(**configs)
transformer.build(**configs)
transformer.apply(transformer.init_model_weights) # 初始化权重
if checkpoint_path is not None:
transformer.load_weights_from_pytorch_checkpoint(checkpoint_path)
transformer.configs = configs
return transformer
from torch.optim.lr_scheduler import LambdaLR
def get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, last_epoch=-1):
"""带warmup的schedule, 源自transformers包optimization.py中
参数
num_warmup_steps:
需要warmup的步数, 一般为 num_training_steps * warmup_proportion(warmup的比例, 建议0.05-0.15)
num_training_steps:
总的训练步数, 一般为 train_batches * num_epoch
"""
def lr_lambda(current_step: int):
if current_step < num_warmup_steps:
return float(current_step) / float(max(1, num_warmup_steps))
return max(0.0, float(num_training_steps - current_step) / float(max(1, num_training_steps - num_warmup_steps)))
return LambdaLR(optimizer, lr_lambda, last_epoch)
def extend_with_exponential_moving_average(model, decay=0.999):
class ExponentialMovingAverage():
''' 模型权重的指数滑动平均, 不参加梯度更新,只是记录滑动平均的参数,给预测使用
注意区别于类似adam一类的自适应学习率优化器, 针对一阶二阶梯度的指数滑动平均, 两者完全不同
例子:
# 初始化
ema = ExponentialMovingAverage(model, 0.999)
# 训练过程中, 更新完参数后, 同步update ema_weights weights
def train():
optimizer.step()
ema.step()
# eval前, 调用apply_ema_weights(); eval之后, restore_raw_weights()恢复原来模型的参数
def evaluate():
ema.apply_ema_weights()
# evaluate
# 如果想保存ema后的模型, 请在restore方法之前调用torch.save()
ema.restore_raw_weights()
'''
def __init__(self, model, decay):
self.model = model
self.decay = decay
# 保存ema权重(当前step的每一层的滑动平均权重)
self.ema_weights = {}
# 在进行evaluate的时候, 保存原始的模型权重, 当执行完evaluate后, 从ema权重恢复到原始权重
self.model_weights = {}
# 初始化ema_weights为model_weights
for name, param in self.model.named_parameters():
if param.requires_grad:
self.ema_weights[name] = param.data.clone()
def step(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
assert name in self.ema_weights
new_average = (1.0 - self.decay) * param.data + self.decay * self.ema_weights[name]
self.ema_weights[name] = new_average.clone()
def apply_ema_weights(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
assert name in self.ema_weights
self.model_weights[name] = param.data
param.data = self.ema_weights[name]
def restore_raw_weights(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
assert name in self.model_weights
param.data = self.model_weights[name]
self.model_weights = {}
return ExponentialMovingAverage(model, decay)
\ No newline at end of file
#! -*- coding: utf-8 -*-
# 其他代码合
import unicodedata
import six
import numpy as np
import re
import torch
from torch.nn.utils.rnn import pad_sequence
import time
import sys
import collections
import torch.nn as nn
from torch.utils.data import Dataset, IterableDataset
import math
import gc
import inspect
import json
import torch.nn.functional as F
import random
from datetime import datetime
from typing import List, Tuple, Dict, Union, Optional
import warnings
import os
is_py2 = six.PY2
if not is_py2:
basestring = str
def take_along_dim(input_tensor, indices, dim=None):
'''兼容部分低版本pytorch没有torch.take_along_dim
'''
if torch.__version__ >= '1.9.0':
return torch.take_along_dim(input_tensor, indices, dim)
else:
# 该逻辑仅在少量数据上测试,如有bug,欢迎反馈
if dim is None:
res = input_tensor.flatten()[indices]
else:
res = np.take_along_axis(input_tensor.cpu().numpy(), indices.cpu().numpy(), axis=dim)
res = torch.from_numpy(res).to(input_tensor.device)
# assert res.equal(torch.take_along_dim(input_tensor, indices, dim))
return res
def is_string(s):
"""判断是否是字符串
"""
return isinstance(s, basestring)
def truncate_sequences(maxlen, indices, *sequences):
"""截断总长度至不超过maxlen
"""
sequences = [s for s in sequences if s]
if not isinstance(indices, (list, tuple)):
indices = [indices] * len(sequences)
while True:
lengths = [len(s) for s in sequences]
if sum(lengths) > maxlen:
i = np.argmax(lengths)
sequences[i].pop(indices[i])
else:
return sequences
def text_segmentate(text, maxlen, seps='\n', strips=None, truncate=True):
"""将文本按照标点符号划分为若干个短句
truncate: True表示标点符号切分后仍然超长时, 按照maxlen硬截断分成若干个短句
"""
text = text.strip().strip(strips)
if seps and len(text) > maxlen:
pieces = text.split(seps[0])
text, texts = '', []
for i, p in enumerate(pieces):
if text and p and len(text) + len(p) > maxlen - 1:
texts.extend(text_segmentate(text, maxlen, seps[1:], strips, truncate))
text = ''
if i + 1 == len(pieces):
text = text + p
else:
text = text + p + seps[0]
if text:
texts.extend(text_segmentate(text, maxlen, seps[1:], strips, truncate))
return texts
elif truncate and (not seps) and (len(text) > maxlen):
# 标点符号用完,仍然超长,且设置了truncate=True
return [text[i*maxlen:(i+1)*maxlen] for i in range(0, int(np.ceil(len(text)/maxlen)))]
else:
return [text]
def merge_segmentate(sequences, maxlen, sep=''):
'''把m个句子合并成不超过maxlen的n个句子, 主要用途是合并碎句子
'''
sequences_new = []
text = ''
for t in sequences:
if text and len(text + sep + t) <= maxlen:
text = text + sep + t
elif text:
sequences_new.append(text)
text = t
elif len(t) < maxlen: # text为空
text = t
else:
sequences_new.append(t)
text = ''
if text:
sequences_new.append(text)
return sequences_new
def text_augmentation(texts, noise_dict=None, noise_len=0, noise_p=0.0, skip_words=None, strategy='random', allow_dup=True):
'''简单的EDA策略, 增删改
texts: 需要增强的文本/文本list
noise_dict: 噪音数据, 元素为str的list, tuple, set
noise_len: 噪音长度, 优先试用
noise_p: 噪音比例
skip_words: 跳过的短语, string/list
strategy: 修改的策略, 包含增insert, 删delete, 改replace, 随机random
allow_dup: 是否允许同一个位置多次EDA
'''
def insert(text, insert_idx, noise_dict):
text = list(text)
for i in insert_idx:
text[i] = text[i] + random.choice(noise_dict)
return ''.join(text)
def delete(text, delete_idx):
text = list(text)
for i in delete_idx:
text[i] = ''
return ''.join(text)
def replace(text, replace_idx, noise_dict):
text = list(text)
for i in replace_idx:
text[i] = random.choice(noise_dict)
return ''.join(text)
def search(pattern, sequence, keep_last=True):
"""从sequence中寻找子串pattern, 返回符合pattern的id集合
"""
n = len(pattern)
pattern_idx_set = set()
for i in range(len(sequence)):
if sequence[i:i + n] == pattern:
pattern_idx_set = pattern_idx_set.union(set(range(i, i+n))) if keep_last else pattern_idx_set.union(set(range(i, i+n-1)))
return pattern_idx_set
if (noise_len==0) and (noise_p==0):
return texts
assert strategy in {'insert', 'delete', 'replace', 'random'}, 'EDA strategy only support insert, delete, replace, random'
if isinstance(texts, str):
texts = [texts]
if skip_words is None:
skip_words = []
elif isinstance(skip_words, str):
skip_words = [skip_words]
for id, text in enumerate(texts):
sel_len = noise_len if noise_len > 0 else int(len(text)*noise_p) # 噪声长度
skip_idx = set() # 不能修改的idx区间
for item in skip_words:
# insert时最后一位允许插入
skip_idx = skip_idx.union(search(item, text, strategy!='insert'))
sel_idxs = [i for i in range(len(text)) if i not in skip_idx] # 可供选择的idx区间
sel_len = sel_len if allow_dup else min(sel_len, len(sel_idxs)) # 无重复抽样需要抽样数小于总样本
if (sel_len == 0) or (len(sel_idxs) == 0): # 如果不可采样则跳过
continue
sel_idx = np.random.choice(sel_idxs, sel_len, replace=allow_dup)
if strategy == 'insert':
texts[id] = insert(text, sel_idx, noise_dict)
elif strategy == 'delete':
texts[id] = delete(text, sel_idx)
elif strategy == 'replace':
texts[id] = replace(text, sel_idx, noise_dict)
elif strategy == 'random':
if random.random() < 0.333:
skip_idx = set() # 不能修改的idx区间
for item in skip_words:
# insert时最后一位允许插入
skip_idx = skip_idx.union(search(item, text, keep_last=False))
texts[id] = insert(text, sel_idx, noise_dict)
elif random.random() < 0.667:
texts[id] = delete(text, sel_idx)
else:
texts[id] = replace(text, sel_idx, noise_dict)
return texts if len(texts) > 1 else texts[0]
def lowercase_and_normalize(text, never_split=()):
"""转小写,并进行简单的标准化
"""
if is_py2:
text = unicode(text)
# convert non-special tokens to lowercase
escaped_special_toks = [re.escape(s_tok) for s_tok in never_split]
pattern = r"(" + r"|".join(escaped_special_toks) + r")|" + r"(.+?)"
text = re.sub(pattern, lambda m: m.groups()[0] or m.groups()[1].lower(), text)
# text = text.lower()
text = unicodedata.normalize('NFD', text)
text = ''.join([ch for ch in text if unicodedata.category(ch) != 'Mn'])
return text
def sequence_padding(inputs, length=None, value=0, seq_dims=1, mode='post'):
"""将序列padding到同一长度
"""
if isinstance(inputs[0], (np.ndarray, list)):
if length is None:
length = np.max([np.shape(x)[:seq_dims] for x in inputs], axis=0)
elif not hasattr(length, '__getitem__'):
length = [length]
slices = [np.s_[:length[i]] for i in range(seq_dims)]
slices = tuple(slices) if len(slices) > 1 else slices[0]
pad_width = [(0, 0) for _ in np.shape(inputs[0])]
outputs = []
for x in inputs:
x = x[slices]
for i in range(seq_dims):
if mode == 'post':
pad_width[i] = (0, length[i] - np.shape(x)[i])
elif mode == 'pre':
pad_width[i] = (length[i] - np.shape(x)[i], 0)
else:
raise ValueError('"mode" argument must be "post" or "pre".')
x = np.pad(x, pad_width, 'constant', constant_values=value)
outputs.append(x)
return np.array(outputs)
elif isinstance(inputs[0], torch.Tensor):
assert mode == 'post', '"mode" argument must be "post" when element is torch.Tensor'
if length is not None:
inputs = [i[:length] for i in inputs]
return pad_sequence(inputs, padding_value=value, batch_first=True)
else:
raise ValueError('"input" argument must be tensor/list/ndarray.')
def insert_arguments(**arguments):
"""装饰器,为类方法增加参数(主要用于类的__init__方法)
"""
def actual_decorator(func):
def new_func(self, *args, **kwargs):
for k, v in arguments.items():
if k in kwargs:
v = kwargs.pop(k)
setattr(self, k, v)
return func(self, *args, **kwargs)
return new_func
return actual_decorator
def delete_arguments(*arguments):
"""装饰器,为类方法删除参数(主要用于类的__init__方法)
"""
def actual_decorator(func):
def new_func(self, *args, **kwargs):
for k in arguments:
if k in kwargs:
raise TypeError(
'%s got an unexpected keyword argument \'%s\'' %
(self.__class__.__name__, k)
)
return func(self, *args, **kwargs)
return new_func
return actual_decorator
class Progbar(object):
"""Displays a progress bar.
# Arguments
target: Total number of steps expected, None if unknown.
width: Progress bar width on screen.
verbose: Verbosity mode, 0 (silent), 1 (verbose), 2 (semi-verbose)
stateful_metrics: Iterable of string names of metrics that
should *not* be averaged over time. Metrics in this list
will be displayed as-is. All others will be averaged
by the progbar before display.
interval: Minimum visual progress update interval (in seconds).
"""
def __init__(self, target, width=30, verbose=1, interval=0.05, stateful_metrics=None):
self.target = target
self.width = width
self.verbose = verbose
self.interval = interval
if stateful_metrics:
self.stateful_metrics = set(stateful_metrics)
else:
self.stateful_metrics = set()
self._dynamic_display = ((hasattr(sys.stdout, 'isatty') and
sys.stdout.isatty()) or
'ipykernel' in sys.modules)
self._total_width = 0
self._seen_so_far = 0
self._values = collections.OrderedDict()
self._start = time.time()
self._last_update = 0
def update(self, current, values=None):
"""Updates the progress bar.
# Arguments
current: Index of current step.
values: List of tuples:
`(name, value_for_last_step)`.
If `name` is in `stateful_metrics`,
`value_for_last_step` will be displayed as-is.
Else, an average of the metric over time will be displayed.
"""
values = values or []
for k, v in values:
if k not in self.stateful_metrics:
if k not in self._values:
self._values[k] = [v * (current - self._seen_so_far),
current - self._seen_so_far]
else:
self._values[k][0] += v * (current - self._seen_so_far)
self._values[k][1] += (current - self._seen_so_far)
else:
# Stateful metrics output a numeric value. This representation
# means "take an average from a single value" but keeps the
# numeric formatting.
self._values[k] = [v, 1]
self._seen_so_far = current
now = time.time()
info = ' - %.0fs' % (now - self._start)
if self.verbose == 1:
if (now - self._last_update < self.interval and
self.target is not None and current < self.target):
return
prev_total_width = self._total_width
if self._dynamic_display:
sys.stdout.write('\b' * prev_total_width)
sys.stdout.write('\r')
else:
sys.stdout.write('\n')
if self.target is not None:
numdigits = int(np.floor(np.log10(self.target))) + 1
barstr = '%%%dd/%d [' % (numdigits, self.target)
bar = barstr % current
prog = float(current) / self.target
prog_width = int(self.width * prog)
if prog_width > 0:
bar += ('=' * (prog_width - 1))
if current < self.target:
bar += '>'
else:
bar += '='
bar += ('.' * (self.width - prog_width))
bar += ']'
else:
bar = '%7d/Unknown' % current
self._total_width = len(bar)
sys.stdout.write(bar)
if current:
time_per_unit = (now - self._start) / current
else:
time_per_unit = 0
if self.target is not None and current < self.target:
eta = time_per_unit * (self.target - current)
if eta > 3600:
eta_format = ('%d:%02d:%02d' %
(eta // 3600, (eta % 3600) // 60, eta % 60))
elif eta > 60:
eta_format = '%d:%02d' % (eta // 60, eta % 60)
else:
eta_format = '%ds' % eta
info = ' - ETA: %s' % eta_format
else:
if time_per_unit >= 1:
info += ' %.0fs/step' % time_per_unit
elif time_per_unit >= 1e-3:
info += ' %.0fms/step' % (time_per_unit * 1e3)
else:
info += ' %.0fus/step' % (time_per_unit * 1e6)
for k in self._values:
info += ' - %s:' % k
if isinstance(self._values[k], list):
avg = np.mean(
self._values[k][0] / max(1, self._values[k][1]))
if abs(avg) > 1e-3:
info += ' %.4f' % avg
else:
info += ' %.4e' % avg
else:
info += ' %s' % self._values[k]
self._total_width += len(info)
if prev_total_width > self._total_width:
info += (' ' * (prev_total_width - self._total_width))
if self.target is not None and current >= self.target:
info += '\n'
sys.stdout.write(info)
sys.stdout.flush()
elif self.verbose == 2:
if self.target is None or current >= self.target:
for k in self._values:
info += ' - %s:' % k
avg = np.mean(
self._values[k][0] / max(1, self._values[k][1]))
if avg > 1e-3:
info += ' %.4f' % avg
else:
info += ' %.4e' % avg
info += '\n'
sys.stdout.write(info)
sys.stdout.flush()
self._last_update = now
def add(self, n, values=None):
self.update(self._seen_so_far + n, values)
class Callback(object):
'''Callback基类
'''
def __init__(self):
pass
def on_train_begin(self, logs=None):
pass
def on_train_end(self, logs=None):
pass
def on_epoch_begin(self, global_step, epoch, logs=None):
pass
def on_epoch_end(self, global_step, epoch, logs=None):
pass
def on_batch_begin(self, global_step, local_step, logs=None):
pass
def on_batch_end(self, global_step, local_step, logs=None):
pass
def on_dataloader_end(self, logs=None):
pass
class ProgbarLogger(Callback):
"""Callback that prints metrics to stdout.
# Arguments
count_mode: One of "steps" or "samples".
Whether the progress bar should
count samples seen or steps (batches) seen.
stateful_metrics: Iterable of string names of metrics that
should *not* be averaged over an epoch.
Metrics in this list will be logged as-is.
All others will be averaged over time (e.g. loss, etc).
# Raises
ValueError: In case of invalid `count_mode`.
"""
def __init__(self, epochs, steps, metrics, stateful_metrics=None, verbose=1):
super(ProgbarLogger, self).__init__()
if stateful_metrics:
self.stateful_metrics = set(stateful_metrics)
else:
self.stateful_metrics = set()
self.params = {'epochs': epochs, 'steps': steps, 'verbose': verbose, 'metrics': metrics}
self.verbose = verbose
self.epochs = epochs
def add_metrics(self, metrics, stateful_metrics=None, add_position=None):
if add_position is None:
add_position = len(self.params['metrics'])
metrics = [metrics] if isinstance(metrics, str) else metrics
if stateful_metrics:
stateful_metrics = [stateful_metrics] if isinstance(stateful_metrics, str) else stateful_metrics
self.stateful_metrics.update(set(stateful_metrics))
self.progbar.stateful_metrics.update(set(stateful_metrics))
add_metrics = []
for metric in metrics:
if metric not in self.params['metrics']:
add_metrics.append(metric)
self.params['metrics'] = self.params['metrics'][:add_position] + add_metrics + self.params['metrics'][add_position:]
def on_train_begin(self, logs=None):
if self.verbose:
print('Start Training'.center(40, '='))
def on_epoch_begin(self, global_step=None, epoch=None, logs=None):
if self.verbose:
time_start = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
print('%s - Epoch: %d/%d' % (time_start, epoch + 1, self.epochs))
self.target = self.params['steps']
self.progbar = Progbar(target=self.target, verbose=self.verbose, stateful_metrics=self.stateful_metrics)
self.seen = 0
def on_batch_begin(self, global_step=None, local_step=None, logs=None):
if self.seen < self.target:
self.log_values = []
def on_batch_end(self, global_step=None, local_step=None, logs=None):
logs = logs or {}
self.seen += 1
for k in self.params['metrics']:
if k in logs:
self.log_values.append((k, logs[k]))
# Skip progbar update for the last batch;
# will be handled by on_epoch_end.
if self.verbose and self.seen < self.target:
self.progbar.update(self.seen, self.log_values)
def on_epoch_end(self, global_step=None, epoch=None, logs=None):
logs = logs or {}
for k in self.params['metrics']:
if k in logs:
self.log_values.append((k, logs[k]))
if self.verbose:
self.progbar.update(self.seen, self.log_values)
def on_train_end(self, logs=None):
if self.verbose:
print('Finish Training'.center(40, '='))
class EarlyStopping(Callback):
'''Stop training策略, 从keras中移植
'''
def __init__(self, monitor='loss', min_delta=0, patience=0, verbose=0, mode='auto', baseline=None):
super(EarlyStopping, self).__init__()
self.monitor = monitor
self.baseline = baseline
self.patience = patience
self.verbose = verbose
self.min_delta = min_delta
self.wait = 0
self.stopped_epoch = 0
if mode not in ['auto', 'min', 'max']:
warnings.warn('EarlyStopping mode %s is unknown, fallback to auto mode.' % mode, RuntimeWarning)
mode = 'auto'
if mode == 'min':
self.monitor_op = np.less
elif mode == 'max':
self.monitor_op = np.greater
else:
self.monitor_op = np.greater if 'acc' in self.monitor else np.less
self.min_delta = self.min_delta if self.monitor_op == np.greater else -self.min_delta
def on_train_begin(self, logs=None):
# Allow instances to be re-used
self.wait = 0
self.stopped_epoch = 0
if self.baseline is not None:
self.best = self.baseline
else:
self.best = np.Inf if self.monitor_op == np.less else -np.Inf
def on_epoch_end(self, steps, epoch, logs=None):
current = self.get_monitor_value(logs)
if current is None:
return
if self.monitor_op(current - self.min_delta, self.best):
self.best = current
self.wait = 0
else:
self.wait += 1
if self.wait >= self.patience:
self.stopped_epoch = epoch
def on_train_end(self, logs=None):
if self.stopped_epoch > 0 and self.verbose > 0:
print(f'Epoch {self.stopped_epoch+1}: early stopping\n')
def get_monitor_value(self, logs):
monitor_value = logs.get(self.monitor)
if monitor_value is None:
warnings.warn('Early stopping conditioned on metric `%s` '
'which is not available. Available metrics are: %s' %
(self.monitor, ','.join(list(logs.keys()))), RuntimeWarning)
return monitor_value
class Logger(Callback):
'''默认logging
对于valid/dev和test的日志需要在evaluate之后对log进行赋值,如log['dev_f1']=f1,并在Evaluator之后调用
若每隔一定steps对验证集评估,则Logger的interval设置成和Evaluater一致或者约数,保证日志能记录到
'''
def __init__(self, filename, interval=10, verbosity=1, name=None):
super(Logger, self).__init__()
self.interval = interval
import logging
level_dict = {0: logging.DEBUG, 1: logging.INFO, 2: logging.WARNING}
formatter = logging.Formatter("[%(asctime)s][%(filename)s][line:%(lineno)d][%(levelname)s] %(message)s")
self.logger = logging.getLogger(name)
self.logger.setLevel(level_dict[verbosity])
fh = logging.FileHandler(filename, "a")
fh.setFormatter(formatter)
self.logger.addHandler(fh)
def on_train_begin(self, logs=None):
self.logger.info('Start Training'.center(40, '='))
def on_train_end(self, logs=None):
self.logger.info('Finish Training'.center(40, '='))
def on_epoch_begin(self, global_step, epoch, logs=None):
self.logger.info(f'Epoch {epoch}'.center(40, '='))
def on_epoch_end(self, global_step, epoch, logs=None):
log_str = '\t '.join([f'{k}={v:.5f}' for k, v in logs.items()])
self.logger.info(f'epoch={epoch+1}\t {log_str}')
def on_batch_end(self, global_step, local_step, logs=None):
if (global_step+1) % self.interval == 0:
log_str = '\t '.join([f'{k}={v:.5f}' for k, v in logs.items()])
self.logger.info(f'step={global_step+1}\t {log_str}')
class Tensorboard(Callback):
'''默认Tensorboard
对于valid/dev和test的Tensorboard需要在evaluate之后对log进行赋值,如log['dev/f1']=f1,并在Evaluator之后调用
赋值需要分栏目的用'/'进行分隔
若每隔一定steps对验证集评估,则Tensorboard的interval设置成和Evaluater一致或者约数,保证Tensorboard能记录到
'''
def __init__(self, dirname, interval=10, prefix='train'):
super(Tensorboard, self).__init__()
self.interval = interval
self.prefix = prefix
from tensorboardX import SummaryWriter
self.writer = SummaryWriter(log_dir=str(dirname)) # prepare summary writer
def on_epoch_end(self, global_step, epoch, logs=None):
for k, v in logs.items():
index = k if '/' in k else f"{self.prefix}/{k}"
self.writer.add_scalar(index, v, global_step)
def on_batch_end(self, global_step, local_step, logs=None):
if (global_step+1) % self.interval == 0:
for k, v in logs.items():
index = k if '/' in k else f"{self.prefix}/{k}"
self.writer.add_scalar(index, v, global_step)
def metric_mapping(metric, func, y_pred, y_true):
# 自定义metrics
if inspect.isfunction(func):
metric_res = func(y_pred, y_true)
if inspect.isfunction(metric):
# 如果直接传入回调函数(无key),要求回调函数返回Dict[String: Int/Float]类型
assert isinstance(metric_res, dict), 'Custom metrics callbacks should return "Dict[String: Int/Float]" value'
elif isinstance(metric, str):
# 如果直接传入回调函数(有key),要求回调函数返回Int/Float类型
assert isinstance(metric_res, (int, float)), 'Custom metrics callbacks should return "Int, Float" value'
return metric_res
elif metric == 'loss':
pass
# 自带metrics
elif isinstance(metric, str):
# 如果forward返回了list, tuple,则选取第一项
y_pred_tmp = y_pred[0] if isinstance(y_pred, (list, tuple)) else y_pred
y_true_tmp = y_true[0] if isinstance(y_true, (list, tuple)) else y_true
# 根据shape做预处理
if len(y_pred_tmp.shape) == len(y_true_tmp.shape) + 1:
y_pred_tmp = torch.argmax(y_pred_tmp, dim=-1)
elif len(y_pred_tmp.shape) == len(y_true_tmp.shape):
pass
else:
raise ValueError(f'y_pred_tmp.shape={y_pred_tmp.shape} while y_true_tmp.shape={y_true_tmp.shape}')
# 执行内置的metric
if metric in {'accuracy', 'acc'}:
return torch.sum(y_pred_tmp.eq(y_true_tmp)).item() / y_true_tmp.numel()
elif metric in {'mae', 'MAE', 'mean_absolute_error'}:
return torch.mean(torch.abs(y_pred_tmp - y_true_tmp)).item()
elif metric in {'mse', 'MSE', 'mean_squared_error'}:
return torch.mean(torch.square(y_pred_tmp - y_true_tmp)).item()
elif metric in {'mape', 'MAPE', 'mean_absolute_percentage_error'}:
diff = torch.abs((y_true_tmp - y_pred_tmp) / torch.clamp(torch.abs(y_true_tmp), 1e-7, None))
return 100. * torch.mean(diff).item()
elif metric in {'msle', 'MSLE', 'mean_squared_logarithmic_error'}:
first_log = torch.log(torch.clamp(y_pred_tmp, 1e-7, None) + 1.)
second_log = torch.log(torch.clamp(y_true_tmp, 1e-7, None) + 1.)
return torch.mean(torch.square(first_log - second_log)).item()
return None
def softmax(x, axis=-1):
"""numpy版softmax
"""
x = x - x.max(axis=axis, keepdims=True)
x = np.exp(x)
return x / x.sum(axis=axis, keepdims=True)
class AutoRegressiveDecoder(object):
"""通用自回归生成模型解码基类
包含beam search和random sample两种策略
"""
def __init__(self, start_id, end_id, maxlen, minlen=1, device='cpu'):
self.start_id = start_id
self.end_id = end_id
self.maxlen = maxlen
self.minlen = minlen
self.models = {}
self.device = device
if start_id is None:
self.first_output_ids = torch.empty((1, 0), dtype=int, device=device)
else:
self.first_output_ids = torch.tensor([[self.start_id]], device=device)
@staticmethod
def wraps(default_rtype='probas', use_states=False):
"""用来进一步完善predict函数
目前包含: 1. 设置rtype参数,并做相应处理;
2. 确定states的使用,并做相应处理;
3. 设置温度参数,并做相应处理。
"""
def actual_decorator(predict):
def new_predict(self, inputs, output_ids, states, temperature=1, rtype=default_rtype):
assert rtype in ['probas', 'logits']
prediction = predict(self, inputs, output_ids, states)
if not use_states:
prediction = (prediction, None)
if default_rtype == 'logits':
prediction = (nn.Softmax(dim=-1)(prediction[0] / temperature), prediction[1])
elif temperature != 1:
probas = torch.power(prediction[0], 1.0 / temperature)
probas = probas / probas.sum(axis=-1, keepdims=True)
prediction = (probas, prediction[1])
if rtype == 'probas':
return prediction
else:
return torch.log(prediction[0] + 1e-12), prediction[1]
return new_predict
return actual_decorator
def predict(self, inputs, output_ids, states=None):
"""用户需自定义递归预测函数
说明: 定义的时候,需要用wraps方法进行装饰,传入default_rtype和use_states,
其中default_rtype为字符串logits或probas,probas时返回归一化的概率,
rtype=logits时则返回softmax前的结果或者概率对数。
返回: 二元组 (得分或概率, states)
"""
raise NotImplementedError
def beam_search(self, inputs_raw, topk, states=None, temperature=1, min_ends=1, add_btz_dim=True):
"""beam search解码
说明: 这里的topk即beam size;
返回: 最优解码序列。
"""
inputs = []
for i in inputs_raw:
if isinstance(i, torch.torch.Tensor):
pass
elif isinstance(i, (list, tuple, np.ndarray)) and add_btz_dim:
i = torch.tensor([i], device=self.device)
elif isinstance(i, (list, tuple, np.ndarray)) and not add_btz_dim:
i = torch.tensor(i, device=self.device)
else:
raise ValueError('Beam search inputs ele only support tensor、array、list、tuple')
inputs.append(i)
output_ids, output_scores = self.first_output_ids, torch.zeros(1, device=self.device)
for step in range(self.maxlen):
scores, states = self.predict(inputs, output_ids, states, temperature, 'logits') # 计算当前得分
if step == 0: # 第1步预测后将输入重复topk次
inputs = [i.repeat([topk]+[1]*(len(i.shape)-1)) for i in inputs]
scores = output_scores.reshape((-1, 1)) + scores # 综合累积得分
indices = scores.flatten().argsort(dim=-1, descending=True)[:topk] # 仅保留topk
if torch.__version__ <= '1.7.1':
indices_1 = indices // scores.shape[1] # 兼容老版本
else:
indices_1 = torch.div(indices, scores.shape[1], rounding_mode='floor') # 行索引
indices_2 = (indices % scores.shape[1]).reshape((-1, 1)) # 列索引
output_ids = torch.cat([output_ids[indices_1], indices_2], 1) # 更新输出
output_scores = take_along_dim(scores, indices, dim=None) # 更新得分
is_end = output_ids[:, -1] == self.end_id # 标记是否以end标记结束
end_counts = (output_ids == self.end_id).sum(1) # 统计出现的end标记
if output_ids.shape[1] >= self.minlen: # 最短长度判断
best = output_scores.argmax() # 得分最大的那个
if is_end[best] and end_counts[best] >= min_ends: # 如果已经终止
return output_ids[best] # 直接输出
else: # 否则,只保留未完成部分
flag = ~is_end | (end_counts < min_ends) # 标记未完成序列
if not flag.all(): # 如果有已完成的
inputs = [i[flag] for i in inputs] # 扔掉已完成序列
output_ids = output_ids[flag] # 扔掉已完成序列
output_scores = output_scores[flag] # 扔掉已完成序列
end_counts = end_counts[flag] # 扔掉已完成end计数
topk = flag.sum() # topk相应变化
# 达到长度直接输出
return output_ids[output_scores.argmax()]
def random_sample(self, inputs, n, topk=None, topp=None, states=None, temperature=1, min_ends=1):
"""随机采样n个结果
说明: 非None的topk表示每一步只从概率最高的topk个中采样;而非None的topp
表示每一步只从概率最高的且概率之和刚好达到topp的若干个token中采样。
返回: n个解码序列组成的list。
"""
inputs = [torch.tensor([i], device=self.device) for i in inputs]
output_ids = self.first_output_ids
results = []
for step in range(self.maxlen):
probas, states = self.predict(inputs, output_ids, states, temperature, 'probas') # 计算当前概率
probas /= probas.sum(dim=-1, keepdims=True) # 确保归一化
if step == 0: # 第1步预测后将结果重复n次
probas = probas.repeat([n]+[1]*(len(probas.shape)-1))
inputs = [i.repeat([n]+[1]*(len(i.shape)-1)) for i in inputs]
output_ids = output_ids.repeat([n]+[1]*(len(output_ids.shape)-1))
if topk is not None:
k_indices = probas.argsort(dim=-1, descending=True)[:, :topk] # 仅保留topk
probas = take_along_dim(probas, k_indices, dim=1) # topk概率
probas /= probas.sum(dim=1, keepdims=True) # 重新归一化
if topp is not None:
p_indices = probas.argsort(dim=-1, descending=True) # 从高到低排序
probas = take_along_dim(probas, p_indices, dim=-1) # 排序概率
cumsum_probas = torch.cumsum(probas, dim=-1) # 累积概率
flag = torch.roll(cumsum_probas >= topp, 1, dims=1) # 标记超过topp的部分
flag[:, 0] = False # 结合上面的torch.roll,实现平移一位的效果
probas[flag] = 0 # 后面的全部置零
probas /= probas.sum(dim=1, keepdims=True) # 重新归一化
sample_func = lambda p: torch.multinomial(p, 1) # 按概率采样函数
sample_ids = torch.stack([sample_func(p) for p in probas])
sample_ids = sample_ids.reshape((-1, 1)) # 对齐形状
if topp is not None:
sample_ids = take_along_dim(p_indices, sample_ids, dim=1) # 对齐原id
if topk is not None:
sample_ids = take_along_dim(k_indices, sample_ids, dim=1) # 对齐原id
output_ids = torch.cat([output_ids, sample_ids], 1) # 更新输出
is_end = output_ids[:, -1] == self.end_id # 标记是否以end标记结束
end_counts = (output_ids == self.end_id).sum(1) # 统计出现的end标记
if output_ids.shape[1] >= self.minlen: # 最短长度判断
flag = is_end & (end_counts >= min_ends) # 标记已完成序列
if flag.any(): # 如果有已完成的
for ids in output_ids[flag]: # 存好已完成序列
results.append(ids)
flag = (flag == False) # 标记未完成序列
inputs = [i[flag] for i in inputs] # 只保留未完成部分输入
output_ids = output_ids[flag] # 只保留未完成部分候选集
end_counts = end_counts[flag] # 只保留未完成部分end计数
if len(output_ids) == 0:
break
# 如果还有未完成序列,直接放入结果
for ids in output_ids:
results.append(ids)
# 返回结果
return results
def search_layer(model, layer_name, retrun_first=True):
'''根据layer_name搜索并返回参数/参数list
'''
return_list = []
for name, param in model.named_parameters():
if param.requires_grad and layer_name in name:
return_list.append(param)
if len(return_list) == 0:
return None
if retrun_first:
return return_list[0]
else:
return return_list
class ListDataset(Dataset):
'''数据是List格式Dataset,支持传入file_path或者外部已读入的data(List格式)
'''
def __init__(self, file_path=None, data=None, **kwargs):
self.kwargs = kwargs
if isinstance(file_path, (str, tuple, list)):
self.data = self.load_data(file_path)
elif isinstance(data, list):
self.data = data
else:
raise ValueError('The input args shall be str format file_path / list format dataset')
def __len__(self):
return len(self.data)
def __getitem__(self, index):
return self.data[index]
@staticmethod
def load_data(file_path):
return file_path
class IterDataset(IterableDataset):
'''流式读取文件,用于大数据量、多小文件
使用时候需要注意steps_per_epoch != None
'''
def __init__(self, file_path=None, **kwargs):
self.kwargs = kwargs
if isinstance(file_path, (str, tuple, list)):
self.file_path = file_path
else:
raise ValueError('The input args shall be str format file_path / list format dataset')
def __iter__(self):
return self.load_data(self.file_path)
@staticmethod
def load_data(file_path, verbose=0):
if isinstance(file_path, (tuple, list)):
for file in file_path:
if verbose != 0:
print("Load data: ", file)
with open(file, 'r') as file_obj:
for line in file_obj:
yield line
elif isinstance(file_path, str):
with open(file_path, 'r') as file_obj:
for line in file_obj:
yield line
def get_sinusoid_encoding_table(n_position, d_hid, padding_idx=None):
''' sinusoid编码
Returns: [seq_len, d_hid]
'''
position = torch.arange(0, n_position, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_hid, 2).float() * (-math.log(10000.0) / d_hid))
embeddings_table = torch.zeros(n_position, d_hid)
embeddings_table[:, 0::2] = torch.sin(position * div_term)
embeddings_table[:, 1::2] = torch.cos(position * div_term)
return embeddings_table
# 第二种实现
position_ids = torch.arange(0, n_position).unsqueeze(1)
position_ids = position_ids.expand(-1, d_hid)
indices = torch.arange(0, d_hid)
position_ids = position_ids * torch.pow(10000, -2 * torch.true_divide(torch.floor_divide(indices, 2), d_hid))
position_ids[:, ::2] = torch.sin(position_ids[:, ::2])
position_ids[:, 1::2] = torch.cos(position_ids[:, 1::2])
return position_ids
def cal_ts_num(tensor_shape):
'''查看某个tensor在gc中的数量
'''
cal_num = 0
for obj in gc.get_objects():
try:
if torch.is_tensor(obj): # or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
tensor = obj
else:
continue
if tensor.is_cuda and tensor.size() == tensor_shape:
print(tensor.shape)
cal_num+=1
except Exception as e:
print('A trivial exception occured: {}'.format(e))
print(cal_num)
def get_kw(cls, kwargs):
'''保留排除cls的入参后的kwargs
'''
kwargs_new = {}
for k in kwargs:
if k not in set(inspect.getargspec(cls)[0]):
kwargs_new[k] = kwargs[k]
return kwargs_new
class FGM():
'''对抗训练
'''
def __init__(self, model):
self.model = model
self.backup = {}
def attack(self, epsilon=1., emb_name='word_embeddings', **kwargs):
# emb_name这个参数要换成你模型中embedding的参数名
# 例如,self.emb = nn.Embedding(5000, 100)
for name, param in self.model.named_parameters():
if param.requires_grad and emb_name in name:
self.backup[name] = param.data.clone()
norm = torch.norm(param.grad) # 默认为2范数
if norm != 0 and not torch.isnan(norm): # nan是为了apex混合精度时:
r_at = epsilon * param.grad / norm
param.data.add_(r_at)
def restore(self, emb_name='emb', **kwargs):
# emb_name这个参数要换成你模型中embedding的参数名
for name, param in self.model.named_parameters():
if param.requires_grad and emb_name in name:
assert name in self.backup
param.data = self.backup[name]
self.backup = {}
class PGD():
'''对抗训练
'''
def __init__(self, model):
self.model = model
self.emb_backup = {}
self.grad_backup = {}
def attack(self, epsilon=1., alpha=0.3, emb_name='word_embeddings', is_first_attack=False, **kwargs):
# emb_name这个参数要换成你模型中embedding的参数名
for name, param in self.model.named_parameters():
if param.requires_grad and emb_name in name:
if is_first_attack:
self.emb_backup[name] = param.data.clone()
norm = torch.norm(param.grad)
if norm != 0 and not torch.isnan(norm): # nan是为了apex混合精度时
r_at = alpha * param.grad / norm
param.data.add_(r_at)
param.data = self.project(name, param.data, epsilon)
def restore(self, emb_name='emb', **kwargs):
# emb_name这个参数要换成你模型中embedding的参数名
for name, param in self.model.named_parameters():
if param.requires_grad and emb_name in name:
assert name in self.emb_backup
param.data = self.emb_backup[name]
self.emb_backup = {}
def project(self, param_name, param_data, epsilon):
r = param_data - self.emb_backup[param_name]
if torch.norm(r) > epsilon:
r = epsilon * r / torch.norm(r)
return self.emb_backup[param_name] + r
def backup_grad(self):
for name, param in self.model.named_parameters():
# 修复如pooling层参与foward,但是不参与backward过程时grad为空的问题
if param.requires_grad and (param.grad is not None):
self.grad_backup[name] = param.grad.clone()
def restore_grad(self):
for name, param in self.model.named_parameters():
if param.requires_grad and (param.grad is not None):
param.grad = self.grad_backup[name]
class VAT():
'''虚拟对抗训练 https://github.com/namisan/mt-dnn/blob/v0.2/alum/adv_masked_lm.py
'''
def __init__(self, model, emb_name='word_embeddings', noise_var=1e-5, noise_gamma=1e-6, adv_step_size=1e-3,
adv_alpha=1, norm_type='l2', **kwargs):
self.model = model
self.noise_var = noise_var # 噪声的方差
self.noise_gamma = noise_gamma # eps
self.adv_step_size = adv_step_size # 学习率
self.adv_alpha = adv_alpha # 对抗loss的权重
self.norm_type = norm_type # 归一化方式
self.embed = None
for (name, module) in self.model.named_modules():
if emb_name in name:
module.register_forward_hook(hook=self.hook)
def hook(self, module, fea_in, fea_out):
self.embed = fea_out
return None
def forward_(self, train_X, new_embed):
# 把原来的train_X中的token_ids换成embedding形式
if isinstance(train_X, (tuple, list)):
new_train_X = [new_embed] + train_X[1:]
adv_output = self.model.forward(*new_train_X) if self.model.forward.__code__.co_argcount >= 3 else self.model.forward(new_train_X)
elif isinstance(train_X, torch.Tensor):
adv_output = self.model.forward(new_embed)
return adv_output
def virtual_adversarial_training(self, train_X, logits):
# 初始扰动 r
noise = self.embed.data.new(self.embed.size()).normal_(0, 1) * self.noise_var
noise.requires_grad_()
# x + r
new_embed = self.embed.data.detach() + noise
adv_output = self.forward_(train_X, new_embed) # forward第一次
adv_logits = adv_output[0] if isinstance(adv_output, (list, tuple)) else adv_output
adv_loss = self.kl(adv_logits, logits.detach(), reduction="batchmean")
delta_grad, = torch.autograd.grad(adv_loss, noise, only_inputs=True)
norm = delta_grad.norm()
# 梯度消失,退出
if torch.isnan(norm) or torch.isinf(norm):
return None
# inner sum
noise = noise + delta_grad * self.adv_step_size
# projection
noise = self.adv_project(noise, norm_type=self.norm_type, eps=self.noise_gamma)
new_embed = self.embed.data.detach() + noise
new_embed = new_embed.detach()
# 在进行一次训练
adv_output = self.forward_(train_X, new_embed) # forward第二次
adv_logits = adv_output[0] if isinstance(adv_output, (list, tuple)) else adv_output
adv_loss_f = self.kl(adv_logits, logits.detach())
adv_loss_b = self.kl(logits, adv_logits.detach())
# 在预训练时设置为10,下游任务设置为1
adv_loss = (adv_loss_f + adv_loss_b) * self.adv_alpha
return adv_loss
@staticmethod
def kl(inputs, targets, reduction="sum"):
"""
计算kl散度
inputs:tensor,logits
targets:tensor,logits
"""
loss = F.kl_div(F.log_softmax(inputs, dim=-1), F.softmax(targets, dim=-1), reduction=reduction)
return loss
@staticmethod
def adv_project(grad, norm_type='inf', eps=1e-6):
"""
L0,L1,L2正则,对于扰动计算
"""
if norm_type == 'l2':
direction = grad / (torch.norm(grad, dim=-1, keepdim=True) + eps)
elif norm_type == 'l1':
direction = grad.sign()
else:
direction = grad / (grad.abs().max(-1, keepdim=True)[0] + eps)
return direction
class WebServing(object):
"""简单的Web接口
用法:
arguments = {'text': (None, True), 'n': (int, False)}
web = WebServing(port=8864)
web.route('/gen_synonyms', gen_synonyms, arguments)
web.start()
# 然后访问 http://127.0.0.1:8864/gen_synonyms?text=你好
说明:
基于bottlepy简单封装,仅作为临时测试使用,不保证性能。
目前仅保证支持 Tensorflow 1.x + Keras <= 2.3.1。
欢迎有经验的开发者帮忙改进。
依赖:
pip install bottle
pip install paste
(如果不用 server='paste' 的话,可以不装paste库)
"""
def __init__(self, host='0.0.0.0', port=8000, server='paste'):
import bottle
self.host = host
self.port = port
self.server = server
self.bottle = bottle
def wraps(self, func, arguments, method='GET'):
"""封装为接口函数
参数:
func:要转换为接口的函数,需要保证输出可以json化,即需要
保证 json.dumps(func(inputs)) 能被执行成功;
arguments:声明func所需参数,其中key为参数名,value[0]为
对应的转换函数(接口获取到的参数值都是字符串
型),value[1]为该参数是否必须;
method:GET或者POST。
"""
def new_func():
outputs = {'code': 0, 'desc': u'succeeded', 'data': {}}
kwargs = {}
for key, value in arguments.items():
if method == 'GET':
result = self.bottle.request.GET.getunicode(key)
else:
result = self.bottle.request.POST.getunicode(key)
if result is None:
if value[1]:
outputs['code'] = 1
outputs['desc'] = 'lack of "%s" argument' % key
return json.dumps(outputs, ensure_ascii=False)
else:
if value[0] is not None:
result = value[0](result)
kwargs[key] = result
try:
outputs['data'] = func(**kwargs)
except Exception as e:
outputs['code'] = 2
outputs['desc'] = str(e)
return json.dumps(outputs, ensure_ascii=False)
return new_func
def route(self, path, func, arguments, method='GET'):
"""添加接口
"""
func = self.wraps(func, arguments, method)
self.bottle.route(path, method=method)(func)
def start(self):
"""启动服务
"""
self.bottle.run(host=self.host, port=self.port, server=self.server)
def get_pool_emb(hidden_state=None, pooler=None, attention_mask=None, pool_strategy='cls', custom_layer=None):
''' 获取句向量
'''
if pool_strategy == 'pooler':
return pooler
elif pool_strategy == 'cls':
if isinstance(hidden_state, (list, tuple)):
hidden_state = hidden_state[-1]
assert isinstance(hidden_state, torch.Tensor), f'{pool_strategy} strategy request tensor hidden_state'
return hidden_state[:, 0]
elif pool_strategy in {'last-avg', 'mean'}:
if isinstance(hidden_state, (list, tuple)):
hidden_state = hidden_state[-1]
assert isinstance(hidden_state, torch.Tensor), f'{pool_strategy} pooling strategy request tensor hidden_state'
hid = torch.sum(hidden_state * attention_mask[:, :, None], dim=1)
attention_mask = torch.sum(attention_mask, dim=1)[:, None]
return hid / attention_mask
elif pool_strategy in {'last-max', 'max'}:
if isinstance(hidden_state, (list, tuple)):
hidden_state = hidden_state[-1]
assert isinstance(hidden_state, torch.Tensor), f'{pool_strategy} pooling strategy request tensor hidden_state'
hid = hidden_state * attention_mask[:, :, None]
return torch.max(hid, dim=1)
elif pool_strategy == 'first-last-avg':
assert isinstance(hidden_state, list), f'{pool_strategy} pooling strategy request list hidden_state'
hid = torch.sum(hidden_state[1] * attention_mask[:, :, None], dim=1) # 这里不取0
hid += torch.sum(hidden_state[-1] * attention_mask[:, :, None], dim=1)
attention_mask = torch.sum(attention_mask, dim=1)[:, None]
return hid / (2 * attention_mask)
elif pool_strategy == 'custom':
# 取指定层
assert isinstance(hidden_state, list), f'{pool_strategy} pooling strategy request list hidden_state'
assert isinstance(custom_layer, (int, list, tuple)), f'{pool_strategy} pooling strategy request int/list/tuple custom_layer'
custom_layer = [custom_layer] if isinstance(custom_layer, int) else custom_layer
hid = 0
for i, layer in enumerate(custom_layer, start=1):
hid += torch.sum(hidden_state[layer] * attention_mask[:, :, None], dim=1)
attention_mask = torch.sum(attention_mask, dim=1)[:, None]
return hid / (i * attention_mask)
else:
raise ValueError('pool_strategy illegal')
def seed_everything(seed=None):
'''固定seed
'''
max_seed_value = np.iinfo(np.uint32).max
min_seed_value = np.iinfo(np.uint32).min
if (seed is None) or not (min_seed_value <= seed <= max_seed_value):
seed = random.randint(np.iinfo(np.uint32).min, np.iinfo(np.uint32).max)
print(f"Global seed set to {seed}")
os.environ["PYTHONHASHSEED"] = str(seed)
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
return seed
def parallel_apply_generator(func, iterable, workers, max_queue_size, dummy=False, random_seeds=True):
"""多进程或多线程地将func应用到iterable的每个元素中(直接从bert4keras中移植过来)。
注意这个apply是异步且无序的,也就是说依次输入a,b,c,但是输出可能是func(c), func(a), func(b)。结果将作为一个
generator返回,其中每个item是输入的序号以及该输入对应的处理结果。
参数:
dummy: False是多进程/线性,True则是多线程/线性;
random_seeds: 每个进程的随机种子。
"""
if dummy:
from multiprocessing.dummy import Pool, Queue
else:
from multiprocessing import Pool, Queue
in_queue, out_queue, seed_queue = Queue(max_queue_size), Queue(), Queue()
if random_seeds is True:
random_seeds = [None] * workers
elif random_seeds is None or random_seeds is False:
random_seeds = []
for seed in random_seeds:
seed_queue.put(seed)
def worker_step(in_queue, out_queue):
"""单步函数包装成循环执行
"""
if not seed_queue.empty():
np.random.seed(seed_queue.get())
while True:
i, d = in_queue.get()
r = func(d)
out_queue.put((i, r))
# 启动多进程/线程
pool = Pool(workers, worker_step, (in_queue, out_queue))
# 存入数据,取出结果
in_count, out_count = 0, 0
for i, d in enumerate(iterable):
in_count += 1
while True:
try:
in_queue.put((i, d), block=False)
break
except six.moves.queue.Full:
while out_queue.qsize() > max_queue_size:
yield out_queue.get()
out_count += 1
if out_queue.qsize() > 0:
yield out_queue.get()
out_count += 1
while out_count != in_count:
yield out_queue.get()
out_count += 1
pool.terminate()
def parallel_apply(func, iterable, workers, max_queue_size, callback=None, dummy=False, random_seeds=True, unordered=True):
"""多进程或多线程地将func应用到iterable的每个元素中(直接从bert4keras中移植过来)。
注意这个apply是异步且无序的,也就是说依次输入a,b,c,但是输出可能是func(c), func(a), func(b)。
参数:
callback: 处理单个输出的回调函数;
dummy: False是多进程/线性,True则是多线程/线性;windows需设置dummy=True
random_seeds: 每个进程的随机种子;
unordered: 若为False,则按照输入顺序返回,仅当callback为None时生效。
"""
generator = parallel_apply_generator(func, iterable, workers, max_queue_size, dummy, random_seeds)
if callback is None:
if unordered:
return [d for i, d in generator]
else:
results = sorted(generator, key=lambda d: d[0])
return [d for i, d in results]
else:
for i, d in generator:
callback(d)
\ No newline at end of file
# coding=utf-8
"""Tokenization classes."""
from __future__ import absolute_import, division, print_function, unicode_literals
import collections
import logging
import unicodedata
from io import open
from bert4torch.snippets import truncate_sequences, is_string, lowercase_and_normalize
import re
import six
from collections import OrderedDict
logger = logging.getLogger(__name__)
is_py2 = six.PY2
def load_vocab(dict_path, encoding="utf-8", simplified=False, startswith=None):
"""加载词典文件到dict"""
token_dict = collections.OrderedDict()
index = 0
with open(dict_path, "r", encoding=encoding) as reader:
while True:
token = reader.readline()
if not token:
break
token = token.strip()
token_dict[token] = index
index += 1
if simplified: # 过滤冗余部分token,如[unused1]
new_token_dict, keep_tokens = {}, []
startswith = startswith or []
for t in startswith:
new_token_dict[t] = len(new_token_dict)
keep_tokens.append(token_dict[t])
for t, _ in sorted(token_dict.items(), key=lambda s: s[1]):
if t not in new_token_dict and not Tokenizer._is_redundant(t):
new_token_dict[t] = len(new_token_dict)
keep_tokens.append(token_dict[t])
return new_token_dict, keep_tokens
else:
return token_dict
def whitespace_tokenize(text):
"""去除文本中的空白符"""
text = text.strip()
if not text:
return []
tokens = text.split()
return tokens
class TokenizerBase(object):
"""分词器基类
"""
def __init__(self, token_start='[CLS]', token_end='[SEP]', token_unk='[UNK]', token_pad='[PAD]', token_mask='[MASK]',
add_special_tokens=None, pre_tokenize=None, token_translate=None):
"""参数说明:
token_unk: 未知词标记
token_end: 句子切分标记,当只有一句话作为输入时,此标记知识作为结束符;当有两句话作为输入时,此标记作为分隔符、最后一句话的结束符
pad_token: padding填充标记
token_start: 分类标记,位于整个序列的第一个
mask_token: mask标记
pre_tokenize: 外部传入的分词函数,用作对文本进行预分词。如果传入pre_tokenize,则先执行pre_tokenize(text),然后在它的基础上执行原本的tokenize函数;
token_translate: 映射字典,主要用在tokenize之后,将某些特殊的token替换为对应的token。
"""
self._token_pad = token_pad
self._token_unk = token_unk
self._token_mask = token_mask
self._token_start = token_start
self._token_end = token_end
self.never_split = [self._token_unk, self._token_end, self._token_pad, self._token_start, self._token_mask]
if add_special_tokens is not None:
if isinstance(add_special_tokens, (tuple, list)):
self.never_split.extend(add_special_tokens)
elif isinstance(add_special_tokens, str):
self.never_split.append(add_special_tokens)
self.tokens_trie = self._create_trie(self.never_split) # trie树主要是为了special_tokens的分词
self._pre_tokenize = pre_tokenize
self._token_translate = token_translate or {}
self._token_translate_inv = {v: k for k, v in self._token_translate.items()}
def _create_trie(self, unique_no_split_tokens):
trie = Trie()
for token in unique_no_split_tokens:
trie.add(token)
return trie
def tokenize(self, text, maxlen=None):
"""分词函数
"""
tokens = [self._token_translate.get(token) or token for token in self._tokenize(text)]
if self._token_start is not None:
tokens.insert(0, self._token_start)
if self._token_end is not None:
tokens.append(self._token_end)
if maxlen is not None:
index = int(self._token_end is not None) + 1
truncate_sequences(maxlen, -index, tokens)
return tokens
def token_to_id(self, token):
"""token转换为对应的id
"""
raise NotImplementedError
def tokens_to_ids(self, tokens):
"""token序列转换为对应的id序列
"""
return [self.token_to_id(token) for token in tokens]
def _encode(self, first_text, second_text=None, maxlen=None, pattern='S*E*E', truncate_from='right', return_offsets=False):
"""输出文本对应token id和segment id
"""
first_tokens = self.tokenize(first_text) if is_string(first_text) else first_text
if second_text is None:
second_tokens = None
elif is_string(second_text):
second_tokens = self.tokenize(second_text)
else:
second_tokens = second_text
if maxlen is not None:
# 这里截断思路是优先截断最长的子句
if truncate_from == 'right':
index = -int(self._token_end is not None) - 1
elif truncate_from == 'left':
index = int(self._token_start is not None)
else:
index = truncate_from
if second_text is not None and pattern == 'S*E*E':
maxlen += 1
truncate_sequences(maxlen, index, first_tokens, second_tokens)
first_token_ids = self.tokens_to_ids(first_tokens)
first_segment_ids = [0] * len(first_token_ids)
if second_text is not None:
if pattern == 'S*E*E':
idx = int(bool(self._token_start))
second_tokens = second_tokens[idx:]
second_token_ids = self.tokens_to_ids(second_tokens)
second_segment_ids = [1] * len(second_token_ids)
first_token_ids.extend(second_token_ids)
first_segment_ids.extend(second_segment_ids)
encode_output = [first_token_ids, first_segment_ids]
if return_offsets != False:
offset = self.rematch(first_text, first_tokens) + self.rematch(second_text, second_tokens)
if return_offsets == 'transformers': # transformers包中tokenizer的形式
encode_output.append([[0, 0] if not k else [k[0], k[-1]+1] for k in offset])
else:
encode_output.append(offset)
return encode_output
def encode(self, first_texts, second_texts=None, maxlen=None, pattern='S*E*E', truncate_from='right', return_offsets=False):
'''可以处理多条或者单条
'''
return_list = False if isinstance(first_texts, str) else True
first_texts = [first_texts] if isinstance(first_texts, str) else first_texts
second_texts = [second_texts] if isinstance(second_texts, str) else second_texts
first_token_ids, first_segment_ids, offsets = [], [], []
if second_texts is None:
second_texts = [None] * len(first_texts)
assert len(first_texts) == len(second_texts), 'first_texts and second_texts should be same length'
# 循环处理每条样本
for first_text, second_text in zip(first_texts, second_texts):
outputs = self._encode(first_text, second_text, maxlen, pattern, truncate_from, return_offsets)
first_token_ids.append(outputs[0])
first_segment_ids.append(outputs[1])
if len(outputs) >= 3:
offsets.append(outputs[2])
encode_outputs = [first_token_ids, first_segment_ids]
if return_offsets:
encode_outputs.append(offsets)
if not return_list: # 如果输入是string
encode_outputs = [item[0] for item in encode_outputs]
return encode_outputs
def id_to_token(self, i):
"""id序列为对应的token
"""
raise NotImplementedError
def ids_to_tokens(self, ids):
"""id序列转换为对应的token序列
"""
return [self.id_to_token(i) for i in ids]
def decode(self, ids):
"""转为可读文本
"""
raise NotImplementedError
def _tokenize(self, text):
"""基本分词函数
"""
raise NotImplementedError
def rematch(self):
"""生成text和tokens之间的对应关系
"""
pass
class Tokenizer(TokenizerBase):
"""Bert原生分词器
"""
def __init__(self, token_dict, do_lower_case=True, do_basic_tokenize=True, do_tokenize_unk=False, **kwargs):
"""
参数:
token_dict:
词典文件
do_lower_case:
是否转换成小写
do_basic_tokenize:
分词前,是否进行基础的分词
do_tokenize_unk:
分词后,是否生成[UNK]标记,还是在encode阶段生成
"""
super(Tokenizer, self).__init__(**kwargs)
if is_string(token_dict):
token_dict = load_vocab(token_dict)
self._do_lower_case = do_lower_case
self._vocab_size = len(token_dict)
self._token_dict = token_dict
self._token_dict_inv = {v: k for k, v in token_dict.items()}
self.do_basic_tokenize = do_basic_tokenize
if do_basic_tokenize:
self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case, never_split=self.never_split)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self._token_dict, unk_token=self._token_unk, do_tokenize_unk=do_tokenize_unk)
for token in ['pad', 'unk', 'mask', 'start', 'end']:
try:
_token_id = token_dict[getattr(self, '_token_%s' % token)]
setattr(self, '_token_%s_id' % token, _token_id)
except:
pass
def _tokenize(self, text, pre_tokenize=True):
"""基本分词函数
"""
# 以下pre_tokenizer逻辑参考bert4keras
if self._do_lower_case:
text = lowercase_and_normalize(text, never_split=self.never_split)
if pre_tokenize and self._pre_tokenize is not None:
tokens = []
for token in self._pre_tokenize(text):
if token in self._token_dict:
tokens.append(token)
else:
tokens.extend(self._tokenize(token, False))
return tokens
# 以下逻辑参考pytorch版本bert分词器自己的
text_pieces = self.tokens_trie.split(text) # 新增逻辑,主要是special_tokens的分词
split_tokens = []
for text_piece in text_pieces:
if not text_piece:
continue
elif text_piece in self._token_dict:
split_tokens.append(text_piece)
elif self.do_basic_tokenize:
for token in self.basic_tokenizer.tokenize(text_piece):
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)
else:
split_tokens.extend(self.wordpiece_tokenizer.tokenize(text_piece))
return split_tokens
def token_to_id(self, token):
"""token转为vocab中的id"""
return self._token_dict.get(token, self._token_unk_id)
def id_to_token(self, id):
"""id转为词表中的token"""
return self._token_dict_inv[id]
def decode(self, ids, tokens=None):
"""转为可读文本
"""
tokens = tokens or self.ids_to_tokens(ids)
tokens = [token for token in tokens if not self._is_special(token)]
text, flag = '', False
for i, token in enumerate(tokens):
if token[:2] == '##':
text += token[2:]
elif len(token) == 1 and self._is_cjk_character(token):
text += token
elif len(token) == 1 and self._is_punctuation(token):
text += token
text += ' '
elif i > 0 and self._is_cjk_character(text[-1]):
text += token
else:
text += ' '
text += token
text = re.sub(' +', ' ', text)
text = re.sub('\' (re|m|s|t|ve|d|ll) ', '\'\\1 ', text)
punctuation = self._cjk_punctuation() + '+-/={(<['
punctuation_regex = '|'.join([re.escape(p) for p in punctuation])
punctuation_regex = '(%s) ' % punctuation_regex
text = re.sub(punctuation_regex, '\\1', text)
text = re.sub('(\d\.) (\d)', '\\1\\2', text)
return text.strip()
@staticmethod
def stem(token):
"""获取token的“词干”(如果是##开头,则自动去掉##)
"""
if token[:2] == '##':
return token[2:]
else:
return token
@staticmethod
def _is_space(ch):
"""空格类字符判断
"""
return ch == ' ' or ch == '\n' or ch == '\r' or ch == '\t' or \
unicodedata.category(ch) == 'Zs'
@staticmethod
def _is_punctuation(ch):
"""标点符号类字符判断(全/半角均在此内)
提醒:unicodedata.category这个函数在py2和py3下的
表现可能不一样,比如u'§'字符,在py2下的结果为'So',
在py3下的结果是'Po'。
"""
code = ord(ch)
return 33 <= code <= 47 or \
58 <= code <= 64 or \
91 <= code <= 96 or \
123 <= code <= 126 or \
unicodedata.category(ch).startswith('P')
@staticmethod
def _cjk_punctuation():
return u'\uff02\uff03\uff04\uff05\uff06\uff07\uff08\uff09\uff0a\uff0b\uff0c\uff0d\uff0f\uff1a\uff1b\uff1c\uff1d\uff1e\uff20\uff3b\uff3c\uff3d\uff3e\uff3f\uff40\uff5b\uff5c\uff5d\uff5e\uff5f\uff60\uff62\uff63\uff64\u3000\u3001\u3003\u3008\u3009\u300a\u300b\u300c\u300d\u300e\u300f\u3010\u3011\u3014\u3015\u3016\u3017\u3018\u3019\u301a\u301b\u301c\u301d\u301e\u301f\u3030\u303e\u303f\u2013\u2014\u2018\u2019\u201b\u201c\u201d\u201e\u201f\u2026\u2027\ufe4f\ufe51\ufe54\u00b7\uff01\uff1f\uff61\u3002'
@staticmethod
def _is_cjk_character(ch):
"""CJK类字符判断(包括中文字符也在此列)
参考:https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
"""
code = ord(ch)
return 0x4E00 <= code <= 0x9FFF or \
0x3400 <= code <= 0x4DBF or \
0x20000 <= code <= 0x2A6DF or \
0x2A700 <= code <= 0x2B73F or \
0x2B740 <= code <= 0x2B81F or \
0x2B820 <= code <= 0x2CEAF or \
0xF900 <= code <= 0xFAFF or \
0x2F800 <= code <= 0x2FA1F
@staticmethod
def _is_control(ch):
"""控制类字符判断
"""
return unicodedata.category(ch) in ('Cc', 'Cf')
@staticmethod
def _is_special(ch):
"""判断是不是有特殊含义的符号
"""
return bool(ch) and (ch[0] == '[') and (ch[-1] == ']')
@staticmethod
def _is_redundant(token):
"""判断该token是否冗余(默认情况下不可能分出来)
"""
if len(token) > 1:
for ch in Tokenizer.stem(token):
if (
Tokenizer._is_cjk_character(ch) or
Tokenizer._is_punctuation(ch)
):
return True
def rematch(self, text, tokens):
"""给出原始的text和tokenize后的tokens的映射关系
"""
if is_py2:
text = unicode(text)
if self._do_lower_case:
text = text.lower()
normalized_text, char_mapping = '', []
for i, ch in enumerate(text):
if self._do_lower_case:
ch = lowercase_and_normalize(ch, self.never_split)
ch = ''.join([
c for c in ch
if not (ord(c) == 0 or ord(c) == 0xfffd or self._is_control(c))
])
normalized_text += ch
char_mapping.extend([i] * len(ch))
text, token_mapping, offset = normalized_text, [], 0
for token in tokens:
if self._is_special(token):
token_mapping.append([])
else:
token = self.stem(token)
start = text[offset:].index(token) + offset
end = start + len(token)
token_mapping.append(char_mapping[start:end])
offset = end
return token_mapping
class BasicTokenizer(object):
"""Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
def __init__(self, do_lower_case=True, never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]")):
"""Constructs a BasicTokenizer.
Args:
do_lower_case: Whether to lower case the input.
"""
self.do_lower_case = do_lower_case
self.never_split = never_split
def tokenize(self, text):
"""文本切分成token"""
text = self._clean_text(text)
text = self._tokenize_chinese_chars(text)
orig_tokens = whitespace_tokenize(text)
split_tokens = []
for token in orig_tokens:
if self.do_lower_case and token not in self.never_split:
token = token.lower()
token = self._run_strip_accents(token)
split_tokens.extend(self._run_split_on_punc(token))
output_tokens = whitespace_tokenize(" ".join(split_tokens))
return output_tokens
def _run_strip_accents(self, text):
"""Strips accents from a piece of text."""
text = unicodedata.normalize("NFD", text)
output = []
for char in text:
cat = unicodedata.category(char)
if cat == "Mn":
continue
output.append(char)
return "".join(output)
def _run_split_on_punc(self, text):
"""Splits punctuation on a piece of text."""
if text in self.never_split:
return [text]
chars = list(text)
i = 0
start_new_word = True
output = []
while i < len(chars):
char = chars[i]
if _is_punctuation(char):
output.append([char])
start_new_word = True
else:
if start_new_word:
output.append([])
start_new_word = False
output[-1].append(char)
i += 1
return ["".join(x) for x in output]
def _tokenize_chinese_chars(self, text):
"""Adds whitespace around any CJK character."""
output = []
for char in text:
cp = ord(char)
if self._is_chinese_char(cp):
output.append(" ")
output.append(char)
output.append(" ")
else:
output.append(char)
return "".join(output)
def _is_chinese_char(self, cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
(cp >= 0x3400 and cp <= 0x4DBF) or #
(cp >= 0x20000 and cp <= 0x2A6DF) or #
(cp >= 0x2A700 and cp <= 0x2B73F) or #
(cp >= 0x2B740 and cp <= 0x2B81F) or #
(cp >= 0x2B820 and cp <= 0x2CEAF) or
(cp >= 0xF900 and cp <= 0xFAFF) or #
(cp >= 0x2F800 and cp <= 0x2FA1F)): #
return True
return False
def _clean_text(self, text):
"""Performs invalid character removal and whitespace cleanup on text."""
output = []
for char in text:
cp = ord(char)
if cp == 0 or cp == 0xfffd or _is_control(char):
continue
if _is_whitespace(char):
output.append(" ")
else:
output.append(char)
return "".join(output)
class WordpieceTokenizer(object):
"""Runs WordPiece tokenization."""
def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100, do_tokenize_unk=False):
self.vocab = vocab
self.unk_token = unk_token
self.max_input_chars_per_word = max_input_chars_per_word
self.do_tokenize_unk = do_tokenize_unk
def tokenize(self, text):
"""Tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization
using the given vocabulary.
For example:
input = "unaffable"
output = ["un", "##aff", "##able"]
Args:
text: A single token or whitespace separated tokens. This should have
already been passed through `BasicTokenizer`.
Returns:
A list of wordpiece tokens.
"""
output_tokens = []
for token in whitespace_tokenize(text):
chars = list(token)
if len(chars) > self.max_input_chars_per_word:
output_tokens.append(self.unk_token if self.do_tokenize_unk else token) # 超长
continue
is_bad = False
start = 0
sub_tokens = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = "".join(chars[start:end])
if start > 0:
substr = "##" + substr
if (substr in self.vocab) or (not self.do_tokenize_unk):
cur_substr = substr
break
end -= 1
if cur_substr is None:
is_bad = True
break
sub_tokens.append(cur_substr)
start = end
if self.do_tokenize_unk and is_bad: # 是否在tokenize阶段转UNK
output_tokens.append(self.unk_token)
else:
output_tokens.extend(sub_tokens)
return output_tokens
def _is_whitespace(char):
"""Checks whether `chars` is a whitespace character."""
# \t, \n, and \r are technically contorl characters but we treat them
# as whitespace since they are generally considered as such.
if char == " " or char == "\t" or char == "\n" or char == "\r":
return True
cat = unicodedata.category(char)
if cat == "Zs":
return True
return False
def _is_control(char):
"""Checks whether `chars` is a control character."""
# These are technically control characters but we count them as whitespace
# characters.
if char == "\t" or char == "\n" or char == "\r":
return False
cat = unicodedata.category(char)
if cat.startswith("C"):
return True
return False
def _is_punctuation(char):
"""Checks whether `chars` is a punctuation character."""
cp = ord(char)
# We treat all non-letter/number ASCII as punctuation.
# Characters such as "^", "$", and "`" are not in the Unicode
# Punctuation class but we treat them as punctuation anyways, for
# consistency.
if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
(cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
return True
cat = unicodedata.category(char)
if cat.startswith("P"):
return True
return False
def convert_to_unicode(text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
class SpTokenizer(TokenizerBase):
"""基于SentencePiece模型的封装,使用上跟Tokenizer基本一致。
"""
def __init__(self, sp_model_path, remove_space=True, keep_accents=False, do_lower_case=False, **kwargs):
super(SpTokenizer, self).__init__(**kwargs)
import sentencepiece as spm
self.sp_model = spm.SentencePieceProcessor()
self.sp_model.Load(sp_model_path)
self._token_pad = self.sp_model.id_to_piece(self.sp_model.pad_id())
self._token_unk = self.sp_model.id_to_piece(self.sp_model.unk_id())
self._vocab_size = self.sp_model.get_piece_size()
self.remove_space = remove_space
self.keep_accents = keep_accents
self.do_lower_case = do_lower_case
for token in ['pad', 'unk', 'mask', 'start', 'end']:
try:
_token = getattr(self, '_token_%s' % token)
_token_id = self.sp_model.piece_to_id(_token)
setattr(self, '_token_%s_id' % token, _token_id)
except:
pass
def preprocess_text(self, inputs):
'''从transformers包的tokenization_xlnet移植过来,主要区别是对标点符号的处理
'''
if self.remove_space:
outputs = " ".join(inputs.strip().split())
else:
outputs = inputs
outputs = outputs.replace("``", '"').replace("''", '"')
if not self.keep_accents:
outputs = unicodedata.normalize("NFKD", outputs)
outputs = "".join([c for c in outputs if not unicodedata.combining(c)])
if self.do_lower_case:
outputs = outputs.lower()
return outputs
def token_to_id(self, token):
"""token转换为对应的id
"""
return self.sp_model.piece_to_id(token)
def id_to_token(self, i):
"""id转换为对应的token
"""
if i < self._vocab_size:
return self.sp_model.id_to_piece(i)
else:
return ''
def decode(self, ids):
"""转为可读文本
"""
tokens = [self._token_translate_inv.get(token) or token for token in self.ids_to_tokens(ids)]
text = self.sp_model.decode_pieces(tokens)
return convert_to_unicode(text)
def _tokenize(self, text):
"""基本分词函数
"""
if self._pre_tokenize is not None:
text = ' '.join(self._pre_tokenize(text))
text = self.preprocess_text(text) # 是否去空格,转符号,转小写
tokens = self.sp_model.encode_as_pieces(text)
return tokens
def _is_special(self, i):
"""判断是不是有特殊含义的符号
"""
return self.sp_model.is_control(i) or \
self.sp_model.is_unknown(i) or \
self.sp_model.is_unused(i)
def _is_decodable(self, i):
"""判断是否应该被解码输出
"""
return (i < self._vocab_size) and not self._is_special(i)
class Trie:
"""直接从transformer的tokenization_utils.py中移植, 主要是为了special_tokens分词
"""
def __init__(self):
self.data = {}
def add(self, word: str):
if not word:
# Prevent empty string
return
ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
ref = ref[char]
ref[""] = 1
def split(self, text: str):
states = OrderedDict()
# This will contain every indices where we need
# to cut.
# We force to cut at offset 0 and len(text) (added later)
offsets = [0]
# This is used by the lookahead which needs to skip over
# some text where the full match exceeded the place in the initial
# for loop
skip = 0
# Main loop, Giving this algorithm O(n) complexity
for current, current_char in enumerate(text):
if skip and current < skip:
# Prevents the lookahead for matching twice
# like extra_id_100 and id_100
continue
# This will track every state
# that stop matching, we need to stop tracking them.
# If we look at "lowball", we're going to match "l" (add it to states), "o", "w", then
# fail on "b", we need to remove 0 from the valid states.
to_remove = set()
# Whenever we found a match, we need to drop everything
# this is a greedy algorithm, it will match on the first found token
reset = False
# In this case, we already have partial matches (But unfinished)
for start, trie_pointer in states.items():
if "" in trie_pointer:
# This is a final match, we need to reset and
# store the results in `offsets`.
# Lookahead to match longest first
# Important in case of extra_id_1 vs extra_id_100
# Here we are also actively looking for other earlier partial
# matches
# "[CLS]", "L", we need to match CLS even if L is special
for lookstart, looktrie_pointer in states.items():
if lookstart > start:
# This partial match is later, we can stop looking
break
elif lookstart < start:
# This partial match is earlier, the trie pointer
# was already updated, so index is + 1
lookahead_index = current + 1
end = current + 1
else:
# Here lookstart == start and
# looktrie_pointer == trie_pointer
# It wasn't updated yet so indices are current ones
lookahead_index = current
end = current
next_char = text[lookahead_index] if lookahead_index < len(text) else None
if "" in looktrie_pointer:
start = lookstart
end = lookahead_index
skip = lookahead_index
while next_char in looktrie_pointer:
looktrie_pointer = looktrie_pointer[next_char]
lookahead_index += 1
if "" in looktrie_pointer:
start = lookstart
end = lookahead_index
skip = lookahead_index
if lookahead_index == len(text):
# End of string
break
next_char = text[lookahead_index]
# End lookahead
# Storing and resetting
offsets.append(start)
offsets.append(end)
reset = True
break
elif current_char in trie_pointer:
# The current character being looked at has a match within the trie
# update the pointer (it will be stored back into states later).
trie_pointer = trie_pointer[current_char]
# Storing back the new pointer into the states.
# Partial matches got longer by one.
states[start] = trie_pointer
else:
# The new character has not match in the trie, we need
# to stop keeping track of this partial match.
# We can't do it directly within the loop because of how
# python iteration works
to_remove.add(start)
# Either clearing the full start (we found a real match)
# Or clearing only the partial matches that didn't work.
if reset:
states = {}
else:
for start in to_remove:
del states[start]
# If this character is a starting character within the trie
# start keeping track of this partial match.
if current >= skip and current_char in self.data:
states[current] = self.data[current_char]
# We have a cut at the end with states.
for start, trie_pointer in states.items():
if "" in trie_pointer:
# This is a final match, we need to reset and
# store the results in `offsets`.
end = len(text)
offsets.append(start)
offsets.append(end)
# Longest cut is always the one with lower start so the first
# item so we need to break.
break
return self.cut_text(text, offsets)
def cut_text(self, text, offsets):
# We have all the offsets now, we just need to do the actual splitting.
# We need to eventually add the first part of the string and the eventual
# last part.
offsets.append(len(text))
tokens = []
start = 0
for end in offsets:
if start > end:
logger.error(
"There was a bug in Trie algorithm in tokenization. Attempting to recover. Please report it anyway."
)
continue
elif start == end:
# This might happen if there's a match at index 0
# we're also preventing zero-width cuts in case of two
# consecutive matches
continue
tokens.append(text[start:end])
start = end
return tokens
# 1. 文本分类
## 1.1 不同预训练模型的指标对比
- [情感分类数据集](https://github.com/bojone/bert4keras/blob/master/examples/datasets/sentiment.zip)+cls位分类
| solution | epoch | valid_acc | test_acc | comment |
| ---- | ---- | ---- | ---- | ---- |
| albert_small | 10/10 | 94.46 | 93.98 | small版本 |
| bert | 6/10 | 94.72 | 94.11 | —— |
| robert | 4/10 | 94.77 | 94.64 | —— |
| nezha | 7/10 | 95.07 | 94.72 | —— |
| xlnet | 6/10 | 95.00 | 94.24 | —— |
| electra | 10/10 | 94.94 | 94.78 | —— |
| roformer | 9/10 | 94.85 | 94.42 | —— |
| roformer_v2 | 3/10 | 95.78 | 96.09 | —— |
| gau_alpha | 2/10 | 95.25 | 94.46 | —— |
## 1.2 不同trick下的指标对比
- trick测试+[情感分类数据集](https://github.com/bojone/bert4keras/blob/master/examples/datasets/sentiment.zip)+cls分类+无segment_input
| solution | epoch | valid_acc | test_acc | comment |
| ---- | ---- | ---- | ---- | ---- |
| bert | 10/10 | 94.90 | 94.78 | —— |
| fgm | 4/10 | 95.34 | 94.99 | —— |
| pgd | 6/10 | 95.34 | 94.64 | —— |
| gradient_penalty | 7/10 | 95.07 | 94.81 | —— |
| vat | 8/10 | 95.21 | 95.03 | —— |
| ema | 7/10 | 95.21 | 94.86 | —— |
| ema+warmup | 7/10 | 95.51 | 95.12 | —— |
| mix_up | 6/10 | 95.12 | 94.42 | —— |
| R-drop | 9/10 | 95.25 | 94.94 | —— |
| UDA | 8/10 | 94.90 | 95.56 | —— |
| semi-vat | 10/10 | 95.34 | 95.38 | —— |
| temporal_ensembling | 8/10 | 94.94 | 94.90 | —— |
# 2. 序列标注
- [人民日报数据集](http://s3.bmio.net/kashgari/china-people-daily-ner-corpus.tar.gz)+bert预训练模型
- valid集指标
| solution | epoch | f1_token | f1_entity | comment |
| ---- | ---- | ---- | ---- | ---- |
| bert+crf | 18/20 | 96.89 | 96.05 | —— |
| bert+crf+init | 18/20 | 96.93 | 96.08 | 用训练数据初始化crf权重 |
| bert+crf+freeze | 11/20 | 96.89 | 96.13 | 用训练数据生成crf权重(不训练) |
| bert+cascade+crf | 5/20 | 98.10 | 96.26 | crf类别少所以f1_token偏高 |
| bert+crf+posseg | 13/20 | 97.32 | 96.55 | 加了词性输入 |
| bert+global_pointer | 18/20 | —— | 95.66 | —— |
| bert+efficient_global_pointer | 17/20 | —— | 96.55 | —— |
| bert+mrc | 7/20 | —— | 95.75 | —— |
| bert+span | 13/20 | —— | 96.31 | —— |
| bert+tplinker_plus | 20/20 | —— | 95.71 | 长度限制明显 |
| uie | 20/20 | —— | 96.57 | zeroshot:f1=60.8, fewshot-100样本:f1=85.82, 200样本:f1=86.40 |
| W2NER | 18/20 | 97.37 | 96.32 | 对显存要求较高 |
# 3. 文本表示
## 3.1 无监督语义相似度
- bert预训练模型 + 无监督finetune + cls位句向量(PromptBert除外)
- 五个中文数据集 + 5个epoch取最优值 + valid的spearmanr相关系数
- 继续finetune, 部分数据集有小幅提升
- 实验显示dropout_rate对结果影响较大
| solution | ATEC | BQ | LCQMC | PAWSX | STS-B | comment |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Bert-whitening | 26.79 | 31.81| 56.34 | 17.22 | 67.45 | cls+不降维 |
| CT | 30.65 | 44.50| 68.67 | 16.20 | 69.27 | dropout=0.1, 收敛慢跑了10个epoch |
| CT_In_Batch_Neg | 32.47 | 47.09| 68.56 | 27.50 | 74.00 | dropout=0.1 |
| TSDAE | —— | 46.65| 65.30 | 12.54 | —— | dropout=0.1, ——表示该指标异常未记录 |
| SimCSE | 33.90 | 50.29| 71.81 | 13.14 | 71.09 | dropout=0.3 |
| ESimCSE | 34.05 | 50.54| 71.58 | 12.53 | 71.27 | dropout=0.3 |
| DiffSCE | 33.04 | 48.17| 71.51 | 12.91 | 71.10 | dropout=0.3, 没啥效果 |
| PromptBert | 33.98 | 49.89| 73.18 | 13.30 | 73.42 | dropout=0.3 |
## 3.2 有监督语义相似度
- bert预训练模型 + 训练数据finetune + cls位句向量
- 五个中文数据集 + 5个epoch取最优值 + valid/test的spearmanr相关系数
- STS-B任务是5分类,其余是2分类
| solution | ATEC | BQ | LCQMC | PAWSX | STS-B | comment |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| CoSENT |50.61 / 49.81|72.84 / 71.61|77.79 / 78.74|55.00 / 56.00|83.48 / 80.06| |
| ContrastiveLoss |50.02 / 49.19|72.52 / 70.98|77.49 / 78.27|58.21 / 57.65|69.87 / 68.58| STS-B转为2分类 |
| InfoNCE |47.77 / 46.99|69.86 / 68.14|71.74 / 74.54|52.82 / 54.21|83.31 / 78.72| STS-B转为2分类 |
|concat CrossEntropy|48.71 / 47.62|72.16 / 70.07|78.44 / 78.77|51.46 / 52.28|61.31 / 56.62| STS-B转为2分类 |
| CosineMSELoss |46.89 / 45.86|72.27 / 71.35|75.29 / 77.19|54.92 / 54.35|81.64 / 77.76| STS-B标准化到0-1 |
# 4. 关系提取
- [百度关系提取数据集](http://ai.baidu.com/broad/download?dataset=sked)
| solution | f1 | comment |
| ---- | ---- | ---- |
| CasRel | 81.87 | |
| gplinker | 81.88 | |
| tplinker | 74.49 | seq_len=64, 未完全收敛 |
| tplinker_plus | 79.30 | seq_len=64 |
# 5. 文本生成
- [CSL数据集](https://github.com/CLUEbenchmark/CLGE),注意是训练集1万左右的版本,分别dev/test指标
| solution | Rouge-L | Rouge-1 | Rouge-2 | BLEU | comment |
| ---- | ---- | ---- | ---- | ---- | ---- |
|bert+unlim|63.65 / 63.01|66.25 / 66.34|54.48 / 54.81|44.21 / 44.60| |
| bart |64.62 / 64.99|67.72 / 68.40|56.08 / 57.26|46.15 / 47.67| |
| mt5 |67.67 / 65.98|70.39 / 69.36|59.60 / 59.05|50.34 / 50.11| |
|t5_pegasus|66.07 / 66.11|68.94 / 69.61|57.12 / 58.38|46.14 / 47.95| |
| uer_t5 |63.59 / 63.11|66.56 / 66.48|54.65 / 54.82|44.27 / 44.60| |
\ No newline at end of file
## example简介
### 基础测试
- [basic_test_tokenizer.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/basic/basic_test_tokenizer.py): 测试tokenizer和transformers包的结果一致。
- [basic_test_parallel_apply.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/basic/basic_test_parallel_apply.py): 测试parallel_apply的效果。
- [basic_extract_features.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/basic/basic_extract_features.py):测试BERT对句子的编码序列。
- [basic_gibbs_sampling_via_mlm.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/basic/basic_gibbs_sampling_via_mlm.py):利用BERT+Gibbs采样进行文本随机生成,参考[这里](https://kexue.fm/archives/8119)
- [basic_language_model_nezha_gen_gpt.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/basic/basic_language_model_nezha_gen_gpt.py):测试[GPTBase(又叫NEZHE-GEN)](https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/NEZHA-Gen-TensorFlow)的生成效果。
- [basic_make_uncased_model_cased.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/basic/basic_make_uncased_model_cased.py):通过简单修改词表,使得不区分大小写的模型有区分大小写的能力。
- [basic_language_model_bert.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/basic/basic_language_model_bert.py):测试BERT的MLM模型效果。
- [basic_language_model_ernie.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/basic/basic_language_model_ernie.py):测试百度文心ERNIE的MLM模型效果。
- [basic_language_model_GAU_alpha.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/basic/basic_language_model_GAU_alpha.py):测试[GAU-alpha](https://github.com/ZhuiyiTechnology/GAU-alpha)的MLM模型效果。
- [basic_language_model_roformer.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/basic/basic_language_model_roformer.py):测试roformer的MLM模型效果。
- [basic_language_model_CDial_GPT.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/basic/basic_language_model_CDial_GPT.py):测试[CDial_GPT](https://github.com/thu-coai/CDial-GPT)的对话生成效果。
- [basic_language_model_gpt2_ml.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/basic/basic_language_model_gpt2_ml.py):测试[gpt2-ml](https://github.com/imcaspar/gpt2-ml)的的生成效果。
- [basic_language_model_cpm_lm.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/basic/basic_language_model_cpm_lm.py):测试[CPM-Generate](https://github.com/TsinghuaAI/CPM-Generate)的的生成效果。
- [basic_language_model_uer_t5.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/basic/basic_language_model_uer_t5.py):测试[uer-t5-small](https://huggingface.co/uer/t5-small-chinese-cluecorpussmall)[uer-t5-base](https://huggingface.co/uer/t5-base-chinese-cluecorpussmall)的生成效果。
- [basic_language_model_t5_pegasus.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/basic/basic_language_model_t5_pegasus.py):测试[t5_pegasus](https://github.com/ZhuiyiTechnology/t5-pegasus)的生成效果。
- [basic_language_model_simbert.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/basic/basic_language_model_simbert.py):测试[simbert](https://github.com/ZhuiyiTechnology/simbert)[roformer-sim](https://github.com/ZhuiyiTechnology/roformer-sim)的生成效果和句子相似度效果。
- [basic_language_model_transformer_xl.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/basic/basic_language_model_transformer_xl.py): 测试transformer_xl模型,做了一些简化,仅有英文预训练模型。
- [basic_language_model_xlnet.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/basic/basic_language_model_xlnet.py): 测试xlnet模型。
- [basic_language_model_nezha_gpt_dialog.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/basic/basic_language_model_nezha_gpt_dialog.py): 测试[nezha_gpt_dialog](https://kexue.fm/archives/7718)
- [basic_language_model_bart.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/basic/basic_language_model_bart.py): 测试bart模型。
### 文本分类
- [task_sentence_similarity_lcqmc.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_classfication/task_sentence_similarity_lcqmc.py):句子对分类任务。
- [task_sentiment_classification.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_classfication/task_sentiment_classification.py):情感分类任务,bert做简单文本分类
- [task_sentiment_classification_albert.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_classfication/task_sentiment_classification_albert.py):情感分类任务,加载ALBERT模型。
- [task_sentiment_classification_xlnet.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_classfication/task_sentiment_classification_xlnet.py):情感分类任务,加载XLNET模型。
- [task_sentiment_classification_hierarchical_position.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_classfication/task_sentiment_classification_hierarchical_position.py):情感分类任务,层次分解位置编码做长文本的初始化
- [task_sentiment_classification_nezha.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_classfication/task_sentiment_classification_nezha.py):情感分类任务,加载nezha模型
- [task_sentiment_classification_roformer.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_classfication/task_sentiment_classification_roformer.py):情感分类任务,加载roformer权重
- [task_sentiment_classification_roformer_v2.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_classfication/task_sentiment_classification_roformer_v2.py):情感分类任务,加载roformer_v2权重
- [task_sentiment_classification_electra.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_classfication/task_sentiment_classification_electra.py):情感分类任务,加载electra权重
- [task_sentiment_classification_GAU_alpha.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_classfication/task_sentiment_classification_GAU_alpha.py):情感分类任务,加载GAU-alpha权重
- [task_sentiment_classification_PET.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_classfication/task_sentiment_classification_PET.py):情感分类项目,[Pattern-Exploiting-Training](https://github.com/bojone/Pattern-Exploiting-Training), [bert4keras示例](https://github.com/bojone/Pattern-Exploiting-Training)
- [task_sentiment_classification_P_tuning.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_classfication/task_sentiment_classification_P_tuning.py):情感分类项目,[P-tuning](https://github.com/THUDM/P-tuning), [bert4keras示例](https://github.com/bojone/P-tuning)
- [Sohu_2022_ABSA](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_classfication/Sohu_2022_ABSA):搜狐2022实体情感分类Top1方案复现和自己的baseline
- [Tianchi_News_Classification](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_classfication/Tianchi_News_Classification):天池零基础入门NLP-新闻分类Top1方案复现
### 序列标注
- [task_sequence_labeling_ner_efficient_global_pointer.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sequence_labeling/task_sequence_labeling_ner_efficient_global_pointer.py):ner例子,efficient_global_pointer的pytorch实现
- [task_sequence_labeling_ner_global_pointer.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sequence_labeling/task_sequence_labeling_ner_global_pointer.py):ner例子,global_pointer的pytorch实现
- [task_sequence_labeling_ner_crf.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sequence_labeling/task_sequence_labeling_ner_crf.py):ner例子,bert+crf
- [task_sequence_labeling_ner_crf_freeze.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sequence_labeling/task_sequence_labeling_ner_crf_freeze.py):ner例子,bert+crf, 一种是用数据集来生成crf权重,第二种是来初始化
- [task_sequence_labeling_ner_cascade_crf.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sequence_labeling/task_sequence_labeling_ner_cascade_crf.py):ner例子,bert+crf+级联
- [task_sequence_labeling_ner_crf_add_posseg.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sequence_labeling/task_sequence_labeling_ner_crf_add_posseg.py):ner例子,bert+crf,词性作为输入
- [task_sequence_labeling_ner_tplinker_plus.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sequence_labeling/task_sequence_labeling_ner_tplinker_plus.py):ner例子,改造了关系抽取[TPLinker](https://github.com/131250208/TPlinker-joint-extraction)
- [task_sequence_labeling_ner_mrc.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sequence_labeling/task_sequence_labeling_ner_mrc.py):ner例子,[mrc方案](https://github.com/z814081807/DeepNER),用阅读理解的方式来做
- [task_sequence_labeling_ner_span.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sequence_labeling/task_sequence_labeling_ner_span.py):ner例子,[span方案](https://github.com/z814081807/DeepNER),用半指针-半标注方式来做
- [uie](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sequence_labeling/uie):ner例子,[uie方案](https://github.com/universal-ie/UIE),prompt+mrc模型结构
- [task_sequence_labeling_ner_W2NER.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sequence_labeling/task_sequence_labeling_ner_W2NER.py):ner例子,[W2NER](https://github.com/ljynlp/W2NER)
### 文本表示
- [task_sentence_embedding_unsup_bert_whitening.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_embedding/task_sentence_embedding_unsup_bert_whitening.py):参考[bert_whitening](https://github.com/bojone/BERT-whitening)
- [task_sentence_embedding_unsup_CT.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_embedding/task_sentence_embedding_unsup_CT.py):参考[SentenceTransformer](https://www.sbert.net/index.html)
- [task_sentence_embedding_unsup_CT_In-Batch_Negatives.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_embedding/task_sentence_embedding_unsup_CT_In-Batch_Negatives.py):参考[SentenceTransformer](https://www.sbert.net/index.html)
- [task_sentence_embedding_unsup_SimCSE.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_embedding/task_sentence_embedding_unsup_SimCSE.py):参考[SentenceTransformer](https://www.sbert.net/index.html)[科学空间版中文测试](https://kexue.fm/archives/8348)
- [task_sentence_embedding_unsup_ESimCSE.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_embedding/task_sentence_embedding_unsup_ESimCSE.py):参考[ESimCSE论文](https://arxiv.org/pdf/2109.04380.pdf)[第三方实现](https://github.com/shuxinyin/SimCSE-Pytorch)
- [task_sentence_embedding_unsup_TSDAE.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_embedding/task_sentence_embedding_unsup_TSDAE.py):参考[SentenceTransformer](https://www.sbert.net/index.html)
- [task_sentence_embedding_unsup_PromptBert.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_embedding/task_sentence_embedding_unsup_PromptBert.py)[PromptBert](https://github.com/kongds/Prompt-BERT)方式
- [task_sentence_embedding_unsup_DiffCSE.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_embedding/task_sentence_embedding_unsup_DiffCSE.py)[DiffCSE](https://github.com/voidism/DiffCSE)
- [task_sentence_embedding_sup_ContrastiveLoss.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_embedding/task_sentence_embedding_sup_ContrastiveLoss.py):参考[SentenceTransformer](https://www.sbert.net/index.html)
- [task_sentence_embedding_sup_CosineMSELoss.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_embedding/task_sentence_embedding_sup_CosineMSELoss.py):参考[SentenceTransformer](https://www.sbert.net/index.html)
- [task_sentence_embedding_sup_concat_CrossEntropyLoss.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_embedding/task_sentence_embedding_sup_concat_CrossEntropyLoss.py):参考[SentenceTransformer](https://www.sbert.net/index.html)
- [task_sentence_embedding_sup_MultiNegtiveRankingLoss.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_embedding/task_sentence_embedding_sup_MultiNegtiveRankingLoss.py):参考[SentenceTransformer](https://www.sbert.net/index.html)
- [task_sentence_embedding_sup_CoSENT.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_embedding/task_sentence_embedding_sup_CoSENT.py):参考[CoSENT](https://kexue.fm/archives/8847)
- [task_sentence_embedding_DimensionalityReduction.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_embedding/task_sentence_embedding_DimensionalityReduction.py):参考[SentenceTransformer](https://www.sbert.net/index.html)
- [task_sentence_embedding_model_distillation.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_embedding/task_sentence_embedding_model_distillation.py):参考[SentenceTransformer](https://www.sbert.net/index.html)
- [FinanceFAQ](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_embedding/FinanceFAQ):金融领域FAQ两阶段(召回+排序)pipline
### 关系提取
- [task_relation_extraction_CasRel.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/relation_extraction/task_relation_extraction_CasRel.py):结合BERT以及自行设计的“半指针-半标注”结构来做[关系抽取](https://kexue.fm/archives/7161)
- [task_relation_extraction_gplinker.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/relation_extraction/task_relation_extraction_gplinker.py):结合GlobalPointer做关系抽取[GPLinker](https://kexue.fm/archives/8888)
- [task_relation_extraction_tplinker.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/relation_extraction/task_relation_extraction_tplinker.py):tplinker关系抽取[TPLinker](https://github.com/131250208/TPlinker-joint-extraction)
- [task_relation_extraction_tplinker_plus.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/relation_extraction/task_relation_extraction_tplinker_plus.py):tplinker关系抽取[TPLinkerPlus](https://github.com/131250208/TPlinker-joint-extraction)
### 文本生成
- [task_seq2seq_autotitle_unilm.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/seq2seq/task_seq2seq_autotitle_unilm.py):通过[UniLM](https://kexue.fm/archives/6933)式的Seq2Seq模型来做新闻标题生成。
- [task_seq2seq_autotitle_csl_bart.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/seq2seq/task_seq2seq_autotitle_csl_bart.py):通过BART来做新闻标题生成
- [task_seq2seq_autotitle_csl_uer_t5.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/seq2seq/task_seq2seq_autotitle_csl_uer_t5.py):通过T5来做新闻标题生成,用的[uer-t5-small](https://huggingface.co/uer/t5-small-chinese-cluecorpussmall)
- [task_seq2seq_autotitle_csl.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/seq2seq/task_seq2seq_autotitle_csl.py):通过[UniLM](https://kexue.fm/archives/6933)式的Seq2Seq模型来做论文标题生成。
- [task_seq2seq_autotitle_csl_mt5.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/seq2seq/task_seq2seq_autotitle_csl_mt5.py):通过[google_mt](https://huggingface.co/google/mt5-base)的Seq2Seq模型来做论文标题生成。
- [task_question_answer_generation_by_seq2seq.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/seq2seq/task_question_answer_generation_by_seq2seq.py):通过[UniLM](https://kexue.fm/archives/6933)式的Seq2Seq模型来做[问答对自动构建](https://kexue.fm/archives/7630),属于自回归文本生成。
- [task_reading_comprehension_by_mlm.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/seq2seq/task_reading_comprehension_by_mlm.py):通过MLM模型来做[阅读理解问答](https://kexue.fm/archives/7148),属于简单的非自回归文本生成。
- [task_reading_comprehension_by_seq2seq.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/seq2seq/task_reading_comprehension_by_seq2seq.py):通过[UniLM](https://kexue.fm/archives/6933)式的Seq2Seq模型来做[阅读理解问答](https://kexue.fm/archives/7115),属于自回归文本生成。
- [task_seq2seq_simbert.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/seq2seq/task_seq2seq_simbert.py):相似问生成,数据增广,参考[SimBERT](https://kexue.fm/archives/7427)
- [task_seq2seq_ape210k_math_word_problem.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/seq2seq/task_seq2seq_ape210k_math_word_problem.py):bert+unilm硬刚小学数学题,参考[博客](https://kexue.fm/archives/7809)
- [task_kgclue_seq2seq.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/seq2seq/task_kgclue_seq2seq.py):seq2seq+前缀树,参考[博客](https://kexue.fm/archives/8802)
### 训练Trick
- [task_sentiment_adversarial_training.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/training_trick/task_sentiment_adversarial_training.py):通过对抗训练,虚拟对抗训练,梯度惩罚等措施来提升分类效果。
- [task_sentiment_virtual_adversarial_training.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/training_trick/task_sentiment_virtual_adversarial_training.py):通过半监督的虚拟对抗训练等措施来提升分类效果。
- [task_sentiment_UDA.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/training_trick/task_sentiment_UDA.py):通过[UDA](https://arxiv.org/abs/1904.12848)半监督学习提升分类效果,在原来Losss上加一致性损失。
- [task_sentiment_mixup.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/training_trick/task_sentiment_mixup.py):通过[Mixup](https://github.com/vikasverma1077/manifold_mixup)提升模型泛化性能。
- [task_sentiment_exponential_moving_average.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/training_trick/task_sentiment_exponential_moving_average.py):EMA指数滑动平均
- [task_sentiment_exponential_moving_average_warmup.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/training_trick/task_sentiment_exponential_moving_average_warmup.py):EMA指数滑动平均+warmup两种策略
- [task_sentiment_TemporalEnsembling.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/training_trick/task_sentiment_TemporalEnsembling.py):通过[TemporalEnsembling官方项目](https://github.com/s-laine/tempens)[pytorch第三方实现](https://github.com/ferretj/temporal-ensembling)提升模型泛化性能。
- [task_sentiment_R-Drop.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/training_trick/task_sentiment_R-Drop.py):通过[R-Drop](https://github.com/dropreg/R-Drop)提升分类效果,可以视为用dropout加噪下的UDA。
- [task_amp.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/training_trick/task_amp.py):Pytorch的amp混合精度训练
- [task_data_parallel.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/training_trick/task_data_parallel.py):DataParallel模式的多GPU训练方式
- [task_distributed_data_parallel.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/training_trick/task_distributed_data_parallel.py):DistributedDataParallel模式的多GPU训练方式
### 预训练
- [roberta_pretrain](https://github.com/Tongjilibo/bert4torch/blob/master/examples/pretrain/roberta_pretrain):roberta的mlm预训练,数据生成代码和训练代码
- [simbert_v2_pretrain](https://github.com/Tongjilibo/bert4torch/blob/master/examples/pretrain/simbert_v2_pretrain):相似问生成,数据增广,三个步骤:1-[弱监督](https://github.com/Tongjilibo/bert4torch/blob/master/examples/pretrain/simbert_v2_pretrain/simbert_v2_stage1.py),2-[蒸馏](https://github.com/Tongjilibo/bert4torch/blob/master/examples/pretrain/simbert_v2_pretrain/simbert_v2_stage2.py),3-[有监督](https://github.com/Tongjilibo/bert4torch/blob/master/examples/pretrain/simbert_v2_pretrain/simbert_v2_supervised.py),参考[SimBERT-V2](https://kexue.fm/archives/8454)
### 模型部署
- [basic_simple_web_serving_simbert.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/serving/basic_simple_web_serving_simbert.py): 测试自带的WebServing(将模型转化为Web接口)。
- [task_bert_cls_onnx.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/serving/task_bert_cls_onnx.py):ONNX转换bert权重
- [task_bert_cls_onnx_tensorrt.md](https://github.com/Tongjilibo/bert4torch/blob/master/examples/serving/task_bert_cls_onnx_tensorrt.md):ONNX+Tensorrt部署
- [sanic_server](https://github.com/Tongjilibo/bert4torch/blob/master/examples/serving/sanic_server):sanic+onnx部署
### 其他
- [task_conditional_language_model.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/others/task_conditional_language_model.py):结合BERT+[ConditionalLayerNormalization](https://kexue.fm/archives/7124)做条件语言模型。
- [task_language_model.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/others/task_language_model.py):加载BERT的预训练权重做无条件语言模型,效果上等价于GPT。
- [task_iflytek_bert_of_theseus.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/others/task_iflytek_bert_of_theseus.py):通过[BERT-of-Theseus](https://kexue.fm/archives/7575)来进行模型压缩。
- [task_language_model_chinese_chess.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/others/task_language_model_chinese_chess.py):用GPT的方式下中国象棋,过程请参考[博客](https://kexue.fm/archives/7877)
- [task_nl2sql_baseline.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/others/task_nl2sql_baseline.py)[追一科技2019年NL2SQL挑战赛的一个Baseline](https://kexue.fm/archives/6771)
### 教程
- [Tutorials](https://github.com/Tongjilibo/bert4torch/blob/master/examples/tutorials/Tutorials):教程说明文档。
- [tutorials_custom_fit_progress.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/tutorials/tutorials_custom_fit_progress.py):教程,自定义训练过程fit函数(集成了训练进度条展示),可用于满足如半精度,梯度裁剪等高阶需求。
- [tutorials_load_transformers_model.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/tutorials/tutorials_load_transformers_model.py):教程,加载transformer包中模型,可以使用bert4torch中继承的对抗训练等trick。
- [tutorials_small_tips.py](https://github.com/Tongjilibo/bert4torch/blob/master/examples/tutorials/tutorials_small_tips.py):教程,常见的一些tips集合。
## 用到的数据集
| 数据集名称 | 用途 | 下载链接 |
| ---- | ---- | ---- |
|人民日报数据集|实体识别|[china-people-daily-ner-corpus](http://s3.bmio.net/kashgari/china-people-daily-ner-corpus.tar.gz)
|百度关系抽取|关系抽取|[BD_Knowledge_Extraction](http://ai.baidu.com/broad/download?dataset=sked)
|Sentiment|情感分类|[Sentiment](https://github.com/bojone/bert4keras/blob/master/examples/datasets/sentiment.zip)
|THUCNews|文本分类、文本生成|[THUCNews](http://thuctc.thunlp.org/#%E4%B8%AD%E6%96%87%E6%96%87%E6%9C%AC%E5%88%86%E7%B1%BB%E6%95%B0%E6%8D%AE%E9%9B%86THUCNews)
|ATEC| 文本相似度 | [ATEC](https://github.com/IceFlameWorm/NLP_Datasets/tree/master/ATEC)
|BQ| 文本相似度 | [BQ](http://icrc.hitsz.edu.cn/info/1037/1162.htm)
|LCQMC| 文本相似度 | [LCQMC](http://icrc.hitsz.edu.cn/Article/show/171.html)
|PAWSX| 文本相似度 | [PAWSX](https://arxiv.org/abs/1908.11828)
|STS-B| 文本相似度 | [STS-B](https://github.com/pluto-junzeng/CNSD)
|CSL|文本生成|[CSL](https://github.com/CLUEbenchmark/CLGE)|
\ No newline at end of file
#! -*- coding: utf-8 -*-
# 测试代码可用性: 提取特征
import torch
from bert4torch.models import build_transformer_model
from bert4torch.tokenizers import Tokenizer
root_model_path = "F:/Projects/pretrain_ckpt/bert/[google_tf_base]--chinese_L-12_H-768_A-12"
vocab_path = root_model_path + "/vocab.txt"
config_path = root_model_path + "/bert_config.json"
checkpoint_path = root_model_path + '/pytorch_model.bin'
tokenizer = Tokenizer(vocab_path, do_lower_case=True) # 建立分词器
model = build_transformer_model(config_path, checkpoint_path) # 建立模型,加载权重
# 编码测试
token_ids, segment_ids = tokenizer.encode(u'语言模型')
token_ids, segment_ids = torch.tensor([token_ids]), torch.tensor([segment_ids])
print('\n ===== predicting =====\n')
model.eval()
with torch.no_grad():
print(model([token_ids, segment_ids])[0])
"""
输出:
[[[-0.63251007 0.2030236 0.07936534 ... 0.49122632 -0.20493352
0.2575253 ]
[-0.7588351 0.09651865 1.0718756 ... -0.6109694 0.04312154
0.03881441]
[ 0.5477043 -0.792117 0.44435206 ... 0.42449304 0.41105673
0.08222899]
[-0.2924238 0.6052722 0.49968526 ... 0.8604137 -0.6533166
0.5369075 ]
[-0.7473459 0.49431565 0.7185162 ... 0.3848612 -0.74090636
0.39056838]
[-0.8741375 -0.21650358 1.338839 ... 0.5816864 -0.4373226
0.56181806]]]
"""
\ No newline at end of file
#! -*- coding: utf-8 -*-
# 测试代码可用性: 结合MLM的Gibbs采样
from tqdm import tqdm
import numpy as np
from bert4torch.models import build_transformer_model
from bert4torch.tokenizers import Tokenizer
import torch
import torch.nn as nn
root_model_path = "F:/Projects/pretrain_ckpt/bert/[google_tf_base]--chinese_L-12_H-768_A-12"
vocab_path = root_model_path + "/vocab.txt"
config_path = root_model_path + "/bert_config.json"
checkpoint_path = root_model_path + '/pytorch_model.bin'
tokenizer = Tokenizer(vocab_path, do_lower_case=True) # 建立分词器
model = build_transformer_model(
config_path=config_path, checkpoint_path=checkpoint_path, with_mlm='softmax'
) # 建立模型,加载权重
sentences = []
init_sent = u'科学技术是第一生产力。' # 给定句子或者None
minlen, maxlen = 8, 32
steps = 10000
converged_steps = 1000
vocab_size = tokenizer._vocab_size
if init_sent is None:
length = np.random.randint(minlen, maxlen + 1)
tokens = ['[CLS]'] + ['[MASK]'] * length + ['[SEP]']
token_ids = tokenizer.tokens_to_ids(tokens)
segment_ids = [0] * len(token_ids)
else:
token_ids, segment_ids = tokenizer.encode(init_sent)
length = len(token_ids) - 2
device='cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)
model.eval()
with torch.no_grad():
for _ in tqdm(range(steps), desc='Sampling'):
# Gibbs采样流程:随机mask掉一个token,然后通过MLM模型重新采样这个token。
i = np.random.choice(length) + 1
token_ids[i] = tokenizer._token_mask_id
token_ids_tensor, segment_ids_tensor = torch.tensor([token_ids], device=device), torch.tensor([segment_ids], device=device)
_, probas = model([token_ids_tensor, segment_ids_tensor])
probas = probas[0, i]
token = np.random.choice(vocab_size, p=probas.cpu().numpy())
token_ids[i] = token
sentences.append(tokenizer.decode(token_ids))
print(u'部分随机采样结: ')
for _ in range(10):
print(np.random.choice(sentences[converged_steps:]))
#! -*- coding: utf-8 -*-
# 基本测试:中文GPT模型,base版本,CDial-GPT版
# 项目链接:https://github.com/thu-coai/CDial-GPT
# 参考项目:https://github.com/bojone/CDial-GPT-tf
# 权重需转换后方可加载,转换脚本见convert_script文件夹
import torch
from bert4torch.models import build_transformer_model
from bert4torch.tokenizers import Tokenizer
from bert4torch.snippets import AutoRegressiveDecoder
config_path = 'F:/Projects/pretrain_ckpt/gpt/[thu-coai_torch_base]--CDial-GPT-LCCC-base/bert4torch_config.json'
checkpoint_path = 'F:/Projects/pretrain_ckpt/gpt/[thu-coai_torch_base]--CDial-GPT-LCCC-base/bert4torch_pytorch_model.bin'
dict_path = 'F:/Projects/pretrain_ckpt/gpt/[thu-coai_torch_base]--CDial-GPT-LCCC-base/bert4torch_vocab.txt'
device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = Tokenizer(dict_path, do_lower_case=True) # 建立分词器
speakers = [tokenizer.token_to_id('[speaker1]'), tokenizer.token_to_id('[speaker2]')]
# config中设置shared_segment_embeddings=True,segment embedding用word embedding的权重生成
model = build_transformer_model(
config_path=config_path,
checkpoint_path=checkpoint_path,
model='gpt',
).to(device) # 建立模型,加载权重
class ChatBot(AutoRegressiveDecoder):
"""基于随机采样的闲聊回复
"""
@AutoRegressiveDecoder.wraps(default_rtype='probas')
def predict(self, inputs, output_ids, states):
token_ids, segment_ids = inputs
curr_segment_ids = torch.zeros_like(output_ids) + token_ids[0, -1]
token_ids = torch.cat([token_ids, output_ids], 1)
segment_ids = torch.cat([segment_ids, curr_segment_ids], 1)
logits = model.predict([token_ids, segment_ids])
return logits[:, -1, :]
def response(self, texts, n=1, topk=5):
token_ids = [tokenizer._token_start_id, speakers[0]]
segment_ids = [tokenizer._token_start_id, speakers[0]]
for i, text in enumerate(texts):
ids = tokenizer.encode(text)[0][1:-1] + [speakers[(i + 1) % 2]]
token_ids.extend(ids)
segment_ids.extend([speakers[i % 2]] * len(ids))
segment_ids[-1] = speakers[(i + 1) % 2]
results = self.random_sample([token_ids, segment_ids], n, topk) # 基于随机采样
return tokenizer.decode(results[0].cpu().numpy())
chatbot = ChatBot(start_id=None, end_id=tokenizer._token_end_id, maxlen=32, device=device)
print(chatbot.response([u'别爱我没结果', u'你这样会失去我的', u'失去了又能怎样']))
"""
回复是随机的,例如:你还有我 | 那就不要爱我 | 你是不是傻 | 等等。
"""
\ No newline at end of file
#! -*- coding: utf-8 -*-
# 基础测试:GAU_alpha的mlm预测,和bert4keras版本比对一致
# 测试中长文本效果明显高于短文本效果
# 博客:https://kexue.fm/archives/9052
# 权重转换脚本:./convert_script/convert_GAU_alpha.py
from bert4torch.models import build_transformer_model
from bert4torch.tokenizers import Tokenizer
import torch
# 加载模型,请更换成自己的路径
config_path = 'F:/Projects/pretrain_ckpt/gau/[sushen-torch]--chinese_GAU-alpha-char_L-24_H-768/bert_config.json'
checkpoint_path = 'F:/Projects/pretrain_ckpt/gau/[sushen-torch]--chinese_GAU-alpha-char_L-24_H-768/pytorch_model.bin'
dict_path = 'F:/Projects/pretrain_ckpt/gau/[sushen-torch]--chinese_GAU-alpha-char_L-24_H-768/vocab.txt'
# 建立分词器
tokenizer = Tokenizer(dict_path, do_lower_case=True)
model = build_transformer_model(config_path, checkpoint_path, model='gau_alpha', with_mlm='softmax') # 建立模型,加载权重
token_ids, segments_ids = tokenizer.encode("近期正是上市公司财报密集披露的时间,但有多家龙头公司的业绩令投资者失望")
token_ids[5] = token_ids[6] = tokenizer._token_mask_id
print(''.join(tokenizer.ids_to_tokens(token_ids)))
tokens_ids_tensor = torch.tensor([token_ids])
segment_ids_tensor = torch.tensor([segments_ids])
# 需要传入参数with_mlm
model.eval()
with torch.no_grad():
_, probas = model([tokens_ids_tensor, segment_ids_tensor])
result = torch.argmax(probas[0, 5:7], dim=-1).numpy()
print(tokenizer.decode(result))
# 测试bart语言模型的预测效果
# bert4torch需要转换一下权重,见convert文件夹中
from transformers import BertTokenizer, BartForConditionalGeneration
tokenizer = BertTokenizer.from_pretrained("F:/Projects/pretrain_ckpt/bart/[FudanNLP_torch_base]/")
model = BartForConditionalGeneration.from_pretrained("F:/Projects/pretrain_ckpt/bart/[FudanNLP_torch_base]/")
input_ids = tokenizer.encode("北京是[MASK]的首都", return_tensors='pt')
pred_ids = model.generate(input_ids, num_beams=4, max_length=20)
print('transformers output: ', tokenizer.convert_ids_to_tokens(pred_ids[0]))
# 输出: ['[SEP]', '[CLS]', '北', '京', '是', '中', '国', '的', '首', '都', '[SEP]']
from bert4torch.models import build_transformer_model
from bert4torch.tokenizers import Tokenizer
from bert4torch.snippets import AutoRegressiveDecoder
import torch
# bert配置
config_path = 'F:/Projects/pretrain_ckpt/bart/[FudanNLP_torch_base]/bert4torch_config.json'
checkpoint_path = 'F:/Projects/pretrain_ckpt/bart/[FudanNLP_torch_base]/bert4torch_pytorch_model.bin'
dict_path = 'F:/Projects/pretrain_ckpt/bart/[FudanNLP_torch_base]/vocab.txt'
device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = Tokenizer(dict_path, do_lower_case=True)
model = build_transformer_model(config_path, checkpoint_path, model='bart', segment_vocab_size=0).to(device)
class AutoTitle(AutoRegressiveDecoder):
"""seq2seq解码器
"""
@AutoRegressiveDecoder.wraps(default_rtype='logits')
def predict(self, inputs, output_ids, states):
return model.decoder.predict([output_ids] + inputs)[-1][:, -1, :] # 保留最后一位
def generate(self, text, topk=4):
token_ids, _ = tokenizer.encode(text, maxlen=128)
token_ids = torch.tensor([token_ids], device=device)
encoder_output = model.encoder.predict([token_ids])
output_ids = self.beam_search(encoder_output, topk=topk) # 基于beam search
return tokenizer.decode(output_ids.cpu().numpy())
autotitle = AutoTitle(start_id=102, end_id=tokenizer._token_end_id, maxlen=32, device=device)
print('bert4torch output: ', autotitle.generate("北京是[MASK]的首都"))
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment