"tests/python/pytorch/sparse/test_sddmm.py" did not exist on "56ce60b06107c281b6d3e425c96005f009d57ef4"
Commit 66a1d0d0 authored by yangzhong's avatar yangzhong
Browse files

提交初版bert4torch project

parents
Pipeline #519 canceled with stages
MIT License
Copyright (c) 2022 Bo仔很忙
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
# bert4torch
## 模型介绍
bert4torch是一个基于pytorch的训练框架,前期以效仿和实现bert4keras的主要功能为主,方便加载多类预训练模型进行finetune,提供了中文注释方便用户理解模型结构。
## 模型结构
BERT的主模型是BERT中最重要组件,BERT通过预训练(pre-training),具体来说,就是在主模型后再接个专门的模块计算预训练的损失(loss),预训练后就得到了主模型的参数(parameter),当应用到下游任务时,就在主模型后接个跟下游任务配套的模块,然后主模型赋上预训练的参数,下游任务模块随机初始化,然后微调(fine-tuning)就可以了(注意:微调的时候,主模型和下游任务模块两部分的参数一般都要调整,也可以冻结一部分,调整另一部分)。
主模型由三部分构成:**嵌入层****编码器****池化层**
如图:
![img](https://images.cnblogs.com/cnblogs_com/wangzb96/1789835/o_200618140451BERT%E4%B9%8B%E4%B8%BB%E6%A8%A1%E5%9E%8B.png)
其中
- 输入:一个个小批(mini-batch),小批里是`batch_size`个序列(句子或句子对),每个序列由若干个离散编码向量组成。
- 嵌入层:将输入的序列转换成连续分布式表示(distributed representation),即词嵌入(word embedding)或词向量(word vector)。
- 编码器:对每个序列进行非线性表示。
- 池化层:取出`[CLS]`标记(token)的表示(representation)作为整个序列的表示。
- 输出:编码器最后一层输出的表示(序列中每个标记的表示)和池化层输出的表示(序列整体的表示)。
## 环境配置
### Docker
在光源可拉取docker镜像,拉取方式如下:
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py37-latest
```
安装依赖包和bert4torch
```
pip install -r requirements.txt
cd bert4torch
python3 setup.py install
```
## 数据集和预训练模型
数据集下载地址:https://s3.bmio.net/kashgari/china-people-daily-ner-corpus.tar.gz,人民日报数据集存放在目录/datasets/bert-base-chinese目录下,然后解压。
预训练模型下载地址:https://huggingface.co/bert-base-chinese/tree/main,所有文件下载存放在目录/datasets/bert-base-chinese下。
训练数据目录结构如下:
```
dataset
|
bert-base-chinese
|
china-people-daily-ner-corpus config.json flax_model.msgpack pytorch_model.bin vocab.txt
|
example.dev example.test example.train
```
## 训练
### 修改配置文件
```
cd examples/sequence_labeling/
# 修改训练脚本配置文件
crf.py # 单卡训练脚本
crf_ddp.py # 多卡训练脚本 多卡训练使用torch的ddp,在单卡训练代码基础上增加DDP的相关内容
仅修改配置文件路径,包括config_path, checkpoint_path, dict_path, train_dataloader, valid_dtaloader,根据需要调整batch_size大小。
注:如果需要测试fp16,可以修改crf_ddp.py和crf.py中model.compile(),添加use_amp=True。
```
### 单机单卡
```
cd examples/sequence_labeling/
./single_train.sh
```
### 单机多卡
```
cd examples/sequence_labeling/
./multi_train.sh
```
## 精度数据
| 卡数 | 类型 | batch_size | f1 | p | r |
| ---- | ---- | ---------- | ------ | ------ | ------ |
| 1 | fp32 | 64 | 0.9592 | 0.9643 | 0.9617 |
| 1 | fp16 | 64 | 0.9559 | 0.9596 | 0.9545 |
| 4 | fp32 | 256 | 0.9459 | 0.9398 | 0.9521 |
| 4 | fp16 | 256 | 0.9438 | 0.9398 | 0.9505 |
## 源码仓库及问题反馈
- https://developer.hpccube.com/codes/modelzoo/bert4torch
## 参考资料
- https://github.com/Tongjilibo/bert4torch
# bert4torch使用教程
## 1. 建模流程示例
```python
# 建立分词器
tokenizer = Tokenizer(dict_path, do_lower_case=True)
# 加载数据集,可以自己继承Dataset来定义
class MyDataset(ListDataset):
@staticmethod
def load_data(filenames):
"""读取文本文件,整理成需要的格式
"""
D = []
return D
def collate_fn(batch):
'''处理上述load_data得到的batch数据,整理成对应device上的Tensor
注意:返回值分为feature和label, feature可整理成list或tuple
'''
batch_token_ids, batch_segment_ids, batch_labels = [], [], []
return [batch_token_ids, batch_segment_ids], batch_labels.flatten()
# 加载数据集
train_dataloader = DataLoader(MyDataset('file_path'), batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
# 定义bert上的模型结构,以文本二分类为例
class Model(BaseModel):
def __init__(self) -> None:
super().__init__()
self.bert = build_transformer_model(config_path, checkpoint_path, with_pool=True)
self.dropout = nn.Dropout(0.1)
self.dense = nn.Linear(768, 2)
def forward(self, token_ids, segment_ids):
# build_transformer_model得到的模型仅接受list/tuple传参,因此入参只有一个时候包装成[token_ids]
hidden_states, pooled_output = self.bert([token_ids, segment_ids])
output = self.dropout(pooled_output)
output = self.dense(output)
return output
model = Model().to(device)
# 定义使用的loss和optimizer,这里支持自定义
model.compile(
loss=nn.CrossEntropyLoss(), # 可以自定义Loss
optimizer=optim.Adam(model.parameters(), lr=2e-5), # 可以自定义优化器
scheduler=None, # 可以自定义scheduler
metrics=['accuracy']
)
# 定义评价函数
def evaluate(data):
total, right = 0., 0.
for x_true, y_true in data:
y_pred = model.predict(x_true).argmax(axis=1)
total += len(y_true)
right += (y_true == y_pred).sum().item()
return right / total
class Evaluator(Callback):
"""评估与保存,这里定义仅在epoch结束后调用
"""
def __init__(self):
self.best_val_acc = 0.
def on_epoch_end(self, global_step, epoch, logs=None):
val_acc = evaluate(valid_dataloader)
if val_acc > self.best_val_acc:
self.best_val_acc = val_acc
model.save_weights('best_model.pt')
print(f'val_acc: {val_acc:.5f}, best_val_acc: {self.best_val_acc:.5f}\n')
if __name__ == '__main__':
evaluator = Evaluator()
model.fit(train_dataloader, epochs=20, steps_per_epoch=100, grad_accumulation_steps=2, callbacks=[evaluator])
```
## 2. 主要模块讲解
### 1) 数据处理部分
#### a. 精简词表,并建立分词器
```python
token_dict, keep_tokens = load_vocab(
dict_path=dict_path, # 词典文件路径
simplified=True, # 过滤冗余部分token,如[unused1]
startswith=['[PAD]', '[UNK]', '[CLS]', '[SEP]'], # 指定起始的token,如[UNK]从bert默认的103位置调整到1
)
tokenizer = Tokenizer(token_dict, do_lower_case=True) # 若无需精简,仅使用当前行定义tokenizer即可
```
#### b. 好用的小函数
- `text_segmentate()`: 截断总长度至不超过maxlen, 接受多个sequence输入,每次截断最长的句子,indices表示删除的token位置
- `tokenizer.encode()`: 把text转成token_ids,默认句首添加[CLS],句尾添加[SEP],返回token_ids和segment_ids,相当于同时调用`tokenizer.tokenize()``tokenizer.tokens_to_ids()`
- `tokenizer.decode()`: 把token_ids转成text,默认会删除[CLS], [SEP], [UNK]等特殊字符,相当于调用`tokenizer.ids_to_tokens()`并做了一些后处理
- `sequence_padding`: 将序列padding到同一长度, 传入一个元素为list, ndarray, tensor的list,返回ndarry或tensor
### 2) 模型定义部分
- 模型创建
```python
'''
调用模型后,若设置with_pool, with_nsp, with_mlm,则返回值依次为[hidden_states, pool_emb/nsp_emb, mlm_scores],否则只返回hidden_states
'''
build_transformer_model(
config_path=config_path, # 模型的config文件地址
checkpoint_path=checkpoint_path, # 模型文件地址,默认值None表示不加载预训练模型
model='bert', # 加载的模型结构,这里Model也可以基于nn.Module自定义后传入
application='encoder', # 模型应用,支持encoder,lm和unilm格式
segment_vocab_size=2, # type_token_ids数量,默认为2,如不传入segment_ids则需设置为0
with_pool=False, # 是否包含Pool部分
with_nsp=False, # 是否包含NSP部分
with_mlm=False, # 是否包含MLM部分
return_model_config=False, # 是否返回模型配置参数
output_all_encoded_layers=False, # 是否返回所有hidden_state层
)
```
- 定义loss,optimizer,scheduler等
```python
'''
定义使用的loss和optimizer,这里支持自定义
'''
model.compile(
loss=nn.CrossEntropyLoss(), # 可以自定义Loss
optimizer=optim.Adam(model.parameters(), lr=2e-5), # 可以自定义优化器
scheduler=None, # 可以自定义scheduler
adversarial_train={'name': 'fgm'}, # 训练trick方案设置,支持fgm, pgd, gradient_penalty, vat
metrics=['accuracy'] # loss等默认打印的字段无需设置
)
```
- 自定义模型
```python
'''
基于bert上层的各类魔改,如last2layer_average, token_first_last_average
'''
class Model(BaseModel):
# 需要继承BaseModel
def __init__(self):
super().__init__()
self.bert = build_transformer_model(config_path, checkpoint_path)
def forward(self):
pass
```
- [自定义训练过程](https://github.com/Tongjilibo/bert4torch/blob/master/examples/others/task_custom_fit_progress.py)
```python
'''
自定义fit过程,适用于自带fit()不满足需求时
'''
class Model(BaseModel):
def fit(self, train_dataloader, steps_per_epoch, epochs):
train_dataloader = cycle(train_dataloader)
self.train()
for epoch in range(epochs):
for bti in range(steps_per_epoch):
train_X, train_y = next(train_dataloader)
output = self.forward(*train_X)
loss = self.criterion(output, train_y)
loss.backward()
self.optimizer.step()
self.optimizer.zero_grad()
```
- 模型保存和加载
```python
'''
prefix: 是否以原始的key来保存,如word_embedding原始key为bert.embeddings.word_embeddings.weight
默认为None表示不启用, 若基于BaseModel自定义模型,需指定为bert模型对应的成员变量名,直接使用设置为''
主要是为了别的训练框架容易加载
'''
model.save_weights(save_path, prefix=None)
model.load_weights(load_path, strict=True, prefix=None)
```
- [加载transformers模型进行训练](https://github.com/Tongjilibo/bert4torch/blob/master/examples/others/task_load_transformers_model.py)
```python
from transformers import AutoModelForSequenceClassification
class Model(BaseModel):
def __init__(self):
super().__init__()
self.bert = AutoModelForSequenceClassification.from_pretrained("file_path", num_labels=2)
def forward(self, token_ids, attention_mask, segment_ids):
output = self.bert(input_ids=token_ids, attention_mask=attention_mask, token_type_ids=segment_ids)
return output.logits
```
### 3) 模型评估部分
```python
'''支持在多个位置执行
'''
class Evaluator(Callback):
"""评估与保存
"""
def __init__(self):
self.best_val_acc = 0.
def on_train_begin(self, logs=None): # 训练开始时候
pass
def on_train_end(self, logs=None): # 训练结束时候
pass
def on_batch_begin(self, global_step, batch, logs=None): # batch开始时候
pass
def on_batch_end(self, global_step, batch, logs=None): # batch结束时候
# 可以设置每隔多少个step,后台记录log,写tensorboard等
# 尽量不要在batch_begin和batch_end中print,防止打断进度条功能
pass
def on_epoch_begin(self, global_step, epoch, logs=None): # epoch开始时候
pass
def on_epoch_end(self, global_step, epoch, logs=None): # epoch结束时候
val_acc = evaluate(valid_dataloader)
if val_acc > self.best_val_acc:
self.best_val_acc = val_acc
model.save_weights('best_model.pt')
print(f'val_acc: {val_acc:.5f}, best_val_acc: {self.best_val_acc:.5f}\n')
```
## 3. 其他特性讲解
### 1) 单机多卡训练
#### a. 使用DataParallel
```python
'''DP有两种方式,第一种是forward只计算logit,第二种是forward直接计算loss
建议使用第二种,可以部分缓解负载不均衡的问题
'''
from bert4torch.models import BaseModelDP
# ===========处理数据和定义model===========
model = BaseModelDP(model) # 指定DP模式使用多gpu
model.compile(
loss=lambda x, _: x.mean(), # 多个gpu计算的loss的均值
optimizer=optim.Adam(model.parameters(), lr=2e-5),
)
```
#### b. 使用DistributedDataParallel
```python
'''DDP使用torch.distributed.launch,从命令行启动
'''
# 需要定义命令行参数
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int, default=-1)
args = parser.parse_args()
torch.cuda.set_device(args.local_rank)
device = torch.device('cuda', args.local_rank)
torch.distributed.init_process_group(backend='nccl')
# ===========处理数据和定义model===========
# 指定DDP模型使用多gpu, master_rank为指定用于打印训练过程的local_rank
model = BaseModelDDP(model, master_rank=0, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=False)
# 定义使用的loss和optimizer,这里支持自定义
model.compile(
loss=lambda x, _: x, # 直接把forward计算的loss传出来
optimizer=optim.Adam(model.parameters(), lr=2e-5),
)
```
### 2) tensorboard保存训练过程
```python
from tensorboardX import SummaryWriter
class Evaluator(Callback):
"""每隔多少个step评估并记录tensorboard
"""
def on_batch_end(self, global_step, batch, logs=None):
if global_step % 100 == 0:
writer.add_scalar(f"train/loss", logs['loss'], global_step)
val_acc = evaluate(valid_dataloader)
writer.add_scalar(f"valid/acc", val_acc, global_step)
```
### 3) 打印训练参数
```python
from torchinfo import summary
summary(model, input_data=next(iter(train_dataloader))[0])
```
\ No newline at end of file
Metadata-Version: 2.1
Name: bert4torch
Version: 0.1.9
Summary: an elegant bert4torch
Home-page: https://github.com/Tongjilibo/bert4torch
Author: Tongjilibo
License: MIT Licence
Platform: UNKNOWN
License-File: LICENSE
bert4torch: https://github.com/Tongjilibo/bert4torch
LICENSE
README.md
setup.py
bert4torch/__init__.py
bert4torch/activations.py
bert4torch/layers.py
bert4torch/losses.py
bert4torch/models.py
bert4torch/optimizers.py
bert4torch/snippets.py
bert4torch/tokenizers.py
bert4torch.egg-info/PKG-INFO
bert4torch.egg-info/SOURCES.txt
bert4torch.egg-info/dependency_links.txt
bert4torch.egg-info/requires.txt
bert4torch.egg-info/top_level.txt
\ No newline at end of file
#! -*- coding: utf-8 -*-
__version__ = '0.1.9'
\ No newline at end of file
# 从transformer中移植过来的activation, 原来的bert4keras并没有
import math
import torch
from packaging import version
from torch import nn
def _gelu_python(x):
"""
Original Implementation of the GELU activation function in Google BERT repo when initially created. For
information: OpenAI GPT's GELU is slightly different (and gives slightly different results): 0.5 * x * (1 +
torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))) This is now written in C in nn.functional
Also see the Gaussian Error Linear Units paper: https://arxiv.org/abs/1606.08415
"""
return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
def gelu_new(x):
"""
Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT). Also see
the Gaussian Error Linear Units paper: https://arxiv.org/abs/1606.08415
"""
return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))
if version.parse(torch.__version__) < version.parse("1.4"):
gelu = _gelu_python
else:
gelu = nn.functional.gelu
def gelu_fast(x):
return 0.5 * x * (1.0 + torch.tanh(x * 0.7978845608 * (1.0 + 0.044715 * x * x)))
def quick_gelu(x):
return x * torch.sigmoid(1.702 * x)
def _silu_python(x):
"""
See Gaussian Error Linear Units (Hendrycks et al., https://arxiv.org/abs/1606.08415) where the SiLU (Sigmoid Linear
Unit) was originally introduced and coined, and see Sigmoid-Weighted Linear Units for Neural Network Function
Approximation in Reinforcement Learning (Elfwing et al., https://arxiv.org/abs/1702.03118) and Swish: a Self-Gated
Activation Function (Ramachandran et al., https://arxiv.org/abs/1710.05941v1) where the SiLU was experimented with
later.
"""
return x * torch.sigmoid(x)
if version.parse(torch.__version__) < version.parse("1.7"):
silu = _silu_python
else:
silu = nn.functional.silu
def _mish_python(x):
"""
See Mish: A Self-Regularized Non-Monotonic Activation Function (Misra., https://arxiv.org/abs/1908.08681). Also
visit the official repository for the paper: https://github.com/digantamisra98/Mish
"""
return x * torch.tanh(nn.functional.softplus(x))
if version.parse(torch.__version__) < version.parse("1.9"):
mish = _mish_python
else:
mish = nn.functional.mish
def linear_act(x):
return x
ACT2FN = {
"relu": nn.functional.relu,
"silu": silu,
"swish": silu,
"gelu": gelu,
"tanh": torch.tanh,
"gelu_new": gelu_new,
"gelu_fast": gelu_fast,
"quick_gelu": quick_gelu,
"mish": mish,
"linear": linear_act,
"sigmoid": torch.sigmoid,
"softmax": nn.Softmax(dim=-1)
}
def get_activation(activation_string):
if activation_string in ACT2FN:
return ACT2FN[activation_string]
else:
raise KeyError(f"function {activation_string} not found in ACT2FN mapping {list(ACT2FN.keys())}")
This diff is collapsed.
from ast import arg
from tracemalloc import start
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import numpy as np
class FocalLoss(nn.Module):
'''Multi-class Focal loss implementation'''
def __init__(self, gamma=2, weight=None,ignore_index=-100):
super(FocalLoss, self).__init__()
self.gamma = gamma
self.weight = weight
self.ignore_index=ignore_index
def forward(self, input, target):
"""
input: [N, C]
target: [N, ]
"""
logpt = F.log_softmax(input, dim=1)
pt = torch.exp(logpt)
logpt = (1-pt)**self.gamma * logpt
loss = F.nll_loss(logpt, target, self.weight,ignore_index=self.ignore_index)
return loss
class LabelSmoothingCrossEntropy(nn.Module):
def __init__(self, eps=0.1, reduction='mean',ignore_index=-100):
super(LabelSmoothingCrossEntropy, self).__init__()
self.eps = eps
self.reduction = reduction
self.ignore_index = ignore_index
def forward(self, output, target):
c = output.size()[-1]
log_preds = F.log_softmax(output, dim=-1)
if self.reduction=='sum':
loss = -log_preds.sum()
else:
loss = -log_preds.sum(dim=-1)
if self.reduction=='mean':
loss = loss.mean()
return loss*self.eps/c + (1-self.eps) * F.nll_loss(log_preds, target, reduction=self.reduction,
ignore_index=self.ignore_index)
class MultilabelCategoricalCrossentropy(nn.Module):
"""多标签分类的交叉熵
说明:y_true和y_pred的shape一致,y_true的元素非0即1, 1表示对应的类为目标类,0表示对应的类为非目标类。
警告:请保证y_pred的值域是全体实数,换言之一般情况下y_pred不用加激活函数,尤其是不能加sigmoid或者softmax!预测
阶段则输出y_pred大于0的类。如有疑问,请仔细阅读并理解本文。
参考:https://kexue.fm/archives/7359
"""
def __init__(self, **kwargs):
super().__init__(**kwargs)
def forward(self, y_pred, y_true):
""" y_true ([Tensor]): [..., num_classes]
y_pred ([Tensor]): [..., num_classes]
"""
y_pred = (1-2*y_true) * y_pred
y_pred_pos = y_pred - (1-y_true) * 1e12
y_pred_neg = y_pred - y_true * 1e12
y_pred_pos = torch.cat([y_pred_pos, torch.zeros_like(y_pred_pos[..., :1])], dim=-1)
y_pred_neg = torch.cat([y_pred_neg, torch.zeros_like(y_pred_neg[..., :1])], dim=-1)
pos_loss = torch.logsumexp(y_pred_pos, dim=-1)
neg_loss = torch.logsumexp(y_pred_neg, dim=-1)
return (pos_loss + neg_loss).mean()
class SparseMultilabelCategoricalCrossentropy(nn.Module):
"""稀疏版多标签分类的交叉熵
说明:
1. y_true.shape=[..., num_positive],
y_pred.shape=[..., num_classes];
2. 请保证y_pred的值域是全体实数,换言之一般情况下y_pred不用加激活函数,尤其是不能加sigmoid或者softmax;
3. 预测阶段则输出y_pred大于0的类;
4. 详情请看:https://kexue.fm/archives/7359 。
"""
def __init__(self, mask_zero=False, epsilon=1e-7, **kwargs):
super().__init__(**kwargs)
self.mask_zero = mask_zero
self.epsilon = epsilon
def forward(self, y_pred, y_true):
zeros = torch.zeros_like(y_pred[..., :1])
y_pred = torch.cat([y_pred, zeros], dim=-1)
if self.mask_zero:
infs = zeros + float('inf')
y_pred = torch.cat([infs, y_pred[..., 1:]], dim=-1)
y_pos_2 = torch.gather(y_pred, dim=-1, index=y_true)
y_pos_1 = torch.cat([y_pos_2, zeros], dim=-1)
if self.mask_zero:
y_pred = torch.cat([-infs, y_pred[..., 1:]], dim=-1)
y_pos_2 = torch.gather(y_pred, dim=-1, index=y_true)
pos_loss = torch.logsumexp(-y_pos_1, dim=-1)
all_loss = torch.logsumexp(y_pred, dim=-1) # a
aux_loss = torch.logsumexp(y_pos_2, dim=-1) - all_loss # b-a
aux_loss = torch.clamp(1 - torch.exp(aux_loss), self.epsilon, 1) # 1-exp(b-a)
neg_loss = all_loss + torch.log(aux_loss) # a + log[1-exp(b-a)]
return pos_loss + neg_loss
class ContrastiveLoss(nn.Module):
"""对比损失:减小正例之间的距离,增大正例和反例之间的距离
公式:labels * distance_matrix.pow(2) + (1-labels)*F.relu(margin-distance_matrix).pow(2)
https://www.sbert.net/docs/package_reference/losses.html
"""
def __init__(self, margin=0.5, size_average=True, online=False):
super(ContrastiveLoss, self).__init__()
self.margin = margin
self.size_average = size_average
self.online = online
def forward(self, distances, labels, pos_id=1, neg_id=0):
if not self.online:
losses = 0.5 * (labels.float() * distances.pow(2) + (1 - labels).float() * F.relu(self.margin - distances).pow(2))
return losses.mean() if self.size_average else losses.sum()
else:
negs = distances[labels == neg_id]
poss = distances[labels == pos_id]
# select hard positive and hard negative pairs
negative_pairs = negs[negs < (poss.max() if len(poss) > 1 else negs.mean())]
positive_pairs = poss[poss > (negs.min() if len(negs) > 1 else poss.mean())]
positive_loss = positive_pairs.pow(2).sum()
negative_loss = F.relu(self.margin - negative_pairs).pow(2).sum()
return positive_loss + negative_loss
class RDropLoss(nn.Module):
'''R-Drop的Loss实现,官方项目:https://github.com/dropreg/R-Drop
'''
def __init__(self, alpha=4, rank='adjacent'):
super().__init__()
self.alpha = alpha
# 支持两种方式,一种是奇偶相邻排列,一种是上下排列
assert rank in {'adjacent', 'updown'}, "rank kwarg only support 'adjacent' and 'updown' "
self.rank = rank
self.loss_sup = nn.CrossEntropyLoss()
self.loss_rdrop = nn.KLDivLoss(reduction='none')
def forward(self, *args):
'''支持两种方式: 一种是y_pred, y_true, 另一种是y_pred1, y_pred2, y_true
'''
assert len(args) in {2, 3}, 'RDropLoss only support 2 or 3 input args'
# y_pred是1个Tensor
if len(args) == 2:
y_pred, y_true = args
loss_sup = self.loss_sup(y_pred, y_true) # 两个都算
if self.rank == 'adjacent':
y_pred1 = y_pred[1::2]
y_pred2 = y_pred[::2]
elif self.rank == 'updown':
half_btz = y_true.shape[0] // 2
y_pred1 = y_pred[:half_btz]
y_pred2 = y_pred[half_btz:]
# y_pred是两个tensor
else:
y_pred1, y_pred2, y_true = args
loss_sup = self.loss_sup(y_pred1, y_true)
loss_rdrop1 = self.loss_rdrop(F.log_softmax(y_pred1, dim=-1), F.softmax(y_pred2, dim=-1))
loss_rdrop2 = self.loss_rdrop(F.log_softmax(y_pred2, dim=-1), F.softmax(y_pred1, dim=-1))
return loss_sup + torch.mean(loss_rdrop1 + loss_rdrop2) / 4 * self.alpha
class UDALoss(nn.Module):
'''UDALoss,使用时候需要继承一下,因为forward需要使用到global_step和total_steps
https://arxiv.org/abs/1904.12848
'''
def __init__(self, tsa_schedule=None, total_steps=None, start_p=0, end_p=1, return_all_loss=True):
super().__init__()
self.loss_sup = nn.CrossEntropyLoss()
self.loss_unsup = nn.KLDivLoss(reduction='batchmean')
self.tsa_schedule = tsa_schedule
self.start = start_p
self.end = end_p
if self.tsa_schedule:
assert self.tsa_schedule in {'linear_schedule', 'exp_schedule', 'log_schedule'}, 'tsa_schedule config illegal'
self.return_all_loss = return_all_loss
def forward(self, y_pred, y_true_sup, global_step, total_steps):
sup_size = y_true_sup.size(0)
unsup_size = (y_pred.size(0) - sup_size) // 2
# 有监督部分, 用交叉熵损失
y_pred_sup = y_pred[:sup_size]
if self.tsa_schedule is None:
loss_sup = self.loss_sup(y_pred_sup, y_true_sup)
else: # 使用tsa来去掉预测概率较高的有监督样本
threshold = self.get_tsa_threshold(self.tsa_schedule, global_step, total_steps, self.start, self.end)
true_prob = torch.gather(F.softmax(y_pred_sup, dim=-1), dim=1, index=y_true_sup[:, None])
sel_rows = true_prob.lt(threshold).sum(dim=-1).gt(0) # 仅保留小于阈值的样本
loss_sup = self.loss_sup(y_pred_sup[sel_rows], y_true_sup[sel_rows]) if sel_rows.sum() > 0 else 0
# 无监督部分,这里用KL散度,也可以用交叉熵
y_true_unsup = y_pred[sup_size:sup_size+unsup_size]
y_true_unsup = F.softmax(y_true_unsup.detach(), dim=-1)
y_pred_unsup = F.log_softmax(y_pred[sup_size+unsup_size:], dim=-1)
loss_unsup = self.loss_unsup(y_pred_unsup, y_true_unsup)
if self.return_all_loss:
return loss_sup + loss_unsup, loss_sup, loss_unsup
else:
return loss_sup + loss_unsup
@ staticmethod
def get_tsa_threshold(schedule, global_step, num_train_steps, start, end):
training_progress = global_step / num_train_steps
if schedule == "linear_schedule":
threshold = training_progress
elif schedule == "exp_schedule":
scale = 5
threshold = math.exp((training_progress - 1) * scale)
elif schedule == "log_schedule":
scale = 5
threshold = 1 - math.exp((-training_progress) * scale)
return threshold * (end - start) + start
class TemporalEnsemblingLoss(nn.Module):
'''TemporalEnsembling的实现,思路是在监督loss的基础上,增加一个mse的一致性损失loss
官方项目:https://github.com/s-laine/tempens
pytorch第三方实现:https://github.com/ferretj/temporal-ensembling
使用的时候,train_dataloader的shffle必须未False
'''
def __init__(self, epochs, max_val=10.0, ramp_up_mult=-5.0, alpha=0.5, max_batch_num=100, hist_device='cpu'):
super().__init__()
self.loss_sup = nn.CrossEntropyLoss()
self.max_epochs = epochs
self.max_val = max_val
self.ramp_up_mult = ramp_up_mult
self.alpha = alpha
self.max_batch_num = max_batch_num # 设置未None表示记录全部数据历史,数据量大时耗资源
self.hist_unsup = [] # 历史无监督logit
self.hist_sup = [] # 历史监督信息logit
self.hist_device = hist_device
self.hist_input_y = [] # 历史监督标签y
assert (self.alpha >= 0) & (self.alpha < 1) # 等于1的时候upata写分母为0
def forward(self, y_pred_sup, y_pred_unsup, y_true_sup, epoch, bti):
self.same_batch_check(y_pred_sup, y_pred_unsup, y_true_sup, bti)
if (self.max_batch_num is None) or (bti < self.max_batch_num):
self.init_hist(bti, y_pred_sup, y_pred_unsup) # 初始化历史
sup_ratio = float(len(y_pred_sup)) / (len(y_pred_sup) + len(y_pred_unsup)) # 监督样本的比例
w = self.weight_schedule(epoch, sup_ratio)
sup_loss, unsup_loss = self.temporal_loss(y_pred_sup, y_pred_unsup, y_true_sup, bti)
# 更新
self.hist_unsup[bti] = self.update(self.hist_unsup[bti], y_pred_unsup.detach(), epoch)
self.hist_sup[bti] = self.update(self.hist_sup[bti], y_pred_sup.detach(), epoch)
# if bti == 0: $ 用于检查每个epoch数据顺序是否一致
# print(w, sup_loss.item(), w * unsup_loss.item())
# print(y_true_sup)
return sup_loss + w * unsup_loss, sup_loss, w * unsup_loss
else:
return self.loss_sup(y_pred_sup, y_true_sup)
def same_batch_check(self, y_pred_sup, y_pred_unsup, y_true_sup, bti):
'''检测数据的前几个batch必须是一致的, 这里写死是10个
'''
if bti >= 10:
return
if bti >= len(self.hist_input_y):
self.hist_input_y.append(y_true_sup.to(self.hist_device))
else: # 检测
err_msg = 'TemporalEnsemblingLoss requests the same sort dataloader, you may need to set train_dataloader shuffle=False'
assert self.hist_input_y[bti].equal(y_true_sup.to(self.hist_device)), err_msg
def update(self, hist, y_pred, epoch):
'''更新历史logit,利用alpha门控来控制比例
'''
Z = self.alpha * hist.to(y_pred) + (1. -self.alpha) * y_pred
output = Z * (1. / (1. - self.alpha ** (epoch + 1)))
return output.to(self.hist_device)
def weight_schedule(self, epoch, sup_ratio):
max_val = self.max_val * sup_ratio
if epoch == 0:
return 0.
elif epoch >= self.max_epochs:
return max_val
return max_val * np.exp(self.ramp_up_mult * (1. - float(epoch) / self.max_epochs) ** 2)
def temporal_loss(self, y_pred_sup, y_pred_unsup, y_true_sup, bti):
# MSE between current and temporal outputs
def mse_loss(out1, out2):
quad_diff = torch.sum((F.softmax(out1, dim=1) - F.softmax(out2, dim=1)) ** 2)
return quad_diff / out1.data.nelement()
sup_loss = self.loss_sup(y_pred_sup, y_true_sup)
# 原来实现是sup和unsup作为一个tensor,整体计算的,这里由于是拆分成两个tensor,因此分开算
unsup_loss = mse_loss(y_pred_unsup, self.hist_unsup[bti].to(y_pred_unsup))
unsup_loss += mse_loss(y_pred_sup, self.hist_sup[bti].to(y_pred_sup))
return sup_loss, unsup_loss
def init_hist(self, bti, y_pred_sup, y_pred_unsup):
if bti >= len(self.hist_sup):
self.hist_sup.append(torch.zeros_like(y_pred_sup).to(self.hist_device))
self.hist_unsup.append(torch.zeros_like(y_pred_unsup).to(self.hist_device))
\ No newline at end of file
This diff is collapsed.
from torch.optim.lr_scheduler import LambdaLR
def get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, last_epoch=-1):
"""带warmup的schedule, 源自transformers包optimization.py中
参数
num_warmup_steps:
需要warmup的步数, 一般为 num_training_steps * warmup_proportion(warmup的比例, 建议0.05-0.15)
num_training_steps:
总的训练步数, 一般为 train_batches * num_epoch
"""
def lr_lambda(current_step: int):
if current_step < num_warmup_steps:
return float(current_step) / float(max(1, num_warmup_steps))
return max(0.0, float(num_training_steps - current_step) / float(max(1, num_training_steps - num_warmup_steps)))
return LambdaLR(optimizer, lr_lambda, last_epoch)
def extend_with_exponential_moving_average(model, decay=0.999):
class ExponentialMovingAverage():
''' 模型权重的指数滑动平均, 不参加梯度更新,只是记录滑动平均的参数,给预测使用
注意区别于类似adam一类的自适应学习率优化器, 针对一阶二阶梯度的指数滑动平均, 两者完全不同
例子:
# 初始化
ema = ExponentialMovingAverage(model, 0.999)
# 训练过程中, 更新完参数后, 同步update ema_weights weights
def train():
optimizer.step()
ema.step()
# eval前, 调用apply_ema_weights(); eval之后, restore_raw_weights()恢复原来模型的参数
def evaluate():
ema.apply_ema_weights()
# evaluate
# 如果想保存ema后的模型, 请在restore方法之前调用torch.save()
ema.restore_raw_weights()
'''
def __init__(self, model, decay):
self.model = model
self.decay = decay
# 保存ema权重(当前step的每一层的滑动平均权重)
self.ema_weights = {}
# 在进行evaluate的时候, 保存原始的模型权重, 当执行完evaluate后, 从ema权重恢复到原始权重
self.model_weights = {}
# 初始化ema_weights为model_weights
for name, param in self.model.named_parameters():
if param.requires_grad:
self.ema_weights[name] = param.data.clone()
def step(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
assert name in self.ema_weights
new_average = (1.0 - self.decay) * param.data + self.decay * self.ema_weights[name]
self.ema_weights[name] = new_average.clone()
def apply_ema_weights(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
assert name in self.ema_weights
self.model_weights[name] = param.data
param.data = self.ema_weights[name]
def restore_raw_weights(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
assert name in self.model_weights
param.data = self.model_weights[name]
self.model_weights = {}
return ExponentialMovingAverage(model, decay)
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
#! -*- coding: utf-8 -*-
__version__ = '0.1.9'
\ No newline at end of file
# 从transformer中移植过来的activation, 原来的bert4keras并没有
import math
import torch
from packaging import version
from torch import nn
def _gelu_python(x):
"""
Original Implementation of the GELU activation function in Google BERT repo when initially created. For
information: OpenAI GPT's GELU is slightly different (and gives slightly different results): 0.5 * x * (1 +
torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))) This is now written in C in nn.functional
Also see the Gaussian Error Linear Units paper: https://arxiv.org/abs/1606.08415
"""
return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
def gelu_new(x):
"""
Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT). Also see
the Gaussian Error Linear Units paper: https://arxiv.org/abs/1606.08415
"""
return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))
if version.parse(torch.__version__) < version.parse("1.4"):
gelu = _gelu_python
else:
gelu = nn.functional.gelu
def gelu_fast(x):
return 0.5 * x * (1.0 + torch.tanh(x * 0.7978845608 * (1.0 + 0.044715 * x * x)))
def quick_gelu(x):
return x * torch.sigmoid(1.702 * x)
def _silu_python(x):
"""
See Gaussian Error Linear Units (Hendrycks et al., https://arxiv.org/abs/1606.08415) where the SiLU (Sigmoid Linear
Unit) was originally introduced and coined, and see Sigmoid-Weighted Linear Units for Neural Network Function
Approximation in Reinforcement Learning (Elfwing et al., https://arxiv.org/abs/1702.03118) and Swish: a Self-Gated
Activation Function (Ramachandran et al., https://arxiv.org/abs/1710.05941v1) where the SiLU was experimented with
later.
"""
return x * torch.sigmoid(x)
if version.parse(torch.__version__) < version.parse("1.7"):
silu = _silu_python
else:
silu = nn.functional.silu
def _mish_python(x):
"""
See Mish: A Self-Regularized Non-Monotonic Activation Function (Misra., https://arxiv.org/abs/1908.08681). Also
visit the official repository for the paper: https://github.com/digantamisra98/Mish
"""
return x * torch.tanh(nn.functional.softplus(x))
if version.parse(torch.__version__) < version.parse("1.9"):
mish = _mish_python
else:
mish = nn.functional.mish
def linear_act(x):
return x
ACT2FN = {
"relu": nn.functional.relu,
"silu": silu,
"swish": silu,
"gelu": gelu,
"tanh": torch.tanh,
"gelu_new": gelu_new,
"gelu_fast": gelu_fast,
"quick_gelu": quick_gelu,
"mish": mish,
"linear": linear_act,
"sigmoid": torch.sigmoid,
"softmax": nn.Softmax(dim=-1)
}
def get_activation(activation_string):
if activation_string in ACT2FN:
return ACT2FN[activation_string]
else:
raise KeyError(f"function {activation_string} not found in ACT2FN mapping {list(ACT2FN.keys())}")
This diff is collapsed.
from ast import arg
from tracemalloc import start
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import numpy as np
class FocalLoss(nn.Module):
'''Multi-class Focal loss implementation'''
def __init__(self, gamma=2, weight=None,ignore_index=-100):
super(FocalLoss, self).__init__()
self.gamma = gamma
self.weight = weight
self.ignore_index=ignore_index
def forward(self, input, target):
"""
input: [N, C]
target: [N, ]
"""
logpt = F.log_softmax(input, dim=1)
pt = torch.exp(logpt)
logpt = (1-pt)**self.gamma * logpt
loss = F.nll_loss(logpt, target, self.weight,ignore_index=self.ignore_index)
return loss
class LabelSmoothingCrossEntropy(nn.Module):
def __init__(self, eps=0.1, reduction='mean',ignore_index=-100):
super(LabelSmoothingCrossEntropy, self).__init__()
self.eps = eps
self.reduction = reduction
self.ignore_index = ignore_index
def forward(self, output, target):
c = output.size()[-1]
log_preds = F.log_softmax(output, dim=-1)
if self.reduction=='sum':
loss = -log_preds.sum()
else:
loss = -log_preds.sum(dim=-1)
if self.reduction=='mean':
loss = loss.mean()
return loss*self.eps/c + (1-self.eps) * F.nll_loss(log_preds, target, reduction=self.reduction,
ignore_index=self.ignore_index)
class MultilabelCategoricalCrossentropy(nn.Module):
"""多标签分类的交叉熵
说明:y_true和y_pred的shape一致,y_true的元素非0即1, 1表示对应的类为目标类,0表示对应的类为非目标类。
警告:请保证y_pred的值域是全体实数,换言之一般情况下y_pred不用加激活函数,尤其是不能加sigmoid或者softmax!预测
阶段则输出y_pred大于0的类。如有疑问,请仔细阅读并理解本文。
参考:https://kexue.fm/archives/7359
"""
def __init__(self, **kwargs):
super().__init__(**kwargs)
def forward(self, y_pred, y_true):
""" y_true ([Tensor]): [..., num_classes]
y_pred ([Tensor]): [..., num_classes]
"""
y_pred = (1-2*y_true) * y_pred
y_pred_pos = y_pred - (1-y_true) * 1e12
y_pred_neg = y_pred - y_true * 1e12
y_pred_pos = torch.cat([y_pred_pos, torch.zeros_like(y_pred_pos[..., :1])], dim=-1)
y_pred_neg = torch.cat([y_pred_neg, torch.zeros_like(y_pred_neg[..., :1])], dim=-1)
pos_loss = torch.logsumexp(y_pred_pos, dim=-1)
neg_loss = torch.logsumexp(y_pred_neg, dim=-1)
return (pos_loss + neg_loss).mean()
class SparseMultilabelCategoricalCrossentropy(nn.Module):
"""稀疏版多标签分类的交叉熵
说明:
1. y_true.shape=[..., num_positive],
y_pred.shape=[..., num_classes];
2. 请保证y_pred的值域是全体实数,换言之一般情况下y_pred不用加激活函数,尤其是不能加sigmoid或者softmax;
3. 预测阶段则输出y_pred大于0的类;
4. 详情请看:https://kexue.fm/archives/7359 。
"""
def __init__(self, mask_zero=False, epsilon=1e-7, **kwargs):
super().__init__(**kwargs)
self.mask_zero = mask_zero
self.epsilon = epsilon
def forward(self, y_pred, y_true):
zeros = torch.zeros_like(y_pred[..., :1])
y_pred = torch.cat([y_pred, zeros], dim=-1)
if self.mask_zero:
infs = zeros + float('inf')
y_pred = torch.cat([infs, y_pred[..., 1:]], dim=-1)
y_pos_2 = torch.gather(y_pred, dim=-1, index=y_true)
y_pos_1 = torch.cat([y_pos_2, zeros], dim=-1)
if self.mask_zero:
y_pred = torch.cat([-infs, y_pred[..., 1:]], dim=-1)
y_pos_2 = torch.gather(y_pred, dim=-1, index=y_true)
pos_loss = torch.logsumexp(-y_pos_1, dim=-1)
all_loss = torch.logsumexp(y_pred, dim=-1) # a
aux_loss = torch.logsumexp(y_pos_2, dim=-1) - all_loss # b-a
aux_loss = torch.clamp(1 - torch.exp(aux_loss), self.epsilon, 1) # 1-exp(b-a)
neg_loss = all_loss + torch.log(aux_loss) # a + log[1-exp(b-a)]
return pos_loss + neg_loss
class ContrastiveLoss(nn.Module):
"""对比损失:减小正例之间的距离,增大正例和反例之间的距离
公式:labels * distance_matrix.pow(2) + (1-labels)*F.relu(margin-distance_matrix).pow(2)
https://www.sbert.net/docs/package_reference/losses.html
"""
def __init__(self, margin=0.5, size_average=True, online=False):
super(ContrastiveLoss, self).__init__()
self.margin = margin
self.size_average = size_average
self.online = online
def forward(self, distances, labels, pos_id=1, neg_id=0):
if not self.online:
losses = 0.5 * (labels.float() * distances.pow(2) + (1 - labels).float() * F.relu(self.margin - distances).pow(2))
return losses.mean() if self.size_average else losses.sum()
else:
negs = distances[labels == neg_id]
poss = distances[labels == pos_id]
# select hard positive and hard negative pairs
negative_pairs = negs[negs < (poss.max() if len(poss) > 1 else negs.mean())]
positive_pairs = poss[poss > (negs.min() if len(negs) > 1 else poss.mean())]
positive_loss = positive_pairs.pow(2).sum()
negative_loss = F.relu(self.margin - negative_pairs).pow(2).sum()
return positive_loss + negative_loss
class RDropLoss(nn.Module):
'''R-Drop的Loss实现,官方项目:https://github.com/dropreg/R-Drop
'''
def __init__(self, alpha=4, rank='adjacent'):
super().__init__()
self.alpha = alpha
# 支持两种方式,一种是奇偶相邻排列,一种是上下排列
assert rank in {'adjacent', 'updown'}, "rank kwarg only support 'adjacent' and 'updown' "
self.rank = rank
self.loss_sup = nn.CrossEntropyLoss()
self.loss_rdrop = nn.KLDivLoss(reduction='none')
def forward(self, *args):
'''支持两种方式: 一种是y_pred, y_true, 另一种是y_pred1, y_pred2, y_true
'''
assert len(args) in {2, 3}, 'RDropLoss only support 2 or 3 input args'
# y_pred是1个Tensor
if len(args) == 2:
y_pred, y_true = args
loss_sup = self.loss_sup(y_pred, y_true) # 两个都算
if self.rank == 'adjacent':
y_pred1 = y_pred[1::2]
y_pred2 = y_pred[::2]
elif self.rank == 'updown':
half_btz = y_true.shape[0] // 2
y_pred1 = y_pred[:half_btz]
y_pred2 = y_pred[half_btz:]
# y_pred是两个tensor
else:
y_pred1, y_pred2, y_true = args
loss_sup = self.loss_sup(y_pred1, y_true)
loss_rdrop1 = self.loss_rdrop(F.log_softmax(y_pred1, dim=-1), F.softmax(y_pred2, dim=-1))
loss_rdrop2 = self.loss_rdrop(F.log_softmax(y_pred2, dim=-1), F.softmax(y_pred1, dim=-1))
return loss_sup + torch.mean(loss_rdrop1 + loss_rdrop2) / 4 * self.alpha
class UDALoss(nn.Module):
'''UDALoss,使用时候需要继承一下,因为forward需要使用到global_step和total_steps
https://arxiv.org/abs/1904.12848
'''
def __init__(self, tsa_schedule=None, total_steps=None, start_p=0, end_p=1, return_all_loss=True):
super().__init__()
self.loss_sup = nn.CrossEntropyLoss()
self.loss_unsup = nn.KLDivLoss(reduction='batchmean')
self.tsa_schedule = tsa_schedule
self.start = start_p
self.end = end_p
if self.tsa_schedule:
assert self.tsa_schedule in {'linear_schedule', 'exp_schedule', 'log_schedule'}, 'tsa_schedule config illegal'
self.return_all_loss = return_all_loss
def forward(self, y_pred, y_true_sup, global_step, total_steps):
sup_size = y_true_sup.size(0)
unsup_size = (y_pred.size(0) - sup_size) // 2
# 有监督部分, 用交叉熵损失
y_pred_sup = y_pred[:sup_size]
if self.tsa_schedule is None:
loss_sup = self.loss_sup(y_pred_sup, y_true_sup)
else: # 使用tsa来去掉预测概率较高的有监督样本
threshold = self.get_tsa_threshold(self.tsa_schedule, global_step, total_steps, self.start, self.end)
true_prob = torch.gather(F.softmax(y_pred_sup, dim=-1), dim=1, index=y_true_sup[:, None])
sel_rows = true_prob.lt(threshold).sum(dim=-1).gt(0) # 仅保留小于阈值的样本
loss_sup = self.loss_sup(y_pred_sup[sel_rows], y_true_sup[sel_rows]) if sel_rows.sum() > 0 else 0
# 无监督部分,这里用KL散度,也可以用交叉熵
y_true_unsup = y_pred[sup_size:sup_size+unsup_size]
y_true_unsup = F.softmax(y_true_unsup.detach(), dim=-1)
y_pred_unsup = F.log_softmax(y_pred[sup_size+unsup_size:], dim=-1)
loss_unsup = self.loss_unsup(y_pred_unsup, y_true_unsup)
if self.return_all_loss:
return loss_sup + loss_unsup, loss_sup, loss_unsup
else:
return loss_sup + loss_unsup
@ staticmethod
def get_tsa_threshold(schedule, global_step, num_train_steps, start, end):
training_progress = global_step / num_train_steps
if schedule == "linear_schedule":
threshold = training_progress
elif schedule == "exp_schedule":
scale = 5
threshold = math.exp((training_progress - 1) * scale)
elif schedule == "log_schedule":
scale = 5
threshold = 1 - math.exp((-training_progress) * scale)
return threshold * (end - start) + start
class TemporalEnsemblingLoss(nn.Module):
'''TemporalEnsembling的实现,思路是在监督loss的基础上,增加一个mse的一致性损失loss
官方项目:https://github.com/s-laine/tempens
pytorch第三方实现:https://github.com/ferretj/temporal-ensembling
使用的时候,train_dataloader的shffle必须未False
'''
def __init__(self, epochs, max_val=10.0, ramp_up_mult=-5.0, alpha=0.5, max_batch_num=100, hist_device='cpu'):
super().__init__()
self.loss_sup = nn.CrossEntropyLoss()
self.max_epochs = epochs
self.max_val = max_val
self.ramp_up_mult = ramp_up_mult
self.alpha = alpha
self.max_batch_num = max_batch_num # 设置未None表示记录全部数据历史,数据量大时耗资源
self.hist_unsup = [] # 历史无监督logit
self.hist_sup = [] # 历史监督信息logit
self.hist_device = hist_device
self.hist_input_y = [] # 历史监督标签y
assert (self.alpha >= 0) & (self.alpha < 1) # 等于1的时候upata写分母为0
def forward(self, y_pred_sup, y_pred_unsup, y_true_sup, epoch, bti):
self.same_batch_check(y_pred_sup, y_pred_unsup, y_true_sup, bti)
if (self.max_batch_num is None) or (bti < self.max_batch_num):
self.init_hist(bti, y_pred_sup, y_pred_unsup) # 初始化历史
sup_ratio = float(len(y_pred_sup)) / (len(y_pred_sup) + len(y_pred_unsup)) # 监督样本的比例
w = self.weight_schedule(epoch, sup_ratio)
sup_loss, unsup_loss = self.temporal_loss(y_pred_sup, y_pred_unsup, y_true_sup, bti)
# 更新
self.hist_unsup[bti] = self.update(self.hist_unsup[bti], y_pred_unsup.detach(), epoch)
self.hist_sup[bti] = self.update(self.hist_sup[bti], y_pred_sup.detach(), epoch)
# if bti == 0: $ 用于检查每个epoch数据顺序是否一致
# print(w, sup_loss.item(), w * unsup_loss.item())
# print(y_true_sup)
return sup_loss + w * unsup_loss, sup_loss, w * unsup_loss
else:
return self.loss_sup(y_pred_sup, y_true_sup)
def same_batch_check(self, y_pred_sup, y_pred_unsup, y_true_sup, bti):
'''检测数据的前几个batch必须是一致的, 这里写死是10个
'''
if bti >= 10:
return
if bti >= len(self.hist_input_y):
self.hist_input_y.append(y_true_sup.to(self.hist_device))
else: # 检测
err_msg = 'TemporalEnsemblingLoss requests the same sort dataloader, you may need to set train_dataloader shuffle=False'
assert self.hist_input_y[bti].equal(y_true_sup.to(self.hist_device)), err_msg
def update(self, hist, y_pred, epoch):
'''更新历史logit,利用alpha门控来控制比例
'''
Z = self.alpha * hist.to(y_pred) + (1. -self.alpha) * y_pred
output = Z * (1. / (1. - self.alpha ** (epoch + 1)))
return output.to(self.hist_device)
def weight_schedule(self, epoch, sup_ratio):
max_val = self.max_val * sup_ratio
if epoch == 0:
return 0.
elif epoch >= self.max_epochs:
return max_val
return max_val * np.exp(self.ramp_up_mult * (1. - float(epoch) / self.max_epochs) ** 2)
def temporal_loss(self, y_pred_sup, y_pred_unsup, y_true_sup, bti):
# MSE between current and temporal outputs
def mse_loss(out1, out2):
quad_diff = torch.sum((F.softmax(out1, dim=1) - F.softmax(out2, dim=1)) ** 2)
return quad_diff / out1.data.nelement()
sup_loss = self.loss_sup(y_pred_sup, y_true_sup)
# 原来实现是sup和unsup作为一个tensor,整体计算的,这里由于是拆分成两个tensor,因此分开算
unsup_loss = mse_loss(y_pred_unsup, self.hist_unsup[bti].to(y_pred_unsup))
unsup_loss += mse_loss(y_pred_sup, self.hist_sup[bti].to(y_pred_sup))
return sup_loss, unsup_loss
def init_hist(self, bti, y_pred_sup, y_pred_unsup):
if bti >= len(self.hist_sup):
self.hist_sup.append(torch.zeros_like(y_pred_sup).to(self.hist_device))
self.hist_unsup.append(torch.zeros_like(y_pred_unsup).to(self.hist_device))
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment