"src/target/codegen_cuda.cc" did not exist on "64f17c2f369e612cc297d358f607307a615bbb59"
Commit a2210436 authored by Tang Lang's avatar Tang Lang Committed by QuanluZhang
Browse files

Knowledge Distillation support (#1819)

parent 665790fc
......@@ -6,7 +6,9 @@ NNI provides an easy-to-use toolkit to help user design and use compression algo
## Supported algorithms
We have provided two naive compression algorithms and three popular ones for users, including two pruning algorithms and three quantization algorithms:
We have provided several compression algorithms, including several pruning and quantization algorithms:
**Pruning**
|Name|Brief Introduction of Algorithm|
|---|---|
......@@ -16,6 +18,11 @@ We have provided two naive compression algorithms and three popular ones for use
| [Slim Pruner](./Pruner.md#slim-pruner) | Pruning channels in convolution layers by pruning scaling factors in BN layers(Learning Efficient Convolutional Networks through Network Slimming)[Reference Paper](https://arxiv.org/abs/1708.06519) |
| [Lottery Ticket Pruner](./Pruner.md#agp-pruner) | The pruning process used by "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". It prunes a model iteratively. [Reference Paper](https://arxiv.org/abs/1803.03635)|
| [FPGM Pruner](./Pruner.md#fpgm-pruner) | Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration [Reference Paper](https://arxiv.org/pdf/1811.00250.pdf)|
**Quantization**
|Name|Brief Introduction of Algorithm|
|---|---|
| [Naive Quantizer](./Quantizer.md#naive-quantizer) | Quantize weights to default 8 bits |
| [QAT Quantizer](./Quantizer.md#qat-quantizer) | Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. [Reference Paper](http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf)|
| [DoReFa Quantizer](./Quantizer.md#dorefa-quantizer) | DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. [Reference Paper](https://arxiv.org/abs/1606.06160)|
......@@ -24,25 +31,26 @@ We have provided two naive compression algorithms and three popular ones for use
We use a simple example to show how to modify your trial code in order to apply the compression algorithms. Let's say you want to prune all weight to 80% sparsity with Level Pruner, you can add the following three lines into your code before training your model ([here](https://github.com/microsoft/nni/tree/master/examples/model_compress) is complete code).
Tensorflow code
PyTorch code
```python
from nni.compression.tensorflow import LevelPruner
from nni.compression.torch import LevelPruner
config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
pruner = LevelPruner(tf.get_default_graph(), config_list)
pruner = LevelPruner(model, config_list)
pruner.compress()
```
PyTorch code
Tensorflow code
```python
from nni.compression.torch import LevelPruner
from nni.compression.tensorflow import LevelPruner
config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
pruner = LevelPruner(model, config_list)
pruner = LevelPruner(tf.get_default_graph(), config_list)
pruner.compress()
```
You can use other compression algorithms in the package of `nni.compression`. The algorithms are implemented in both PyTorch and Tensorflow, under `nni.compression.torch` and `nni.compression.tensorflow` respectively. You can refer to [Pruner](./Pruner.md) and [Quantizer](./Quantizer.md) for detail description of supported algorithms.
You can use other compression algorithms in the package of `nni.compression`. The algorithms are implemented in both PyTorch and Tensorflow, under `nni.compression.torch` and `nni.compression.tensorflow` respectively. You can refer to [Pruner](./Pruner.md) and [Quantizer](./Quantizer.md) for detail description of supported algorithms. Also if you want to use knowledge distillation, you can refer to [KDExample](../TrialExample/KDExample.md)
The function call `pruner.compress()` modifies user defined model (in Tensorflow the model can be obtained with `tf.get_default_graph()`, while in PyTorch the model is the defined model class), and the model is modified with masks inserted. Then when you run the model, the masks take effect. The masks can be adjusted at runtime by the algorithms.
......
Knowledge Distillation on NNI Compressor
===
## KnowledgeDistill
Knowledge distillation support, in [Distilling the Knowledge in a Neural Network](https://arxiv.org/abs/1503.02531), the compressed model is trained to mimic a pre-trained, larger model. This training setting is also referred to as "teacher-student", where the large model is the teacher and the small model is the student.
![](../../img/distill.png)
### Usage
PyTorch code
```python
from knowledge_distill.knowledge_distill import KnowledgeDistill
kd = KnowledgeDistill(kd_teacher_model, kd_T=5)
alpha = 1
beta = 0.8
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.cross_entropy(output, target)
# you only to add the following line to fine-tune with knowledge distillation
loss = alpha * loss + beta * kd.loss(data=data, student_out=output)
loss.backward()
```
#### User configuration for KnowledgeDistill
* **kd_teacher_model:** The pre-trained teacher model
* **kd_T:** Temperature for smoothing teacher model's output
The complete code can be found here
\ No newline at end of file
import logging
import torch
import torch.nn.functional as F
_logger = logging.getLogger(__name__)
class KnowledgeDistill():
"""
Knowledge Distillaion support while fine-tuning the compressed model
Geoffrey Hinton, Oriol Vinyals, Jeff Dean
"Distilling the Knowledge in a Neural Network"
https://arxiv.org/abs/1503.02531
"""
def __init__(self, teacher_model, kd_T=1):
"""
Parameters
----------
teacher_model : pytorch model
the teacher_model for teaching the student model, it should be pretrained
kd_T: float
kd_T is the temperature parameter, when kd_T=1 we get the standard softmax function
As kd_T grows, the probability distribution generated by the softmax function becomes softer
"""
self.teacher_model = teacher_model
self.kd_T = kd_T
def _get_kd_loss(self, data, student_out, teacher_out_preprocess=None):
"""
Parameters
----------
data : torch.Tensor
the input training data
student_out: torch.Tensor
output of the student network
teacher_out_preprocess: function
a function for pre-processing teacher_model's output
e.g. when teacher_out_preprocess=lambda x:x[0]
extract teacher_model's output (tensor1, tensor2)->tensor1
Returns
-------
torch.Tensor
weighted distillation loss
"""
with torch.no_grad():
kd_out = self.teacher_model(data)
if teacher_out_preprocess is not None:
kd_out = teacher_out_preprocess(kd_out)
assert type(kd_out) is torch.Tensor
assert type(student_out) is torch.Tensor
assert kd_out.shape == student_out.shape
soft_log_out = F.log_softmax(student_out / self.kd_T, dim=1)
soft_t = F.softmax(kd_out / self.kd_T, dim=1)
loss_kd = F.kl_div(soft_log_out, soft_t.detach(), reduction='batchmean')
return loss_kd
def loss(self, data, student_out):
"""
Parameters
----------
data : torch.Tensor
Input of the student model
student_out : torch.Tensor
Output of the student model
Returns
-------
torch.Tensor
Weighted loss of student loss and distillation loss
"""
return self._get_kd_loss(data, student_out)
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from nni.compression.torch import L1FilterPruner
from knowledge_distill.knowledge_distill import KnowledgeDistill
class vgg(nn.Module):
def __init__(self, init_weights=True):
super(vgg, self).__init__()
cfg = [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512]
self.cfg = cfg
self.feature = self.make_layers(cfg, True)
num_classes = 10
self.classifier = nn.Sequential(
nn.Linear(cfg[-1], 512),
nn.BatchNorm1d(512),
nn.ReLU(inplace=True),
nn.Linear(512, num_classes)
)
if init_weights:
self._initialize_weights()
def make_layers(self, cfg, batch_norm=True):
layers = []
in_channels = 3
for v in cfg:
if v == 'M':
layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
else:
conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1, bias=False)
if batch_norm:
layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
else:
layers += [conv2d, nn.ReLU(inplace=True)]
in_channels = v
return nn.Sequential(*layers)
def forward(self, x):
x = self.feature(x)
x = nn.AvgPool2d(2)(x)
x = x.view(x.size(0), -1)
y = self.classifier(x)
return y
def _initialize_weights(self):
for m in self.modules():
if isinstance(m, nn.Conv2d):
n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
m.weight.data.normal_(0, math.sqrt(2. / n))
if m.bias is not None:
m.bias.data.zero_()
elif isinstance(m, nn.BatchNorm2d):
m.weight.data.fill_(0.5)
m.bias.data.zero_()
elif isinstance(m, nn.Linear):
m.weight.data.normal_(0, 0.01)
m.bias.data.zero_()
def train(model, device, train_loader, optimizer, kd=None):
alpha = 1
beta = 0.8
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
student_loss = F.cross_entropy(output, target)
if kd is not None:
kd_loss = kd.loss(data=data, student_out=output)
loss = alpha * student_loss + beta * kd_loss
else:
loss = student_loss
loss.backward()
optimizer.step()
if batch_idx % 100 == 0:
print('{:2.0f}% Loss {}'.format(100 * batch_idx / len(train_loader), loss.item()))
def test(model, device, test_loader):
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += F.nll_loss(output, target, reduction='sum').item()
pred = output.argmax(dim=1, keepdim=True)
correct += pred.eq(target.view_as(pred)).sum().item()
test_loss /= len(test_loader.dataset)
acc = 100 * correct / len(test_loader.dataset)
print('Loss: {} Accuracy: {}%)\n'.format(
test_loss, acc))
return acc
def main():
torch.manual_seed(0)
device = torch.device('cuda')
train_loader = torch.utils.data.DataLoader(
datasets.CIFAR10('./data.cifar10', train=True, download=True,
transform=transforms.Compose([
transforms.Pad(4),
transforms.RandomCrop(32),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])),
batch_size=64, shuffle=True)
test_loader = torch.utils.data.DataLoader(
datasets.CIFAR10('./data.cifar10', train=False, transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])),
batch_size=200, shuffle=False)
model = vgg()
model.to(device)
# Train the base VGG-16 model
print('=' * 10 + 'Train the unpruned base model' + '=' * 10)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, 160, 0)
for epoch in range(160):
print('# Epoch {} #'.format(epoch))
train(model, device, train_loader, optimizer)
test(model, device, test_loader)
lr_scheduler.step(epoch)
torch.save(model.state_dict(), 'vgg16_cifar10.pth')
# Test base model accuracy
print('=' * 10 + 'Test on the original model' + '=' * 10)
model.load_state_dict(torch.load('vgg16_cifar10.pth'))
test(model, device, test_loader)
# top1 = 93.51%
# Pruning Configuration, all convolution layers are pruned out 80% filters according to the L1 norm
configure_list = [{
'sparsity': 0.8,
'op_types': ['Conv2d'],
}]
# Prune model and test accuracy without fine tuning.
print('=' * 10 + 'Test on the pruned model before fine tune' + '=' * 10)
pruner = L1FilterPruner(model, configure_list)
model = pruner.compress()
test(model, device, test_loader)
# top1 = 10.00%
# Fine tune the pruned model for 40 epochs and test accuracy
print('=' * 10 + 'Fine tuning' + '=' * 10)
optimizer_finetune = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-4)
best_top1 = 0
kd_teacher_model = vgg()
kd_teacher_model.to(device)
kd_teacher_model.load_state_dict(torch.load('vgg16_cifar10.pth'))
kd = KnowledgeDistill(kd_teacher_model, kd_T=5)
for epoch in range(40):
pruner.update_epoch(epoch)
print('# Epoch {} #'.format(epoch))
train(model, device, train_loader, optimizer_finetune, kd)
top1 = test(model, device, test_loader)
if top1 > best_top1:
best_top1 = top1
# Export the best model, 'model_path' stores state_dict of the pruned model,
# mask_path stores mask_dict of the pruned model
pruner.export_model(model_path='pruned_vgg16_cifar10.pth', mask_path='mask_vgg16_cifar10.pth')
# Test the exported model
print('=' * 10 + 'Test on the pruned model after fine tune' + '=' * 10)
new_model = vgg()
new_model.to(device)
new_model.load_state_dict(torch.load('pruned_vgg16_cifar10.pth'))
test(new_model, device, test_loader)
# top1 = 85.43% with kd, top1 = 85.04% without kd,
if __name__ == '__main__':
main()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment