KDExample.md 1.22 KB
Newer Older
1
Knowledge Distillation on NNI
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
===

## KnowledgeDistill

Knowledge distillation support, in [Distilling the Knowledge in a Neural Network](https://arxiv.org/abs/1503.02531),  the compressed model is trained to mimic a pre-trained, larger model.  This training setting is also referred to as "teacher-student",  where the large model is the teacher and the small model is the student.

![](../../img/distill.png)

### Usage

PyTorch code

```python
from knowledge_distill.knowledge_distill import KnowledgeDistill
kd = KnowledgeDistill(kd_teacher_model, kd_T=5)
alpha = 1
beta = 0.8
for batch_idx, (data, target) in enumerate(train_loader):
    data, target = data.to(device), target.to(device)
    optimizer.zero_grad()
    output = model(data)
    loss = F.cross_entropy(output, target)
    # you only to add the following line to fine-tune with knowledge distillation
    loss = alpha * loss + beta * kd.loss(data=data, student_out=output)
    loss.backward()
```

#### User configuration for KnowledgeDistill
* **kd_teacher_model:** The pre-trained teacher model 
* **kd_T:** Temperature for smoothing teacher model's output

33
The complete code can be found [here](https://github.com/microsoft/nni/tree/v1.3/examples/model_compress/knowledge_distill/)