README.md 5.61 KB
Newer Older
thomwolf's avatar
thomwolf committed
1
# DistilBERT
VictorSanh's avatar
VictorSanh committed
2

thomwolf's avatar
thomwolf committed
3
This folder contains the original code used to train DistilBERT as well as examples showcasing how to use DistilBERT.
VictorSanh's avatar
VictorSanh committed
4

thomwolf's avatar
thomwolf committed
5
## What is DistilBERT
VictorSanh's avatar
VictorSanh committed
6

thomwolf's avatar
thomwolf committed
7
DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving over 95% of Bert's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
VictorSanh's avatar
VictorSanh committed
8

thomwolf's avatar
thomwolf committed
9
For more information on DistilBERT, please refer to our [detailed blog post](https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-distilbert-a-distilled-version-of-bert-8cf3380435b5
thomwolf's avatar
thomwolf committed
10
).
VictorSanh's avatar
VictorSanh committed
11

thomwolf's avatar
thomwolf committed
12
## How to use DistilBERT
VictorSanh's avatar
VictorSanh committed
13

thomwolf's avatar
thomwolf committed
14
PyTorch-Transformers includes two pre-trained DistilBERT models, currently only provided for English (we are investigating the possibility to train and release a multilingual version of DistilBERT):
VictorSanh's avatar
VictorSanh committed
15

thomwolf's avatar
thomwolf committed
16
17
- `distilbert-base-uncased`: DistilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-uncased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters.
- `distilbert-base-uncased-distilled-squad`: A finetuned version of `distilbert-base-uncased` finetuned using (a second step of) knwoledge distillation on SQuAD 1.0. This model reaches a F1 score of 86.2 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 88.5 F1 score).
thomwolf's avatar
thomwolf committed
18

thomwolf's avatar
thomwolf committed
19
Using DistilBERT is very similar to using BERT. DistilBERT share the same tokenizer as BERT's `bert-base-uncased` even though we provide a link to this tokenizer under the `DistilBertTokenizer` name to have a consistent naming between the library models.
VictorSanh's avatar
VictorSanh committed
20
21

```python
thomwolf's avatar
thomwolf committed
22
23
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
VictorSanh's avatar
VictorSanh committed
24
25
26
27
28
29

input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
```

thomwolf's avatar
thomwolf committed
30
## How to train DistilBERT
VictorSanh's avatar
VictorSanh committed
31
32
33
34
35

In the following, we will explain how you can train your own compressed model.

### A. Preparing the data

thomwolf's avatar
thomwolf committed
36
The weights we release are trained using a concatenation of Toronto Book Corpus and English Wikipedia (same training data as the English version of BERT).
VictorSanh's avatar
VictorSanh committed
37
38
39

To avoid processing the data several time, we do it once and for all before the training. From now on, will suppose that you have a text file `dump.txt` which contains one sequence per line (a sequence being composed of one of several coherent sentences).

thomwolf's avatar
thomwolf committed
40
First, we will binarize the data, i.e. tokenize the data and convert each token in an index in our model's vocabulary.
VictorSanh's avatar
VictorSanh committed
41
42
43
44
45
46
47
48

```bash
python scripts/binarized_data.py \
    --file_path data/dump.txt \
    --bert_tokenizer bert-base-uncased \
    --dump_file data/binarized_text
```

thomwolf's avatar
thomwolf committed
49
Our implementation of masked language modeling loss follows [XLM](https://github.com/facebookresearch/XLM)'s one and smoothes the probability of masking with a factor that put more emphasis on rare words. Thus we count the occurences of each tokens in the data:
VictorSanh's avatar
VictorSanh committed
50
51
52
53
54
55
56
57
58

```bash
python scripts/token_counts.py \
    --data_file data/binarized_text.bert-base-uncased.pickle \
    --token_counts_dump data/token_counts.bert-base-uncased.pickle
```

### B. Training

thomwolf's avatar
thomwolf committed
59
Training with distillation is really simple once you have pre-processed the data:
VictorSanh's avatar
VictorSanh committed
60
61
62
63
64
65

```bash
python train.py \
    --dump_path serialization_dir/my_first_training \
    --data_file data/binarized_text.bert-base-uncased.pickle \
    --token_counts data/token_counts.bert-base-uncased.pickle \
thomwolf's avatar
thomwolf committed
66
67
68
69
    --force # overwrites the `dump_path` if it already exists.
```

By default, this will launch a training on a single GPU (even if more are available on the cluster). Other parameters are available in the command line, please look in `train.py` or run `python train.py --help` to list them.
VictorSanh's avatar
VictorSanh committed
70

71
We highly encourage you to use distributed training for training DistilBert as the training corpus is quite large. Here's an example that runs a distributed training on a single node having 4 GPUs:
VictorSanh's avatar
VictorSanh committed
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92

```bash
export NODE_RANK=0
export N_NODES=1

export N_GPU_NODE=4
export WORLD_SIZE=4
export MASTER_PORT=<AN_OPEN_PORT>
export MASTER_ADDR=<I.P.>

pkill -f 'python -u train.py'

python -m torch.distributed.launch \
    --nproc_per_node=$N_GPU_NODE \
    --nnodes=$N_NODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT \
    train.py \
        --force \
        --n_gpu $WORLD_SIZE \
93
94
95
        --data_file data/binarized_text.bert-base-uncased.pickle \
        --token_counts data/token_counts.bert-base-uncased.pickle \
        --dump_path serialization_dir/my_first_distillation
VictorSanh's avatar
VictorSanh committed
96
97
```

98
**Tips:** Starting distillated training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract_for_distil.py` to create a valid initialization checkpoint and use `--from_pretrained_weights` and `--from_pretrained_config` arguments to use this initialization for the distilled training!
VictorSanh's avatar
VictorSanh committed
99
100

Happy distillation!