perplexity.rst 7.15 KB
Newer Older
1
Perplexity of fixed-length models
Sylvain Gugger's avatar
Sylvain Gugger committed
2
=======================================================================================================================
3

Sylvain Gugger's avatar
Sylvain Gugger committed
4
5
6
7
Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note
that the metric applies specifically to classical language models (sometimes called autoregressive or causal language
models) and is not well defined for masked language models like BERT (see :doc:`summary of the models
<model_summary>`).
8

Sylvain Gugger's avatar
Sylvain Gugger committed
9
10
Perplexity is defined as the exponentiated average log-likelihood of a sequence. If we have a tokenized sequence
:math:`X = (x_0, x_1, \dots, x_t)`, then the perplexity of :math:`X` is,
11
12
13
14
15
16

.. math::

    \text{PPL}(X)
    = \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i}) } \right\}

Sylvain Gugger's avatar
Sylvain Gugger committed
17
18
19
20
21
where :math:`\log p_\theta (x_i|x_{<i})` is the log-likelihood of the ith token conditioned on the preceding tokens
:math:`x_{<i}` according to our model. Intuitively, it can be thought of as an evaluation of the model's ability to
predict uniformly among the set of specified tokens in a corpus. Importantly, this means that the tokenization
procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing
different models.
22

Sylvain Gugger's avatar
Sylvain Gugger committed
23
24
25
This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions. For more
intuition about perplexity and its relationship to Bits Per Character (BPC) and data compression, check out this
`fantastic blog post on The Gradient <https://thegradient.pub/understanding-evaluation-metrics-for-language-models/>`_.
26
27

Calculating PPL with fixed-length models
Sylvain Gugger's avatar
Sylvain Gugger committed
28
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
29

Sylvain Gugger's avatar
Sylvain Gugger committed
30
31
If we weren't limited by a model's context size, we would evaluate the model's perplexity by autoregressively
factorizing a sequence and conditioning on the entire preceding subsequence at each step, as shown below.
32
33
34
35
36

.. image:: imgs/ppl_full.gif
    :width: 600
    :alt: Full decomposition of a sequence with unlimited context length

Sylvain Gugger's avatar
Sylvain Gugger committed
37
38
39
When working with approximate models, however, we typically have a constraint on the number of tokens the model can
process. The largest version of :doc:`GPT-2 <model_doc/gpt2>`, for example, has a fixed length of 1024 tokens, so we
cannot calculate :math:`p_\theta(x_t|x_{<t})` directly when :math:`t` is greater than 1024.
40

Sylvain Gugger's avatar
Sylvain Gugger committed
41
42
43
44
45
Instead, the sequence is typically broken into subsequences equal to the model's maximum input size. If a model's max
input size is :math:`k`, we then approximate the likelihood of a token :math:`x_t` by conditioning only on the
:math:`k-1` tokens that precede it rather than the entire context. When evaluating the model's perplexity of a
sequence, a tempting but suboptimal approach is to break the sequence into disjoint chunks and add up the decomposed
log-likelihoods of each segment independently.
46
47
48
49
50

.. image:: imgs/ppl_chunked.gif
    :width: 600
    :alt: Suboptimal PPL not taking advantage of full available context

Sylvain Gugger's avatar
Sylvain Gugger committed
51
52
53
This is quick to compute since the perplexity of each segment can be computed in one forward pass, but serves as a poor
approximation of the fully-factorized perplexity and will typically yield a higher (worse) PPL because the model will
have less context at most of the prediction steps.
54

Sylvain Gugger's avatar
Sylvain Gugger committed
55
56
Instead, the PPL of fixed-length models should be evaluated with a sliding-window strategy. This involves repeatedly
sliding the context window so that the model has more context when making each prediction.
57
58
59
60
61

.. image:: imgs/ppl_sliding.gif
    :width: 600
    :alt: Sliding window PPL taking advantage of all available context

Sylvain Gugger's avatar
Sylvain Gugger committed
62
63
64
This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more
favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good
practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by
Santiago Castro's avatar
Santiago Castro committed
65
1 token a time. This allows computation to proceed much faster while still giving the model a large context to make
Sylvain Gugger's avatar
Sylvain Gugger committed
66
predictions at each step.
67
68

Example: Calculating perplexity with GPT-2 in 馃 Transformers
Sylvain Gugger's avatar
Sylvain Gugger committed
69
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
70
71
72
73
74
75
76
77
78
79
80

Let's demonstrate this process with GPT-2.

.. code-block:: python

    from transformers import GPT2LMHeadModel, GPT2TokenizerFast
    device = 'cuda'
    model_id = 'gpt2-large'
    model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
    tokenizer = GPT2TokenizerFast.from_pretrained(model_id)

Sylvain Gugger's avatar
Sylvain Gugger committed
81
82
83
We'll load in the WikiText-2 dataset and evaluate the perplexity using a few different sliding-window strategies. Since
this dataset is small and we're just doing one forward pass over the set, we can just load and encode the entire
dataset in memory.
84
85
86
87
88
89
90

.. code-block:: python

    from nlp import load_dataset
    test = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
    encodings = tokenizer('\n\n'.join(test['text']), return_tensors='pt')

Sylvain Gugger's avatar
Sylvain Gugger committed
91
92
93
94
95
96
97
With 馃 Transformers, we can simply pass the ``input_ids`` as the ``labels`` to our model, and the average
log-likelihood for each token is returned as the loss. With our sliding window approach, however, there is overlap in
the tokens we pass to the model at each iteration. We don't want the log-likelihood for the tokens we're just treating
as context to be included in our loss, so we can set these targets to ``-100`` so that they are ignored. The following
is an example of how we could do this with a stride of ``512``. This means that the model will have at least 512 tokens
for context when calculating the conditional likelihood of any one token (provided there are 512 preceding tokens
available to condition on).
98
99
100
101
102
103
104

.. code-block:: python

    max_length = model.config.n_positions
    stride = 512

    lls = []
Joe Davison's avatar
Joe Davison committed
105
    for i in tqdm(range(0, encodings.input_ids.size(1), stride)):
106
        begin_loc = max(i + stride - max_length, 0)
107
108
        end_loc = min(i + stride, encodings.input_ids.size(1))
        trg_len = end_loc - i    # may be different from stride on last loop
109
110
        input_ids = encodings.input_ids[:,begin_loc:end_loc].to(device)
        target_ids = input_ids.clone()
111
        target_ids[:,:-trg_len] = -100
112
113
114

        with torch.no_grad():
            outputs = model(input_ids, labels=target_ids)
115
            log_likelihood = outputs[0] * trg_len
116
117

        lls.append(log_likelihood)
118
119

    ppl = torch.exp(torch.stack(lls).sum() / end_loc)
120

Sylvain Gugger's avatar
Sylvain Gugger committed
121
122
123
124
125
126
127
128
Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window
strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction,
and the better the reported perplexity will typically be.

When we run the above with ``stride = 1024``, i.e. no overlap, the resulting PPL is ``19.64``, which is about the same
as the ``19.93`` reported in the GPT-2 paper. By using ``stride = 512`` and thereby employing our striding window
strategy, this jumps down to ``16.53``. This is not only a more favorable score, but is calculated in a way that is
closer to the true autoregressive decomposition of a sequence likelihood.