Fix bits_per_byte metric in PerplexityTask

bits_per_byte was calculated as average per-byte loglikelihood, which would work if loglikelihood was base-2 log, but it is natural log. To correct for that, bits_per_byte should be divided by math.log(2). Also, it should be true that 2^bits_per_byte == byte_perplexity, which is true after the fix.

Fix bits_per_byte metric in PerplexityTask
bits_per_byte was calculated as average per-byte loglikelihood, which would work if loglikelihood was base-2 log, but it is natural log. To correct for that, bits_per_byte should be divided by math.log(2). Also, it should be true that 2^bits_per_byte == byte_perplexity, which is true after the fix.
38360512 · Igor Ostrovsky · df5d7cf0 · 38360512 · 38360512
Commit 38360512 authored Dec 25, 2021 by Igor Ostrovsky
Hide whitespace changes
Inline Side-by-side

Showing with 6 additions and 3 deletions

lm_eval/base.py lm_eval/base.py +3 -3

lm_eval/metrics.py lm_eval/metrics.py +3 -0

No files found.
--- a/lm_eval/base.py
+++ b/lm_eval/base.py
@@ -10,7 +10,7 @@ from tqdm import tqdm
 import torch
 import torch.nn.functional as F

-from lm_eval.metrics import mean, weighted_perplexity, weighted_mean
+from lm_eval.metrics import mean, weighted_perplexity, weighted_mean, bits_per_byte
 from lm_eval import utils
 from abc import abstractmethod

@@ -560,14 +560,14 @@ class PerplexityTask(Task, abc.ABC):
        return {
            "word_perplexity": (loglikelihood, words),
            "byte_perplexity": (loglikelihood, bytes_),
-            "bits_per_byte": (-loglikelihood, self.count_bytes(doc))
+            "bits_per_byte": (loglikelihood, bytes_),
        }

    def aggregation(self):
        return {
            "word_perplexity": weighted_perplexity,
            "byte_perplexity": weighted_perplexity,
-            "bits_per_byte": weighted_mean
+            "bits_per_byte": bits_per_byte,
        }

    @classmethod

--- a/lm_eval/metrics.py
+++ b/lm_eval/metrics.py
@@ -102,6 +102,9 @@ def weighted_mean(items):
 def weighted_perplexity(items):
    return math.exp(-weighted_mean(items))

+def bits_per_byte(items):
+    return -weighted_mean(items) / math.log(2)
+

 def bleu(items):
    """The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric