Character error rate (CER) is a common metric of the performance of an automatic speech recognition system.
CER is similar to Word Error Rate (WER), but operates on character instead of word. Please refer to docs of WER for further information.
Character error rate can be computed as:
CER = (S + D + I) / N = (S + D + I) / (S + D + C)
where
S is the number of substitutions,
D is the number of deletions,
I is the number of insertions,
C is the number of correct characters,
N is the number of characters in the reference (N=S+D+C).
CER's output is not always a number between 0 and 1, in particular when there is a high number of insertions. This value is often associated to the percentage of characters that were incorrectly predicted. The lower the value, the better the
performance of the ASR system with a CER of 0 being a perfect score.
---
# Metric Card for CER
## Metric description
Character error rate (CER) is a common metric of the performance of an automatic speech recognition (ASR) system. CER is similar to Word Error Rate (WER), but operates on character instead of word.
Character error rate can be computed as:
`CER = (S + D + I) / N = (S + D + I) / (S + D + C)`
where
`S` is the number of substitutions,
`D` is the number of deletions,
`I` is the number of insertions,
`C` is the number of correct characters,
`N` is the number of characters in the reference (`N=S+D+C`).
## How to use
The metric takes two inputs: references (a list of references for each speech input) and predictions (a list of transcriptions to score).
This metric outputs a float representing the character error rate.
```
print(cer_score)
0.34146341463414637
```
The **lower** the CER value, the **better** the performance of the ASR system, with a CER of 0 being a perfect score.
However, CER's output is not always a number between 0 and 1, in particular when there is a high number of insertions (see [Examples](#Examples) below).
### Values from popular papers
This metric is highly dependent on the content and quality of the dataset, and therefore users can expect very different values for the same model but on different datasets.
Multilingual datasets such as [Common Voice](https://huggingface.co/datasets/common_voice) report different CERs depending on the language, ranging from 0.02-0.03 for languages such as French and Italian, to 0.05-0.07 for English (see [here](https://github.com/speechbrain/speechbrain/tree/develop/recipes/CommonVoice/ASR/CTC) for more values).
CER is useful for comparing different models for tasks such as automatic speech recognition (ASR) and optic character recognition (OCR), especially for multilingual datasets where WER is not suitable given the diversity of languages. However, CER provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort.
Also, in some cases, instead of reporting the raw CER, a normalized CER is reported where the number of mistakes is divided by the sum of the number of edit operations (`I` + `S` + `D`) and `C` (the number of correct characters), which results in CER values that fall within the range of 0–100%.
## Citation
```bibtex
@inproceedings{morris2004,
author={Morris, Andrew and Maier, Viktoria and Green, Phil},
year={2004},
month={01},
pages={},
title={From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition.}
}
```
## Further References
-[Hugging Face Tasks -- Automatic Speech Recognition](https://huggingface.co/tasks/automatic-speech-recognition)
author = {Morris, Andrew and Maier, Viktoria and Green, Phil},
year = {2004},
month = {01},
pages = {},
title = {From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition.}
}
"""
_DESCRIPTION="""\
Character error rate (CER) is a common metric of the performance of an automatic speech recognition system.
CER is similar to Word Error Rate (WER), but operates on character instead of word. Please refer to docs of WER for further information.
Character error rate can be computed as:
CER = (S + D + I) / N = (S + D + I) / (S + D + C)
where
S is the number of substitutions,
D is the number of deletions,
I is the number of insertions,
C is the number of correct characters,
N is the number of characters in the reference (N=S+D+C).
CER's output is not always a number between 0 and 1, in particular when there is a high number of insertions. This value is often associated to the percentage of characters that were incorrectly predicted. The lower the value, the better the
performance of the ASR system with a CER of 0 being a perfect score.
"""
_KWARGS_DESCRIPTION="""
Computes CER score of transcribed segments against references.
Args:
references: list of references for each speech input.
predictions: list of transcribtions to score.
concatenate_texts: Whether or not to concatenate sentences before evaluation, set to True for more accurate result.
Returns:
(float): the character error rate
Examples:
>>> predictions = ["this is the prediction", "there is an other sample"]
>>> references = ["this is the reference", "there is another one"]
-**predictions**: a single prediction or a list of predictions to score. Each prediction should be a string with
tokens separated by spaces.
-**references**: a single reference or a list of reference for each prediction. Each reference should be a string with
tokens separated by spaces.
### Output Values
-**charcut_mt**: the CharCut evaluation score (lower is better)
### Output Example
```python
{'charcut_mt':0.1971153846153846}
```
## Citation
```bibtex
@inproceedings{lardilleux-lepage-2017-charcut,
title="{CHARCUT}: Human-Targeted Character-Based {MT} Evaluation with Loose Differences",
author="Lardilleux, Adrien and
Lepage, Yves",
booktitle="Proceedings of the 14th International Conference on Spoken Language Translation",
month=dec#" 14-15",
year="2017",
address="Tokyo, Japan",
publisher="International Workshop on Spoken Language Translation",
url="https://aclanthology.org/2017.iwslt-1.20",
pages="146--153",
abstract="We present CHARCUT, a character-based machine translation evaluation metric derived from a human-targeted segment difference visualisation algorithm. It combines an iterative search for longest common substrings between the candidate and the reference translation with a simple length-based threshold, enabling loose differences that limit noisy character matches. Its main advantage is to produce scores that directly reflect human-readable string differences, making it a useful support tool for the manual analysis of MT output and its display to end users. Experiments on WMT16 metrics task data show that it is on par with the best {``}un-trained{''} metrics in terms of correlation with human judgement, well above BLEU and TER baselines, on both system and segment tasks.",
}
```
## Further References
- Repackaged version that is used in this HF implementation: [https://github.com/BramVanroy/CharCut](https://github.com/BramVanroy/CharCut)
- Original version: [https://github.com/alardill/CharCut](https://github.com/alardill/CharCut)
ChrF and ChrF++ are two MT evaluation metrics. They both use the F-score statistic for character n-gram matches,
and ChrF++ adds word n-grams as well which correlates more strongly with direct assessment. We use the implementation
that is already present in sacrebleu.
The implementation here is slightly different from sacrebleu in terms of the required input format. The length of
the references and hypotheses lists need to be the same, so you may need to transpose your references compared to
sacrebleu's required input format. See https://github.com/huggingface/datasets/issues/3154#issuecomment-950746534
See the README.md file at https://github.com/mjpost/sacreBLEU#chrf--chrf for more information.
---
# Metric Card for chrF(++)
## Metric Description
ChrF and ChrF++ are two MT evaluation metrics that use the F-score statistic for character n-gram matches. ChrF++ additionally includes word n-grams, which correlate more strongly with direct assessment. We use the implementation that is already present in sacrebleu.
While this metric is included in sacreBLEU, the implementation here is slightly different from sacreBLEU in terms of the required input format. Here, the length of the references and hypotheses lists need to be the same, so you may need to transpose your references compared to sacrebleu's required input format. See https://github.com/huggingface/datasets/issues/3154#issuecomment-950746534
See the [sacreBLEU README.md](https://github.com/mjpost/sacreBLEU#chrf--chrf) for more information.
## How to Use
At minimum, this metric requires a `list` of predictions and a `list` of `list`s of references:
```python
>>>prediction=["The relationship between cats and dogs is not exactly friendly.","a good bookshop is just a genteel black hole that knows how to read."]
>>>reference=[["The relationship between dogs and cats is not exactly friendly.",],["A good bookshop is just a genteel Black Hole that knows how to read."]]
-**`predictions`** (`list` of `str`): The predicted sentences.
-**`references`** (`list` of `list` of `str`): The references. There should be one reference sub-list for each prediction sentence.
-**`char_order`** (`int`): Character n-gram order. Defaults to `6`.
-**`word_order`** (`int`): Word n-gram order. If equals to 2, the metric is referred to as chrF++. Defaults to `0`.
-**`beta`** (`int`): Determine the importance of recall w.r.t precision. Defaults to `2`.
-**`lowercase`** (`bool`): If `True`, enables case-insensitivity. Defaults to `False`.
-**`whitespace`** (`bool`): If `True`, include whitespaces when extracting character n-grams. Defaults to `False`.
-**`eps_smoothing`** (`bool`): If `True`, applies epsilon smoothing similar to reference chrF++.py, NLTK, and Moses implementations. If `False`, takes into account effective match order similar to sacreBLEU < 2.0.0. Defaults to `False`.
### Output Values
The output is a dictionary containing the following fields:
-**`'score'`** (`float`): The chrF (chrF++) score.
-**`'char_order'`** (`int`): The character n-gram order.
-**`'word_order'`** (`int`): The word n-gram order. If equals to `2`, the metric is referred to as chrF++.
-**`'beta'`** (`int`): Determine the importance of recall w.r.t precision.
The chrF(++) score can be any value between `0.0` and `100.0`, inclusive.
#### Values from Popular Papers
### Examples
A simple example of calculating chrF:
```python
>>>prediction=["The relationship between cats and dogs is not exactly friendly.","a good bookshop is just a genteel black hole that knows how to read."]
>>>reference=[["The relationship between dogs and cats is not exactly friendly.",],["A good bookshop is just a genteel Black Hole that knows how to read."]]
The same example, but with the argument `word_order=2`, to calculate chrF++ instead of chrF:
```python
>>>prediction=["The relationship between cats and dogs is not exactly friendly.","a good bookshop is just a genteel black hole that knows how to read."]
>>>reference=[["The relationship between dogs and cats is not exactly friendly.",],["A good bookshop is just a genteel Black Hole that knows how to read."]]
The same chrF++ example as above, but with `lowercase=True` to normalize all case:
```python
>>>prediction=["The relationship between cats and dogs is not exactly friendly.","a good bookshop is just a genteel black hole that knows how to read."]
>>>reference=[["The relationship between dogs and cats is not exactly friendly.",],["A good bookshop is just a genteel Black Hole that knows how to read."]]
- According to [Popović 2017](https://www.statmt.org/wmt17/pdf/WMT70.pdf), chrF+ (where `word_order=1`) and chrF++ (where `word_order=2`) produce scores that correlate better with human judgements than chrF (where `word_order=0`) does.
## Citation
```bibtex
@inproceedings{popovic-2015-chrf,
title="chr{F}: character n-gram {F}-score for automatic {MT} evaluation",
author="Popovi{\'c}, Maja",
booktitle="Proceedings of the Tenth Workshop on Statistical Machine Translation",
month=sep,
year="2015",
address="Lisbon, Portugal",
publisher="Association for Computational Linguistics",
url="https://aclanthology.org/W15-3049",
doi="10.18653/v1/W15-3049",
pages="392--395",
}
@inproceedings{popovic-2017-chrf,
title="chr{F}++: words helping character n-grams",
author="Popovi{\'c}, Maja",
booktitle="Proceedings of the Second Conference on Machine Translation",
month=sep,
year="2017",
address="Copenhagen, Denmark",
publisher="Association for Computational Linguistics",
url="https://aclanthology.org/W17-4770",
doi="10.18653/v1/W17-4770",
pages="612--618",
}
@inproceedings{post-2018-call,
title="A Call for Clarity in Reporting {BLEU} Scores",
author="Post, Matt",
booktitle="Proceedings of the Third Conference on Machine Translation: Research Papers",
month=oct,
year="2018",
address="Belgium, Brussels",
publisher="Association for Computational Linguistics",
url="https://www.aclweb.org/anthology/W18-6319",
pages="186--191",
}
```
## Further References
- See the [sacreBLEU README.md](https://github.com/mjpost/sacreBLEU#chrf--chrf) for more information on this implementation.
ChrF and ChrF++ are two MT evaluation metrics. They both use the F-score statistic for character n-gram matches,
and ChrF++ adds word n-grams as well which correlates more strongly with direct assessment. We use the implementation
that is already present in sacrebleu.
The implementation here is slightly different from sacrebleu in terms of the required input format. The length of
the references and hypotheses lists need to be the same, so you may need to transpose your references compared to
sacrebleu's required input format. See https://github.com/huggingface/datasets/issues/3154#issuecomment-950746534
See the README.md file at https://github.com/mjpost/sacreBLEU#chrf--chrf for more information.
"""
_KWARGS_DESCRIPTION="""
Produces ChrF(++) scores for hypotheses given reference translations.
Args:
predictions (list of str): The predicted sentences.
references (list of list of str): The references. There should be one reference sub-list for each prediction sentence.
char_order (int): Character n-gram order. Defaults to `6`.
word_order (int): Word n-gram order. If equals to `2`, the metric is referred to as chrF++. Defaults to `0`.
beta (int): Determine the importance of recall w.r.t precision. Defaults to `2`.
lowercase (bool): if `True`, enables case-insensitivity. Defaults to `False`.
whitespace (bool): If `True`, include whitespaces when extracting character n-grams.
eps_smoothing (bool): If `True`, applies epsilon smoothing similar
to reference chrF++.py, NLTK and Moses implementations. If `False`,
it takes into account effective match order similar to sacreBLEU < 2.0.0. Defaults to `False`.
Returns:
'score' (float): The chrF (chrF++) score,
'char_order' (int): The character n-gram order,
'word_order' (int): The word n-gram order. If equals to 2, the metric is referred to as chrF++,
'beta' (int): Determine the importance of recall w.r.t precision
Examples:
Example 1--a simple example of calculating chrF:
>>> prediction = ["The relationship between cats and dogs is not exactly friendly.", "a good bookshop is just a genteel black hole that knows how to read."]
>>> reference = [["The relationship between dogs and cats is not exactly friendly."], ["A good bookshop is just a genteel Black Hole that knows how to read."]]
Example 2--the same example, but with the argument word_order=2, to calculate chrF++ instead of chrF:
>>> prediction = ["The relationship between cats and dogs is not exactly friendly.", "a good bookshop is just a genteel black hole that knows how to read."]
>>> reference = [["The relationship between dogs and cats is not exactly friendly."], ["A good bookshop is just a genteel Black Hole that knows how to read."]]
Example 3--the same chrF++ example as above, but with `lowercase=True` to normalize all case:
>>> prediction = ["The relationship between cats and dogs is not exactly friendly.", "a good bookshop is just a genteel black hole that knows how to read."]
>>> reference = [["The relationship between dogs and cats is not exactly friendly."], ["A good bookshop is just a genteel Black Hole that knows how to read."]]