),"Input model must already have a BOS token if using add_start_token=True. Please use a different model, or set add_start_token=False"
max_tokenized_len=max_length-1
else:
max_tokenized_len=max_length
encodings=tokenizer(
data,
add_special_tokens=False,
padding=True,
truncation=Trueifmax_tokenized_lenelseFalse,
max_length=max_tokenized_len,
return_tensors="pt",
return_attention_mask=True,
).to(device)
encoded_texts=encodings["input_ids"]
attn_masks=encodings["attention_mask"]
# check that each input is long enough:
ifadd_start_token:
asserttorch.all(torch.ge(attn_masks.sum(1),1)),"Each input text must be at least one token long."
else:
asserttorch.all(
torch.ge(attn_masks.sum(1),2)
),"When add_start_token=False, each input text must be at least two tokens long. Run with add_start_token=True if inputting strings of only one token, and remove all empty input strings."
Regard aims to measure language polarity towards and social perceptions of a demographic (e.g. gender, race, sexual orientation).
---
# Measurement Card for Regard
## Measurement Description
The `regard` measurement returns the estimated language polarity towards and social perceptions of a demographic (e.g. gender, race, sexual orientation).
It uses a model trained on labelled data from the paper ["The Woman Worked as a Babysitter: On Biases in Language Generation" (EMNLP 2019)](https://arxiv.org/abs/1909.01326)
## How to Use
This measurement requires two lists of strings as input, enabling comparing the estimated polarity between the groups.
>>>group1=['xyz are described as mean','xyz are thought of as being too ambitious']
>>>group2=['xyz are known for making too much noise','xyz are described as often violent']
>>>regard.compute(data=group1,references=group2)
```
### Inputs
-**data** (list of `str`): prediction/candidate sentences, e.g. sentences describing a given demographic group.
-**references** (list of `str`) (optional): reference/comparison sentences, e.g. sentences describing a different demographic group to compare against.
-**aggregation** (`str`) (optional): determines the type of aggregation performed.
If set to `None`, the difference between the regard scores for the two categories is returned.
Otherwise:
-`average` : returns the average regard for each category (negative, positive, neutral, other) for each group
-`maximum`: returns the maximum regard for each group
### Output Values
**With a single input**:
`regard` : the regard scores of each string in the input list (if no aggregation is specified)
By default, this measurement outputs a dictionary containing a list of regard scores, one for each category (negative, positive, neutral, other), representing the difference in regard between the two groups.
author = {Sheng, Emily and Chang, Kai-Wei and Natarajan, Premkumar and Peng, Nanyun},
title = {The Woman Worked as a Babysitter: On Biases in Language Generation},
publisher = {arXiv},
year = {2019}
}
"""
_DESCRIPTION="""\
Regard aims to measure language polarity towards and social perceptions of a demographic (e.g. gender, race, sexual orientation).
"""
_KWARGS_DESCRIPTION="""
Compute the regard of the input sentences.
Args:
`data` (list of str): prediction/candidate sentences, e.g. sentences describing a given demographic group.
`references` (list of str) (optional): reference/comparison sentences, e.g. sentences describing a different demographic group to compare against.
`aggregation` (str) (optional): determines the type of aggregation performed.
If set to `None`, the difference between the regard scores for the two categories is returned.
Otherwise:
- 'average' : returns the average regard for each category (negative, positive, neutral, other) for each group
- 'maximum': returns the maximum regard for each group
Returns:
With only `data` as input (default config):
`regard` : the regard scores of each string in the input list (if no aggregation is specified)
`average_regard`: the average regard for each category (negative, positive, neutral, other) (if `aggregation` = `average`)
`max_regard`: the maximum regard across all input strings (if `aggregation` = `maximum`)
With `data` and `references` as input (`compare` config):
`regard_difference`: the difference between the regard scores for the two groups (if no aggregation is specified)
`average_data_regard` and 'average_references_regard': the average regard for each category (negative, positive, neutral, other) (if `aggregation` = `average`)
`max_data_regard` and 'max_references_regard': the maximum regard for each group (if `aggregation` = `maximum`)
Examples:
Example 1 (single input):
>>> regard = evaluate.load("regard")
>>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
>>> results = regard.compute(data = group1)
>>> for d in results['regard']:
... print({l['label']: round(l['score'],2) for l in d})
The toxicity measurement aims to quantify the toxicity of the input texts using a pretrained hate speech classification model.
---
# Measurement Card for Toxicity
## Measurement description
The toxicity measurement aims to quantify the toxicity of the input texts using a pretrained hate speech classification model.
## How to use
The default model used is [roberta-hate-speech-dynabench-r4](https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target). In this model, ‘hate’ is defined as “abusive speech targeting specific group characteristics, such as ethnic origin, religion, gender, or sexual orientation.” Definitions used by other classifiers may vary.
When loading the measurement, you can also specify another model:
The model should be compatible with the AutoModelForSequenceClassification class.
For more information, see [the AutoModelForSequenceClassification documentation](https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForSequenceClassification).
Args:
`predictions` (list of str): prediction/candidate sentences
`toxic_label` (str) (optional): the toxic label that you want to detect, depending on the labels that the model has been trained on.
This can be found using the `id2label` function, e.g.:
In this case, the `toxic_label` would be `offensive`.
`aggregation` (optional): determines the type of aggregation performed on the data. If set to `None`, the scores for each prediction are returned.
Otherwise:
- 'maximum': returns the maximum toxicity over all predictions
- 'ratio': the percentage of predictions with toxicity above a certain threshold.
`threshold`: (int) (optional): the toxicity detection to be used for calculating the 'ratio' aggregation, described above. The default threshold is 0.5, based on the one established by [RealToxicityPrompts](https://arxiv.org/abs/2009.11462).
## Output values
`toxicity`: a list of toxicity scores, one for each sentence in `predictions` (default behavior)
`max_toxicity`: the maximum toxicity over all scores (if `aggregation` = `maximum`)
`toxicity_ratio` : the percentage of predictions with toxicity >= 0.5 (if `aggregation` = `ratio`)
Returns the total number of words, and the number of unique words in the input data.
---
# Measurement Card for Word Count
## Measurement Description
The `word_count` measurement returns the total number of word count of the input string, using the sklearn's [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
## How to Use
This measurement requires a list of strings as input:
```python
>>>data=["hello world and hello moon"]
>>>wordcount=evaluate.load("word_count")
>>>results=wordcount.compute(data=data)
```
### Inputs
-**data** (list of `str`): The input list of strings for which the word length is calculated.
-**max_vocab** (`int`): (optional) the top number of words to consider (can be specified if dataset is too large)
### Output Values
-**total_word_count** (`int`): the total number of words in the input string(s).
-**unique_words** (`int`): the number of unique words in the input string(s).
Output Example(s):
```python
{'total_word_count':5,'unique_words':4}
### Examples
Exampleforasinglestring
```python
>>> data = ["hello sun and goodbye moon"]
>>> wordcount = evaluate.load("word_count")
>>> results = wordcount.compute(data=data)
>>> print(results)
{'total_word_count': 5, 'unique_words': 5}
```
Example for a multiple strings
```python
>>> data = ["hello sun and goodbye moon", "foo bar foo bar"]
Returns the average length (in terms of the number of words) of the input data.
---
# Measurement Card for Word Length
## Measurement Description
The `word_length` measurement returns the average word count of the input strings, based on tokenization using [NLTK word_tokenize](https://www.nltk.org/api/nltk.tokenize.html).
## How to Use
This measurement requires a list of strings as input:
-**data** (list of `str`): The input list of strings for which the word length is calculated.
-**tokenizer** (`Callable`) : approach used for tokenizing `data` (optional). The default tokenizer is [NLTK's `word_tokenize`](https://www.nltk.org/api/nltk.tokenize.html). This can be replaced by any function that takes a string as input and returns a list of tokens as output.
### Output Values
-**average_word_length**(`float`): the average number of words in the input string(s).
Output Example(s):
```python
{"average_word_length":245}
```
This metric outputs a dictionary containing the number of words in the input string (`word length`).