"""Performs pooling (max or mean) on the token embeddings.
"""
Performs pooling (max or mean) on the token embeddings.
Using pooling, it generates from a variable sized sentence a fixed sized sentence embedding. This layer also allows
to use the CLS token if it is returned by the underlying word embedding model. You can concatenate multiple poolings
together.
:param word_embedding_dimension: Dimensions for the word embeddings
:param pooling_mode: Either "cls", "lasttoken", "max", "mean", "mean_sqrt_len_tokens", or "weightedmean". If set, overwrites the other pooling_mode_* settings
:param pooling_mode_cls_token: Use the first token (CLS token) as text representations
:param pooling_mode_max_tokens: Use max in each dimension over all tokens.
:param pooling_mode_mean_sqrt_len_tokens: Perform mean-pooling, but divide by sqrt(input_length).
:param pooling_mode_weightedmean_tokens: Perform (position) weighted mean pooling. See `SGPT: GPT Sentence Embeddings for Semantic Search <https://arxiv.org/abs/2202.08904>`_.
:param pooling_mode_lasttoken: Perform last token pooling. See `SGPT: GPT Sentence Embeddings for Semantic Search <https://arxiv.org/abs/2202.08904>`_ and `Text and Code Embeddings by Contrastive Pre-Training <https://arxiv.org/abs/2201.10005>`_.
Args:
word_embedding_dimension: Dimensions for the word embeddings
pooling_mode: Either "cls", "lasttoken", "max", "mean",
"mean_sqrt_len_tokens", or "weightedmean". If set,
overwrites the other pooling_mode_* settings
pooling_mode_cls_token: Use the first token (CLS token) as text
representations
pooling_mode_max_tokens: Use max in each dimension over all
tokens.
pooling_mode_mean_tokens: Perform mean-pooling
pooling_mode_mean_sqrt_len_tokens: Perform mean-pooling, but
Mapping of tokens to a float weight value. Words embeddings are multiplied by this float value. Tokens in word_weights must not be equal to the vocab (can contain more or less values)
:param unknown_word_weight:
Weight for words in vocab, that do not appear in the word_weights lookup. These can be for example rare words in the vocab, where no weight exists.
Initializes the WordWeights class.
Args:
vocab (List[str]): Vocabulary of the tokenizer.
word_weights (Dict[str, float]): Mapping of tokens to a float weight value. Word embeddings are multiplied
by this float value. Tokens in word_weights must not be equal to the vocab (can contain more or less values).
unknown_word_weight (float, optional): Weight for words in vocab that do not appear in the word_weights lookup.
These can be, for example, rare words in the vocab where no weight exists. Defaults to 1.
Reads in the STS dataset. Each line contains two sentences (s1_col_idx, s2_col_idx) and one label (score_col_idx)
"""Reads in the STS dataset. Each line contains two sentences (s1_col_idx, s2_col_idx) and one label (score_col_idx)
Default values expects a tab separated file with the first & second column the sentence pair and third column the score (0...1). Default config normalizes scores from 0...5 to 0...1
"""
...
...
@@ -34,9 +34,7 @@ class STSDataReader:
self.max_score=max_score
defget_examples(self,filename,max_examples=0):
"""
filename specified which data split to use (train.csv, dev.csv, test.csv).
"""
"""filename specified which data split to use (train.csv, dev.csv, test.csv)."""