"""Performs pooling (max or mean) on the token embeddings.
"""
Performs pooling (max or mean) on the token embeddings.
Using pooling, it generates from a variable sized sentence a fixed sized sentence embedding. This layer also allows
Using pooling, it generates from a variable sized sentence a fixed sized sentence embedding. This layer also allows
to use the CLS token if it is returned by the underlying word embedding model. You can concatenate multiple poolings
to use the CLS token if it is returned by the underlying word embedding model. You can concatenate multiple poolings
together.
together.
:param word_embedding_dimension: Dimensions for the word embeddings
Args:
:param pooling_mode: Either "cls", "lasttoken", "max", "mean", "mean_sqrt_len_tokens", or "weightedmean". If set, overwrites the other pooling_mode_* settings
word_embedding_dimension: Dimensions for the word embeddings
:param pooling_mode_cls_token: Use the first token (CLS token) as text representations
pooling_mode: Either "cls", "lasttoken", "max", "mean",
:param pooling_mode_max_tokens: Use max in each dimension over all tokens.
"mean_sqrt_len_tokens", or "weightedmean". If set,
:param pooling_mode_mean_sqrt_len_tokens: Perform mean-pooling, but divide by sqrt(input_length).
pooling_mode_cls_token: Use the first token (CLS token) as text
:param pooling_mode_weightedmean_tokens: Perform (position) weighted mean pooling. See `SGPT: GPT Sentence Embeddings for Semantic Search <https://arxiv.org/abs/2202.08904>`_.
representations
:param pooling_mode_lasttoken: Perform last token pooling. See `SGPT: GPT Sentence Embeddings for Semantic Search <https://arxiv.org/abs/2202.08904>`_ and `Text and Code Embeddings by Contrastive Pre-Training <https://arxiv.org/abs/2201.10005>`_.
pooling_mode_max_tokens: Use max in each dimension over all
tokens.
pooling_mode_mean_tokens: Perform mean-pooling
pooling_mode_mean_sqrt_len_tokens: Perform mean-pooling, but
Mapping of tokens to a float weight value. Words embeddings are multiplied by this float value. Tokens in word_weights must not be equal to the vocab (can contain more or less values)
word_weights (Dict[str, float]): Mapping of tokens to a float weight value. Word embeddings are multiplied
:param unknown_word_weight:
by this float value. Tokens in word_weights must not be equal to the vocab (can contain more or less values).
Weight for words in vocab, that do not appear in the word_weights lookup. These can be for example rare words in the vocab, where no weight exists.
unknown_word_weight (float, optional): Weight for words in vocab that do not appear in the word_weights lookup.
These can be, for example, rare words in the vocab where no weight exists. Defaults to 1.
"""Reads in the STS dataset. Each line contains two sentences (s1_col_idx, s2_col_idx) and one label (score_col_idx)
Reads in the STS dataset. Each line contains two sentences (s1_col_idx, s2_col_idx) and one label (score_col_idx)
Default values expects a tab separated file with the first & second column the sentence pair and third column the score (0...1). Default config normalizes scores from 0...5 to 0...1
Default values expects a tab separated file with the first & second column the sentence pair and third column the score (0...1). Default config normalizes scores from 0...5 to 0...1
"""
"""
...
@@ -34,9 +34,7 @@ class STSDataReader:
...
@@ -34,9 +34,7 @@ class STSDataReader:
self.max_score=max_score
self.max_score=max_score
defget_examples(self,filename,max_examples=0):
defget_examples(self,filename,max_examples=0):
"""
"""filename specified which data split to use (train.csv, dev.csv, test.csv)."""
filename specified which data split to use (train.csv, dev.csv, test.csv).