Mapping of tokens to a float weight value. Words embeddings are multiplied by this float value. Tokens in word_weights must not be equal to the vocab (can contain more or less values)
:param unknown_word_weight:
Weight for words in vocab, that do not appear in the word_weights lookup. These can be for example rare words in the vocab, where no weight exists.
"""Tokenizes the text with respect to existent phrases in the vocab.
This tokenizers respects phrases that are in the vocab. Phrases are separated with 'ngram_separator', for example,
in Google News word2vec file, ngrams are separated with a _ like New_York. These phrases are detected in text and merged as one special token. (New York is the ... => [New_York, is, the])
Reads in the STS dataset. Each line contains two sentences (s1_col_idx, s2_col_idx) and one label (score_col_idx)
Default values expects a tab separated file with the first & second column the sentence pair and third column the score (0...1). Default config normalizes scores from 0...5 to 0...1
"""
def__init__(
self,
dataset_folder,
s1_col_idx=0,
s2_col_idx=1,
score_col_idx=2,
delimiter="\t",
quoting=csv.QUOTE_NONE,
normalize_scores=True,
min_score=0,
max_score=5,
):
self.dataset_folder=dataset_folder
self.score_col_idx=score_col_idx
self.s1_col_idx=s1_col_idx
self.s2_col_idx=s2_col_idx
self.delimiter=delimiter
self.quoting=quoting
self.normalize_scores=normalize_scores
self.min_score=min_score
self.max_score=max_score
defget_examples(self,filename,max_examples=0):
"""
filename specified which data split to use (train.csv, dev.csv, test.csv).
Given a list of sentences / texts, this function performs paraphrase mining. It compares all sentences against all
other sentences and returns a list with the pairs that have the highest cosine similarity score.
:param model: SentenceTransformer model for embedding computation
:param sentences: A list of strings (texts or sentences)
:param show_progress_bar: Plotting of a progress bar
:param batch_size: Number of texts that are encoded simultaneously by the model
:param query_chunk_size: Search for most similar pairs for #query_chunk_size at the same time. Decrease, to lower memory footprint (increases run-time).
:param corpus_chunk_size: Compare a sentence simultaneously against #corpus_chunk_size other sentences. Decrease, to lower memory footprint (increases run-time).
:param max_pairs: Maximal number of text pairs returned.
:param top_k: For each sentence, we retrieve up to top_k other sentences
:param score_function: Function for computing scores. By default, cosine similarity.
:return: Returns a list of triplets with the format [score, id1, id2]
Given a list of sentences / texts, this function performs paraphrase mining. It compares all sentences against all
other sentences and returns a list with the pairs that have the highest cosine similarity score.
:param embeddings: A tensor with the embeddings
:param query_chunk_size: Search for most similar pairs for #query_chunk_size at the same time. Decrease, to lower memory footprint (increases run-time).
:param corpus_chunk_size: Compare a sentence simultaneously against #corpus_chunk_size other sentences. Decrease, to lower memory footprint (increases run-time).
:param max_pairs: Maximal number of text pairs returned.
:param top_k: For each sentence, we retrieve up to top_k other sentences
:param score_function: Function for computing scores. By default, cosine similarity.
:return: Returns a list of triplets with the format [score, id1, id2]
"""
top_k+=1# A sentence has the highest similarity to itself. Increase +1 as we are interest in distinct pairs
This function performs a cosine similarity search between a list of query embeddings and a list of corpus embeddings.
It can be used for Information Retrieval / Semantic Search for corpora up to about 1 Million entries.
:param query_embeddings: A 2 dimensional tensor with the query embeddings.
:param corpus_embeddings: A 2 dimensional tensor with the corpus embeddings.
:param query_chunk_size: Process 100 queries simultaneously. Increasing that value increases the speed, but requires more memory.
:param corpus_chunk_size: Scans the corpus 100k entries at a time. Increasing that value increases the speed, but requires more memory.
:param top_k: Retrieve top k matching entries.
:param score_function: Function for computing scores. By default, cosine similarity.
:return: Returns a list with one entry for each query. Each entry is a list of dictionaries with the keys 'corpus_id' and 'score', sorted by decreasing cosine similarity scores.