Commit 24db6dab authored by Rayyyyy's avatar Rayyyyy
Browse files

first add

parents
Pipeline #850 failed with stages
in 0 seconds
from sentence_transformers import SentenceTransformer, util, models
from PIL import Image
###########
image = Image.open("two_dogs_in_snow.jpg")
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
inputs = processor(texts=["a cat", "a dog"], images=[image], return_tensors="pt", padding=True)
output = model(**inputs)
# vision_outputs = model.vision_model(pixel_values=inputs['pixel_values'])
# image_embeds = model.visual_projection(vision_outputs[1])
# print(image_embeds.shape)
# exit()
# Load CLIP model
clip = models.CLIPModel()
model = SentenceTransformer(modules=[clip])
model.save("tmp-clip-model")
model = SentenceTransformer("tmp-clip-model")
# Encode an image:
img_emb = model.encode(Image.open("two_dogs_in_snow.jpg"))
# Encode text descriptions
text_emb = model.encode(["Two dogs in the snow", "A cat on a table", "A picture of London at night"])
# Compute cosine similarities
cos_scores = util.cos_sim(img_emb, text_emb)
print(cos_scores)
# Translated Sentence Mining
Bitext mining describes the process of finding parallel (translated) sentence pairs in monolingual corpora. For example, you have an set of English sentences:
```
This is an example sentences.
Hello World!
My final third sentence in this list.
```
And a set of German sentences:
```
Hallo Welt!
Dies ist ein Beispielsatz.
Dieser Satz taucht im Englischen nicht auf.
```
Here, you want to find all translation pairs between the English set and the German set of languages.
The correct (two) translations are:
```
Hello World! Hallo Welt!
This is an example sentences. Dies ist ein Beispielsatz.
```
Usually you apply this method to large corpora, for example, you want to find all translated sentences in the English Wikipedia and the Chinese Wikipedia.
## Marging Based Mining
We follow the setup from [Artetxe and Schwenk, Section 4.3](https://arxiv.org/pdf/1812.10464.pdf) to find translated sentences in two datasets:
1) First, we encode all sentences to their respective embedding. As shown in [our paper](https://arxiv.org/abs/2004.09813) is [LaBSE](https://tfhub.dev/google/LaBSE/1) currently the best method for bitext mining. The model is integrated in Sentence-Transformers
2) Once we have all embeddings, we find the *k* nearest neighbor sentences for all sentences in both directions. Typical choices for k are between 4 and 16.
3) Then, we score all possible sentence combinations using the formula mentioned in Section 4.3.
4) The pairs with the highest scores are most likely translated sentences. Note, that the score can be larger than 1. Usually you have to find some cut-off where you ignore pairs below that threshold. For a high quality, a threshold of about 1.2 - 1.3 works quite well.
## Examples
- **[bucc2018.py](bucc2018.py)** - This script contains an example for the [BUCC 2018 shared task](https://comparable.limsi.fr/bucc2018/bucc2018-task.html) on finding parallel sentences. This dataset can be used to evaluate different strategies, as we know which sentences are parallel in the two corpora. The script mines for parallel sentences and then prints the optimal threshold that leads to the highest F1-score.
- **[bitext_mining.py](bitext_mining.py)** - This file reads in two text files (with a single sentence in each line) and outputs parallel sentences to *parallel-sentences-out.tsv.gz.
- **[In-domain Data Selection for MT](https://www.clinjournal.org/clinj/article/view/137)** - This paper also employed S-BERT to generate/select in-domain parallel data for machine translation systems – using monolingual texts.
"""
This scripts show how to mine parallel (translated) sentences from two list of monolingual sentences.
As input, you specific two text files that have sentences in every line. Then, the
LaBSE model is used to find parallel (translated) across these two files.
The result is written to disc.
A large source for monolingual sentences in different languages is:
http://data.statmt.org/cc-100/
This script requires that you have FAISS installed:
https://github.com/facebookresearch/faiss
"""
from sentence_transformers import SentenceTransformer, models
import numpy as np
from bitext_mining_utils import score_candidates, kNN, file_open
import gzip
import tqdm
from sklearn.decomposition import PCA
import torch
# Model we want to use for bitext mining. LaBSE achieves state-of-the-art performance
model_name = "LaBSE"
model = SentenceTransformer(model_name)
# Input files. We interpret every line as sentence.
source_file = "data/so.txt.xz"
target_file = "data/yi.txt.xz"
# Only consider sentences that are between min_sent_len and max_sent_len characters long
min_sent_len = 10
max_sent_len = 200
# We base the scoring on k nearest neighbors for each element
knn_neighbors = 4
# Min score for text pairs. Note, score can be larger than 1
min_threshold = 1
# Do we want to use exact search of approximate nearest neighbor search (ANN)
# Exact search: Slower, but we don't miss any parallel sentences
# ANN: Faster, but the recall will be lower
use_ann_search = True
# Number of clusters for ANN. Each cluster should have at least 10k entries
ann_num_clusters = 32768
# How many cluster to explorer for search. Higher number = better recall, slower
ann_num_cluster_probe = 3
# To save memory, we can use PCA to reduce the dimensionality from 768 to for example 128 dimensions
# The encoded embeddings will hence require 6 times less memory. However, we observe a small drop in performance.
use_pca = True
pca_dimensions = 128
if use_pca:
# We use a smaller number of training sentences to learn the PCA
train_sent = []
num_train_sent = 20000
with file_open(source_file) as fSource, file_open(target_file) as fTarget:
for line_source, line_target in zip(fSource, fTarget):
if min_sent_len <= len(line_source.strip()) <= max_sent_len:
sentence = line_source.strip()
train_sent.append(sentence)
if min_sent_len <= len(line_target.strip()) <= max_sent_len:
sentence = line_target.strip()
train_sent.append(sentence)
if len(train_sent) >= num_train_sent:
break
print("Encode training embeddings for PCA")
train_matrix = model.encode(train_sent, show_progress_bar=True, convert_to_numpy=True)
pca = PCA(n_components=pca_dimensions)
pca.fit(train_matrix)
dense = models.Dense(
in_features=model.get_sentence_embedding_dimension(),
out_features=pca_dimensions,
bias=False,
activation_function=torch.nn.Identity(),
)
dense.linear.weight = torch.nn.Parameter(torch.tensor(pca.components_))
model.add_module("dense", dense)
print("Read source file")
source_sentences = set()
with file_open(source_file) as fIn:
for line in tqdm.tqdm(fIn):
line = line.strip()
if len(line) >= min_sent_len and len(line) <= max_sent_len:
source_sentences.add(line)
print("Read target file")
target_sentences = set()
with file_open(target_file) as fIn:
for line in tqdm.tqdm(fIn):
line = line.strip()
if len(line) >= min_sent_len and len(line) <= max_sent_len:
target_sentences.add(line)
print("Source Sentences:", len(source_sentences))
print("Target Sentences:", len(target_sentences))
### Encode source sentences
source_sentences = list(source_sentences)
print("Encode source sentences")
source_embeddings = model.encode(source_sentences, show_progress_bar=True, convert_to_numpy=True)
### Encode target sentences
target_sentences = list(target_sentences)
print("Encode target sentences")
target_embeddings = model.encode(target_sentences, show_progress_bar=True, convert_to_numpy=True)
# Normalize embeddings
x = source_embeddings
x = x / np.linalg.norm(x, axis=1, keepdims=True)
y = target_embeddings
y = y / np.linalg.norm(y, axis=1, keepdims=True)
# Perform kNN in both directions
x2y_sim, x2y_ind = kNN(x, y, knn_neighbors, use_ann_search, ann_num_clusters, ann_num_cluster_probe)
x2y_mean = x2y_sim.mean(axis=1)
y2x_sim, y2x_ind = kNN(y, x, knn_neighbors, use_ann_search, ann_num_clusters, ann_num_cluster_probe)
y2x_mean = y2x_sim.mean(axis=1)
# Compute forward and backward scores
margin = lambda a, b: a / b
fwd_scores = score_candidates(x, y, x2y_ind, x2y_mean, y2x_mean, margin)
bwd_scores = score_candidates(y, x, y2x_ind, y2x_mean, x2y_mean, margin)
fwd_best = x2y_ind[np.arange(x.shape[0]), fwd_scores.argmax(axis=1)]
bwd_best = y2x_ind[np.arange(y.shape[0]), bwd_scores.argmax(axis=1)]
indices = np.stack(
[np.concatenate([np.arange(x.shape[0]), bwd_best]), np.concatenate([fwd_best, np.arange(y.shape[0])])], axis=1
)
scores = np.concatenate([fwd_scores.max(axis=1), bwd_scores.max(axis=1)])
seen_src, seen_trg = set(), set()
# Extract list of parallel sentences
print("Write sentences to disc")
sentences_written = 0
with gzip.open("parallel-sentences-out.tsv.gz", "wt", encoding="utf8") as fOut:
for i in np.argsort(-scores):
src_ind, trg_ind = indices[i]
src_ind = int(src_ind)
trg_ind = int(trg_ind)
if scores[i] < min_threshold:
break
if src_ind not in seen_src and trg_ind not in seen_trg:
seen_src.add(src_ind)
seen_trg.add(trg_ind)
fOut.write(
"{:.4f}\t{}\t{}\n".format(
scores[i],
source_sentences[src_ind].replace("\t", " "),
target_sentences[trg_ind].replace("\t", " "),
)
)
sentences_written += 1
print("Done. {} sentences written".format(sentences_written))
"""
This file contains some utilities functions used to find parallel sentences
in two monolingual corpora.
Code in this file has been adapted from the LASER repository:
https://github.com/facebookresearch/LASER
"""
import faiss
import numpy as np
import time
import gzip
import lzma
######## Functions to find and score candidates
def score(x, y, fwd_mean, bwd_mean, margin):
return margin(x.dot(y), (fwd_mean + bwd_mean) / 2)
def score_candidates(x, y, candidate_inds, fwd_mean, bwd_mean, margin):
scores = np.zeros(candidate_inds.shape)
for i in range(scores.shape[0]):
for j in range(scores.shape[1]):
k = candidate_inds[i, j]
scores[i, j] = score(x[i], y[k], fwd_mean[i], bwd_mean[k], margin)
return scores
def kNN(x, y, k, use_ann_search=False, ann_num_clusters=32768, ann_num_cluster_probe=3):
start_time = time.time()
if use_ann_search:
print("Perform approx. kNN search")
n_cluster = min(ann_num_clusters, int(y.shape[0] / 1000))
quantizer = faiss.IndexFlatIP(y.shape[1])
index = faiss.IndexIVFFlat(quantizer, y.shape[1], n_cluster, faiss.METRIC_INNER_PRODUCT)
index.nprobe = ann_num_cluster_probe
index.train(y)
index.add(y)
sim, ind = index.search(x, k)
else:
print("Perform exact search")
idx = faiss.IndexFlatIP(y.shape[1])
idx.add(y)
sim, ind = idx.search(x, k)
print("Done: {:.2f} sec".format(time.time() - start_time))
return sim, ind
def file_open(filepath):
# Function to allowing opening files based on file extension
if filepath.endswith(".gz"):
return gzip.open(filepath, "rt", encoding="utf8")
elif filepath.endswith("xz"):
return lzma.open(filepath, "rt", encoding="utf8")
else:
return open(filepath, "r", encoding="utf8")
"""
This script tests the approach on the BUCC 2018 shared task on finding parallel sentences:
https://comparable.limsi.fr/bucc2018/bucc2018-task.html
You can download the necessary files from there.
We have used it in our paper (https://arxiv.org/pdf/2004.09813.pdf) in Section 4.2 to evaluate different multilingual models.
This script requires that you have FAISS installed:
https://github.com/facebookresearch/faiss
"""
from sentence_transformers import SentenceTransformer, models
from collections import defaultdict
import os
import pickle
from sklearn.decomposition import PCA
import torch
from bitext_mining_utils import score_candidates, kNN
import numpy as np
# Model we want to use for bitext mining. LaBSE achieves state-of-the-art performance
model_name = "LaBSE"
model = SentenceTransformer(model_name)
# Input files for BUCC2018 shared task
source_file = "bucc2018/de-en/de-en.training.de"
target_file = "bucc2018/de-en/de-en.training.en"
labels_file = "bucc2018/de-en/de-en.training.gold"
# We base the scoring on k nearest neighbors for each element
knn_neighbors = 4
# Min score for text pairs. Note, score can be larger than 1
min_threshold = 1
# Do we want to use exact search of approximate nearest neighbor search (ANN)
# Exact search: Slower, but we don't miss any parallel sentences
# ANN: Faster, but the recall will be lower
use_ann_search = True
# Number of clusters for ANN. Optimal number depends on dataset size
ann_num_clusters = 32768
# How many cluster to explorer for search. Higher number = better recall, slower
ann_num_cluster_probe = 5
# To save memory, we can use PCA to reduce the dimensionality from 768 to for example 128 dimensions
# The encoded embeddings will hence require 6 times less memory. However, we observe a small drop in performance.
use_pca = False
pca_dimensions = 128
# We store the embeddings on disc, so that they can later be loaded from disc
source_embedding_file = "{}_{}_{}.emb".format(
model_name, os.path.basename(source_file), pca_dimensions if use_pca else model.get_sentence_embedding_dimension()
)
target_embedding_file = "{}_{}_{}.emb".format(
model_name, os.path.basename(target_file), pca_dimensions if use_pca else model.get_sentence_embedding_dimension()
)
# Use PCA to reduce the dimensionality of the sentence embedding model
if use_pca:
# We use a smaller number of training sentences to learn the PCA
train_sent = []
num_train_sent = 20000
with open(source_file, encoding="utf8") as fSource, open(target_file, encoding="utf8") as fTarget:
for line_source, line_target in zip(fSource, fTarget):
id, sentence = line_source.strip().split("\t", maxsplit=1)
train_sent.append(sentence)
id, sentence = line_target.strip().split("\t", maxsplit=1)
train_sent.append(sentence)
if len(train_sent) >= num_train_sent:
break
print("Encode training embeddings for PCA")
train_matrix = model.encode(train_sent, show_progress_bar=True, convert_to_numpy=True)
pca = PCA(n_components=pca_dimensions)
pca.fit(train_matrix)
dense = models.Dense(
in_features=model.get_sentence_embedding_dimension(),
out_features=pca_dimensions,
bias=False,
activation_function=torch.nn.Identity(),
)
dense.linear.weight = torch.nn.Parameter(torch.tensor(pca.components_))
model.add_module("dense", dense)
print("Read source file")
source = {}
with open(source_file, encoding="utf8") as fIn:
for line in fIn:
id, sentence = line.strip().split("\t", maxsplit=1)
source[id] = sentence
print("Read target file")
target = {}
with open(target_file, encoding="utf8") as fIn:
for line in fIn:
id, sentence = line.strip().split("\t", maxsplit=1)
target[id] = sentence
labels = defaultdict(lambda: defaultdict(bool))
num_total_parallel = 0
with open(labels_file) as fIn:
for line in fIn:
src_id, trg_id = line.strip().split("\t")
if src_id in source and trg_id in target:
labels[src_id][trg_id] = True
labels[trg_id][src_id] = True
num_total_parallel += 1
print("Source Sentences:", len(source))
print("Target Sentences:", len(target))
print("Num Parallel:", num_total_parallel)
### Encode source sentences
source_ids = list(source.keys())
source_sentences = [source[id] for id in source_ids]
if not os.path.exists(source_embedding_file):
print("Encode source sentences")
source_embeddings = model.encode(source_sentences, show_progress_bar=True, convert_to_numpy=True)
with open(source_embedding_file, "wb") as fOut:
pickle.dump(source_embeddings, fOut)
else:
with open(source_embedding_file, "rb") as fIn:
source_embeddings = pickle.load(fIn)
### Encode target sentences
target_ids = list(target.keys())
target_sentences = [target[id] for id in target_ids]
if not os.path.exists(target_embedding_file):
print("Encode target sentences")
target_embeddings = model.encode(target_sentences, show_progress_bar=True, convert_to_numpy=True)
with open(target_embedding_file, "wb") as fOut:
pickle.dump(target_embeddings, fOut)
else:
with open(target_embedding_file, "rb") as fIn:
target_embeddings = pickle.load(fIn)
##### Now we start to search for parallel (translated) sentences
# Normalize embeddings
x = source_embeddings
y = target_embeddings
print("Shape Source:", x.shape)
print("Shape Target:", y.shape)
x = x / np.linalg.norm(x, axis=1, keepdims=True)
y = y / np.linalg.norm(y, axis=1, keepdims=True)
# Perform kNN in both directions
x2y_sim, x2y_ind = kNN(x, y, knn_neighbors, use_ann_search, ann_num_clusters, ann_num_cluster_probe)
x2y_mean = x2y_sim.mean(axis=1)
y2x_sim, y2x_ind = kNN(y, x, knn_neighbors, use_ann_search, ann_num_clusters, ann_num_cluster_probe)
y2x_mean = y2x_sim.mean(axis=1)
# Compute forward and backward scores
margin = lambda a, b: a / b
fwd_scores = score_candidates(x, y, x2y_ind, x2y_mean, y2x_mean, margin)
bwd_scores = score_candidates(y, x, y2x_ind, y2x_mean, x2y_mean, margin)
fwd_best = x2y_ind[np.arange(x.shape[0]), fwd_scores.argmax(axis=1)]
bwd_best = y2x_ind[np.arange(y.shape[0]), bwd_scores.argmax(axis=1)]
indices = np.stack(
[np.concatenate([np.arange(x.shape[0]), bwd_best]), np.concatenate([fwd_best, np.arange(y.shape[0])])], axis=1
)
scores = np.concatenate([fwd_scores.max(axis=1), bwd_scores.max(axis=1)])
seen_src, seen_trg = set(), set()
# Extract list of parallel sentences
bitext_list = []
for i in np.argsort(-scores):
src_ind, trg_ind = indices[i]
src_ind = int(src_ind)
trg_ind = int(trg_ind)
if scores[i] < min_threshold:
break
if src_ind not in seen_src and trg_ind not in seen_trg:
seen_src.add(src_ind)
seen_trg.add(trg_ind)
bitext_list.append([scores[i], source_ids[src_ind], target_ids[trg_ind]])
# Measure Performance by computing the threshold
# that leads to the best F1 score performance
bitext_list = sorted(bitext_list, key=lambda x: x[0], reverse=True)
n_extract = n_correct = 0
threshold = 0
best_f1 = best_recall = best_precision = 0
average_precision = 0
for idx in range(len(bitext_list)):
score, id1, id2 = bitext_list[idx]
n_extract += 1
if labels[id1][id2] or labels[id2][id1]:
n_correct += 1
precision = n_correct / n_extract
recall = n_correct / num_total_parallel
f1 = 2 * precision * recall / (precision + recall)
average_precision += precision
if f1 > best_f1:
best_f1 = f1
best_precision = precision
best_recall = recall
threshold = (bitext_list[idx][0] + bitext_list[min(idx + 1, len(bitext_list) - 1)][0]) / 2
print("Best Threshold:", threshold)
print("Recall:", best_recall)
print("Precision:", best_precision)
print("F1:", best_f1)
# Paraphrase Mining
Paraphrase mining is the task of finding paraphrases (texts with identical / similar meaning) in a large corpus of sentences. In [Semantic Textual Similarity](../../../docs/usage/semantic_textual_similarity.md) we saw a simplified version of finding paraphrases in a list of sentences. The approach presented there used a brute-force approach to score and rank all pairs.
However, as this has a quadratic runtime, it fails to scale to large (10,000 and more) collections of sentences.
For larger collections, *util* offers the *paraphrase_mining* function that can be used like this:
```python
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
# Single list of sentences - Possible tens of thousands of sentences
sentences = [
"The cat sits outside",
"A man is playing guitar",
"I love pasta",
"The new movie is awesome",
"The cat plays in the garden",
"A woman watches TV",
"The new movie is so great",
"Do you like pizza?",
]
paraphrases = util.paraphrase_mining(model, sentences)
for paraphrase in paraphrases[0:10]:
score, i, j = paraphrase
print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], score))
```
The **paraphrase_mining()**-method accepts the following parameters:
```eval_rst
.. autofunction:: sentence_transformers.util.paraphrase_mining
```
Instead of computing all pairwise cosine scores and ranking all possible, combinations, the approach is a bit more complex (and hence efficient). We chunk our corpus into smaller pieces, which is defined by *query_chunk_size* and *corpus_chunk_size*. For example, if we set *query_chunk_size=1000*, we search paraphrases for 1,000 sentences at a time in the remaining corpus (all other sentences). However, the remaining corpus is also chunked, for example, if we set *corpus_chunk_size=10000*, we look for paraphrases in 10k sentences at a time.
If we pass a list of 20k sentences, we will chunk it to 20x1000 sentences, and each of the query is compared first against sentences 0-10k and then 10k-20k.
This is done to reduce the memory requirement. Increasing both values improves the speed, but increases also the memory requirement.
The next critical thing is finding the pairs with the highest similarities. Instead of getting and sorting all n^2 pairwise scores, we take for each query only the *top_k* scores. So with *top_k=100*, we find at most 100 paraphrases per sentence per chunk. You can play around with *top_k* to the ensure a certain behaviour.
So for example, with
```python
paraphrases = util.paraphrase_mining(model, sentences, corpus_chunk_size=len(sentences), top_k=1)
```
You will get for each sentence only the one most other relevant sentence. Note, if B is the most similar sentence for A, A must not be the most similar sentence for B. So it can happen that the returned list contains entries like (A, B) and (B, C).
The final relevant parameter is *max_pairs*, which determines the maximum number of paraphrase pairs you like to get returned. If you set it to e.g. *max_pairs=100*, you will not get more than 100 paraphrase pairs returned. Usually, you get fewer pairs returned as the list is cleaned of duplicates, e.g., if it contains (A, B) and (B, A), then only one is returned.
# Retrieve & Re-Rank
In [Semantic Search](../semantic-search/README.md) we have shown how to use SentenceTransformer to compute embeddings for queries, sentences, and paragraphs and how to use this for semantic search.
For complex search tasks, for example, for question answering retrieval, the search can significantly be improved by using **Retrieve & Re-Rank**.
## Retrieve & Re-Rank Pipeline
A pipeline for information retrieval / question answering retrieval that works well is the following. All components are provided and explained in this article:
![InformationRetrieval](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/InformationRetrieval.png)
Given a search query, we first use a **retrieval system** that retrieves a large list of e.g. 100 possible hits which are potentially relevant for the query. For the retrieval, we can use either lexical search, e.g. with Elasticsearch, or we can use dense retrieval with a bi-encoder.
However, the retrieval system might retrieve documents that are not that relevant for the search query. Hence, in a second stage, we use a **re-ranker** based on a **cross-encoder** that scores the relevancy of all candidates for the given search query.
The output will be a ranked list of hits we can present to the user.
## Retrieval: Bi-Encoder
For the retrieval of the candidate set, we can either use lexical search (e.g. [Elasticsearch](https://www.elastic.co/elasticsearch/)), or we can use a bi-encoder which is implemented in this repository.
Lexical search looks for literal matches of the query words in your document collection. It will not recognize synonyms, acronyms or spelling variations. In contrast, semantic search (or dense retrieval) encodes the search query into vector space and retrieves the document embeddings that are close in vector space.
![SemanticSearch](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SemanticSearch.png)
Semantic search overcomes the short comings of lexical search and can recognize synonym and acronyms. Have a look at the [semantic search article](../semantic-search/README.md) for different options to implement semantic search.
## Re-Ranker: Cross-Encoder
The retriever has to be efficient for large document collections with millions of entries. However, it might return irrelevant candidates.
A re-ranker based on a Cross-Encoder can substantially improve the final results for the user. The query and a possible document is passed simultaneously to transformer network, which then outputs a single score between 0 and 1 indicating how relevant the document is for the given query.
![CrossEncoder](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/CrossEncoder.png)
The advantage of Cross-Encoders is the higher performance, as they perform attention across the query and the document.
Scoring thousands or millions of (query, document)-pairs would be rather slow. Hence, we use the retriever to create a set of e.g. 100 possible candidates which are then re-ranked by the Cross-Encoder.
## Example Scripts
* **[retrieve_rerank_simple_wikipedia.ipynb](retrieve_rerank_simple_wikipedia.ipynb)** [ [Colab Version](https://colab.research.google.com/github/UKPLab/sentence-transformers/blob/master/examples/applications/retrieve_rerank/retrieve_rerank_simple_wikipedia.ipynb) ]: This script uses the smaller [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page) as document collection to provide answers to user questions / search queries. First, we split all Wikipedia articles into paragraphs and encode them with a bi-encoder. If a new query / question is entered, it is encoded by the same bi-encoder and the paragraphs with the highest cosine-similarity are retrieved (see [semantic search](../semantic-search/README.md)). Next, the retrieved candidates are scored by a Cross-Encoder re-ranker and the 5 passages with the highest score from the Cross-Encoder are presented to the user.
- **[in_document_search_crossencoder.py](in_document_search_crossencoder.py):** If you only have a small set of paragraphs, we don't do the retrieval stage. This is for example the case if you want to perform search within a single document. In this example, we take the Wikipedia article about Europe and split it into paragraphs. Then, the search query / question and all paragraphs are scored using the Cross-Encoder re-ranker. The most relevant passages for the query are returned.
## Pre-trained Bi-Encoders (Retrieval)
The bi-encoder produces embeddings independently for your paragraphs and for your search queries. You can use it like this:
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("model_name")
docs = [
"My first paragraph. That contains information",
"Python is a programming language.",
]
document_embeddings = model.encode(docs)
query = "What is Python?"
query_embedding = model.encode(query)
```
For more details how to compare the embeddings, see [semantic search](../semantic-search/README.md).
We provide pre-trained models based on:
- **MS MARCO:** 500k real user queries from Bing search engine. See [MS MARCO models](https://www.sbert.net/docs/pretrained-models/msmarco-v3.html)
## Pre-trained Cross-Encoders (Re-Ranker)
For pre-trained models, see: [MS MARCO Cross-Encoders](https://www.sbert.net/docs/pretrained-models/ce-msmarco.html)
"""
This example show how in-document search can be used with a CrossEncoder.
The document is split into passage. Here, we use three consecutive sentences as a passage. You can use shorter passage, for example, individual sentences,
or longer passages, like full paragraphs.
The CrossEncoder takes the search query and scores every passage how relevant the passage is for the given score. The five passages with the highest score are then returned.
As CrossEncoder, we use cross-encoder/ms-marco-TinyBERT-L-2, a BERT model with only 2 layers trained on the MS MARCO dataset. This is an extremely quick model able to score up to 9000 passages per second (on a V100 GPU). You can also use a larger model, which gives better results but is also slower.
Note: As we score the [query, passage]-pair for every new query, this search method
becomes at some point in-efficient if the document gets too large.
Usage: python in_document_search_crossencoder.py
Note: Requires NLTK: `pip install nltk`
"""
from sentence_transformers import CrossEncoder
from nltk import sent_tokenize
import time
# As document, we take the first two section from the Wikipedia article about Europe
document = """Europe is a continent located entirely in the Northern Hemisphere and mostly in the Eastern Hemisphere. It comprises the westernmost part of Eurasia and is bordered by the Arctic Ocean to the north, the Atlantic Ocean to the west, the Mediterranean Sea to the south, and Asia to the east. Europe is commonly considered to be separated from Asia by the watershed of the Ural Mountains, the Ural River, the Caspian Sea, the Greater Caucasus, the Black Sea, and the waterways of the Turkish Straits. Although some of this border is over land, Europe is generally accorded the status of a full continent because of its great physical size and the weight of history and tradition.
Europe covers about 10,180,000 square kilometres (3,930,000 sq mi), or 2% of the Earth's surface (6.8% of land area), making it the second smallest continent. Politically, Europe is divided into about fifty sovereign states, of which Russia is the largest and most populous, spanning 39% of the continent and comprising 15% of its population. Europe had a total population of about 741 million (about 11% of the world population) as of 2018. The European climate is largely affected by warm Atlantic currents that temper winters and summers on much of the continent, even at latitudes along which the climate in Asia and North America is severe. Further from the sea, seasonal differences are more noticeable than close to the coast.
European culture is the root of Western civilization, which traces its lineage back to ancient Greece and ancient Rome. The fall of the Western Roman Empire in 476 AD and the subsequent Migration Period marked the end of Europe's ancient history and the beginning of the Middle Ages. Renaissance humanism, exploration, art and science led to the modern era. Since the Age of Discovery, started by Portugal and Spain, Europe played a predominant role in global affairs. Between the 16th and 20th centuries, European powers colonized at various times the Americas, almost all of Africa and Oceania, and the majority of Asia.
The Age of Enlightenment, the subsequent French Revolution and the Napoleonic Wars shaped the continent culturally, politically and economically from the end of the 17th century until the first half of the 19th century. The Industrial Revolution, which began in Great Britain at the end of the 18th century, gave rise to radical economic, cultural and social change in Western Europe and eventually the wider world. Both world wars took place for the most part in Europe, contributing to a decline in Western European dominance in world affairs by the mid-20th century as the Soviet Union and the United States took prominence. During the Cold War, Europe was divided along the Iron Curtain between NATO in the West and the Warsaw Pact in the East, until the revolutions of 1989 and fall of the Berlin Wall.
In 1949, the Council of Europe was founded with the idea of unifying Europe to achieve common goals. Further European integration by some states led to the formation of the European Union (EU), a separate political entity that lies between a confederation and a federation. The EU originated in Western Europe but has been expanding eastward since the fall of the Soviet Union in 1991. The currency of most countries of the European Union, the euro, is the most commonly used among Europeans; and the EU's Schengen Area abolishes border and immigration controls between most of its member states. There exists a political movement favoring the evolution of the European Union into a single federation encompassing much of the continent.
In classical Greek mythology, Europa (Ancient Greek: Εὐρώπη, Eurṓpē) was a Phoenician princess. One view is that her name derives from the ancient Greek elements εὐρύς (eurús), "wide, broad" and ὤψ (ōps, gen. ὠπός, ōpós) "eye, face, countenance", hence their composite Eurṓpē would mean "wide-gazing" or "broad of aspect". Broad has been an epithet of Earth herself in the reconstructed Proto-Indo-European religion and the poetry devoted to it. An alternative view is that of R.S.P. Beekes who has argued in favor of a Pre-Indo-European origin for the name, explaining that a derivation from ancient Greek eurus would yield a different toponym than Europa. Beekes has located toponyms related to that of Europa in the territory of ancient Greece and localities like that of Europos in ancient Macedonia.
There have been attempts to connect Eurṓpē to a Semitic term for "west", this being either Akkadian erebu meaning "to go down, set" (said of the sun) or Phoenician 'ereb "evening, west", which is at the origin of Arabic Maghreb and Hebrew ma'arav. Michael A. Barry finds the mention of the word Ereb on an Assyrian stele with the meaning of "night, [the country of] sunset", in opposition to Asu "[the country of] sunrise", i.e. Asia. The same naming motive according to "cartographic convention" appears in Greek Ἀνατολή (Anatolḗ "[sun] rise", "east", hence Anatolia). Martin Litchfield West stated that "phonologically, the match between Europa's name and any form of the Semitic word is very poor", while Beekes considers a connection to Semitic languages improbable. Next to these hypotheses there is also a Proto-Indo-European root *h1regʷos, meaning "darkness", which also produced Greek Erebus.
Most major world languages use words derived from Eurṓpē or Europa to refer to the continent. Chinese, for example, uses the word Ōuzhōu (歐洲/欧洲), which is an abbreviation of the transliterated name Ōuluóbā zhōu (歐羅巴洲) (zhōu means "continent"); a similar Chinese-derived term Ōshū (欧州) is also sometimes used in Japanese such as in the Japanese name of the European Union, Ōshū Rengō (欧州連合), despite the katakana Yōroppa (ヨーロッパ) being more commonly used. In some Turkic languages, the originally Persian name Frangistan ("land of the Franks") is used casually in referring to much of Europe, besides official names such as Avrupa or Evropa."""
## We split this article into paragraphs and then every paragraph into sentences
paragraphs = []
for paragraph in document.replace("\r\n", "\n").split("\n\n"):
if len(paragraph.strip()) > 0:
paragraphs.append(sent_tokenize(paragraph.strip()))
# We combine up to 3 sentences into a passage. You can choose smaller or larger values for window_size
# Smaller value: Context from other sentences might get lost
# Lager values: More context from the paragraph remains, but results are longer
window_size = 3
passages = []
for paragraph in paragraphs:
for start_idx in range(0, len(paragraph), window_size):
end_idx = min(start_idx + window_size, len(paragraph))
passages.append(" ".join(paragraph[start_idx:end_idx]))
print("Paragraphs: ", len(paragraphs))
print("Sentences: ", sum([len(p) for p in paragraphs]))
print("Passages: ", len(passages))
## Load our cross-encoder. Use fast tokenizer to speed up the tokenization
model = CrossEncoder("cross-encoder/ms-marco-TinyBERT-L-2")
## Some queries we want to search for in the document
queries = [
"How large is Europe?",
"Is Europe a continent?",
"What is the currency in EU?",
"Fall Roman Empire when", # We can also search for key word queries
"Is Europa in the south part of the globe?",
] # Europe is miss-spelled & the matching sentences does not mention any of the content words
# Search in a loop for the individual queries
for query in queries:
start_time = time.time()
# Concatenate the query and all passages and predict the scores for the pairs [query, passage]
model_inputs = [[query, passage] for passage in passages]
scores = model.predict(model_inputs)
# Sort the scores in decreasing order
results = [{"input": inp, "score": score} for inp, score in zip(model_inputs, scores)]
results = sorted(results, key=lambda x: x["score"], reverse=True)
print("Query:", query)
print("Search took {:.2f} seconds".format(time.time() - start_time))
for hit in results[0:5]:
print("Score: {:.2f}".format(hit["score"]), "\t", hit["input"][1])
print("==========")
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "ZyP3dXRfcXLa"
},
"source": [
"# Retrieve & Re-Rank Demo over Simple Wikipedia\n",
"\n",
"This examples demonstrates the Retrieve & Re-Rank Setup and allows to search over [Simple Wikipedia](https://simple.wikipedia.org/wiki/Main_Page).\n",
"\n",
"You can input a query or a question. The script then uses semantic search\n",
"to find relevant passages in Simple English Wikipedia (as it is smaller and fits better in RAM).\n",
"\n",
"For semantic search, we use `SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')` and retrieve\n",
"32 potentially passages that answer the input query.\n",
"\n",
"Next, we use a more powerful CrossEncoder (`cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')`) that\n",
"scores the query and all retrieved passages for their relevancy. The cross-encoder further boost the performance,\n",
"especially when you search over a corpus for which the bi-encoder was not trained for.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "X2R9TjVzNV_E",
"outputId": "97bda76a-a58e-471b-b305-c8022aaf9b84"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already up-to-date: sentence-transformers in /opt/conda/lib/python3.8/site-packages (2.0.0)\n",
"Requirement already up-to-date: rank_bm25 in /opt/conda/lib/python3.8/site-packages (0.2.1)\n",
"Requirement already satisfied, skipping upgrade: torch>=1.6.0 in /opt/conda/lib/python3.8/site-packages (from sentence-transformers) (1.8.1)\n",
"Requirement already satisfied, skipping upgrade: tqdm in /opt/conda/lib/python3.8/site-packages (from sentence-transformers) (4.51.0)\n",
"Requirement already satisfied, skipping upgrade: huggingface-hub in /opt/conda/lib/python3.8/site-packages (from sentence-transformers) (0.0.8)\n",
"Requirement already satisfied, skipping upgrade: torchvision in /opt/conda/lib/python3.8/site-packages (from sentence-transformers) (0.9.1)\n",
"Requirement already satisfied, skipping upgrade: transformers<5.0.0,>=4.6.0 in /opt/conda/lib/python3.8/site-packages (from sentence-transformers) (4.6.1)\n",
"Requirement already satisfied, skipping upgrade: scipy in /opt/conda/lib/python3.8/site-packages (from sentence-transformers) (1.6.3)\n",
"Requirement already satisfied, skipping upgrade: numpy in /opt/conda/lib/python3.8/site-packages (from sentence-transformers) (1.19.2)\n",
"Requirement already satisfied, skipping upgrade: nltk in /opt/conda/lib/python3.8/site-packages (from sentence-transformers) (3.6.2)\n",
"Requirement already satisfied, skipping upgrade: sentencepiece in /opt/conda/lib/python3.8/site-packages (from sentence-transformers) (0.1.95)\n",
"Requirement already satisfied, skipping upgrade: scikit-learn in /opt/conda/lib/python3.8/site-packages (from sentence-transformers) (0.24.2)\n",
"Requirement already satisfied, skipping upgrade: typing_extensions in /opt/conda/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers) (3.7.4.3)\n",
"Requirement already satisfied, skipping upgrade: filelock in /opt/conda/lib/python3.8/site-packages (from huggingface-hub->sentence-transformers) (3.0.12)\n",
"Requirement already satisfied, skipping upgrade: requests in /opt/conda/lib/python3.8/site-packages (from huggingface-hub->sentence-transformers) (2.24.0)\n",
"Requirement already satisfied, skipping upgrade: pillow>=4.1.1 in /opt/conda/lib/python3.8/site-packages (from torchvision->sentence-transformers) (8.1.2)\n",
"Requirement already satisfied, skipping upgrade: tokenizers<0.11,>=0.10.1 in /opt/conda/lib/python3.8/site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (0.10.2)\n",
"Requirement already satisfied, skipping upgrade: sacremoses in /opt/conda/lib/python3.8/site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (0.0.45)\n",
"Requirement already satisfied, skipping upgrade: regex!=2019.12.17 in /opt/conda/lib/python3.8/site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (2021.4.4)\n",
"Requirement already satisfied, skipping upgrade: packaging in /opt/conda/lib/python3.8/site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (20.9)\n",
"Requirement already satisfied, skipping upgrade: click in /opt/conda/lib/python3.8/site-packages (from nltk->sentence-transformers) (7.1.2)\n",
"Requirement already satisfied, skipping upgrade: joblib in /opt/conda/lib/python3.8/site-packages (from nltk->sentence-transformers) (1.0.1)\n",
"Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.8/site-packages (from scikit-learn->sentence-transformers) (2.1.0)\n",
"Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->huggingface-hub->sentence-transformers) (2.10)\n",
"Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->huggingface-hub->sentence-transformers) (2020.12.5)\n",
"Requirement already satisfied, skipping upgrade: chardet<4,>=3.0.2 in /opt/conda/lib/python3.8/site-packages (from requests->huggingface-hub->sentence-transformers) (3.0.4)\n",
"Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->huggingface-hub->sentence-transformers) (1.25.11)\n",
"Requirement already satisfied, skipping upgrade: six in /opt/conda/lib/python3.8/site-packages (from sacremoses->transformers<5.0.0,>=4.6.0->sentence-transformers) (1.15.0)\n",
"Requirement already satisfied, skipping upgrade: pyparsing>=2.0.2 in /opt/conda/lib/python3.8/site-packages (from packaging->transformers<5.0.0,>=4.6.0->sentence-transformers) (2.4.7)\n"
]
}
],
"source": [
"!pip install -U sentence-transformers rank_bm25"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 431,
"referenced_widgets": [
"89533b9769bd4fe5be0068b381a7b554",
"95bcab1f8336422aafa453f8412d9eed",
"7e15806e1ab74d6893f9c5f4ede9b5af",
"f9a01234d31e48cbb2b02f1b34acbbd1",
"89d89e63246f4e1d80b8a2a05db28a19",
"4d2cd27435f44a4f8bdbdd3e17c8c86a",
"27d20db7a7f346ea81e3abe88868cfcd",
"e47e734b2bbc42d1b9a32e1b1dca6451",
"fae4387adeac4391ba73c21643e6034f",
"be0fac5d7a974f47ac5b8a5e6ed9e4d6",
"938510ed19db4fcbadbe730584ca5e58",
"33b91b75207249a7b885c317c95503dd",
"7d0105cc6432438581e71fad681ae2d8",
"16d9436e5caf448a883bbdac3dc930dd",
"614511ff768642299c1b9c19abfb56a6",
"362ab62385884739ba1de446dddf1b87",
"dd4f3ca182c54678887ff51f75877997",
"bbfae35891b7490f8da4a60627fa5a16",
"bf43844c257c4ee99d5f2cf181179c81",
"aec7bd589fb04ecb999ef9a60dd60afa",
"fc984d27599d41c2be86c13dcc66a171",
"a6c3c35f17394bc4a66d460f91cd6833",
"f10d1c0166f941bcba75f9470db86651",
"10c12f39a3b340a98f6db972e9d37921",
"1c198b0a475f42fe836c19813f838100",
"1655064d0c164b778220e9a4da96d2ea",
"4fd3399b50224953806f5a646d1ffd89",
"fe98ad30ffaa4bcfa09c5c01accdd9b8",
"cf475534803f42718e6a87e645585357",
"88a71f6953af4259accf096fda7356b1",
"f0dca993eee34ed8b9c0a9ead6cbbf68",
"85a56dd0bd8549e790759f3f4ad0df7f",
"11c91ed9d2f049c9aec2124c6c926f88",
"0bee2cf221bf4497b527369b3905553e",
"a39d7b521ea540cd8af250584340e427",
"7ea6f545038f4422850e4a2d75345e65",
"5d1b0a2aa5da48f3ad8e428d7b4ead27",
"297a74178198439ab2847d90f2ee56e0",
"3fd97a1c9ef74e80b999aba18dd20075",
"581eef89f62245a0aaae36179e688843",
"e45327599f404ae09e5452978695d4e8",
"258161ce772d470fbc004f7736cea24e",
"877f416c38944e3db52917a6497b3525",
"9c35672b3e224a34b7070099c5067426",
"e120277c4a7141fca20e0115486a0be1",
"33bd233b64664e639dea85c60f0372dd",
"4c308b4ed1644ae790c878e190642c68",
"55fd02cec9a74a92a7c0bf4cf713bb28",
"703fe20c0249435eb847fe29a5a23c52",
"6bceb6fffeb84566a07aeb340d7c920f",
"1390bf4686774d6581348f87f908a400",
"f3ba3908cf1d4000b0da73107cee1526",
"e6f53f85bbbb4e6caf992e913d9cce88",
"64eb0085f9a84129937350004868ad54",
"3b01947b6c50476db6c9bcb21b03491b",
"bf2f3698d10f47fdb65dc82e54a9e566",
"e00bfbac209b4caa92ac3c679381a5a5",
"44d63b5526824e0f81d62dfba320bdc1",
"442d3ee2958b4db999c5e1624986d201",
"cc09077698c2445c8c7a7be99a5eb931",
"ec1353a6b4e8450f84b4302a50e2fba6",
"c119469e196f4ceca7b1368f2a38d18b",
"67346148e616426c83f10f203be5da8d",
"566d714778c040908ce443cc51c926bb"
]
},
"id": "D_hDi8KzNgMM",
"outputId": "326a0b64-f3fd-4d28-b5f6-a18ace1debd9"
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "350c9b9046e845ce81311ac05c38d523",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=737.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "ca5065fb69bc4be8b5888898a8275079",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=9216.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "cab22180e2944146a8793e552f543a46",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=612.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "85623323d08648eda0f84ad002319e29",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=116.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "67ddbc49a5c341ef8f5f8eee5171c686",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=25457.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "2e6ea096bb9e4e058adcf5be79e53139",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=349.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "cff9b5a83d1a435daa021d4fd2489eec",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=90888945.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "164b6918179c4d72ab2e545653356f81",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=53.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "d09710be64b84512971b34ddae26adc5",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=112.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "17819727fb494f158069daa1bf8963cf",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=466247.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "914e5a8fd98d4c7a9722f1b05b24f53b",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=383.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "02c97554783545118ceab68cc66ad47b",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=13846.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "caf41586a73f4735a121e5cbe9b67606",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=231508.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "c1059f1f0ef44c9db9e00d18afb23254",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=190.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Passages: 169597\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "778a221e446740239c9f9ba037b93c6e",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(HTML(value='Batches'), FloatProgress(value=0.0, max=5300.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"import json\n",
"from sentence_transformers import SentenceTransformer, CrossEncoder, util\n",
"import gzip\n",
"import os\n",
"import torch\n",
"\n",
"if not torch.cuda.is_available():\n",
" print(\"Warning: No GPU found. Please add GPU to your notebook\")\n",
"\n",
"\n",
"#We use the Bi-Encoder to encode all passages, so that we can use it with semantic search\n",
"bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')\n",
"bi_encoder.max_seq_length = 256 #Truncate long passages to 256 tokens\n",
"top_k = 32 #Number of passages we want to retrieve with the bi-encoder\n",
"\n",
"#The bi-encoder will retrieve 100 documents. We use a cross-encoder, to re-rank the results list to improve the quality\n",
"cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')\n",
"\n",
"# As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only\n",
"# about 170k articles. We split these articles into paragraphs and encode them with the bi-encoder\n",
"\n",
"wikipedia_filepath = 'simplewiki-2020-11-01.jsonl.gz'\n",
"\n",
"if not os.path.exists(wikipedia_filepath):\n",
" util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)\n",
"\n",
"passages = []\n",
"with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:\n",
" for line in fIn:\n",
" data = json.loads(line.strip())\n",
"\n",
" #Add all paragraphs\n",
" #passages.extend(data['paragraphs'])\n",
"\n",
" #Only add the first paragraph\n",
" passages.append(data['paragraphs'][0])\n",
"\n",
"print(\"Passages:\", len(passages))\n",
"\n",
"# We encode all passages into our vector space. This takes about 5 minutes (depends on your GPU speed)\n",
"corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 122,
"referenced_widgets": [
"ab19ef7e9763402589b15b4ccd1b5d6b",
"e5672d2910174d87be7434aa7bbe78c0",
"399bf19c69ed4a1490c6609284bacf27",
"c25ffe4acbac42ca9755248338ed864b",
"79e0e838f33f4c51b24bdfd12c7d0a00",
"e5595b23a62643aab844c875127a0a0a",
"dee30a855454476aaa572af57e6b7405",
"7d4f6eb85c0f4b06bf332c278da7f225"
]
},
"id": "0rueR6ovrs01",
"outputId": "965b70da-90bb-4f6f-fc5a-fe44ba26c28c"
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "2d834f0d787744c88abb8876c338a3b9",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=169597.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"# We also compare the results to lexical search (keyword search). Here, we use \n",
"# the BM25 algorithm which is implemented in the rank_bm25 package.\n",
"\n",
"from rank_bm25 import BM25Okapi\n",
"from sklearn.feature_extraction import _stop_words\n",
"import string\n",
"from tqdm.autonotebook import tqdm\n",
"import numpy as np\n",
"\n",
"\n",
"# We lower case our text and remove stop-words from indexing\n",
"def bm25_tokenizer(text):\n",
" tokenized_doc = []\n",
" for token in text.lower().split():\n",
" token = token.strip(string.punctuation)\n",
"\n",
" if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:\n",
" tokenized_doc.append(token)\n",
" return tokenized_doc\n",
"\n",
"\n",
"tokenized_corpus = []\n",
"for passage in tqdm(passages):\n",
" tokenized_corpus.append(bm25_tokenizer(passage))\n",
"\n",
"bm25 = BM25Okapi(tokenized_corpus)\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"id": "UlArb7kqN3Re"
},
"outputs": [],
"source": [
"# This function will search all wikipedia articles for passages that\n",
"# answer the query\n",
"def search(query):\n",
" print(\"Input question:\", query)\n",
"\n",
" ##### BM25 search (lexical search) #####\n",
" bm25_scores = bm25.get_scores(bm25_tokenizer(query))\n",
" top_n = np.argpartition(bm25_scores, -5)[-5:]\n",
" bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]\n",
" bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)\n",
" \n",
" print(\"Top-3 lexical search (BM25) hits\")\n",
" for hit in bm25_hits[0:3]:\n",
" print(\"\\t{:.3f}\\t{}\".format(hit['score'], passages[hit['corpus_id']].replace(\"\\n\", \" \")))\n",
"\n",
" ##### Semantic Search #####\n",
" # Encode the query using the bi-encoder and find potentially relevant passages\n",
" question_embedding = bi_encoder.encode(query, convert_to_tensor=True)\n",
" question_embedding = question_embedding.cuda()\n",
" hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)\n",
" hits = hits[0] # Get the hits for the first query\n",
"\n",
" ##### Re-Ranking #####\n",
" # Now, score all retrieved passages with the cross_encoder\n",
" cross_inp = [[query, passages[hit['corpus_id']]] for hit in hits]\n",
" cross_scores = cross_encoder.predict(cross_inp)\n",
"\n",
" # Sort results by the cross-encoder scores\n",
" for idx in range(len(cross_scores)):\n",
" hits[idx]['cross-score'] = cross_scores[idx]\n",
"\n",
" # Output of top-5 hits from bi-encoder\n",
" print(\"\\n-------------------------\\n\")\n",
" print(\"Top-3 Bi-Encoder Retrieval hits\")\n",
" hits = sorted(hits, key=lambda x: x['score'], reverse=True)\n",
" for hit in hits[0:3]:\n",
" print(\"\\t{:.3f}\\t{}\".format(hit['score'], passages[hit['corpus_id']].replace(\"\\n\", \" \")))\n",
"\n",
" # Output of top-5 hits from re-ranker\n",
" print(\"\\n-------------------------\\n\")\n",
" print(\"Top-3 Cross-Encoder Re-ranker hits\")\n",
" hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)\n",
" for hit in hits[0:3]:\n",
" print(\"\\t{:.3f}\\t{}\".format(hit['cross-score'], passages[hit['corpus_id']].replace(\"\\n\", \" \")))\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "2J0Zxgw0artg",
"outputId": "91a700f5-bb26-444e-abd3-c41721a1b708"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Input question: What is the capital of the United States?\n",
"Top-3 lexical search (BM25) hits\n",
"\t13.316\tCapital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states. The federal government (including the United States military) also uses capital punishment.\n",
"\t11.434\tOhio is one of the 50 states in the United States. Its capital is Columbus. Columbus also is the largest city in Ohio.\n",
"\t11.179\tNevada is one of the United States' states. Its capital is Carson City. Other big cities are Las Vegas and Reno.\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Bi-Encoder Retrieval hits\n",
"\t0.622\tCities in the United States:\n",
"\t0.597\tThe United States Capitol is the building where the United States Congress meets. It is the center of the legislative branch of the U.S. federal government. It is in Washington, D.C., on top of Capitol Hill at the east end of the National Mall.\n",
"\t0.596\tIn the United States:\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Cross-Encoder Re-ranker hits\n",
"\t8.906\tWashington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia) is the capital of the United States. It is a federal district. The President of the USA and many major national government offices are in the territory. This makes it the political center of the United States of America.\n",
"\t3.755\tA capital city (or capital town or just capital) is a city or town, specified by law or constitution, by the government of a country, or part of a country, such as a state, province or county. It usually serves as the location of the government's central meeting place and offices. Most of the country's leaders and officials work in the capital city.\n",
"\t3.681\tThe United States Capitol is the building where the United States Congress meets. It is the center of the legislative branch of the U.S. federal government. It is in Washington, D.C., on top of Capitol Hill at the east end of the National Mall.\n"
]
}
],
"source": [
"search(query = \"What is the capital of the United States?\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "WawjqQBJa3FP",
"outputId": "95ce1de8-a937-4bcd-d07c-0eccb5f90af1"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Input question: What is the best orchestra in the world?\n",
"Top-3 lexical search (BM25) hits\n",
"\t15.328\tThe BBC Symphony Orchestra is the main orchestra of the British Broadcasting Corporation. It is one of the best orchestras in Britain.\n",
"\t15.320\tThe NHK Symphony Orchestra is a Japanese orchestra based in Tokyo, Japan. In Japanese it is written: NHK交響楽団, pronounced: Enueichikei Kōkyō Gakudan. When the orchestra was started in 1926 it was called \"New Symphony Orchestra\". It was the first large professional orchestra in Japan. Later, it changed its name to \"Japan Symphony Orchestra\". In 1951 it started to get money from the Japanese radio station NHK (Nippon Hōsō Kyōkai), so it changed its name again to the name it has now. It is thought of as the best orchestra in Japan. They have played in many parts of the world, including at the BBC Proms in London.\n",
"\t14.079\tThe Bamberger Symphoniker (Bamberg Symphony Orchestra) is a world-famous orchestra from the city of Bamberg, Germany. It was formed in 1946. Most of the musicians who formed the orchestra were Germans who had been forced to leave Czechoslovakia after the World War II. Most of them had previously been members of the German Philharmonic Orchestra of Prague.\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Bi-Encoder Retrieval hits\n",
"\t0.701\tThe Vienna Philharmonic (in German: die Wiener Philharmoniker) is an orchestra based in Vienna, Austria. It is thought of as one of the greatest orchestras in the world.\n",
"\t0.641\tThe Vienna Symphony () is an orchestra in Vienna, Austria.\n",
"\t0.640\tThe Berlin Philharmonic (in German: Die Berliner Philharmoniker), is an orchestra from Berlin, Germany. It is one of the greatest orchestras in the world. The conductor of the orchestra is Sir Simon Rattle.\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Cross-Encoder Re-ranker hits\n",
"\t5.952\tThe London Symphony Orchestra (LSO) is one of the most famous orchestras of the world. They are based in London's Barbican Centre, but they often tour to lots of different countries.\n",
"\t5.794\tThe Vienna Philharmonic (in German: die Wiener Philharmoniker) is an orchestra based in Vienna, Austria. It is thought of as one of the greatest orchestras in the world.\n",
"\t5.324\tThe Berlin Philharmonic (in German: Die Berliner Philharmoniker), is an orchestra from Berlin, Germany. It is one of the greatest orchestras in the world. The conductor of the orchestra is Sir Simon Rattle.\n"
]
}
],
"source": [
"search(query = \"What is the best orchestra in the world?\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Ei9Q1roSa9GE",
"outputId": "7b4323f1-badf-4b11-aed4-a04ee7a7be4c"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Input question: Number countries Europe\n",
"Top-3 lexical search (BM25) hits\n",
"\t13.795\tAmy MacDonald is a Scottish singer and songwriter. She became famous in 2007 with her first album \"This Is The Life\" and her first single \"Poison Prince\". She has become even more successful in Europe since her single \"This Is The Life\" charted at number 1 in many European countries.\n",
"\t13.758\tThe Croatian language is spoken mainly throughout the countries of Croatia and Bosnia and Herzegovina and in the surrounding countries of Europe.\n",
"\t13.019\tOrganization for Security and Co-operation in Europe (OSCE) is an international organization for peace and human rights. Presently, it has 57 countries as its members. Most of the member countries of the OSCE are from Europe, the Caucasus, Central Asia and North America.\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Bi-Encoder Retrieval hits\n",
"\t0.538\tThe Council of Europe (, ) is an international organization of 47 member states in the European region. One of its first successes was the European Convention on Human Rights in 1950, which serves as the basis for the European Court of Human Rights.\n",
"\t0.531\tEngland is a country in Europe. It is a country with over sixty cities in it. It is in a union with Scotland, Wales and Northern Ireland. All four countries are in the British Isles and are part of the United Kingdom (UK).\n",
"\t0.507\tEurope is a Swedish rock band. The band was started by Joey Tempest and John Norum in 1979. Their song \"The Final Countdown\" was a big hit in 1986.\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Cross-Encoder Re-ranker hits\n",
"\t5.199\tThe European Union (abbreviation: EU) is a confederation of 27 member countries in Europe established by the Maastricht Treaty in 1992-1993. The EU grew out of the European Economic Community (EEC) which was established by the Treaties of Rome in 1957. It has created a common economic area with Europe-wide laws allowing the citizens of EU countries to move and trade in other EU countries almost the same as they do in their own. Nineteen of these countries also share the same type of money: the euro.\n",
"\t3.202\tA European Union member state is any one of the twenty-seven countries that have joined the European Union (EU) since it was found in 1958 as the European Economic Community (EEC). From an original membership of six states, there have been five successive enlargements. The largest happened on 1 May 2004, when ten member states joined.\n",
"\t2.798\tThe Schengen Area is an area that includes 26 European countries. All of those countries have signed the Schengen Agreement in Schengen, Luxembourg in 1985.\n"
]
}
],
"source": [
"search(query = \"Number countries Europe\")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "YVQyKpIjbSdC",
"outputId": "5ab13487-4259-4516-b913-6e3d14c140ce"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Input question: When did the cold war end?\n",
"Top-3 lexical search (BM25) hits\n",
"\t17.374\tThe Cold War was the tense relationship between the United States (and its allies), and the Soviet Union (the USSR and its allies) between the end of World War II and the fall of the Soviet Union. It is called the \"Cold\" War because the US and the USSR never actually fought each other directly. Instead, they opposed each other in conflicts known as proxy wars, where each country chose a side to support.\n",
"\t17.291\tThe Reagan Doctrine was a document by the United States under the Reagan Administration. It was about being against the global influence of the Soviet Union during the final years of the Cold War. The doctrine lasted for less than a decade, it was the most important document of United States foreign policy from the early 1980s until the end of the Cold War in 1991.\n",
"\t15.420\tCold Norton is a village and civil parish in Maldon District, Essex, England. In 2001 there were 1103 people living in Cold Norton. Cold Norton is at the south-east end of the Danbury Ridge.\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Bi-Encoder Retrieval hits\n",
"\t0.613\tThe Cold War was the tense relationship between the United States (and its allies), and the Soviet Union (the USSR and its allies) between the end of World War II and the fall of the Soviet Union. It is called the \"Cold\" War because the US and the USSR never actually fought each other directly. Instead, they opposed each other in conflicts known as proxy wars, where each country chose a side to support.\n",
"\t0.557\tThe Continuation War was a war between Finland and the Soviet Union. It was fought between June 25, 1941 and September 19, 1944. It continued the Winter War. Nazi Germany helped Finland as part of the Eastern Front (World War II). The ceasefire started on September 4 at 7.00 am in the Finnish side. The Soviet Union's ceasefire started on September 5. The first peace treaty was signed on September 19 and the final one on February 10, 1947 in Paris.\n",
"\t0.548\tThe Winter War (30 November 1939 - 13 March 1940) was a conflict fought between the Soviet Union and Finland. It began when the Soviet Union tried to invade Finland soon after the Invasion of Poland. The Soviet military forces expected a victory over Finland in a few weeks, because the Soviet army had many more tanks and planes than the Finnish army.\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Cross-Encoder Re-ranker hits\n",
"\t5.203\tThe Cold War was the tense relationship between the United States (and its allies), and the Soviet Union (the USSR and its allies) between the end of World War II and the fall of the Soviet Union. It is called the \"Cold\" War because the US and the USSR never actually fought each other directly. Instead, they opposed each other in conflicts known as proxy wars, where each country chose a side to support.\n",
"\t2.531\tA speech was made by American President Harry S. Truman to the U.S. Congress on 12 March 1947. In this speech he said he thought that The United States should help Greece and Turkey to stop them being 'Totalitarianists' although he meant Soviet Communism. This became known as the Truman Doctrine. Some Historians believe that this was the start of the Cold War.\n",
"\t2.079\tThe Sino-Soviet split (1960–1989) was a time when the relations between the People's Republic of China and the Soviet Union weakened during the Cold War. Eventually, China's leader, Mao Zedong, decided to break the alliance with the Soviet Union.\n"
]
}
],
"source": [
"search(query = \"When did the cold war end?\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "iVomHSSsbcut",
"outputId": "83c8787f-cf25-4b63-caf1-2b5fea8efc4b"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Input question: How long do cats live?\n",
"Top-3 lexical search (BM25) hits\n",
"\t22.997\tReliable information on the lifespans of house cats is hard to find. However, research has been done to get an estimate (an educated guess) on how long cats usually live. Cats usually live for 13 to 20 years. Sometimes cats can live for 22 to 30 years but there are claims of cats dying at ages of more than 30 years old.\n",
"\t16.974\tThe sabertoothed cats or sabretooth cats are some of the best known and most popular extinct animals. They are among the most impressive carnivores that ever have lived. These cats had long canines and jaws which opened wider than modern cats. This suggests a different style of killing from modern felines.\n",
"\t16.490\tThe Cyprus cat is a breed of cat. These cats are thought to have first come from ancient Egypt or Palestine. They were brought to the island of Cyprus by St. Helen. These are now common domestic cats that live in homes or outside. Many of these cats still live all over Cyprus. But, a large number are now feral. This means they are not tame and they run wild.\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Bi-Encoder Retrieval hits\n",
"\t0.765\tReliable information on the lifespans of house cats is hard to find. However, research has been done to get an estimate (an educated guess) on how long cats usually live. Cats usually live for 13 to 20 years. Sometimes cats can live for 22 to 30 years but there are claims of cats dying at ages of more than 30 years old.\n",
"\t0.531\tCreme Puff (August 3, 1967 - August 10, 2005) was a female cat who died in 2005 at the age of 38. She was the oldest cat ever recorded, according to the 2010 edition of \"Guinness World Records\".\n",
"\t0.508\tThe cat righting reflex is a cat's natural ability to turn itself around as it falls so it will land on its feet. This righting reflex starts to happen at 3–4 weeks of age. The cat has entirely learned how to do this by 6–7 weeks. Cats are able to do this because they have a flexible backbone and a clavicle that does not move. The minimum height needed for this to happen safely in most cats is about 12 inches.\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Cross-Encoder Re-ranker hits\n",
"\t10.431\tReliable information on the lifespans of house cats is hard to find. However, research has been done to get an estimate (an educated guess) on how long cats usually live. Cats usually live for 13 to 20 years. Sometimes cats can live for 22 to 30 years but there are claims of cats dying at ages of more than 30 years old.\n",
"\t2.998\tThe sand cat (\"Felis margarita\") is a small wild cat in the Felinae subfamily. It is distributed over African and Asian deserts. Sometimes people call it \"desert cat,\" but that is really the name of a different animal. The sand cat does live in deserts, even the Sahara and Arabian Desert. It is also found in Iran and Pakistan. In zoos, this cat can live for up to 13 years.\n",
"\t2.348\tBobcat (\"Lynx rufus\") are fierce cats that live in forests, swamps, mountains, prairie, and deserts in much of North America. Bobcats are generally nocturnal (most active at night), but are most active at dawn and dusk. They spend the day in their den (a cave, hollow log or rock crevice). They are very good climbers and swimmers. Bobcats are eaten by cougars, coyotes, wolves, and bears. Bobcats usually live from 10 to 14 years. Bobcats and lynxes are closely related.\n"
]
}
],
"source": [
"search(query = \"How long do cats live?\")"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "QIkvohZ8bxMu",
"outputId": "5722f299-395b-45e2-9476-0bb28f437318"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Input question: How many people live in Toronto?\n",
"Top-3 lexical search (BM25) hits\n",
"\t15.978\tMarkham, Ontario is a city in Regional Municipality of York, in the Greater Toronto Area of Southern Ontario, Canada. There are twice as many people there as in 1990. 261,573 people live in Markham. It is the 4th largest town in the Greater Toronto Area, after Toronto, Mississauga, and Brampton.\n",
"\t11.299\tThe Toronto Zoo is a zoo in Toronto, Ontario, Canada. With , the Toronto Zoo is the largest zoo in Canada.\n",
"\t10.679\tDenzil Minnan-Wong (; born ) is a Canadian politician. He is a Toronto city councillor. He is the person that represents Ward 16, an area of Toronto. He is a chairperson of the Employee and Labour Relations Committee in Toronto's municipal government and is also the deputy mayor of Toronto. He is also part of the board of the Toronto Transit Commission and the Toronto Hydro.\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Bi-Encoder Retrieval hits\n",
"\t0.604\tVaughan is a city in Ontario, Canada, 335,000 people live there .\n",
"\t0.595\tToronto is a city in Ohio in the United States.\n",
"\t0.594\tToronto is a city in Kansas, United States.\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Cross-Encoder Re-ranker hits\n",
"\t4.448\tIt has about 110,000 people living there.\n",
"\t3.632\tThe 2010 census counted 23,302 people living in the city. (2005 showed 24,709). The city was founded on January 1, 1955. Much of the city was heavily damaged by the 2011 Tōhoku earthquake and tsunami.\n",
"\t3.372\tAs of January 1, 2010, about 164,294 people lived there. That was 295.83 people per km². The total area is 555.35 km².\n"
]
}
],
"source": [
"search(query = \"How many people live in Toronto?\")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "MpWjJtL_iFBG",
"outputId": "08d85c4c-5278-47db-c122-110df3e1cc48"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Input question: Oldest US president\n",
"Top-3 lexical search (BM25) hits\n",
"\t11.010\tGlafcos Ioannou Clerides (; 24 April 1919 – 15 November 2013) was a Greek-Cypriot politician. He was the fourth President of Cyprus. He was the oldest living former President of the Republic of Cyprus.\n",
"\t9.237\tJosé Celso de Mello Filho (Tatuí, November 1, 1945), is a Brazilian jurist. He is the oldest member of the Supreme Federal Court of Brazil. He was nominated by President José Sarney in 1989.\n",
"\t8.872\tUSS \"Constitution\" is a wooden, three-masted heavy frigate of the United States Navy. Named by President George Washington after the Constitution of the United States of America, she is the world's oldest commissioned naval vessel afloat.\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Bi-Encoder Retrieval hits\n",
"\t0.645\tWilliam Henry Harrison (February 9, 1773 – April 4, 1841) was the 9th President of the United States. His nickname was \"Old Tippecanoe \" and he was a well-respected war veteran. Harrison served the shortest term of any United States President. His term lasted for exactly one month.\n",
"\t0.624\tRichard Arvin Overton (May 11, 1906 – December 27, 2018) was an American supercentenarian. He was the oldest verified surviving American World War II veteran, as well as the oldest American man. In 2013 he was honored by President Barack Obama. On that same Memorial Day, Overton met with Texas Governor Rick Perry. He was born in Bastrop County, Texas.\n",
"\t0.620\tAlben William Barkley (November 24, 1877 – April 30, 1956) was a Democratic member of the U.S. House of Representatives and the United States Senate from Paducah, Kentucky, majority leader of the Senate, and the thirty-fifth Vice President of the United States. He was the oldest Vice President of the United States at the age of . He ran for president in 1952, however the Democratic Party primaries had nominated Adlai Stevenson II instead and labor union leaders rejected him to run because of old age (74).\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Cross-Encoder Re-ranker hits\n",
"\t6.089\tAlben William Barkley (November 24, 1877 – April 30, 1956) was a Democratic member of the U.S. House of Representatives and the United States Senate from Paducah, Kentucky, majority leader of the Senate, and the thirty-fifth Vice President of the United States. He was the oldest Vice President of the United States at the age of . He ran for president in 1952, however the Democratic Party primaries had nominated Adlai Stevenson II instead and labor union leaders rejected him to run because of old age (74).\n",
"\t5.538\tRichard Arvin Overton (May 11, 1906 – December 27, 2018) was an American supercentenarian. He was the oldest verified surviving American World War II veteran, as well as the oldest American man. In 2013 he was honored by President Barack Obama. On that same Memorial Day, Overton met with Texas Governor Rick Perry. He was born in Bastrop County, Texas.\n",
"\t4.702\tJohn Nance Garner IV nicknamed \"Cactus Jack\" (November 22, 1868 – November 7, 1967) was the forty-fourth Speaker of the United States House of Representatives (1931-33) and the thirty-second Vice President of the United States (1933-41). Garner once described the Vice-Presidency as being \"not worth a bucket of warm spit.\" Also, he lived to be 98 years old. That made him the oldest former Vice President of the United States.\n"
]
}
],
"source": [
"search(query = \"Oldest US president\")"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "zo3NOayXiQME",
"outputId": "f385e4b8-b99a-401a-ae5e-8e6c11d96523"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Input question: Coldest place earth\n",
"Top-3 lexical search (BM25) hits\n",
"\t24.891\tEast Antarctica, also called Greater Antarctica, is the largest part (two-thirds) of the Antarctic continent. It is on the Indian Ocean side of the Transantarctic Mountains. It is the coldest, windiest, and driest part of Earth. East Antarctica holds the record as the coldest place on earth.\n",
"\t12.650\tEarth Day is a day that is supposed to inspire more awareness and appreciation for the Earth's natural environment. It takes place each year on April 22. It now takes place in more than 193 countries around the world. During Earth Day, the world encourages everyone to turn off all unwanted lights.\n",
"\t12.172\tHeinrich events occurred during the coldest point of \"Bond Cycles\" in which many icebergs were discharged into the North Atlantic and melted.\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Bi-Encoder Retrieval hits\n",
"\t0.633\tEast Antarctica, also called Greater Antarctica, is the largest part (two-thirds) of the Antarctic continent. It is on the Indian Ocean side of the Transantarctic Mountains. It is the coldest, windiest, and driest part of Earth. East Antarctica holds the record as the coldest place on earth.\n",
"\t0.556\tThe North Pole is the point that is farthest north on Earth. It is the point on which axis of Earth turns. It is in the Arctic Ocean and it is cold there because the sun does not shine there for about half a year and never rises very high. The ocean around the pole is always very cold and it is covered by a thick sheet of ice.\n",
"\t0.516\tPlaces with a subarctic climate (also called boreal climate) have long, usually very cold winters, and short, warm summers. It is found on large landmasses, away from oceans, usually at latitudes from 50° to 70°N. Because there are no large landmasses at such latitudes in the Southern Hemisphere, it is only found at high \"altitudes \"(heights) in the Andes and the mountains of Australia and New Zealand's South Island. These climates are in groups \"Dfc\", \"Dwc\", \"Dfd\" and \"Dwd\" in the Köppen climate classification\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Cross-Encoder Re-ranker hits\n",
"\t6.129\tEast Antarctica, also called Greater Antarctica, is the largest part (two-thirds) of the Antarctic continent. It is on the Indian Ocean side of the Transantarctic Mountains. It is the coldest, windiest, and driest part of Earth. East Antarctica holds the record as the coldest place on earth.\n",
"\t0.594\tThe Arctic is the area around the Earth's North Pole. The Arctic includes parts of Russia, Alaska, Canada, Greenland, Lapland and Svalbard as well as the Arctic Ocean. It is an ocean, mostly covered with ice. Most scientists call the area north of the treeline Arctic. Trees will not grow when the temperatures get too cold. The forests of the continents stop when they get too far north or too high up a mountain. (Higher places are colder, too.) The place where in the trees stop is called the tree line.\n",
"\t0.215\tThe North Pole is the point that is farthest north on Earth. It is the point on which axis of Earth turns. It is in the Arctic Ocean and it is cold there because the sun does not shine there for about half a year and never rises very high. The ocean around the pole is always very cold and it is covered by a thick sheet of ice.\n"
]
}
],
"source": [
"search(query = \"Coldest place earth\")"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "EJ3OqA32ie_s",
"outputId": "e28a27b7-e71b-4bbe-a6d6-583bc7e82e4f"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Input question: Elon Musk year birth\n",
"Top-3 lexical search (BM25) hits\n",
"\t23.364\tTesla, Inc. is a company based in Palo Alto, California which makes electric cars. It was started in 2003 by Martin Eberhard, Dylan Stott, and Elon Musk (who also co-founded PayPal and SpaceX and is the CEO of SpaceX). Eberhard no longer works there. Today, Elon Musk is the Chief Executive Officer (CEO). It started selling its first car, the Roadster in 2008.\n",
"\t19.943\tThe Boring Company is a tunnel boring company founded by Elon Musk, who earlier started SpaceX. It aims to reduce traffic congestion in urban areas. It is involved in the building of the Hyperloop in Los Angeles.\n",
"\t18.392\tElon Reeve Musk (born June 28, 1971) is a businessman and philanthropist. He was born in South Africa. He moved to Canada and later became an American citizen. Musk is the current CEO & Chief Product Architect of Tesla Motors, a company that makes electric vehicles. He is also the CEO of Solar City, a company that makes solar panels, and the CEO & CTO of SpaceX, an aerospace company. In August 2020, Bloomberg ranked Musk third among the richest people on the planet with net worth to be $115.4 billion.\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Bi-Encoder Retrieval hits\n",
"\t0.574\tElon Reeve Musk (born June 28, 1971) is a businessman and philanthropist. He was born in South Africa. He moved to Canada and later became an American citizen. Musk is the current CEO & Chief Product Architect of Tesla Motors, a company that makes electric vehicles. He is also the CEO of Solar City, a company that makes solar panels, and the CEO & CTO of SpaceX, an aerospace company. In August 2020, Bloomberg ranked Musk third among the richest people on the planet with net worth to be $115.4 billion.\n",
"\t0.471\tMuraoka was born on June 21, 1893, in Kofu, Yamanashi Prefecture. Her birth name was Hana Annaka. Her parents were Methodists. She was raised as a Christian. She studied at the Tokyo Eiwa Jogakuin. She began writing children's stories when she was encouraged by translator Hiroko Katayama. She graduated from school in 1913.\n",
"\t0.467\tJose Alves dos Santos Júnior (born July 29, 1969) is a former Brazilian football player.\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Cross-Encoder Re-ranker hits\n",
"\t7.449\tElon Reeve Musk (born June 28, 1971) is a businessman and philanthropist. He was born in South Africa. He moved to Canada and later became an American citizen. Musk is the current CEO & Chief Product Architect of Tesla Motors, a company that makes electric vehicles. He is also the CEO of Solar City, a company that makes solar panels, and the CEO & CTO of SpaceX, an aerospace company. In August 2020, Bloomberg ranked Musk third among the richest people on the planet with net worth to be $115.4 billion.\n",
"\t-2.518\tNour El-Sherif (; 28 April 1946 – 11 August 2015) was a Egyptian actor. His birth name is Mohamad Geber Mohamad ِAbd Allah (Arabic: محمد جابر محمد عبد الله). El-Sherif was born in the working-class neighbourhood of Sayeda Zainab in Cairo (). He was known for his conspiracy theories about the Holocaust and the September 11 attacks.\n",
"\t-4.171\tOlly Murs (born 14 May 1984) is an English singer. He comes from Witham, Essex. He became famous after he finished second in The X Factor in 2009.\n"
]
}
],
"source": [
"search(query = \"Elon Musk year birth\")"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "13h8bMHKk4UX",
"outputId": "b464adbc-6f01-41b1-f482-3d5a8330b20d"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Input question: Paris eiffel tower\n",
"Top-3 lexical search (BM25) hits\n",
"\t27.300\tThe Eiffel Tower (French: La Tour Eiffel, ], IPA pronunciation: \"EYE-full\" English; \"eh-FEHL\" French) is a landmark in Paris. It was built between 1887 and 1889 for the Exposition Universelle (World Fair). The Tower was the Exposition's main attraction.\n",
"\t25.263\tParis is a city in the U.S. state of Texas. It is in Lamar County, Texas. It had a population of 25,171 in 2010. It has been called the \"Second Largest Paris in the World\". It has a replica of the Eiffel Tower.\n",
"\t24.059\tParis is a city in the U.S. state of Tennessee. It had a population of 25,171 in 2010. It has been called the \"World's Biggest Fish Fry\". It has a 70-foot replica of the Eiffel Tower.\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Bi-Encoder Retrieval hits\n",
"\t0.812\tThe Eiffel Tower (French: La Tour Eiffel, ], IPA pronunciation: \"EYE-full\" English; \"eh-FEHL\" French) is a landmark in Paris. It was built between 1887 and 1889 for the Exposition Universelle (World Fair). The Tower was the Exposition's main attraction.\n",
"\t0.626\tAlexandre Gustave Eiffel (December 15, 1832 – December 27, 1923; , ) was a French structural engineer and architect. He is known for designing the Eiffel Tower. He also designed the armature (supporting framework) for the Statue of Liberty, New York Harbor, United States.\n",
"\t0.538\tThe Élysée Palace (, ) is the official residence of the French president. It is in Paris, in the 8th \"arrondissement\", near the Champs-Élysées. The building is under protection of a unit of the famous Republican Guard. It was built between 1718 and 1722.\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Cross-Encoder Re-ranker hits\n",
"\t10.574\tThe Eiffel Tower (French: La Tour Eiffel, ], IPA pronunciation: \"EYE-full\" English; \"eh-FEHL\" French) is a landmark in Paris. It was built between 1887 and 1889 for the Exposition Universelle (World Fair). The Tower was the Exposition's main attraction.\n",
"\t4.362\tAlexandre Gustave Eiffel (December 15, 1832 – December 27, 1923; , ) was a French structural engineer and architect. He is known for designing the Eiffel Tower. He also designed the armature (supporting framework) for the Statue of Liberty, New York Harbor, United States.\n",
"\t3.756\tThe current tower is the second tower on the site. The original tower was built in 1912. The design was based on the Eiffel Tower. The tower had an cable car that connected it to a nearby amusement park called Luna Park. The original was 64 meters tall. It was the second tallest building in Asia at that time. It quickly became one of the most popular locations in the city. Visitors came from all over.\n"
]
}
],
"source": [
"search(query = \"Paris eiffel tower\")"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "hwTRd5Jqw-09",
"outputId": "d3c0286c-b703-4f06-e601-3446bdd01aab"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Input question: Which US president was killed?\n",
"Top-3 lexical search (BM25) hits\n",
"\t10.179\tLyndon Baines Johnson (August 27, 1908 – January 22, 1973) was a member of the Democratic Party and the 36th president of the United States serving from 1963 to 1969. Johnson took over as president when President Kennedy was killed in November 1963. He was then re-elected in the 1964 election.\n",
"\t10.091\tLech Kaczyński, the fourth President of the Republic of Poland, died on 10 April 2010. He died in a plane crash outside of Smolensk, Russia. The plane was a Tu-154 belonging to the Polish Air Force. The crash killed all 96 on board. His wife, Maria Kaczyńska, was also among those killed.\n",
"\t9.791\tJacobo Majluta Azar (October 9, 1934 – March 2, 1996) was a Dominican politician. He was Vice President of the Dominican Republic during the Antonio Guzmán Fernández presidency between 1978 to 1982. He became President of the Dominican Republic after Guzmán Fernández killed himself in 1982. He was president for a month between July to August 1982.\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Bi-Encoder Retrieval hits\n",
"\t0.686\tJohn F. Kennedy was the 35th President of the United States. He was assassinated (murdered) in Dealey Plaza, Dallas, Texas, on Friday, November 22, 1963. This happened while he was traveling in a Presidential motorcade with his wife Jacqueline, the Governor of Texas John Connally, and the governor's wife Nellie.\n",
"\t0.656\tWilliam McKinley, the 25th President of the United States, was assassinated on September 6, 1901, inside the Temple of Music on the grounds of the Pan-American Exposition in Buffalo, New York.\n",
"\t0.655\tJames Abram Garfield (November 19, 1831 - September 19, 1881) was the 20th (1881) President of the United States and the 2nd President to be assassinated (killed while in office). President Garfield was in office from March to September of 1881. He was in office for a total of six months and fifteen days. For almost half that time he was bedridden as a result of an attempt to kill him. He was shot on July 2 and finally died in September the same year he got into office.\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Cross-Encoder Re-ranker hits\n",
"\t9.278\tWilliam McKinley, the 25th President of the United States, was assassinated on September 6, 1901, inside the Temple of Music on the grounds of the Pan-American Exposition in Buffalo, New York.\n",
"\t8.403\tJohn F. Kennedy was the 35th President of the United States. He was assassinated (murdered) in Dealey Plaza, Dallas, Texas, on Friday, November 22, 1963. This happened while he was traveling in a Presidential motorcade with his wife Jacqueline, the Governor of Texas John Connally, and the governor's wife Nellie.\n",
"\t7.914\tOn December 26, 2006, Gerald Ford, the 38th President of the United States, died at his home in Rancho Mirage, California at 6:45 p.m. local time (02:45, December 27, UTC).\n"
]
}
],
"source": [
"search(query = \"Which US president was killed?\")"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "TFkPfYGXIz0Y",
"outputId": "dead42f8-53b5-454a-8b95-bd9a6c838006"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Input question: When is Chinese New Year\n",
"Top-3 lexical search (BM25) hits\n",
"\t18.743\tChinese New Year, known in China as the SpringFestival and in Singapore as the LunarNewYear, is a holiday on and around the new moon on the first day of the year in the traditional Chinese calendar. This calendar is based on the changes in the moon and is only sometimes changed to fit the seasons of the year based on how the Earth moves around the sun. Because of this, Chinese New Year is never on January1. It moves around between January21 and February20.\n",
"\t18.527\tNew Year in Japan is one of the most important festivals. Unlike the Chinese New Year, it is held on January 1.\n",
"\t15.789\tThe CCTV New Year's Gala (Simplified Chinese: 中国中央电视台春节联欢晚会; Traditional Chinese: 中國中央電視台春節聯歡晚會; Pinyin: \"Zhōngguó zhōngyāng diànshìtái chūnjié liánhuān wǎnhuì\") is a Chinese New Year special produced by China Central Television. It was presented by Zhao Zhongxiang.\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Bi-Encoder Retrieval hits\n",
"\t0.782\tChinese New Year, known in China as the SpringFestival and in Singapore as the LunarNewYear, is a holiday on and around the new moon on the first day of the year in the traditional Chinese calendar. This calendar is based on the changes in the moon and is only sometimes changed to fit the seasons of the year based on how the Earth moves around the sun. Because of this, Chinese New Year is never on January1. It moves around between January21 and February20.\n",
"\t0.648\tChinese National Day is the national day of China. It may mean:\n",
"\t0.642\tNational Day is a yearly holiday in the People's Republic of China. It celebrates the beginning of its new government on October1, 1949. It is one of two Golden Weeks in the country, along with the Chinese New Year.\n",
"\n",
"-------------------------\n",
"\n",
"Top-3 Cross-Encoder Re-ranker hits\n",
"\t10.896\tChinese New Year, known in China as the SpringFestival and in Singapore as the LunarNewYear, is a holiday on and around the new moon on the first day of the year in the traditional Chinese calendar. This calendar is based on the changes in the moon and is only sometimes changed to fit the seasons of the year based on how the Earth moves around the sun. Because of this, Chinese New Year is never on January1. It moves around between January21 and February20.\n",
"\t6.147\tNew Year in Japan is one of the most important festivals. Unlike the Chinese New Year, it is held on January 1.\n",
"\t5.081\tNational Day is a yearly holiday in the People's Republic of China. It celebrates the beginning of its new government on October1, 1949. It is one of two Golden Weeks in the country, along with the Chinese New Year.\n"
]
}
],
"source": [
"search(query=\"When is Chinese New Year\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "soZ1nH4_I4Zi"
},
"outputs": [],
"source": []
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"collapsed_sections": [],
"name": "retrieve_rerank_simple_wikipedia.ipynb",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"0bee2cf221bf4497b527369b3905553e": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"10c12f39a3b340a98f6db972e9d37921": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"11c91ed9d2f049c9aec2124c6c926f88": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HBoxModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_a39d7b521ea540cd8af250584340e427",
"IPY_MODEL_7ea6f545038f4422850e4a2d75345e65"
],
"layout": "IPY_MODEL_0bee2cf221bf4497b527369b3905553e"
}
},
"1390bf4686774d6581348f87f908a400": {
"model_module": "@jupyter-widgets/controls",
"model_name": "FloatProgressModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "100%",
"description_tooltip": null,
"layout": "IPY_MODEL_64eb0085f9a84129937350004868ad54",
"max": 50223724,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_e6f53f85bbbb4e6caf992e913d9cce88",
"value": 50223724
}
},
"1655064d0c164b778220e9a4da96d2ea": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"16d9436e5caf448a883bbdac3dc930dd": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"1c198b0a475f42fe836c19813f838100": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HBoxModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_4fd3399b50224953806f5a646d1ffd89",
"IPY_MODEL_fe98ad30ffaa4bcfa09c5c01accdd9b8"
],
"layout": "IPY_MODEL_1655064d0c164b778220e9a4da96d2ea"
}
},
"258161ce772d470fbc004f7736cea24e": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"27d20db7a7f346ea81e3abe88868cfcd": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"297a74178198439ab2847d90f2ee56e0": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"33b91b75207249a7b885c317c95503dd": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_362ab62385884739ba1de446dddf1b87",
"placeholder": "​",
"style": "IPY_MODEL_614511ff768642299c1b9c19abfb56a6",
"value": " 612/612 [00:00&lt;00:00, 641B/s]"
}
},
"33bd233b64664e639dea85c60f0372dd": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"362ab62385884739ba1de446dddf1b87": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"399bf19c69ed4a1490c6609284bacf27": {
"model_module": "@jupyter-widgets/controls",
"model_name": "FloatProgressModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "100%",
"description_tooltip": null,
"layout": "IPY_MODEL_e5595b23a62643aab844c875127a0a0a",
"max": 509663,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_79e0e838f33f4c51b24bdfd12c7d0a00",
"value": 509663
}
},
"3b01947b6c50476db6c9bcb21b03491b": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"3fd97a1c9ef74e80b999aba18dd20075": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"442d3ee2958b4db999c5e1624986d201": {
"model_module": "@jupyter-widgets/controls",
"model_name": "FloatProgressModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "100%",
"description_tooltip": null,
"layout": "IPY_MODEL_c119469e196f4ceca7b1368f2a38d18b",
"max": 782843128,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_ec1353a6b4e8450f84b4302a50e2fba6",
"value": 782843128
}
},
"44d63b5526824e0f81d62dfba320bdc1": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"4c308b4ed1644ae790c878e190642c68": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"4d2cd27435f44a4f8bdbdd3e17c8c86a": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"4fd3399b50224953806f5a646d1ffd89": {
"model_module": "@jupyter-widgets/controls",
"model_name": "FloatProgressModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "Downloading: 100%",
"description_tooltip": null,
"layout": "IPY_MODEL_88a71f6953af4259accf096fda7356b1",
"max": 231508,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_cf475534803f42718e6a87e645585357",
"value": 231508
}
},
"55fd02cec9a74a92a7c0bf4cf713bb28": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"566d714778c040908ce443cc51c926bb": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"581eef89f62245a0aaae36179e688843": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"5d1b0a2aa5da48f3ad8e428d7b4ead27": {
"model_module": "@jupyter-widgets/controls",
"model_name": "ProgressStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": "initial"
}
},
"614511ff768642299c1b9c19abfb56a6": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"64eb0085f9a84129937350004868ad54": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"67346148e616426c83f10f203be5da8d": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"6bceb6fffeb84566a07aeb340d7c920f": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"703fe20c0249435eb847fe29a5a23c52": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HBoxModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_1390bf4686774d6581348f87f908a400",
"IPY_MODEL_f3ba3908cf1d4000b0da73107cee1526"
],
"layout": "IPY_MODEL_6bceb6fffeb84566a07aeb340d7c920f"
}
},
"79e0e838f33f4c51b24bdfd12c7d0a00": {
"model_module": "@jupyter-widgets/controls",
"model_name": "ProgressStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": "initial"
}
},
"7d0105cc6432438581e71fad681ae2d8": {
"model_module": "@jupyter-widgets/controls",
"model_name": "ProgressStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": "initial"
}
},
"7d4f6eb85c0f4b06bf332c278da7f225": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"7e15806e1ab74d6893f9c5f4ede9b5af": {
"model_module": "@jupyter-widgets/controls",
"model_name": "FloatProgressModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "100%",
"description_tooltip": null,
"layout": "IPY_MODEL_4d2cd27435f44a4f8bdbdd3e17c8c86a",
"max": 244719814,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_89d89e63246f4e1d80b8a2a05db28a19",
"value": 244719814
}
},
"7ea6f545038f4422850e4a2d75345e65": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_581eef89f62245a0aaae36179e688843",
"placeholder": "​",
"style": "IPY_MODEL_3fd97a1c9ef74e80b999aba18dd20075",
"value": " 112/112 [00:00&lt;00:00, 747B/s]"
}
},
"85a56dd0bd8549e790759f3f4ad0df7f": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"877f416c38944e3db52917a6497b3525": {
"model_module": "@jupyter-widgets/controls",
"model_name": "FloatProgressModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "Downloading: 100%",
"description_tooltip": null,
"layout": "IPY_MODEL_33bd233b64664e639dea85c60f0372dd",
"max": 541,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_e120277c4a7141fca20e0115486a0be1",
"value": 541
}
},
"88a71f6953af4259accf096fda7356b1": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"89533b9769bd4fe5be0068b381a7b554": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HBoxModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_7e15806e1ab74d6893f9c5f4ede9b5af",
"IPY_MODEL_f9a01234d31e48cbb2b02f1b34acbbd1"
],
"layout": "IPY_MODEL_95bcab1f8336422aafa453f8412d9eed"
}
},
"89d89e63246f4e1d80b8a2a05db28a19": {
"model_module": "@jupyter-widgets/controls",
"model_name": "ProgressStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": "initial"
}
},
"938510ed19db4fcbadbe730584ca5e58": {
"model_module": "@jupyter-widgets/controls",
"model_name": "FloatProgressModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "Downloading: 100%",
"description_tooltip": null,
"layout": "IPY_MODEL_16d9436e5caf448a883bbdac3dc930dd",
"max": 612,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_7d0105cc6432438581e71fad681ae2d8",
"value": 612
}
},
"95bcab1f8336422aafa453f8412d9eed": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"9c35672b3e224a34b7070099c5067426": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_55fd02cec9a74a92a7c0bf4cf713bb28",
"placeholder": "​",
"style": "IPY_MODEL_4c308b4ed1644ae790c878e190642c68",
"value": " 541/541 [00:00&lt;00:00, 8.37kB/s]"
}
},
"a39d7b521ea540cd8af250584340e427": {
"model_module": "@jupyter-widgets/controls",
"model_name": "FloatProgressModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "Downloading: 100%",
"description_tooltip": null,
"layout": "IPY_MODEL_297a74178198439ab2847d90f2ee56e0",
"max": 112,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_5d1b0a2aa5da48f3ad8e428d7b4ead27",
"value": 112
}
},
"a6c3c35f17394bc4a66d460f91cd6833": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"ab19ef7e9763402589b15b4ccd1b5d6b": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HBoxModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_399bf19c69ed4a1490c6609284bacf27",
"IPY_MODEL_c25ffe4acbac42ca9755248338ed864b"
],
"layout": "IPY_MODEL_e5672d2910174d87be7434aa7bbe78c0"
}
},
"aec7bd589fb04ecb999ef9a60dd60afa": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_10c12f39a3b340a98f6db972e9d37921",
"placeholder": "​",
"style": "IPY_MODEL_f10d1c0166f941bcba75f9470db86651",
"value": " 268M/268M [00:07&lt;00:00, 36.9MB/s]"
}
},
"bbfae35891b7490f8da4a60627fa5a16": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"be0fac5d7a974f47ac5b8a5e6ed9e4d6": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"bf2f3698d10f47fdb65dc82e54a9e566": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"bf43844c257c4ee99d5f2cf181179c81": {
"model_module": "@jupyter-widgets/controls",
"model_name": "FloatProgressModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "Downloading: 100%",
"description_tooltip": null,
"layout": "IPY_MODEL_a6c3c35f17394bc4a66d460f91cd6833",
"max": 267871721,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_fc984d27599d41c2be86c13dcc66a171",
"value": 267871721
}
},
"c119469e196f4ceca7b1368f2a38d18b": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"c25ffe4acbac42ca9755248338ed864b": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_7d4f6eb85c0f4b06bf332c278da7f225",
"placeholder": "​",
"style": "IPY_MODEL_dee30a855454476aaa572af57e6b7405",
"value": " 509663/509663 [00:21&lt;00:00, 23222.74it/s]"
}
},
"cc09077698c2445c8c7a7be99a5eb931": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_566d714778c040908ce443cc51c926bb",
"placeholder": "​",
"style": "IPY_MODEL_67346148e616426c83f10f203be5da8d",
"value": " 783M/783M [00:44&lt;00:00, 17.6MB/s]"
}
},
"cf475534803f42718e6a87e645585357": {
"model_module": "@jupyter-widgets/controls",
"model_name": "ProgressStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": "initial"
}
},
"dd4f3ca182c54678887ff51f75877997": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HBoxModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_bf43844c257c4ee99d5f2cf181179c81",
"IPY_MODEL_aec7bd589fb04ecb999ef9a60dd60afa"
],
"layout": "IPY_MODEL_bbfae35891b7490f8da4a60627fa5a16"
}
},
"dee30a855454476aaa572af57e6b7405": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"e00bfbac209b4caa92ac3c679381a5a5": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HBoxModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_442d3ee2958b4db999c5e1624986d201",
"IPY_MODEL_cc09077698c2445c8c7a7be99a5eb931"
],
"layout": "IPY_MODEL_44d63b5526824e0f81d62dfba320bdc1"
}
},
"e120277c4a7141fca20e0115486a0be1": {
"model_module": "@jupyter-widgets/controls",
"model_name": "ProgressStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": "initial"
}
},
"e45327599f404ae09e5452978695d4e8": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HBoxModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_877f416c38944e3db52917a6497b3525",
"IPY_MODEL_9c35672b3e224a34b7070099c5067426"
],
"layout": "IPY_MODEL_258161ce772d470fbc004f7736cea24e"
}
},
"e47e734b2bbc42d1b9a32e1b1dca6451": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"e5595b23a62643aab844c875127a0a0a": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"e5672d2910174d87be7434aa7bbe78c0": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"e6f53f85bbbb4e6caf992e913d9cce88": {
"model_module": "@jupyter-widgets/controls",
"model_name": "ProgressStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": "initial"
}
},
"ec1353a6b4e8450f84b4302a50e2fba6": {
"model_module": "@jupyter-widgets/controls",
"model_name": "ProgressStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": "initial"
}
},
"f0dca993eee34ed8b9c0a9ead6cbbf68": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"f10d1c0166f941bcba75f9470db86651": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"f3ba3908cf1d4000b0da73107cee1526": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_bf2f3698d10f47fdb65dc82e54a9e566",
"placeholder": "​",
"style": "IPY_MODEL_3b01947b6c50476db6c9bcb21b03491b",
"value": " 50.2M/50.2M [00:36&lt;00:00, 1.38MB/s]"
}
},
"f9a01234d31e48cbb2b02f1b34acbbd1": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_e47e734b2bbc42d1b9a32e1b1dca6451",
"placeholder": "​",
"style": "IPY_MODEL_27d20db7a7f346ea81e3abe88868cfcd",
"value": " 245M/245M [00:13&lt;00:00, 18.7MB/s]"
}
},
"fae4387adeac4391ba73c21643e6034f": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HBoxModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_938510ed19db4fcbadbe730584ca5e58",
"IPY_MODEL_33b91b75207249a7b885c317c95503dd"
],
"layout": "IPY_MODEL_be0fac5d7a974f47ac5b8a5e6ed9e4d6"
}
},
"fc984d27599d41c2be86c13dcc66a171": {
"model_module": "@jupyter-widgets/controls",
"model_name": "ProgressStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": "initial"
}
},
"fe98ad30ffaa4bcfa09c5c01accdd9b8": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_85a56dd0bd8549e790759f3f4ad0df7f",
"placeholder": "​",
"style": "IPY_MODEL_f0dca993eee34ed8b9c0a9ead6cbbf68",
"value": " 232k/232k [00:00&lt;00:00, 845kB/s]"
}
}
}
}
},
"nbformat": 4,
"nbformat_minor": 1
}
\ No newline at end of file
# Semantic Search
Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines which only find documents based on lexical matches, semantic search can also find synonyms.
## Background
The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector space.
At search time, the query is embedded into the same vector space and the closest embeddings from your corpus are found. These entries should have a high semantic overlap with the query.
![SemanticSearch](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SemanticSearch.png)
## Symmetric vs. Asymmetric Semantic Search
A **critical distinction** for your setup is *symmetric* vs. *asymmetric semantic search*:
- For **symmetric semantic search** your query and the entries in your corpus are of about the same length and have the same amount of content. An example would be searching for similar questions: Your query could for example be *"How to learn Python online?"* and you want to find an entry like *"How to learn Python on the web?"*. For symmetric tasks, you could potentially flip the query and the entries in your corpus.
- For **asymmetric semantic search**, you usually have a **short query** (like a question or some keywords) and you want to find a longer paragraph answering the query. An example would be a query like *"What is Python"* and you want to find the paragraph *"Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy ..."*. For asymmetric tasks, flipping the query and the entries in your corpus usually does not make sense.
It is critical **that you choose the right model** for your type of task.
Suitable models for **symmetric semantic search**: [Pre-Trained Sentence Embedding Models](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models)
Suitable models for **asymmetric semantic search**: [Pre-Trained MS MARCO Models](https://www.sbert.net/docs/pretrained-models/msmarco-v3.html)
## Python
For small corpora (up to about 1 million entries) we can compute the cosine-similarity between the query and all entries in the corpus.
In the following example, we define a small corpus with few example sentences and compute the embeddings for the corpus as well as for our query.
We then use the [util.cos_sim()](../../../docs/usage/semantic_textual_similarity.md) function to compute the cosine similarity between the query and all corpus entries.
For large corpora, sorting all scores would take too much time. Hence, we use [torch.topk](https://pytorch.org/docs/stable/generated/torch.topk.html) to only get the top k entries.
For a simple example, see [semantic_search.py](semantic_search.py):
```eval_rst
.. literalinclude:: semantic_search.py
```
## util.semantic_search
Instead of implementing semantic search by yourself, you can use the *util.semantic_search* function.
The function accepts the following parameters:
```eval_rst
.. autofunction:: sentence_transformers.util.semantic_search
```
By default, up to 100 queries are processed in parallel. Further, the corpus is chunked into set of up to 500k entries. You can increase *query_chunk_size* and *corpus_chunk_size*, which leads to increased speed for large corpora, but also increases the memory requirement.
## Speed Optimization
To get the optimal speed for the `util.semantic_search` method, it is advisable to have the `query_embeddings` as well as the `corpus_embeddings` on the same GPU-device. This significantly boost the performance.
Further, we can normalize the corpus embeddings so that each corpus embeddings is of length 1. In that case, we can use dot-product for computing scores.
```python
corpus_embeddings = corpus_embeddings.to("cuda")
corpus_embeddings = util.normalize_embeddings(corpus_embeddings)
query_embeddings = query_embeddings.to("cuda")
query_embeddings = util.normalize_embeddings(query_embeddings)
hits = util.semantic_search(query_embeddings, corpus_embeddings, score_function=util.dot_score)
```
## Elasticsearch
[Elasticsearch](https://www.elastic.co/elasticsearch/) has the possibility to [index dense vectors](https://www.elastic.co/what-is/vector-search) and to use them for document scoring. We can easily index embedding vectors, store other data alongside our vectors and, most importantly, efficiently retrieve relevant entries using [approximate nearest neighbor search](https://www.elastic.co/blog/introducing-approximate-nearest-neighbor-search-in-elasticsearch-8-0) (HNSW, see also below) on the embeddings.
For further details, see [semantic_search_quora_elasticsearch.py](semantic_search_quora_elasticsearch.py).
## Approximate Nearest Neighbor
Searching a large corpus with millions of embeddings can be time-consuming if exact nearest neighbor search is used (like it is used by *util.semantic_search*).
In that case, Approximate Nearest Neighbor (ANN) can be helpful. Here, the data is partitioned into smaller fractions of similar embeddings. This index can be searched efficiently and the embeddings with the highest similarity (the nearest neighbors) can be retrieved within milliseconds, even if you have millions of vectors.
However, the results are not necessarily exact. It is possible that some vectors with high similarity will be missed. That's the reason why it is called approximate nearest neighbor.
For all ANN methods, there are usually one or more parameters to tune that determine the recall-speed trade-off. If you want the highest speed, you have a high chance of missing hits. If you want high recall, the search speed decreases.
Three popular libraries for approximate nearest neighbor are [Annoy](https://github.com/spotify/annoy), [FAISS](https://github.com/facebookresearch/faiss), and [hnswlib](https://github.com/nmslib/hnswlib/). Personally I find hnswlib the most suitable library: It is easy to use, offers a great performance and has nice features included that are important for real applications.
Examples:
- [semantic_search_quora_hnswlib.py](semantic_search_quora_hnswlib.py)
- [semantic_search_quora_annoy.py](semantic_search_quora_annoy.py)
- [semantic_search_quora_faiss.py](semantic_search_quora_faiss.py)
## Retrieve & Re-Rank
For complex semantic search scenarios, a retrieve & re-rank pipeline is advisable:
![InformationRetrieval](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/InformationRetrieval.png)
For further details, see [Retrieve & Re-rank](../retrieve_rerank/README.md).
## Examples
In the following we list examples for different use-cases.
### Similar Questions Retrieval
[semantic_search_quora_pytorch.py](semantic_search_quora_pytorch.py) [ [Colab version](https://colab.research.google.com/drive/12cn5Oo0v3HfQQ8Tv6-ukgxXSmT3zl35A?usp=sharing) ] shows an example based on the [Quora duplicate questions](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) dataset. The user can enter a question, and the code retrieves the most similar questions from the dataset using the *util.semantic_search* method. As model, we use *distilbert-multilingual-nli-stsb-quora-ranking*, which was trained to identify similar questions and supports 50+ languages. Hence, the user can input the question in any of the 50+ languages. This is a **symmetric search task**, as the search queries have the same length and content as the questions in the corpus.
### Similar Publication Retrieval
[semantic_search_publications.py](semantic_search_publications.py) [ [Colab version](https://colab.research.google.com/drive/12hfBveGHRsxhPIUMmJYrll2lFU4fOX06?usp=sharing) ] shows an example how to find similar scientific publications. As corpus, we use all publications that have been presented at the EMNLP 2016 - 2018 conferences. As search query, we input the title and abstract of more recent publications and find related publications from our copurs. We use the [SPECTER](https://arxiv.org/abs/2004.07180) model. This is a **symmetric search task**, as the paper in the corpus consists of title & abstract and we search for title & abstract.
### Question & Answer Retrieval
[semantic_search_wikipedia_qa.py](semantic_search_wikipedia_qa.py) [ [Colab Version](https://colab.research.google.com/drive/11GunvCqJuebfeTlgbJWkIMT0xJH6PWF1?usp=sharing) ]: This example uses a model that was trained on the [Natural Questions dataset](https://ai.google.com/research/NaturalQuestions/). It consists of about 100k real Google search queries, together with an annotated passage from Wikipedia that provides the answer. It is an example of an **asymmetric search task**. As corpus, we use the smaller [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page) so that it fits easily into memory.
[retrieve_rerank_simple_wikipedia.ipynb](../retrieve_rerank/retrieve_rerank_simple_wikipedia.ipynb) [ [Colab Version](https://colab.research.google.com/github/UKPLab/sentence-transformers/blob/master/examples/applications/retrieve_rerank/retrieve_rerank_simple_wikipedia.ipynb) ]: This script uses the [Retrieve & Re-rank](../retrieve_rerank/README.md) strategy and is an example for an **asymmetric search task**. We split all Wikipedia articles into paragraphs and encode them with a bi-encoder. If a new query / question is entered, it is encoded by the same bi-encoder and the paragraphs with the highest cosine-similarity are retrieved (see [semantic search](../semantic-search/README.md)). Next, the retrieved candidates are scored by a Cross-Encoder re-ranker and the 5 passages with the highest score from the Cross-Encoder are presented to the user. We use models that were trained on the [MS Marco Passage Reranking](https://github.com/microsoft/MSMARCO-Passage-Ranking/) dataset, a dataset with about 500k real queries from Bing search.
"""
This is a simple application for sentence embeddings: semantic search
We have a corpus with various sentences. Then, for a given query sentence,
we want to find the most similar sentence in this corpus.
This script outputs for various queries the top 5 most similar sentences in the corpus.
"""
from sentence_transformers import SentenceTransformer, util
import torch
embedder = SentenceTransformer("all-MiniLM-L6-v2")
# Corpus with example sentences
corpus = [
"A man is eating food.",
"A man is eating a piece of bread.",
"The girl is carrying a baby.",
"A man is riding a horse.",
"A woman is playing violin.",
"Two men pushed carts through the woods.",
"A man is riding a white horse on an enclosed ground.",
"A monkey is playing drums.",
"A cheetah is running behind its prey.",
]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
# Query sentences:
queries = [
"A man is eating pasta.",
"Someone in a gorilla costume is playing a set of drums.",
"A cheetah chases prey on across a field.",
]
# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
query_embedding = embedder.encode(query, convert_to_tensor=True)
# We use cosine-similarity and torch.topk to find the highest 5 scores
cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
top_results = torch.topk(cos_scores, k=top_k)
print("\n\n======================\n\n")
print("Query:", query)
print("\nTop 5 most similar sentences in corpus:")
for score, idx in zip(top_results[0], top_results[1]):
print(corpus[idx], "(Score: {:.4f})".format(score))
"""
# Alternatively, we can also use util.semantic_search to perform cosine similarty + topk
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=5)
hits = hits[0] #Get the hits for the first query
for hit in hits:
print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
"""
"""
This example demonstrates how we can perform semantic search for scientific publications.
As model, we use SPECTER (https://github.com/allenai/specter), which encodes paper titles and abstracts
into a vector space.
When can then use util.semantic_search() to find the most similar papers.
Colab example: https://colab.research.google.com/drive/12hfBveGHRsxhPIUMmJYrll2lFU4fOX06
"""
import json
import os
from sentence_transformers import SentenceTransformer, util
# First, we load the papers dataset (with title and abstract information)
dataset_file = "emnlp2016-2018.json"
if not os.path.exists(dataset_file):
util.http_get("https://sbert.net/datasets/emnlp2016-2018.json", dataset_file)
with open(dataset_file) as fIn:
papers = json.load(fIn)
print(len(papers), "papers loaded")
# We then load the allenai-specter model with SentenceTransformers
model = SentenceTransformer("allenai-specter")
# To encode the papers, we must combine the title and the abstracts to a single string
paper_texts = [paper["title"] + "[SEP]" + paper["abstract"] for paper in papers]
# Compute embeddings for all papers
corpus_embeddings = model.encode(paper_texts, convert_to_tensor=True)
# We define a function, given title & abstract, searches our corpus for relevant (similar) papers
def search_papers(title, abstract):
query_embedding = model.encode(title + "[SEP]" + abstract, convert_to_tensor=True)
search_hits = util.semantic_search(query_embedding, corpus_embeddings)
search_hits = search_hits[0] # Get the hits for the first query
print("\n\nPaper:", title)
print("Most similar papers:")
for hit in search_hits:
related_paper = papers[hit["corpus_id"]]
print(
"{:.2f}\t{}\t{} {}".format(
hit["score"], related_paper["title"], related_paper["venue"], related_paper["year"]
)
)
# This paper was the EMNLP 2019 Best Paper
search_papers(
title="Specializing Word Embeddings (for Parsing) by Information Bottleneck",
abstract="Pre-trained word embeddings like ELMo and BERT contain rich syntactic and semantic information, resulting in state-of-the-art performance on various tasks. We propose a very fast variational information bottleneck (VIB) method to nonlinearly compress these embeddings, keeping only the information that helps a discriminative parser. We compress each word embedding to either a discrete tag or a continuous vector. In the discrete version, our automatically compressed tags form an alternative tag set: we show experimentally that our tags capture most of the information in traditional POS tag annotations, but our tag sequences can be parsed more accurately at the same level of tag granularity. In the continuous version, we show experimentally that moderately compressing the word embeddings by our method yields a more accurate parser in 8 of 9 languages, unlike simple dimensionality reduction.",
)
# This paper was the EMNLP 2020 Best Paper
search_papers(
title="Digital Voicing of Silent Speech",
abstract="In this paper, we consider the task of digitally voicing silent speech, where silently mouthed words are converted to audible speech based on electromyography (EMG) sensor measurements that capture muscle impulses. While prior work has focused on training speech synthesis models from EMG collected during vocalized speech, we are the first to train from EMG collected during silently articulated speech. We introduce a method of training on silent EMG by transferring audio targets from vocalized to silent signals. Our method greatly improves intelligibility of audio generated from silent EMG compared to a baseline that only trains with vocalized data, decreasing transcription word error rate from 64% to 4% in one data condition and 88% to 68% in another. To spur further development on this task, we share our new dataset of silent and vocalized facial EMG measurements.",
)
# This paper was a EMNLP 2020 Honourable Mention Papers
search_papers(
title="If beam search is the answer, what was the question?",
abstract="Quite surprisingly, exact maximum a posteriori (MAP) decoding of neural language generators frequently leads to low-quality results. Rather, most state-of-the-art results on language generation tasks are attained using beam search despite its overwhelmingly high search error rate. This implies that the MAP objective alone does not express the properties we desire in text, which merits the question: if beam search is the answer, what was the question? We frame beam search as the exact solution to a different decoding objective in order to gain insights into why high probability under a model alone may not indicate adequacy. We find that beam search enforces uniform information density in text, a property motivated by cognitive science. We suggest a set of decoding objectives that explicitly enforce this property and find that exact decoding with these objectives alleviates the problems encountered when decoding poorly calibrated language generation models. Additionally, we analyze the text produced using various decoding strategies and see that, in our neural machine translation experiments, the extent to which this property is adhered to strongly correlates with BLEU.",
)
# This paper was a EMNLP 2020 Honourable Mention Papers
search_papers(
title="Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems",
abstract="The lack of time efficient and reliable evalu-ation methods is hampering the development of conversational dialogue systems (chat bots). Evaluations that require humans to converse with chat bots are time and cost intensive, put high cognitive demands on the human judges, and tend to yield low quality results. In this work, we introduce Spot The Bot, a cost-efficient and robust evaluation framework that replaces human-bot conversations with conversations between bots. Human judges then only annotate for each entity in a conversation whether they think it is human or not (assuming there are humans participants in these conversations). These annotations then allow us to rank chat bots regarding their ability to mimic conversational behaviour of humans. Since we expect that all bots are eventually recognized as such, we incorporate a metric that measures which chat bot is able to uphold human-like be-havior the longest, i.e.Survival Analysis. This metric has the ability to correlate a bot’s performance to certain of its characteristics (e.g.fluency or sensibleness), yielding interpretable results. The comparably low cost of our frame-work allows for frequent evaluations of chatbots during their evaluation cycle. We empirically validate our claims by applying Spot The Bot to three domains, evaluating several state-of-the-art chat bots, and drawing comparisonsto related work. The framework is released asa ready-to-use tool.",
)
# EMNLP 2020 paper on making Sentence-BERT multilingual
search_papers(
title="Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
abstract="We present an easy and efficient method to extend existing sentence embedding models to new languages. This allows to create multilingual versions from previously monolingual models. The training is based on the idea that a translated sentence should be mapped to the same location in the vector space as the original sentence. We use the original (monolingual) model to generate sentence embeddings for the source language and then train a new system on translated sentences to mimic the original model. Compared to other methods for training multilingual sentence embeddings, this approach has several advantages: It is easy to extend existing models with relatively few samples to new languages, it is easier to ensure desired properties for the vector space, and the hardware requirements for training is lower. We demonstrate the effectiveness of our approach for 50+ languages from various language families. Code to extend sentence embeddings models to more than 400 languages is publicly available.",
)
"""
This example uses Approximate Nearest Neighbor Search (ANN) with Annoy (https://github.com/spotify/annoy).
Searching a large corpus with Millions of embeddings can be time-consuming. To speed this up,
ANN can index the existent vectors. For a new query vector, this index can be used to find the nearest neighbors.
This nearest neighbor search is not perfect, i.e., it might not perfectly find all top-k nearest neighbors.
In this example, we use Annoy. It learns to a tree that partitions embeddings into smaller sections. For our query embeddings,
we can efficiently check which section matches and only search that section for nearest neighbor.
Selecting the n_trees parameter is quite important. With more trees, we get a better recall, but a worse run-time.
This script will compare the result from ANN with exact nearest neighbor search and output a Recall@k value
as well as the missing results in the top-k hits list.
See the Annoy repository, how to install Annoy.
For details how Annoy works, see: https://erikbern.com/2015/10/01/nearest-neighbors-and-vector-models-part-2-how-to-search-in-high-dimensional-spaces.html
As dataset, we use the Quora Duplicate Questions dataset, which contains about 500k questions:
https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs
As embeddings model, we use the SBERT model 'quora-distilbert-multilingual',
that it aligned for 100 languages. I.e., you can type in a question in various languages and it will
return the closest questions in the corpus (questions in the corpus are mainly in English).
"""
from sentence_transformers import SentenceTransformer, util
import os
import csv
import pickle
import time
import torch
from annoy import AnnoyIndex
model_name = "quora-distilbert-multilingual"
model = SentenceTransformer(model_name)
url = "http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
dataset_path = "quora_duplicate_questions.tsv"
max_corpus_size = 100000
n_trees = 256 # Number of trees used for Annoy. More trees => better recall, worse run-time
embedding_size = 768 # Size of embeddings
top_k_hits = 10 # Output k hits
annoy_index_path = "quora-embeddings-{}-size-{}-annoy_index-trees-{}.ann".format(
model_name.replace("/", "_"), max_corpus_size, n_trees
)
embedding_cache_path = "quora-embeddings-{}-size-{}.pkl".format(model_name.replace("/", "_"), max_corpus_size)
# Check if embedding cache path exists
if not os.path.exists(embedding_cache_path):
# Check if the dataset exists. If not, download and extract
# Download dataset if needed
if not os.path.exists(dataset_path):
print("Download dataset")
util.http_get(url, dataset_path)
# Get all unique sentences from the file
corpus_sentences = set()
with open(dataset_path, encoding="utf8") as fIn:
reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_MINIMAL)
for row in reader:
corpus_sentences.add(row["question1"])
if len(corpus_sentences) >= max_corpus_size:
break
corpus_sentences.add(row["question2"])
if len(corpus_sentences) >= max_corpus_size:
break
corpus_sentences = list(corpus_sentences)
print("Encode the corpus. This might take a while")
corpus_embeddings = model.encode(corpus_sentences, show_progress_bar=True, convert_to_numpy=True)
print("Store file on disc")
with open(embedding_cache_path, "wb") as fOut:
pickle.dump({"sentences": corpus_sentences, "embeddings": corpus_embeddings}, fOut)
else:
print("Load pre-computed embeddings from disc")
with open(embedding_cache_path, "rb") as fIn:
cache_data = pickle.load(fIn)
corpus_sentences = cache_data["sentences"]
corpus_embeddings = cache_data["embeddings"]
if not os.path.exists(annoy_index_path):
# Create Annoy Index
print("Create Annoy index with {} trees. This can take some time.".format(n_trees))
annoy_index = AnnoyIndex(embedding_size, "angular")
for i in range(len(corpus_embeddings)):
annoy_index.add_item(i, corpus_embeddings[i])
annoy_index.build(n_trees)
annoy_index.save(annoy_index_path)
else:
# Load Annoy Index from disc
annoy_index = AnnoyIndex(embedding_size, "angular")
annoy_index.load(annoy_index_path)
corpus_embeddings = torch.from_numpy(corpus_embeddings)
######### Search in the index ###########
print("Corpus loaded with {} sentences / embeddings".format(len(corpus_sentences)))
while True:
inp_question = input("Please enter a question: ")
start_time = time.time()
question_embedding = model.encode(inp_question)
corpus_ids, scores = annoy_index.get_nns_by_vector(question_embedding, top_k_hits, include_distances=True)
hits = []
for id, score in zip(corpus_ids, scores):
hits.append({"corpus_id": id, "score": 1 - ((score**2) / 2)})
end_time = time.time()
print("Input question:", inp_question)
print("Results (after {:.3f} seconds):".format(end_time - start_time))
for hit in hits[0:top_k_hits]:
print("\t{:.3f}\t{}".format(hit["score"], corpus_sentences[hit["corpus_id"]]))
# Approximate Nearest Neighbor (ANN) is not exact, it might miss entries with high cosine similarity
# Here, we compute the recall of ANN compared to the exact results
correct_hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k_hits)[0]
correct_hits_ids = set([hit["corpus_id"] for hit in correct_hits])
# Compute recall
ann_corpus_ids = set(corpus_ids)
if len(ann_corpus_ids) != len(correct_hits_ids):
print("Approximate Nearest Neighbor returned a different number of results than expected")
recall = len(ann_corpus_ids.intersection(correct_hits_ids)) / len(correct_hits_ids)
print("\nApproximate Nearest Neighbor Recall@{}: {:.2f}".format(top_k_hits, recall * 100))
if recall < 1:
print("Missing results:")
for hit in correct_hits[0:top_k_hits]:
if hit["corpus_id"] not in ann_corpus_ids:
print("\t{:.3f}\t{}".format(hit["score"], corpus_sentences[hit["corpus_id"]]))
print("\n\n========\n")
"""
This script contains an example how to perform semantic search with Elasticsearch.
As dataset, we use the Quora Duplicate Questions dataset, which contains about 500k questions:
https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs
Questions are indexed to Elasticsearch together with their respective sentence
embeddings.
The script shows results from BM25 as well as from semantic search with
cosine similarity.
You need Elasticsearch up and running, for example using Docker
(https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html).
Further, you need the Python Elasticsearch Client installed: https://elasticsearch-py.readthedocs.io/
As embeddings model, we use the SBERT model 'quora-distilbert-multilingual',
that it aligned for 100 languages. I.e., you can type in a question in various languages and it will
return the closest questions in the corpus (questions in the corpus are mainly in English).
"""
from sentence_transformers import SentenceTransformer, util
import os
from elasticsearch import Elasticsearch, helpers
from ssl import create_default_context
import csv
import time
import tqdm.autonotebook
es = Elasticsearch(
hosts=["https://localhost:9200"],
basic_auth=("elastic", os.environ["ELASTIC_PASSWORD"]), # displayed at ES server startup
ssl_context=create_default_context(cafile="http_ca.crt"), # copied from inside ES container
)
model = SentenceTransformer("quora-distilbert-multilingual")
url = "http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
dataset_path = "quora_duplicate_questions.tsv"
max_corpus_size = 100000
# Download dataset if needed
if not os.path.exists(dataset_path):
print("Download dataset")
util.http_get(url, dataset_path)
# Get all unique sentences from the file
all_questions = {}
with open(dataset_path, encoding="utf8") as fIn:
reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_MINIMAL)
for row in reader:
all_questions[row["qid1"]] = row["question1"]
if len(all_questions) >= max_corpus_size:
break
all_questions[row["qid2"]] = row["question2"]
if len(all_questions) >= max_corpus_size:
break
qids = list(all_questions.keys())
questions = [all_questions[qid] for qid in qids]
# Index data, if the index does not exists
if not es.indices.exists(index="quora"):
try:
es_index = {
"mappings": {
"properties": {
"question": {"type": "text"},
"question_vector": {"type": "dense_vector", "dims": 768, "index": True, "similarity": "cosine"},
}
}
}
es.indices.create(index="quora", body=es_index)
chunk_size = 500
print("Index data (you can stop it by pressing Ctrl+C once):")
with tqdm.tqdm(total=len(qids)) as pbar:
for start_idx in range(0, len(qids), chunk_size):
end_idx = start_idx + chunk_size
embeddings = model.encode(questions[start_idx:end_idx], show_progress_bar=False)
bulk_data = []
for qid, question, embedding in zip(qids[start_idx:end_idx], questions[start_idx:end_idx], embeddings):
bulk_data.append(
{
"_index": "quora",
"_id": qid,
"_source": {"question": question, "question_vector": embedding},
}
)
helpers.bulk(es, bulk_data)
pbar.update(chunk_size)
except Exception:
print("During index an exception occurred. Continue\n\n")
# Interactive search queries
while True:
inp_question = input("Please enter a question: ")
encode_start_time = time.time()
question_embedding = model.encode(inp_question)
encode_end_time = time.time()
# Lexical search
bm25 = es.search(index="quora", body={"query": {"match": {"question": inp_question}}})
# Semantic search
sem_search = es.search(
index="quora",
knn={"field": "question_vector", "query_vector": question_embedding, "k": 10, "num_candidates": 100},
)
print("Input question:", inp_question)
print(
"Computing the embedding took {:.3f} seconds, BM25 search took {:.3f} seconds, semantic search with ES took {:.3f} seconds".format(
encode_end_time - encode_start_time, bm25["took"] / 1000, sem_search["took"] / 1000
)
)
print("BM25 results:")
for hit in bm25["hits"]["hits"][0:5]:
print("\t{}".format(hit["_source"]["question"]))
print("\nSemantic Search results:")
for hit in sem_search["hits"]["hits"][0:5]:
print("\t{}".format(hit["_source"]["question"]))
print("\n\n========\n")
"""
This example uses Approximate Nearest Neighbor Search (ANN) with FAISS (https://github.com/facebookresearch/faiss).
Searching a large corpus with Millions of embeddings can be time-consuming. To speed this up,
ANN can index the existent vectors. For a new query vector, this index can be used to find the nearest neighbors.
This nearest neighbor search is not perfect, i.e., it might not perfectly find all top-k nearest neighbors.
In this example, we use FAISS with an inverse flat index (IndexIVFFlat). It learns to partition the corpus embeddings
into different cluster (number is defined by n_clusters). At search time, the matching cluster for query is found and only vectors
in this cluster must be search for nearest neighbors.
This script will compare the result from ANN with exact nearest neighbor search and output a Recall@k value
as well as the missing results in the top-k hits list.
See the FAISS repository, how to install FAISS.
As dataset, we use the Quora Duplicate Questions dataset, which contains about 500k questions (only 100k are used):
https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs.
As embeddings model, we use the SBERT model 'quora-distilbert-multilingual',
that it aligned for 100 languages. I.e., you can type in a question in various languages and it will
return the closest questions in the corpus (questions in the corpus are mainly in English).
"""
from sentence_transformers import SentenceTransformer, util
import os
import csv
import pickle
import time
import faiss
import numpy as np
model_name = "quora-distilbert-multilingual"
model = SentenceTransformer(model_name)
url = "http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
dataset_path = "quora_duplicate_questions.tsv"
max_corpus_size = 100000
embedding_cache_path = "quora-embeddings-{}-size-{}.pkl".format(model_name.replace("/", "_"), max_corpus_size)
embedding_size = 768 # Size of embeddings
top_k_hits = 10 # Output k hits
# Defining our FAISS index
# Number of clusters used for faiss. Select a value 4*sqrt(N) to 16*sqrt(N) - https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
n_clusters = 1024
# We use Inner Product (dot-product) as Index. We will normalize our vectors to unit length, then is Inner Product equal to cosine similarity
quantizer = faiss.IndexFlatIP(embedding_size)
index = faiss.IndexIVFFlat(quantizer, embedding_size, n_clusters, faiss.METRIC_INNER_PRODUCT)
# Number of clusters to explorer at search time. We will search for nearest neighbors in 3 clusters.
index.nprobe = 3
# Check if embedding cache path exists
if not os.path.exists(embedding_cache_path):
# Check if the dataset exists. If not, download and extract
# Download dataset if needed
if not os.path.exists(dataset_path):
print("Download dataset")
util.http_get(url, dataset_path)
# Get all unique sentences from the file
corpus_sentences = set()
with open(dataset_path, encoding="utf8") as fIn:
reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_MINIMAL)
for row in reader:
corpus_sentences.add(row["question1"])
if len(corpus_sentences) >= max_corpus_size:
break
corpus_sentences.add(row["question2"])
if len(corpus_sentences) >= max_corpus_size:
break
corpus_sentences = list(corpus_sentences)
print("Encode the corpus. This might take a while")
corpus_embeddings = model.encode(corpus_sentences, show_progress_bar=True, convert_to_numpy=True)
print("Store file on disc")
with open(embedding_cache_path, "wb") as fOut:
pickle.dump({"sentences": corpus_sentences, "embeddings": corpus_embeddings}, fOut)
else:
print("Load pre-computed embeddings from disc")
with open(embedding_cache_path, "rb") as fIn:
cache_data = pickle.load(fIn)
corpus_sentences = cache_data["sentences"]
corpus_embeddings = cache_data["embeddings"]
### Create the FAISS index
print("Start creating FAISS index")
# First, we need to normalize vectors to unit length
corpus_embeddings = corpus_embeddings / np.linalg.norm(corpus_embeddings, axis=1)[:, None]
# Then we train the index to find a suitable clustering
index.train(corpus_embeddings)
# Finally we add all embeddings to the index
index.add(corpus_embeddings)
######### Search in the index ###########
print("Corpus loaded with {} sentences / embeddings".format(len(corpus_sentences)))
while True:
inp_question = input("Please enter a question: ")
start_time = time.time()
question_embedding = model.encode(inp_question)
# FAISS works with inner product (dot product). When we normalize vectors to unit length, inner product is equal to cosine similarity
question_embedding = question_embedding / np.linalg.norm(question_embedding)
question_embedding = np.expand_dims(question_embedding, axis=0)
# Search in FAISS. It returns a matrix with distances and corpus ids.
distances, corpus_ids = index.search(question_embedding, top_k_hits)
# We extract corpus ids and scores for the first query
hits = [{"corpus_id": id, "score": score} for id, score in zip(corpus_ids[0], distances[0])]
hits = sorted(hits, key=lambda x: x["score"], reverse=True)
end_time = time.time()
print("Input question:", inp_question)
print("Results (after {:.3f} seconds):".format(end_time - start_time))
for hit in hits[0:top_k_hits]:
print("\t{:.3f}\t{}".format(hit["score"], corpus_sentences[hit["corpus_id"]]))
# Approximate Nearest Neighbor (ANN) is not exact, it might miss entries with high cosine similarity
# Here, we compute the recall of ANN compared to the exact results
correct_hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k_hits)[0]
correct_hits_ids = set([hit["corpus_id"] for hit in correct_hits])
ann_corpus_ids = set([hit["corpus_id"] for hit in hits])
if len(ann_corpus_ids) != len(correct_hits_ids):
print("Approximate Nearest Neighbor returned a different number of results than expected")
recall = len(ann_corpus_ids.intersection(correct_hits_ids)) / len(correct_hits_ids)
print("\nApproximate Nearest Neighbor Recall@{}: {:.2f}".format(top_k_hits, recall * 100))
if recall < 1:
print("Missing results:")
for hit in correct_hits[0:top_k_hits]:
if hit["corpus_id"] not in ann_corpus_ids:
print("\t{:.3f}\t{}".format(hit["score"], corpus_sentences[hit["corpus_id"]]))
print("\n\n========\n")
"""
This example uses Approximate Nearest Neighbor Search (ANN) with Hnswlib (https://github.com/nmslib/hnswlib/).
Searching a large corpus with Millions of embeddings can be time-consuming. To speed this up,
ANN can index the existent vectors. For a new query vector, this index can be used to find the nearest neighbors.
This nearest neighbor search is not perfect, i.e., it might not perfectly find all top-k nearest neighbors.
In this example, we use Hnswlib: It is a fast and easy to use library, with excellent results on common benchmarks.
Usually you can install Hnswlib by running:
pip install hnswlib
For more details, see https://github.com/nmslib/hnswlib/
As dataset, we use the Quora Duplicate Questions dataset, which contains about 500k questions (we only use 100k in this example):
https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs
As embeddings model, we use the SBERT model 'quora-distilbert-multilingual',
that it aligned for 100 languages. I.e., you can type in a question in various languages and it will
return the closest questions in the corpus (questions in the corpus are mainly in English).
"""
from sentence_transformers import SentenceTransformer, util
import os
import csv
import pickle
import time
import hnswlib
model_name = "quora-distilbert-multilingual"
model = SentenceTransformer(model_name)
url = "http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
dataset_path = "quora_duplicate_questions.tsv"
max_corpus_size = 100000
embedding_cache_path = "quora-embeddings-{}-size-{}.pkl".format(model_name.replace("/", "_"), max_corpus_size)
embedding_size = 768 # Size of embeddings
top_k_hits = 10 # Output k hits
# Check if embedding cache path exists
if not os.path.exists(embedding_cache_path):
# Check if the dataset exists. If not, download and extract
# Download dataset if needed
if not os.path.exists(dataset_path):
print("Download dataset")
util.http_get(url, dataset_path)
# Get all unique sentences from the file
corpus_sentences = set()
with open(dataset_path, encoding="utf8") as fIn:
reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_MINIMAL)
for row in reader:
corpus_sentences.add(row["question1"])
if len(corpus_sentences) >= max_corpus_size:
break
corpus_sentences.add(row["question2"])
if len(corpus_sentences) >= max_corpus_size:
break
corpus_sentences = list(corpus_sentences)
print("Encode the corpus. This might take a while")
corpus_embeddings = model.encode(corpus_sentences, show_progress_bar=True, convert_to_numpy=True)
print("Store file on disc")
with open(embedding_cache_path, "wb") as fOut:
pickle.dump({"sentences": corpus_sentences, "embeddings": corpus_embeddings}, fOut)
else:
print("Load pre-computed embeddings from disc")
with open(embedding_cache_path, "rb") as fIn:
cache_data = pickle.load(fIn)
corpus_sentences = cache_data["sentences"]
corpus_embeddings = cache_data["embeddings"]
# Defining our hnswlib index
index_path = "./hnswlib.index"
# We use Inner Product (dot-product) as Index. We will normalize our vectors to unit length, then is Inner Product equal to cosine similarity
index = hnswlib.Index(space="cosine", dim=embedding_size)
if os.path.exists(index_path):
print("Loading index...")
index.load_index(index_path)
else:
### Create the HNSWLIB index
print("Start creating HNSWLIB index")
index.init_index(max_elements=len(corpus_embeddings), ef_construction=400, M=64)
# Then we train the index to find a suitable clustering
index.add_items(corpus_embeddings, list(range(len(corpus_embeddings))))
print("Saving index to:", index_path)
index.save_index(index_path)
# Controlling the recall by setting ef:
index.set_ef(50) # ef should always be > top_k_hits
######### Search in the index ###########
print("Corpus loaded with {} sentences / embeddings".format(len(corpus_sentences)))
while True:
inp_question = input("Please enter a question: ")
start_time = time.time()
question_embedding = model.encode(inp_question)
# We use hnswlib knn_query method to find the top_k_hits
corpus_ids, distances = index.knn_query(question_embedding, k=top_k_hits)
# We extract corpus ids and scores for the first query
hits = [{"corpus_id": id, "score": 1 - score} for id, score in zip(corpus_ids[0], distances[0])]
hits = sorted(hits, key=lambda x: x["score"], reverse=True)
end_time = time.time()
print("Input question:", inp_question)
print("Results (after {:.3f} seconds):".format(end_time - start_time))
for hit in hits[0:top_k_hits]:
print("\t{:.3f}\t{}".format(hit["score"], corpus_sentences[hit["corpus_id"]]))
# Approximate Nearest Neighbor (ANN) is not exact, it might miss entries with high cosine similarity
# Here, we compute the recall of ANN compared to the exact results
correct_hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k_hits)[0]
correct_hits_ids = set([hit["corpus_id"] for hit in correct_hits])
ann_corpus_ids = set([hit["corpus_id"] for hit in hits])
if len(ann_corpus_ids) != len(correct_hits_ids):
print("Approximate Nearest Neighbor returned a different number of results than expected")
recall = len(ann_corpus_ids.intersection(correct_hits_ids)) / len(correct_hits_ids)
print("\nApproximate Nearest Neighbor Recall@{}: {:.2f}".format(top_k_hits, recall * 100))
if recall < 1:
print("Missing results:")
for hit in correct_hits[0:top_k_hits]:
if hit["corpus_id"] not in ann_corpus_ids:
print("\t{:.3f}\t{}".format(hit["score"], corpus_sentences[hit["corpus_id"]]))
print("\n\n========\n")
"""
This script contains an example how to perform semantic search with PyTorch. It performs exact nearest neighborh search.
As dataset, we use the Quora Duplicate Questions dataset, which contains about 500k questions (we only use about 100k):
https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs
As embeddings model, we use the SBERT model 'quora-distilbert-multilingual',
that it aligned for 100 languages. I.e., you can type in a question in various languages and it will
return the closest questions in the corpus (questions in the corpus are mainly in English).
Google Colab example: https://colab.research.google.com/drive/12cn5Oo0v3HfQQ8Tv6-ukgxXSmT3zl35A?usp=sharing
"""
from sentence_transformers import SentenceTransformer, util
import os
import csv
import pickle
import time
model_name = "quora-distilbert-multilingual"
model = SentenceTransformer(model_name)
url = "http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
dataset_path = "quora_duplicate_questions.tsv"
max_corpus_size = 100000
embedding_cache_path = "quora-embeddings-{}-size-{}.pkl".format(model_name.replace("/", "_"), max_corpus_size)
# Check if embedding cache path exists
if not os.path.exists(embedding_cache_path):
# Check if the dataset exists. If not, download and extract
# Download dataset if needed
if not os.path.exists(dataset_path):
print("Download dataset")
util.http_get(url, dataset_path)
# Get all unique sentences from the file
corpus_sentences = set()
with open(dataset_path, encoding="utf8") as fIn:
reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_MINIMAL)
for row in reader:
corpus_sentences.add(row["question1"])
if len(corpus_sentences) >= max_corpus_size:
break
corpus_sentences.add(row["question2"])
if len(corpus_sentences) >= max_corpus_size:
break
corpus_sentences = list(corpus_sentences)
print("Encode the corpus. This might take a while")
corpus_embeddings = model.encode(corpus_sentences, show_progress_bar=True, convert_to_tensor=True)
print("Store file on disc")
with open(embedding_cache_path, "wb") as fOut:
pickle.dump({"sentences": corpus_sentences, "embeddings": corpus_embeddings}, fOut)
else:
print("Load pre-computed embeddings from disc")
with open(embedding_cache_path, "rb") as fIn:
cache_data = pickle.load(fIn)
corpus_sentences = cache_data["sentences"][0:max_corpus_size]
corpus_embeddings = cache_data["embeddings"][0:max_corpus_size]
###############################
print("Corpus loaded with {} sentences / embeddings".format(len(corpus_sentences)))
# Move embeddings to the target device of the model
corpus_embeddings = corpus_embeddings.to(model.device)
while True:
inp_question = input("Please enter a question: ")
start_time = time.time()
question_embedding = model.encode(inp_question, convert_to_tensor=True)
hits = util.semantic_search(question_embedding, corpus_embeddings)
end_time = time.time()
hits = hits[0] # Get the hits for the first query
print("Input question:", inp_question)
print("Results (after {:.3f} seconds):".format(end_time - start_time))
for hit in hits[0:5]:
print("\t{:.3f}\t{}".format(hit["score"], corpus_sentences[hit["corpus_id"]]))
print("\n\n========\n")
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment