First add

0fccd232 · Rayyyyy · 0fccd232 · 0fccd232 · 0fccd232 · 0fccd232
Commit 0fccd232 authored May 27, 2024 by Rayyyyy
20 changed files
--- a/.gitignore
+++ b/.gitignore
+.idea
+.vscode
+*.pyc
+*.gz
+*.tsv
+tmp_*.py
+/examples/**/output/*
+examples/datasets/*/
+sentence_transformers.egg-info
+dist/
+nr_*/
+/examples/datasets/
+/examples/embeddings/
+/pretrained-models/
+/cheatsheet.txt
+/testsuite.txt
+/TODO.txt
+/docs/_build/
+/docs/make.bat
+/examples/training/quora_duplicate_questions/quora-IR-dataset/
+build
+htmlcov
+.coverage
+.venv
\ No newline at end of file
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
+repos:
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    # Ruff version.
+    rev: v0.3.5
+    hooks:
+      # These hooks are equivalent to running `make quality`
+      - id: ruff
+      - id: ruff-format
+        args: [ --check ]
--- a/LICENSE
+++ b/LICENSE
+                                Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "{}"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright 2019 Nils Reimers
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+limitations under the License.
--- a/Makefile
+++ b/Makefile
+quality:
+	ruff check
+	ruff format --check
+style:
+	ruff check --fix
+	ruff format
--- a/ModelZooStd.md
+++ b/ModelZooStd.md
+# 仓库目录结构
+* 除预训练模型外其他文件总大小尽量不要超过50M
+```
+    Project
+    ├── dataset
+    │   ├── label_1
+    │             ├── xxx.png
+    │             ├── xxx.png
+    │             └── ...
+    │   └── label_2
+    │             ├── xxx.png
+    │             ├── xxx.png
+    │             └── ...  
+    ├── model
+    │   ├── xxx.pth #预训练模型 
+    │   ├── xxx.onnx #对应的onnx模型
+    │   └── xxx.mxr #对应的migraphx离线推理模型
+    ├── doc
+    │   ├── icon.png
+    │   ├── xxx.png
+    │   └── xxx.png
+    ├── README.md
+    ├── requirement.txt
+    ├── model.properties
+    ├── code_file1.py
+    ├── code_file2.py
+    ├── code_file3.py
+    ├── dirs
+    │   ├── code_file4.py
+    │   ├── code_file5.py
+    └── └── code_file6.py
+```
+* icon.png:模型的图标文件，可到[iconfont](https://www.iconfont.cn/?spm=a313x.7781069.1998910419.d4d0a486a)查找。
+![img](./doc/icon.png)
+* README.md:参照下图，`十二大标题`为必选项，二级标题以下的标题或内容根据自己的项目灵活增删。
+![img](./doc/readme.png)
+* requirement.txt：模型依赖统一写到此文件，与深度学习相关的库请注释，以免安装为nv库。
+```
+说明：数据基本由公司网盘储存并提供url下载或直接读取，数据信息介绍由超算互联网商城提供，内部无数据时提供官网下载地址。
+```
+* 需要提供迷你数据集以供使用者快速上手项目。
+* model.properties：`五大属性`固定模板如下：
+```
+    # 模型唯一标识
+    modelCode=Project ID
+    # 模型名称
+    modelName=模型名称(同项目名称：模型名_深度学习框架)
+    # 模型描述
+    modelDescription=简要描述此模型(尽量50字以内)
+    # 应用场景
+    appScenario=推理,训练,OCR,政府,交通,零售,金融,医疗(首先描述推理、训练信息，然后描述算法类别信息，最后描述应用行业信息，多个标签用英文逗号隔开。)
+    # 框架类型
+    frameType=paddle(说明使用的算法框架, 多个标签用英文逗号隔开。)
+```
+* 增加LICENSE（必要），源github无LICENSE则在LICENSE里填：None LICENSE Currently；CONTRIBUTORS.md根据源github有无提供（非必要）。
\ No newline at end of file
--- a/NOTICE.txt
+++ b/NOTICE.txt
+-------------------------------------------------------------------------------
+Copyright 2019
+Ubiquitous Knowledge Processing (UKP) Lab
+Technische Universität Darmstadt
+-------------------------------------------------------------------------------
\ No newline at end of file
--- a/README.md
+++ b/README.md
+# Sentence-BERT
+## 论文
+`Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks`
+- https://arxiv.org/pdf/1908.10084.pdf
+## 模型结构
+<div align=center>
+    <img src="./doc/model.png"/>
+</div>
+## 算法原理
+对于每个句子对，通过网络传递句子A和句子B，从而得到embeddings u 和 v。使用余弦相似度计算embedding的相似度，并将结果与 gold similarity score进行比较。这允许网络进行微调，并识别句子的相似性.
+<div align=center>
+    <img src="./doc/infer.png"/>
+</div>
+## 环境配置
+1. -v 路径、docker_name和imageID根据实际情况修改
+2. transformers/trainer_pt_utils.py文件 line 37 修改为:
+```bash
+try:
+    from torch.optim.lr_scheduler import _LRScheduler as LRScheduler
+except ImportError:
+    from torch.optim.lr_scheduler import LRScheduler as LRScheduler
+```
+<div align=center>
+    <img src="./doc/example.png"/>
+</div>
+### Docker（方法一）
+```bash
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk24.04-py310
+docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro --shm-size=32G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name docker_name imageID bash
+cd /your_code_path/sentence-bert_pytorch
+pip install -r requirements.txt
+pip install -U sentence-transformers
+pip install -e .
+```
+### Dockerfile（方法二）
+```bash
+cd ./docker
+cp ../requirements.txt requirements.txt
+docker build --no-cache -t sbert:latest .
+docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro --shm-size=32G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name docker_name imageID bash
+cd /your_code_path/sentence-bert_pytorch
+pip install -r requirements.txt
+pip install -U sentence-transformers
+pip install -e .
+```
+### Anaconda（方法三）
+1. 关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装: https://developer.hpccube.com/tool/
+```bash
+DTK软件栈：dtk24.04
+python：python3.10
+torch：2.1.0
+```
+Tips：以上dtk软件栈、python、torch等DCU相关工具版本需要严格一一对应
+2. 其他非特殊库直接按照requirements.txt安装
+```bash
+cd /your_code_path/sentence-bert_pytorch
+pip install -r requirements.txt
+pip install -U sentence-transformers
+pip install -e .
+```
+## 数据集
+使用来自多个数据集的結合来微调模型，句子对的总数超过10亿个句子。对每个数据集进行抽样，给出一个加权概率，该概率在data_config.json文件中详细说明。
+因数据较多，这里仅用[Simple Wikipedia Version 1.0](https://cs.pomona.edu/~dkauchak/simplification/)数据集进行展示，数据集已在 datasets/simple_wikipedia_v1 中提供
+详细数据请参考[all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)模型中的Model card。
+数据集的目录结构如下:
+```
+├── datasets
+│   ├──tmp.txt
+│   ├──simple_wikipedia_v1
+│       ├──simple_wiki_pair.txt # 生成的
+│       ├──wiki.simple
+│       └──wiki.unsimplified
+```
+## 训练
+使用预训练模型[MiniLM-L6-H384-uncased](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased)，有关预训练程序的详细信息，请参阅 model card。
+### 单机多卡
+```bash
+bash finetune.sh
+```
+### 单机单卡
+```bash
+python finetune.py
+```
+## 推理
+预训练模型下载[pretrained models](https://www.sbert.net/docs/pretrained_models.html)
+```bash
+python infer.py --data_path ./datasets/tmp.txt
+```
+## result
+<div align=center>
+    <img src="./doc/results.png"/>
+</div>
+### 精度
+暂无
+## 应用场景
+### 算法类别
+NLP
+### 热点应用行业
+教育,网安,政府
+## 源码仓库及问题反馈
+- https://developer.hpccube.com/codes/modelzoo/sentence-bert_pytorch
+## 参考资料
+- https://github.com/UKPLab/sentence-transformers
--- a/README_ori.md
+++ b/README_ori.md
+<!--- BADGES: START --->
+[![GitHub - License](https://img.shields.io/github/license/UKPLab/sentence-transformers?logo=github&style=flat&color=green)][#github-license]
+[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/sentence-transformers?logo=pypi&style=flat&color=blue)][#pypi-package]
+[![PyPI - Package Version](https://img.shields.io/pypi/v/sentence-transformers?logo=pypi&style=flat&color=orange)][#pypi-package]
+[![Conda - Platform](https://img.shields.io/conda/pn/conda-forge/sentence-transformers?logo=anaconda&style=flat)][#conda-forge-package]
+[![Conda (channel only)](https://img.shields.io/conda/vn/conda-forge/sentence-transformers?logo=anaconda&style=flat&color=orange)][#conda-forge-package]
+[![Docs - GitHub.io](https://img.shields.io/static/v1?logo=github&style=flat&color=pink&label=docs&message=sentence-transformers)][#docs-package]
+<!---
+[![PyPI - Downloads](https://img.shields.io/pypi/dm/sentence-transformers?logo=pypi&style=flat&color=green)][#pypi-package]
+[![Conda](https://img.shields.io/conda/dn/conda-forge/sentence-transformers?logo=anaconda)][#conda-forge-package]
+--->
+[#github-license]: https://github.com/UKPLab/sentence-transformers/blob/master/LICENSE
+[#pypi-package]: https://pypi.org/project/sentence-transformers/
+[#conda-forge-package]: https://anaconda.org/conda-forge/sentence-transformers
+[#docs-package]: https://www.sbert.net/
+<!--- BADGES: END --->
+# Sentence Transformers: Multilingual Sentence, Paragraph, and Image Embeddings using BERT & Co.
+This framework provides an easy method to compute dense vector representations for **sentences**, **paragraphs**, and **images**. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various tasks. Text is embedded in vector space such that similar text are closer and can efficiently be found using cosine similarity.
+We provide an increasing number of **[state-of-the-art pretrained models](https://www.sbert.net/docs/pretrained_models.html)** for more than 100 languages, fine-tuned for various use-cases.
+Further, this framework allows an easy  **[fine-tuning of custom embeddings models](https://www.sbert.net/docs/training/overview.html)**, to achieve maximal performance on your specific task.
+For the **full documentation**, see **[www.SBERT.net](https://www.sbert.net)**.
+The following publications are integrated in this framework:
+- [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084) (EMNLP 2019)
+- [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813) (EMNLP 2020)
+- [Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks](https://arxiv.org/abs/2010.08240) (NAACL 2021)
+- [The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes](https://arxiv.org/abs/2012.14210) (arXiv 2020)
+- [TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning](https://arxiv.org/abs/2104.06979) (arXiv 2021)
+- [BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663) (arXiv 2021)
+- [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) (arXiv 2022)
+## Installation
+We recommend **Python 3.8** or higher, **[PyTorch 1.11.0](https://pytorch.org/get-started/locally/)** or higher and **[transformers v4.32.0](https://github.com/huggingface/transformers)** or higher. The code does **not** work with Python 2.7.
+**Install with pip**
+Install the *sentence-transformers* with `pip`:
+```
+pip install -U sentence-transformers
+```
+**Install with conda**
+You can install the *sentence-transformers* with `conda`:
+```
+conda install -c conda-forge sentence-transformers
+```
+**Install from sources**
+Alternatively, you can also clone the latest version from the [repository](https://github.com/UKPLab/sentence-transformers) and install it directly from the source code:
+````
+pip install -e .
+````
+**PyTorch with CUDA**
+If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version. Follow
+[PyTorch - Get Started](https://pytorch.org/get-started/locally/) for further details how to install PyTorch.
+## Getting Started
+See [Quickstart](https://www.sbert.net/docs/quickstart.html) in our documenation.
+[This example](https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications/computing-embeddings/computing_embeddings.py) shows you how to use an already trained Sentence Transformer model to embed sentences for another task.
+First download a pretrained model.
+````python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("all-MiniLM-L6-v2")
+````
+Then provide some sentences to the model.
+````python
+sentences = [
+    "This framework generates embeddings for each input sentence",
+    "Sentences are passed as a list of string.",
+    "The quick brown fox jumps over the lazy dog.",
+]
+sentence_embeddings = model.encode(sentences)
+````
+And that's it already. We now have a list of numpy arrays with the embeddings.
+````python
+for sentence, embedding in zip(sentences, sentence_embeddings):
+    print("Sentence:", sentence)
+    print("Embedding:", embedding)
+    print("")
+````
+   bbnnm,,,nmm
+## Pre-Trained Models
+We provide a large list of [Pretrained Models](https://www.sbert.net/docs/pretrained_models.html) for more than 100 languages. Some models are general purpose models, while others produce embeddings for specific use cases. Pre-trained models can be loaded by just passing the model name: `SentenceTransformer('model_name')`.
+[»  Full list of pretrained models](https://www.sbert.net/docs/pretrained_models.html)
+## Training
+This framework allows you to fine-tune your own sentence embedding methods, so that you get task-specific sentence embeddings. You have various options to choose from in order to get perfect sentence embeddings for your specific task.
+See [Training Overview](https://www.sbert.net/docs/training/overview.html) for an introduction how to train your own embedding models. We provide [various examples](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training) how to train models on various datasets.
+Some highlights are:
+- Support of various transformer networks including BERT, RoBERTa, XLM-R, DistilBERT, Electra, BART, ...
+- Multi-Lingual and multi-task learning
+- Evaluation during training to find optimal model
+- [20+ loss-functions](https://www.sbert.net/docs/package_reference/losses.html) allowing to tune models specifically for semantic search, paraphrase mining, semantic similarity comparison, clustering, triplet loss, contrastive loss.
+## Performance
+Our models are evaluated extensively on 15+ datasets including challening domains like Tweets, Reddit, emails. They achieve by far the **best performance** from all available sentence embedding methods. Further, we provide several **smaller models** that are **optimized for speed**.
+[» Full list of pretrained models](https://www.sbert.net/docs/pretrained_models.html)
+## Application Examples
+You can use this framework for:
+- [Computing Sentence Embeddings](https://www.sbert.net/examples/applications/computing-embeddings/README.html)
+- [Semantic Textual Similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html)
+- [Clustering](https://www.sbert.net/examples/applications/clustering/README.html)
+- [Paraphrase Mining](https://www.sbert.net/examples/applications/paraphrase-mining/README.html)
+ - [Translated Sentence Mining](https://www.sbert.net/examples/applications/parallel-sentence-mining/README.html)
+ - [Semantic Search](https://www.sbert.net/examples/applications/semantic-search/README.html)
+ - [Retrieve & Re-Rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html)
+ - [Text Summarization](https://www.sbert.net/examples/applications/text-summarization/README.html)
+- [Multilingual Image Search, Clustering & Duplicate Detection](https://www.sbert.net/examples/applications/image-search/README.html)
+and many more use-cases.
+For all examples, see [examples/applications](https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications).
+## Development setup
+After cloning the repo (or a fork) to your machine, in a virtual environment, run:
+```
+python -m pip install -e ".[dev]"
+pre-commit install
+```
+To test your changes, run:
+```
+pytest
+```
+## Citing & Authors
+If you find this repository helpful, feel free to cite our publication [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084):
+```bibtex
+@inproceedings{reimers-2019-sentence-bert,
+    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2019",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/1908.10084",
+}
+```
+If you use one of the multilingual models, feel free to cite our publication [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813):
+```bibtex
+@inproceedings{reimers-2020-multilingual-sentence-bert,
+    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2020",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/2004.09813",
+}
+```
+Please have a look at [Publications](https://www.sbert.net/docs/publications.html) for our different publications that are integrated into SentenceTransformers.
+Contact person: Tom Aarsen, [tom.aarsen@huggingface.co](mailto:tom.aarsen@huggingface.co)
+https://www.ukp.tu-darmstadt.de/
+Don't hesitate to open an issue if something is broken (and it shouldn't be) or if you have further questions.
+> This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.
--- a/datasets/simple_wikipedia_v1/README
+++ b/datasets/simple_wikipedia_v1/README
+The included data set contains 137,362 aligned sentences extracted by pairing Simple English Wikipedia with English Wikipedia.  A complete description of the extraction process can be found in "Simple English Wikipedia: A New Simplification Task", William Coster and David Kauchak (2011).  In Proceedings of ACL (short paper).  The data set contains those sentences with a similarity above 0.50.  Higher precision alignments may be obtained by TF-IDF thresholding at higher levels.
+Two files are included: wiki.normal and wiki.simple.  Each file contains 137,362 lines and corresponds to a sentence.  The nth line/sentence in wiki.normal corresponds to the nth line/sentence in wiki.simple.  Some minimal tokenization has been done to treat most punctuation characters as separate words/tokens.
+For questions regarding the data set set, contact David Kauchak at Pomona College.
--- a/datasets/simple_wikipedia_v1/simple_wiki_pair.txt
+++ b/datasets/simple_wikipedia_v1/simple_wiki_pair.txt
--- a/datasets/simple_wikipedia_v1/wiki.simple
+++ b/datasets/simple_wikipedia_v1/wiki.simple
--- a/datasets/simple_wikipedia_v1/wiki.unsimplified
+++ b/datasets/simple_wikipedia_v1/wiki.unsimplified
--- a/datasets/tmp.txt
+++ b/datasets/tmp.txt
+{"sentence1": "不能,这是属于个人所有的固定资产。", "sentence2": "不可以,这是个人固定资产,不能买卖。", "score": 0.96}
+{"sentence1": "不可以,这属于个人固定资产,不能交易。", "sentence2": "不可以,这属于个人固定资产。", "score": 0.99}
+{"sentence1": "活动前一周内是推荐的提交时间段。", "sentence2": "通常建议在活动开始前的一周内提交。", "score": 0.99}
+{"sentence1": "请一直向参观者强调“不要拍照”。", "sentence2": "请提醒参观者“禁止携带相机拍照”。", "score": 0.85}
+{"sentence1": "可以自己选购所需物资。", "sentence2": "可以自行选购,没有限制。", "score": 0.85}
\ No newline at end of file
--- a/doc/example.png
+++ b/doc/example.png
--- a/doc/image.png
+++ b/doc/image.png
--- a/doc/infer.png
+++ b/doc/infer.png
--- a/doc/model.png
+++ b/doc/model.png
--- a/doc/results.png
+++ b/doc/results.png
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk24.04-py310
\ No newline at end of file
--- a/docs/Makefile
+++ b/docs/Makefile
+docs:
+	sphinx-build -c . -a -E .. _build
\ No newline at end of file