Chinese Massive Text Embedding Benchmark

Installation | Evaluation | Leaderboard | Tasks | Acknowledgement |

## Installation C-MTEB is devloped based on [MTEB](https://github.com/embeddings-benchmark/mteb). ``` pip install -U C_MTEB ``` Or clone this repo and install as editable ``` git clone https://github.com/FlagOpen/FlagEmbedding.git cd FlagEmbedding/C_MTEB pip install -e . ``` ## Evaluation ### Evaluate reranker ```bash python eval_cross_encoder.py --model_name_or_path BAAI/bge-reranker-base ``` ### Evaluate embedding model * **With our scripts** You can **reproduce the results of `baai-general-embedding (bge)`** using the provided python script (see [eval_C-MTEB.py](./eval_C-MTEB.py) ) ```bash python eval_C-MTEB.py --model_name_or_path BAAI/bge-large-zh # for MTEB leaderboard python eval_MTEB.py --model_name_or_path BAAI/bge-large-en ``` * **With sentence-transformers** You can use C-MTEB easily in the same way as [MTEB](https://github.com/embeddings-benchmark/mteb). Note that the original sentence-transformers model doesn't support instruction. So this method cannot test the performance of `bge-*` models. ```python from mteb import MTEB from C_MTEB import * from sentence_transformers import SentenceTransformer # Define the sentence-transformers model name model_name = "bert-base-uncased" model = SentenceTransformer(model_name) evaluation = MTEB(task_langs=['zh']) results = evaluation.run(model, output_folder=f"zh_results/{model_name}") ``` * **Using a custom model** To evaluate a new model, you can load it via sentence_transformers if it is supported by sentence_transformers. Otherwise, models should be implemented like below (implementing an `encode` function taking as input a list of sentences, and returning a list of embeddings (embeddings can be `np.array`, `torch.tensor`, etc.).): ```python class MyModel(): def encode(self, sentences, batch_size=32, **kwargs): """ Returns a list of embeddings for the given sentences. Args: sentences (`List[str]`): List of sentences to encode batch_size (`int`): Batch size for the encoding Returns: `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences """ pass model = MyModel() evaluation = MTEB(tasks=["T2Retrival"]) evaluation.run(model) ``` ## Acknowledgement We thank the great tool from [Massive Text Embedding Benchmark](https://github.com/embeddings-benchmark/mteb) and the open-source datasets from Chinese NLP community. ## Citation If you find this repository useful, please consider citation ``` @misc{c-pack, title={C-Pack: Packaged Resources To Advance General Chinese Embedding}, author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff}, year={2023}, eprint={2309.07597}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```