README.md

<h1 align="center">Chinese Massive Text Embedding Benchmark</h1>
<p align="center">
    <a href="https://www.python.org/">
            <img alt="Build" src="https://img.shields.io/badge/Contribution-Welcome-blue">
    </a>
    <a href="https://huggingface.co/C-MTEB">
        <img alt="Build" src="https://img.shields.io/badge/C_MTEB-🤗-yellow">
    </a>
    <a href="https://www.python.org/">
        <img alt="Build" src="https://img.shields.io/badge/Made with-Python-red">
    </a>
</p>

<h4 align="center">
    <p>
        <a href=#installation>Installation</a> |
        <a href=#evaluation>Evaluation</a>  |
        <a href="#leaderboard">Leaderboard</a> |
        <a href="#tasks">Tasks</a> |
        <a href="#acknowledgement">Acknowledgement</a> |
    <p>
</h4>


## Installation
C-MTEB is devloped based on [MTEB](https://github.com/embeddings-benchmark/mteb).
```
pip install -U C_MTEB
```
Or clone this repo and install as editable
```
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding/C_MTEB
pip install -e .
```

## Evaluation

### Evaluate reranker
```bash
python eval_cross_encoder.py --model_name_or_path BAAI/bge-reranker-base
```

### Evaluate embedding model
* **With our scripts**

You can **reproduce the results of `baai-general-embedding (bge)`** using the provided python script (see [eval_C-MTEB.py](./eval_C-MTEB.py) )
```bash
python eval_C-MTEB.py --model_name_or_path BAAI/bge-large-zh

# for MTEB leaderboard
python eval_MTEB.py --model_name_or_path BAAI/bge-large-en

```

* **With sentence-transformers**

You can use C-MTEB easily in the same way as [MTEB](https://github.com/embeddings-benchmark/mteb).

Note that the original sentence-transformers model doesn't support instruction.
So this method cannot test the performance of `bge-*` models.

```python
from mteb import MTEB
from C_MTEB import *
from sentence_transformers import SentenceTransformer

# Define the sentence-transformers model name
model_name = "bert-base-uncased"

model = SentenceTransformer(model_name)
evaluation = MTEB(task_langs=['zh'])
results = evaluation.run(model, output_folder=f"zh_results/{model_name}")
```


* **Using a custom model**
To evaluate a new model, you can load it via sentence_transformers if it is supported by sentence_transformers.
Otherwise, models should be implemented like below (implementing an `encode` function taking as input a list of sentences, and returning a list of embeddings (embeddings can be `np.array`, `torch.tensor`, etc.).):

```python
class MyModel():
    def encode(self, sentences, batch_size=32, **kwargs):
        """ Returns a list of embeddings for the given sentences.
        Args:
            sentences (`List[str]`): List of sentences to encode
            batch_size (`int`): Batch size for the encoding

        Returns:
            `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
        """
        pass

model = MyModel()
evaluation = MTEB(tasks=["T2Retrival"])
evaluation.run(model)
```

## Acknowledgement

We thank the great tool from [Massive Text Embedding Benchmark](https://github.com/embeddings-benchmark/mteb)  and the open-source datasets from Chinese NLP community.


## Citation

If you find this repository useful, please consider citation

```
@misc{c-pack,
      title={C-Pack: Packaged Resources To Advance General Chinese Embedding},
      author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff},
      year={2023},
      eprint={2309.07597},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```