m['max_seq_len']=32768# Ensure InternLM2-7B model can receive the full length of long texts, adjust for other models based on their supported maximum sequence length.
m['max_out_len']=2000# Ensure complete responses from the model in multi-needle retrieval tasks.
The Needle In A Haystack test, inspired by [NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/main/LLMNeedleHaystackTester.py), is a method to evaluate the long-text information extraction ability of Large Language Models (LLMs). It involves randomly inserting key information at various points in a long text to form a prompt for LLMs. This test assesses the fundamental ability of LLMs to understand long texts by extracting critical information from them.
The Needle In A Haystack test (inspired by [NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/main/LLMNeedleHaystackTester.py)) is an evaluation method that randomly inserts key information into long texts to form prompts for large language models (LLMs). The test aims to detect whether large models can extract such key information from extensive texts, thereby assessing the models' capabilities in processing and understanding long documents.
## Dataset Introduction
## Task Overview
The `Skywork/ChineseDomainModelingEval` dataset includes high-quality Chinese articles published between September and October 2023, covering multiple domains. These articles ensure a fair and challenging benchmark for testing.
Within the `NeedleBench` framework of `OpenCompass`, we have designed a series of increasingly challenging test scenarios to comprehensively evaluate the models' abilities in long text information extraction and reasoning:
## File Description
-**Single-Needle Retrieval Task (S-RT)**: Assesses an LLM's ability to extract a single key piece of information from a long text, testing its precision in recalling specific details within broad narratives. This corresponds to the **original Needle In A Haystack test** setup.
The dataset includes files specific to various domains:
-**Multi-Needle Retrieval Task (M-RT)**: Explores an LLM's capability to retrieve multiple related pieces of information from long texts, simulating real-world scenarios of complex queries on comprehensive documents.
-`zh_finance.jsonl` - Finance
-`zh_game.jsonl` - Gaming
-`zh_government.jsonl` - Government
-`zh_movie.jsonl` - Movies
-`zh_tech.jsonl` - Technology
-`zh_general.jsonl` - General
-**Multi-Needle Reasoning Task (M-RS)**: Evaluates an LLM's long-text abilities by extracting and utilizing multiple key pieces of information, requiring the model to have a comprehensive understanding of each key information fragment.
These files are used to assess the LLM's understanding of different specific domains.
-**Ancestral Trace Challenge (ATC)**: Uses the "relational needle" to test an LLM's ability to handle multi-layer logical challenges in real long texts. In the ATC task, a series of logical reasoning questions are used to test the model's memory and analytical skills for every detail in the text. For this task, we remove the irrelevant text (Haystack) setting, designing all texts as critical information, requiring the LLM to use all the content and reasoning in the text accurately to answer the questions.
### Evaluation Steps
1. Download the dataset from [Skywork/ChineseDomainModelingEval](https://huggingface.co/datasets/Skywork/ChineseDomainModelingEval/tree/main).
2. Place the downloaded files in `opencompass/data/CDME/`. The expected file structure in the `CDME` directory is as follows:
```
opencompass/
├── configs
├── docs
├── data
│ └── CDME
│ ├── processed
│ ├── README.md
│ ├── zh_finance.jsonl
│ ├── zh_game.jsonl
│ ├── zh_general.jsonl
│ ├── zh_government.jsonl
│ ├── zh_movie.jsonl
│ └── zh_tech.jsonl
├── LICENSE
├── opencompass
├── outputs
├── run.py
├── more...
```
### Environment Setup
1. Download the dataset from [here](https://github.com/open-compass/opencompass/files/14741330/needlebench.zip).
2. Place the downloaded files in the `opencompass/data/needlebench/` directory. The expected file structure in the `needlebench` directory is shown below:
In the latest version, datasets are no longer generated by running scripts but dynamically defined and loaded through configuration files. Users need to specify dataset parameters in the configuration file according to their needs, offering greater flexibility and customization options.
#### Dataset Configuration Example
Here is an example of dataset configuration, showing how to define a dataset in the `configs/datasets/cdme/cdme8k.py` configuration file. This example demonstrates a Chinese dataset configuration with a length of 8000 tokens:
In this configuration, the main parameters include:
-`abbr`: Abbreviation of the dataset.
-`type`: Dataset type.
-`path`: Path to the dataset files.
-`length`: Context length in tokens.
-`depth`: Depth percentage of the document.
-`tokenizer_model`: Tokenizer model used.
-`file_list`: List of data source files.
-`num_repeats_per_file`: Number of repeats per file.
-`length_buffer`: Length buffer.
-`guide`: Whether it's a guided dataset.
-`language`: Language of the dataset.
-`needle`: Specific text to find in the dataset (the 'needle').
-`retrieval_question`: Question used to prompt the model for retrieval.
-`reader_cfg`, `infer_cfg`, `eval_cfg`: Configurations for reading, inference, and evaluation, respectively.
We have pre-configured datasets for common text lengths (4k, 8k, 32k, 128k, 200k, 1000k) in `configs/datasets/needlebench`, allowing you to flexibly create datasets that meet your needs by defining related parameters in the configuration files.
By defining these parameters in the configuration file, you can flexibly create datasets that suit your needs. Configuration files offer a highly customizable and scalable way to manage the generation and use of datasets.
### Example of Evaluation
### Multi-Needle Needle In A Haystack Test
#### Evaluating using the `InternLM2-7B` model deployed with `LMDeploy`
The latest version introduces the multi-needle Needle In A Haystack test, allowing multiple different needles (text snippets) to be inserted into the same dataset. These needles are inserted in sequence according to a given depth parameter. Compared to the single-needle test, the multi-needle test provides a more complex data processing scenario.
For instance, to evaluate all tasks in NeedleBench-4K using the `InternLM2-7B` model deployed with `LMDeploy`, use the following command line command that calls the predefined model and dataset configuration files, without needing to write additional configuration files:
#### Multi-Needle Dataset Configuration Example
Here is an example of configuring a multi-needle dataset, showing how to define a multi-needle dataset in the `configs/datasets/cdme/multi_needle/cdme8k_cot3_italy.py` configuration file. This example demonstrates a dataset configuration with three needles:
In this configuration, in addition to the standard parameters, the main new parameters include:
-`needles`: A list containing multiple strings, each representing a needle to be inserted.
-`diff`: Defines the depth increment for subsequent needles relative to the first needle.
-`keyword`: A keyword used for score correction during the evaluation process.
#### Change in Scoring Mechanism
If you only want to test the original Needle In A Haystack task setup, you can change the dataset parameter to `needlebench_single_4k`, such as:
In the source code of `opencompass/datasets/cdme/cdme_multi.py`, the scoring mechanism for multi-needle datasets differs. The following code segment has been added to adjust the scores based on the `keyword` in the predictions:
This code means that if the keyword is present in the prediction, it will be awarded a high score (e.g., 100). If not, the score will be significantly reduced (20% of the original score). This scoring mechanism places more emphasis on the accuracy of keywords, supplementing the traditional scoring methods.
### Evaluation
#### Evaluating with the `internlm` Model
For example, to evaluate using the `internlm` model, the following command can be used:
You can also choose sub-datasets, such as changing the `--datasets` parameter to `needlebench_single_4k/needlebench_zh_datasets` for only testing the Chinese version of the single needle task, where the parameter after `/` represents the sub-dataset. You can find the optional sub-dataset variables in the `configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py`, such as:
This command initiates the evaluation process, where the model attempts to find the specified "needle" in the generated dataset. The parameters `-p partition_name -q auto` and `--max-num-workers 32` specify the Slurm queue and the maximum number of worker processes, respectively.
Be sure to install the [LMDeploy](https://github.com/InternLM/lmdeploy) tool before starting the evaluation:
#### Large-Scale Text Evaluation with `LMDeploy`
```bash
pip install lmdeploy
```
When evaluating especially long texts (e.g., 200k tokens), conventional methods might lead to memory overload. In such cases, quantized models can be used for evaluation. This can be achieved using the `LMDeploy` tool ([LMDeploy](https://github.com/InternLM/lmdeploy)).
This command initiates the evaluation process, with parameters `-p partition_name -q auto` and `--max-num-workers 32` used to specify the Slurm partition name and the maximum number of worker processes.
Detailed information about installing and configuring `LMDeploy` can be found on its GitHub page. Once installed, the `TurboMindModel` defined in the `configs/eval_needleinahaystack_turbomind.py` configuration file can be used for evaluation.
#### Evaluating Other `Huggingface` Models
Below is an example configuration in the `configs/eval_needleinahaystack_turbomind.py` file:
For other models, we recommend writing an additional configuration file to modify the model's `max_seq_len` and `max_out_len` parameters so the model can receive the complete long text content, as we have prepared in the `configs/eval_needlebench.py` file. The complete content is as follows:
In this configuration, the `TurboMindModel` combines the functionality of `LMDeploy`, suitable for handling large-scale text datasets and effectively reducing memory usage.
### Score Calculation Method
In the `CDMEEvaluator` class, we use two main methods to calculate scores: `levenshtein_distance` and `score`. Here are detailed explanations and implementations of these methods.
#### Levenshtein Distance
Levenshtein distance is a measure of the difference between two strings. It represents the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other.
The `score` calculation method accepts two lists of predictions and references and calculates the edit distance and score for each pair of prediction and reference.
# Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary.
# from .datasets.needlebench.needlebench_4k.needlebench_4k import needlebench_datasets
# from .summarizers.needlebench import needlebench_4k_summarizer as summarizer
```python
defscore(self,predictions,references):
iflen(predictions)!=len(references):
return{"error":"predictions and references have different lengths"}
This scoring method first removes all whitespace characters from both predictions and references and then calculates the Levenshtein distance between them. The score is calculated as 100 minus the percentage loss based on edit distance. Finally, it returns detailed scores for each prediction and the average score overall.
# only eval original "needle in a haystack test" in needlebench_4k
# from .datasets.needlebench.atc.atc_choice_50 import needlebench_datasets
# from .summarizers.needlebench import atc_summarizer_50 as summarizer
The `tools_needleinahaystack.py` script can be used to visualize CSV files. This script supports specifying one or more CSV file paths through the `--path` parameter and can use the `--dataset_length` parameter to specify the length of the dataset.
m['max_seq_len']=32768# Ensure InternLM2-7B model can receive the complete long text, other models need to adjust according to their maximum sequence length support.
m['max_out_len']=2000# Ensure that in the multi-needle recall task, the model can receive a complete response
To specify the dataset length for visualization, which is used for generating titles in the visualization charts:
Note, at this point, we do not need to pass in the `--dataset, --models, --summarizer` parameters, as we have already defined these configurations in the config file. You can manually adjust the `--max-partition-size` setting to achieve the best task slicing strategy to improve evaluation efficiency.
Currently, this approach only supports the CDME dataset, and we welcome community contributions for more datasets.
We have built-in result visualization into the `summarizer` implementation in the latest code version. You can find the corresponding visualizations in the plots directory of the respective output folder, eliminating the need for manual visualization of scores across various depths and lengths.
If you use this method, please cite as follows:
If you use this method, please add a reference:
```bibtex
@misc{2023opencompass,
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},