[Doc] Update NeedleInAHaystack Docs (#1102)

* update NeedleInAHaystack Test Docs * update docs

[Doc] Update NeedleInAHaystack Docs (#1102)
* update NeedleInAHaystack Test Docs * update docs
76dd814c · Mo Li · GitHub · cce5b6fb · 76dd814c · 76dd814c
Unverified Commit 76dd814c authored Apr 28, 2024 by Mo Li Committed by GitHub Apr 28, 2024
3 changed files
--- a/configs/eval_needlebench.py
+++ b/configs/eval_needlebench.py
-from opencompass.models import HuggingFaceCausalLM
-from opencompass.models.turbomind import TurboMindModel
-from opencompass.runners import SlurmSequentialRunner
-from opencompass.partitioners import SizePartitioner, NaivePartitioner
-from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
-
 from mmengine.config import read_base
 with read_base():
-    # eval needlebench_4k
-    from .datasets.needlebench.needlebench_4k.needlebench import needlebench_datasets
-    from .summarizers.needlebench import needlebench_4k_summarizer as summarizer
+    from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k
+    from .models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b

-    # only eval original "needle in a haystack test" in needlebench_4k
-    # from .datasets.needlebench.needlebench_4k.needlebench_single import needlebench_datasets_zh, needlebench_datasets_en
+    # Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary.
+    # from .datasets.needlebench.needlebench_4k.needlebench_4k import needlebench_datasets
    # from .summarizers.needlebench import needlebench_4k_summarizer as summarizer

+    # only eval original "needle in a haystack test" in needlebench_4k
+    from .datasets.needlebench.needlebench_4k.needlebench_single_4k import needlebench_zh_datasets, needlebench_en_datasets
+    from .summarizers.needlebench import needlebench_4k_summarizer as summarizer
+
    # eval Ancestral Tracing Challenge(ATC)
-    # from .datasets.needlebench.atc.atc import needlebench_atc_datasets_zh, needlebench_atc_datasets_en
-    # from .summarizers.needlebench import needlebench_atc_summarizer as summarizer
+    # from .datasets.needlebench.atc.atc_choice_50 import needlebench_datasets
+    # from .summarizers.needlebench import atc_summarizer_50 as summarizer

 datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])

-hf_internlm2_chat_7b_model_meta_template = dict(
-    round=[
-        dict(role='HUMAN',
-             begin='<|im_start|>user\n', end='<|im_end|>\n'),
-        dict(role='BOT', begin='<|im_start|>assistant\n',
-             end='<|im_end|>\n', generate=True),
-    ],
-)
-hf_internlm2_chat_7b = dict(
-        type=HuggingFaceCausalLM,
-        abbr='internlm2-chat-7b-hf',
-        path="internlm/internlm2-chat-7b",
-        tokenizer_path='internlm/internlm2-chat-7b',
-        model_kwargs=dict(
-            trust_remote_code=True,
-            device_map='auto',
-        ),
-        tokenizer_kwargs=dict(
-            padding_side='left',
-            truncation_side='left',
-            use_fast=False,
-            trust_remote_code=True,
-        ),
-        max_out_len=2000,
-        max_seq_len=32768,
-        batch_size=8,
-        meta_template=hf_internlm2_chat_7b_model_meta_template,
-        run_cfg=dict(num_gpus=1, num_procs=1),
-        end_str='<|im_end|>',
-        )
-
-internlm2_chat_7b_200k = dict(
-        type=TurboMindModel,
-        abbr='internlm2-chat-7b-200k',
-        path="internlm/internlm2-chat-7b",
-        meta_template=hf_internlm2_chat_7b_model_meta_template,
-        engine_config=dict(session_len=210000,
-                           max_batch_size=8,
-                           rope_scaling_factor=2.0,
-                           model_name="internlm2-chat-7b"),
-        gen_config=dict(top_k=1, top_p=0.8,
-                        temperature=1.0,
-                        max_new_tokens=2000),
-        max_out_len=2000,
-        max_seq_len=210000,
-        batch_size=8,
-        concurrency=8,
-        run_cfg=dict(num_gpus=1, num_procs=1),
-    )
+for m in internlm2_chat_7b:
+    m['max_seq_len'] = 32768 # Ensure InternLM2-7B model can receive the full length of long texts, adjust for other models based on their supported maximum sequence length.
+    m['max_out_len'] = 2000 # Ensure complete responses from the model in multi-needle retrieval tasks.

-models = [
-    # hf_internlm2_chat_7b,
-    internlm2_chat_7b_200k,
-]
+models = internlm2_chat_7b

 work_dir = './outputs/needlebench'
--- a/docs/en/advanced_guides/needleinahaystack_eval.md
+++ b/docs/en/advanced_guides/needleinahaystack_eval.md
-# Needle In A Haystack Experiment Evaluation
+# Needle In A Haystack Experimental Evaluation

 ## Introduction to the Needle In A Haystack Test

-The Needle In A Haystack test, inspired by [NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/main/LLMNeedleHaystackTester.py), is a method to evaluate the long-text information extraction ability of Large Language Models (LLMs). It involves randomly inserting key information at various points in a long text to form a prompt for LLMs. This test assesses the fundamental ability of LLMs to understand long texts by extracting critical information from them.
+The Needle In A Haystack test (inspired by [NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/main/LLMNeedleHaystackTester.py)) is an evaluation method that randomly inserts key information into long texts to form prompts for large language models (LLMs). The test aims to detect whether large models can extract such key information from extensive texts, thereby assessing the models' capabilities in processing and understanding long documents.

-## Dataset Introduction
+## Task Overview

-The `Skywork/ChineseDomainModelingEval` dataset includes high-quality Chinese articles published between September and October 2023, covering multiple domains. These articles ensure a fair and challenging benchmark for testing.
+Within the `NeedleBench` framework of `OpenCompass`, we have designed a series of increasingly challenging test scenarios to comprehensively evaluate the models' abilities in long text information extraction and reasoning:

-## File Description
+- **Single-Needle Retrieval Task (S-RT)**: Assesses an LLM's ability to extract a single key piece of information from a long text, testing its precision in recalling specific details within broad narratives. This corresponds to the **original Needle In A Haystack test** setup.

-The dataset includes files specific to various domains:
+- **Multi-Needle Retrieval Task (M-RT)**: Explores an LLM's capability to retrieve multiple related pieces of information from long texts, simulating real-world scenarios of complex queries on comprehensive documents.

- `zh_finance.jsonl` - Finance
- `zh_game.jsonl` - Gaming
- `zh_government.jsonl` - Government
- `zh_movie.jsonl` - Movies
- `zh_tech.jsonl` - Technology
- `zh_general.jsonl` - General
+- **Multi-Needle Reasoning Task (M-RS)**: Evaluates an LLM's long-text abilities by extracting and utilizing multiple key pieces of information, requiring the model to have a comprehensive understanding of each key information fragment.

-These files are used to assess the LLM's understanding of different specific domains.
+- **Ancestral Trace Challenge (ATC)**: Uses the "relational needle" to test an LLM's ability to handle multi-layer logical challenges in real long texts. In the ATC task, a series of logical reasoning questions are used to test the model's memory and analytical skills for every detail in the text. For this task, we remove the irrelevant text (Haystack) setting, designing all texts as critical information, requiring the LLM to use all the content and reasoning in the text accurately to answer the questions.

 ### Evaluation Steps

-1. Download the dataset from [Skywork/ChineseDomainModelingEval](https://huggingface.co/datasets/Skywork/ChineseDomainModelingEval/tree/main).
-
-2. Place the downloaded files in `opencompass/data/CDME/`. The expected file structure in the `CDME` directory is as follows:
-
-   ```
-   opencompass/
-   ├── configs
-   ├── docs
-   ├── data
-   │   └── CDME
-   │       ├── processed
-   │       ├── README.md
-   │       ├── zh_finance.jsonl
-   │       ├── zh_game.jsonl
-   │       ├── zh_general.jsonl
-   │       ├── zh_government.jsonl
-   │       ├── zh_movie.jsonl
-   │       └── zh_tech.jsonl
-   ├── LICENSE
-   ├── opencompass
-   ├── outputs
-   ├── run.py
-   ├── more...
-   ```
-
-### Environment Setup
+1. Download the dataset from [here](https://github.com/open-compass/opencompass/files/14741330/needlebench.zip).
+
+2. Place the downloaded files in the `opencompass/data/needlebench/` directory. The expected file structure in the `needlebench` directory is shown below:
+
+```
+opencompass/
+├── configs
+├── docs
+├── data
+│   └── needlebench
+│       ├── multi_needle_reasoning_en.json
+│       ├── multi_needle_reasoning_zh.json
+│       ├── names.json
+│       ├── needles.jsonl
+│       ├── PaulGrahamEssays.jsonl
+│       ├── zh_finance.jsonl
+│       ├── zh_game.jsonl
+│       ├── zh_government.jsonl
+│       ├── zh_movie.jsonl
+│       ├── zh_tech.jsonl
+│       ├── zh_general.jsonl
+├── LICENSE
+├── opencompass
+├── outputs
+├── run.py
+├── more...
+```
+
+### `OpenCompass` Environment Setup

 ```bash
 conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
@@ -60,272 +58,96 @@ pip install -e .

 ### Configuring the Dataset

-In the latest version, datasets are no longer generated by running scripts but dynamically defined and loaded through configuration files. Users need to specify dataset parameters in the configuration file according to their needs, offering greater flexibility and customization options.
-
-#### Dataset Configuration Example
-
-Here is an example of dataset configuration, showing how to define a dataset in the `configs/datasets/cdme/cdme8k.py` configuration file. This example demonstrates a Chinese dataset configuration with a length of 8000 tokens:
-
-```python
-for original_context_length in context_lengths:
-    for depth_percent in generate_depth_percents(
-            document_depth_percent_intervals,
-            document_depth_percent_interval_type):
-        dataset_dict = {
-            'abbr': f'CDME_Length{original_context_length}Depth{int(depth_percent)}',
-            'type': CDMEDataset,
-            'path': base_path,
-            'length': original_context_length,
-            'depth': int(depth_percent),
-            'tokenizer_model': 'gpt-4',
-            'file_list': file_list,
-            'num_repeats_per_file': 10,
-            'length_buffer': 200,
-            'guide': True,
-            'language': 'Chinese',
-            'needle': '\n小明最喜欢的实习的地点就是上海人工智能实验室。\n',
-            'retrieval_question': '小明最喜欢的实习地点是哪里？请按照“小明最喜欢的实习地点就是________。”的格式回答。',
-            'reader_cfg': cdme_reader_cfg,
-            'infer_cfg': cdme_infer_cfg,
-            'eval_cfg': cdme_eval_cfg
-        }
-        cdme_datasets.append(dataset_dict)
-```
-
-In this configuration, the main parameters include:
-
- `abbr`: Abbreviation of the dataset.
- `type`: Dataset type.
- `path`: Path to the dataset files.
- `length`: Context length in tokens.
- `depth`: Depth percentage of the document.
- `tokenizer_model`: Tokenizer model used.
- `file_list`: List of data source files.
- `num_repeats_per_file`: Number of repeats per file.
- `length_buffer`: Length buffer.
- `guide`: Whether it's a guided dataset.
- `language`: Language of the dataset.
- `needle`: Specific text to find in the dataset (the 'needle').
- `retrieval_question`: Question used to prompt the model for retrieval.
- `reader_cfg`, `infer_cfg`, `eval_cfg`: Configurations for reading, inference, and evaluation, respectively.
+We have pre-configured datasets for common text lengths (4k, 8k, 32k, 128k, 200k, 1000k) in `configs/datasets/needlebench`, allowing you to flexibly create datasets that meet your needs by defining related parameters in the configuration files.

-By defining these parameters in the configuration file, you can flexibly create datasets that suit your needs. Configuration files offer a highly customizable and scalable way to manage the generation and use of datasets.
+### Example of Evaluation

-### Multi-Needle Needle In A Haystack Test
+#### Evaluating using the `InternLM2-7B` model deployed with `LMDeploy`

-The latest version introduces the multi-needle Needle In A Haystack test, allowing multiple different needles (text snippets) to be inserted into the same dataset. These needles are inserted in sequence according to a given depth parameter. Compared to the single-needle test, the multi-needle test provides a more complex data processing scenario.
+For instance, to evaluate all tasks in NeedleBench-4K using the `InternLM2-7B` model deployed with `LMDeploy`, use the following command line command that calls the predefined model and dataset configuration files, without needing to write additional configuration files:

-#### Multi-Needle Dataset Configuration Example
-
-Here is an example of configuring a multi-needle dataset, showing how to define a multi-needle dataset in the `configs/datasets/cdme/multi_needle/cdme8k_cot3_italy.py` configuration file. This example demonstrates a dataset configuration with three needles:
-
-```python
-# Basic dataset configuration
-base_path = './data/CDME'
-file_list = ['zh_finance.jsonl']
-
-# Definition of Needles
-needles = [
-    '\n意大利的佛罗伦萨有一家名为"La Giostra"的餐馆，是整个佛罗伦萨中排行第一的餐馆。\n',
-    '"La Giostra"餐馆的特色菜肴是松露奶酪通心粉。',
-    '松露奶酪通心粉是该家餐馆的有着意大利皇室烹饪血统的大厨Jack制作'
-]
-
-
-# Configuration parameters
-retrieval_question = ("制作佛罗伦萨中排行第一的餐馆的特色菜肴的人叫什么？"
-                      "请按照'制作佛罗伦萨中排行第一的餐馆的特色菜肴的人叫______。'的格式回答。")
-answer = "制作佛罗伦萨中排行第一的餐馆的特色菜肴的人叫Jack"
-keyword = "Jack"
-diff = 25
-
-# Dataset generation loop
-for original_context_length in context_lengths:
-    for depth_percent in generate_depth_percents(
-            document_depth_percent_intervals,
-            document_depth_percent_interval_type):
-        dataset_dict = {
-            # Other configuration items...
-            'needles': needles,
-            'diff': diff,
-            'keyword': keyword,
-            # Other configuration items...
-        }
-        cdme_datasets.append(dataset_dict)
+```bash
+python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
 ```

-In this configuration, in addition to the standard parameters, the main new parameters include:
-
- `needles`: A list containing multiple strings, each representing a needle to be inserted.
- `diff`: Defines the depth increment for subsequent needles relative to the first needle.
- `keyword`: A keyword used for score correction during the evaluation process.
-
-#### Change in Scoring Mechanism
+If you only want to test the original Needle In A Haystack task setup, you can change the dataset parameter to `needlebench_single_4k`, such as:

-In the source code of `opencompass/datasets/cdme/cdme_multi.py`, the scoring mechanism for multi-needle datasets differs. The following code segment has been added to adjust the scores based on the `keyword` in the predictions:
+```bash
+python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --sl

-```python
-if keyword in prediction:
-    print(f'{keyword} is in {prediction}')
-    score = 100
-else:
-    print(f'{keyword} is not in {prediction}')
-    score = 0.2 * score
+urm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
 ```

-This code means that if the keyword is present in the prediction, it will be awarded a high score (e.g., 100). If not, the score will be significantly reduced (20% of the original score). This scoring mechanism places more emphasis on the accuracy of keywords, supplementing the traditional scoring methods.
-
-### Evaluation
-
-#### Evaluating with the `internlm` Model
-
-For example, to evaluate using the `internlm` model, the following command can be used:
+You can also choose sub-datasets, such as changing the `--datasets` parameter to `needlebench_single_4k/needlebench_zh_datasets` for only testing the Chinese version of the single needle task, where the parameter after `/` represents the sub-dataset. You can find the optional sub-dataset variables in the `configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py`, such as:

 ```bash
-python run.py configs/eval_needleinahaystack.py --slurm -p partition_name -q auto --max-num-workers 32
+python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
 ```

-This command initiates the evaluation process, where the model attempts to find the specified "needle" in the generated dataset. The parameters `-p partition_name -q auto` and `--max-num-workers 32` specify the Slurm queue and the maximum number of worker processes, respectively.
+Be sure to install the [LMDeploy](https://github.com/InternLM/lmdeploy) tool before starting the evaluation:

-#### Large-Scale Text Evaluation with `LMDeploy`
+```bash
+pip install lmdeploy
+```

-When evaluating especially long texts (e.g., 200k tokens), conventional methods might lead to memory overload. In such cases, quantized models can be used for evaluation. This can be achieved using the `LMDeploy` tool ([LMDeploy](https://github.com/InternLM/lmdeploy)).
+This command initiates the evaluation process, with parameters `-p partition_name -q auto` and `--max-num-workers 32` used to specify the Slurm partition name and the maximum number of worker processes.

-Detailed information about installing and configuring `LMDeploy` can be found on its GitHub page. Once installed, the `TurboMindModel` defined in the `configs/eval_needleinahaystack_turbomind.py` configuration file can be used for evaluation.
+#### Evaluating Other `Huggingface` Models

-Below is an example configuration in the `configs/eval_needleinahaystack_turbomind.py` file:
+For other models, we recommend writing an additional configuration file to modify the model's `max_seq_len` and `max_out_len` parameters so the model can receive the complete long text content, as we have prepared in the `configs/eval_needlebench.py` file. The complete content is as follows:

 ```python
-from opencompass.models.turbomind import TurboMindModel
 from mmengine.config import read_base
-
 with read_base():
-    from .datasets.cdme.cdme200k import cdme_datasets
-
-datasets = [*cdme_datasets]
-
-internlm_meta_template = dict(round=[
-    dict(role='HUMAN', begin=':', end='\n'),
-    dict(role='BOT', begin=':', end='<eoa>\n', generate=True),
-],
-                              eos_token_id=103028)
-
-models = [
-    dict(
-        type=TurboMindModel,
-        abbr='internlm-chat-20b-turbomind',
-        path='./turbomind',
-        max_out_len=100,
-        max_seq_len=2048,
-        batch_size=8,
-        concurrency=8,
-        meta_template=internlm_meta_template,
-        run_cfg=dict(num_gpus=1, num_procs=1),
-    )
-]
-```
-
-In this configuration, the `TurboMindModel` combines the functionality of `LMDeploy`, suitable for handling large-scale text datasets and effectively reducing memory usage.
-
-### Score Calculation Method
-
-In the `CDMEEvaluator` class, we use two main methods to calculate scores: `levenshtein_distance` and `score`. Here are detailed explanations and implementations of these methods.
-
-#### Levenshtein Distance
-
-Levenshtein distance is a measure of the difference between two strings. It represents the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other.
-
-```python
-def levenshtein_distance(self, s1, s2):
-    if len(s1) < len(s2):
-        return self.levenshtein_distance(s2, s1)
-
-    if len(s2) == 0:
-        return len(s1)
-
-    previous_row = range(len(s2) + 1)
-    for i, c1 in enumerate(s1):
-        current_row = [i + 1]
-        for j, c2 in enumerate(s2):
-            insertions = previous_row[j + 1] + 1
-            deletions = current_row[j] + 1
-            substitutions = previous_row[j] + (c1 != c2)
-            current_row.append(min(insertions, deletions, substitutions))
-        previous_row = current_row
-
-    return previous_row[-1]
-```
-
-#### Score Calculation
+    from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k
+    from .models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b

-The `score` calculation method accepts two lists of predictions and references and calculates the edit distance and score for each pair of prediction and reference.
+    # Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary.
+    # from .datasets.needlebench.needlebench_4k.needlebench_4k import needlebench_datasets
+    # from .summarizers.needlebench import needlebench_4k_summarizer as summarizer

-```python
-def score(self, predictions, references):
-    if len(predictions) != len(references):
-        return {"error": "predictions and references have different lengths"}
-
-    total_score = 0
-    details = []
-    for prediction, reference in zip(predictions, references):
-        prediction = re.sub(r'\s+', '', prediction)
-        reference = re.sub(r'\s+', '', reference)
-        edit_distance = self.levenshtein_distance(prediction, reference)
-
-
-        max_len = max(len(prediction), len(reference))
-        score = 100 * (1 - edit_distance /max_len) if max_len != 0 else 100
-
-        detail = {
-            "pred": prediction,
-            "ref": reference,
-            "edit_distance": edit_distance,
-            "score": score
-        }
-        total_score += score
-        details.append(detail)
-
-    average_score = total_score / len(predictions) if predictions else 0
-    result = {"average_score": average_score, "details": details}
-    return result
-```
-
-This scoring method first removes all whitespace characters from both predictions and references and then calculates the Levenshtein distance between them. The score is calculated as 100 minus the percentage loss based on edit distance. Finally, it returns detailed scores for each prediction and the average score overall.
+    # only eval original "needle in a haystack test" in needlebench_4k
+    from .datasets.needlebench.needlebench_4k.needlebench_single_4k import needlebench_zh_datasets, needlebench_en_datasets
+    from .summarizers.needlebench import needlebench_4k_summarizer as summarizer

-### Visualization
+    # eval Ancestral Tracing Challenge(ATC)
+    # from .datasets.needlebench.atc.atc_choice_50 import needlebench_datasets
+    # from .summarizers.needlebench import atc_summarizer_50 as summarizer

-The `tools_needleinahaystack.py` script can be used to visualize CSV files. This script supports specifying one or more CSV file paths through the `--path` parameter and can use the `--dataset_length` parameter to specify the length of the dataset.
+datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])

-#### Usage Examples
+for m in internlm2_chat_7b:
+    m['max_seq_len'] = 32768 # Ensure InternLM2-7B model can receive the complete long text, other models need to adjust according to their maximum sequence length support.
+    m['max_out_len'] = 2000 # Ensure that in the multi-needle recall task, the model can receive a complete response

-To visualize a single CSV file:
+models = internlm2_chat_7b

-```bash
-python tools/tools_needleinahaystack.py --path 'outputs/default/20231216_161457/summary/summary_20231216_161457.csv'
+work_dir = './outputs/needlebench'
 ```

-To visualize multiple CSV files:
+Once the test `config` file is written, we can pass the corresponding config file path through the `run.py` file in the command line, such as:

 ```bash
-python tools/tools_needleinahaystack.py --path 'path_to_first_csv.csv' 'path_to_second_csv.csv'
+python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved --max-num-workers 128 --max-partition-size 8000
 ```

-To specify the dataset length for visualization, which is used for generating titles in the visualization charts:
+Note, at this point, we do not need to pass in the `--dataset, --models, --summarizer` parameters, as we have already defined these configurations in the config file. You can manually adjust the `--max-partition-size` setting to achieve the best task slicing strategy to improve evaluation efficiency.

-```bash
-python tools/tools_needleinahaystack.py --path 'path_to_csv.csv' --dataset_length 200K
-```
+### Visualization

-Currently, this approach only supports the CDME dataset, and we welcome community contributions for more datasets.
+We have built-in result visualization into the `summarizer` implementation in the latest code version. You can find the corresponding visualizations in the plots directory of the respective output folder, eliminating the need for manual visualization of scores across various depths and lengths.

-If you use this method, please cite as follows:
+If you use this method, please add a reference:

 ```bibtex
+
 @misc{2023opencompass,
    title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
    author={OpenCompass Contributors},
    howpublished={\url{https://github.com/open-compass/opencompass}},
    year={2023}
+
+
 }

 @misc{LLMTest_NeedleInAHaystack,
@@ -343,4 +165,5 @@ If you use this method, please cite as follows:
      archivePrefix={arXiv},
      primaryClass={cs.CL}
 }
+
 ```
--- a/docs/zh_cn/advanced_guides/needleinahaystack_eval.md
+++ b/docs/zh_cn/advanced_guides/needleinahaystack_eval.md
@@ -2,53 +2,51 @@

 ## 大海捞针测试简介

-大海捞针测试（灵感来自 [NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/main/LLMNeedleHaystackTester.py)）是指通过将关键信息随机插入一段长文本的不同位置，形成大语言模型 (LLM) 的Prompt，通过测试大模型是否能从长文本中提取出关键信息，从而测试大模型的长文本信息提取能力的一种方法，可反映LLM长文本理解的基本能力。
+大海捞针测试（灵感来自[NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/main/LLMNeedleHaystackTester.py)）是一种评估方法，它通过在长文本中随机插入关键信息，形成大型语言模型(LLM)的Prompt。该测试旨在检测大型模型是否能从长文本中提取出这些关键信息，从而评估模型处理长文本信息提取的能力，这可以反映LLM对长文本的理解基础能力。

-## 数据集介绍
+## 任务介绍

-`Skywork/ChineseDomainModelingEval` 数据集收录了 2023 年 9 月至 10 月期间发布的高质量中文文章，涵盖了多个领域。这些文章确保了公平且具有挑战性的基准测试。
+在`OpenCompass`的`NeedleBench`框架中，为了全面评估模型在长文本信息提取和推理方面的能力，我们设计了一系列逐渐增加难度的测试方案。

-## 文件介绍
+- **单一信息检索任务(Single-Needle Retrieval Task, S-RT)**：评估LLM在长文本中提取单一关键信息的能力，测试其对广泛叙述中特定细节的精确回忆能力。这对应于**原始的大海捞针测试**任务设定。

-该数据集包括特定领域的文件：
+- **多信息检索任务(Multi-Needle Retrieval Task, M-RT)**：探讨LLM从长文本中检索多个相关信息的能力，模拟实际场景中对综合文档的复杂查询。

- `zh_finance.jsonl` - 金融
- `zh_game.jsonl` - 游戏
- `zh_government.jsonl` - 政务
- `zh_movie.jsonl` - 电影
- `zh_tech.jsonl` - 技术
- `zh_general.jsonl` - 综合
+- **多信息推理任务(Multi-Needle Reasoning Task, M-RS)**：通过提取并利用长文本中的多个关键信息来评估LLM的长文本能力，要求模型对各关键信息片段有综合理解。

-这些文件用于评估LLM对不同特定领域的理解能力。
+- **祖先追溯挑战(Ancestral Trace Challenge, ATC)**：通过设计“亲属关系针”，测试LLM处理真实长文本中多层逻辑挑战的能力。在ATC任务中，通过一系列逻辑推理问题，检验模型对长文本中每个细节的记忆和分析能力，在此任务中，我们去掉了无关文本(Haystack)的设定，而是将所有文本设计为关键信息，LLM必须综合运用长文本中的所有内容和推理才能准确回答问题。

 ### 评估步骤

-1. 从 [Skywork/ChineseDomainModelingEval](https://huggingface.co/datasets/Skywork/ChineseDomainModelingEval/tree/main) 下载数据集。
-
-2. 将下载的文件放置在 `opencompass/data/CDME/` 下。`CDME` 目录中的预期文件结构如下：
-
-   ```
-   opencompass/
-   ├── configs
-   ├── docs
-   ├── data
-   │   └── CDME
-   │       ├── processed
-   │       ├── README.md
-   │       ├── zh_finance.jsonl
-   │       ├── zh_game.jsonl
-   │       ├── zh_general.jsonl
-   │       ├── zh_government.jsonl
-   │       ├── zh_movie.jsonl
-   │       └── zh_tech.jsonl
-   ├── LICENSE
-   ├── opencompass
-   ├── outputs
-   ├── run.py
-   ├── more...
-   ```
-
-### 环境配置
+1. 从[这里](https://github.com/open-compass/opencompass/files/14741330/needlebench.zip)下载数据集。
+
+2. 将下载的文件放置于`opencompass/data/needlebench/`目录下。`needlebench`目录中预期的文件结构如下所示：
+
+```
+opencompass/
+├── configs
+├── docs
+├── data
+│   └── needlebench
+│       ├── multi_needle_reasoning_en.json
+│       ├── multi_needle_reasoning_zh.json
+│       ├── names.json
+│       ├── needles.jsonl
+│       ├── PaulGrahamEssays.jsonl
+│       ├── zh_finance.jsonl
+│       ├── zh_game.jsonl
+│       ├── zh_government.jsonl
+│       ├── zh_movie.jsonl
+│       ├── zh_tech.jsonl
+│       ├── zh_general.jsonl
+├── LICENSE
+├── opencompass
+├── outputs
+├── run.py
+├── more...
+```
+
+### `OpenCompass`环境配置

 ```bash
 conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
@@ -60,260 +58,82 @@ pip install -e .

 ### 配置数据集

-在最新版本中，数据集不再通过运行脚本手动生成，而是通过在配置文件中动态定义和加载。用户需要根据自己的需求，在配置文件中指定数据集的参数。这种方法提供了更大的灵活性和定制化选项。
-
-#### 数据集配置示例
-
-以下是一个数据集配置的示例，展示了如何在配置文件 `configs/datasets/cdme/cdme8k.py` 中定义一个数据集。这个示例展示了一个 8000 tokens 长度的中文数据集配置：
-
-```python
-for original_context_length in context_lengths:
-    for depth_percent in generate_depth_percents(
-            document_depth_percent_intervals,
-            document_depth_percent_interval_type):
-        dataset_dict = {
-            'abbr': f'CDME_Length{original_context_length}Depth{int(depth_percent)}',
-            'type': CDMEDataset,
-            'path': base_path,
-            'length': original_context_length,
-            'depth': int(depth_percent),
-            'tokenizer_model': 'gpt-4',
-            'file_list': file_list,
-            'num_repeats_per_file': 10,
-            'length_buffer': 200,
-            'guide': True,
-            'language': 'Chinese',
-            'needle': '\n小明最喜欢的实习的地点就是上海人工智能实验室。\n',
-            'retrieval_question': '小明最喜欢的实习地点是哪里？请按照“小明最喜欢的实习地点就是________。”的格式回答。',
-            'reader_cfg': cdme_reader_cfg,
-            'infer_cfg': cdme_infer_cfg,
-            'eval_cfg': cdme_eval_cfg
-        }
-        cdme_datasets.append(dataset_dict)
-```
-
-在这个配置中，主要参数包括：
-
- `abbr`: 数据集的简称。
- `type`: 数据集类型。
- `path`: 数据集文件的路径。
- `length`: 上下文长度（以token为单位）。
- `depth`: 文档深度百分比。
- `tokenizer_model`: 使用的tokenizer 模型。
- `file_list`: 数据源文件列表。
- `num_repeats_per_file`: 每个文件重复的次数。
- `length_buffer`: 长度缓冲区。
- `guide`: 是否为引导式数据集。
- `language`: 数据集的语言。
- `needle`: 在数据集中要查找的特定文本（针）。
- `retrieval_question`: 用于提示模型检索的问题。
- `reader_cfg`, `infer_cfg`, `eval_cfg`: 分别对应读取、推理和评估的配置。
+我们在`configs/datasets/needlebench`中已经预先配置好了关于常见长度区间(4k, 8k, 32k, 128k, 200k, 1000k)的长文本测试设定，您可以通过在配置文件中定义相关参数，以灵活地创建适合您需求的数据集。

-通过在配置文件中定义这些参数，您可以灵活地创建适合您需求的数据集。配置文件提供了一种高度可定制和扩展的方式来管理数据集的生成和使用。
+### 评估示例

-### 多根针大海捞针测试
+#### 使用`LMDeploy`部署的 `InternLM2-7B` 模型进行评估

-最新版本中引入了多根针大海捞针测试，允许在同一个数据集中插入多个不同的针（文本片段）。这些针将根据给定的深度参数依次插入数据集中。相对于单针测试，多针测试提供了更复杂的数据处理场景。
+例如，使用`LMDeploy`部署的 `InternLM2-7B` 模型进行评估NeedleBench-4K的所有任务，可以在命令行中直接使用以下命令，该命令会调用预定义好的模型、数据集配置文件，而无需额外书写配置文件：

-#### 多针数据集配置示例
-
-以下是一个配置多针数据集的示例，展示了如何在配置文件 `configs/datasets/cdme/multi_needle/cdme8k_cot3_italy.py` 中定义多针数据集。这个示例展示了一个包含三根针的数据集配置：
-
-```python
-# 数据集基础配置
-base_path = './data/CDME'
-file_list = ['zh_finance.jsonl']
-
-# 针（Needles）定义
-needles = [
-    '\n意大利的佛罗伦萨有一家名为"La Giostra"的餐馆，是整个佛罗伦萨中排行第一的餐馆。\n',
-    '"La Giostra"餐馆的特色菜肴是松露奶酪通心粉。',
-    '松露奶酪通心粉是该家餐馆的有着意大利皇室烹饪血统的大厨Jack制作'
-]
-
-# 配置参数
-retrieval_question = ("制作佛罗伦萨中排行第一的餐馆的特色菜肴的人叫什么？"
-                      "请按照'制作佛罗伦萨中排行第一的餐馆的特色菜肴的人叫______。'的格式回答。")
-answer = "制作佛罗伦萨中排行第一的餐馆的特色菜肴的人叫Jack"
-keyword = "Jack"
-diff = 25
-
-# 数据集生成循环
-for original_context_length in context_lengths:
-    for depth_percent in generate_depth_percents(
-            document_depth_percent_intervals,
-            document_depth_percent_interval_type):
-        dataset_dict = {
-            # 其他配置项...
-            'needles': needles,
-            'diff': diff,
-            'keyword': keyword,
-            # 其他配置项...
-        }
-        cdme_datasets.append(dataset_dict)
+```bash
+python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b  --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
 ```

-在这个配置中，除了标准的参数之外，主要新增了以下几个关键参数：
-
- `needles`: 一个包含多个字符串的列表，每个字符串代表一个要插入的针。
- `diff`: 定义后续针相对于第一根针的插入深度增量。
- `keyword`: 用于在评分过程中对答案进行校正的关键词。
-
-#### 评分机制的改变
+如果只想测试原始的大海捞针任务设定，可以更换数据集的参数为`needlebench_single_4k`，如：

-在 `opencompass/datasets/cdme/cdme_multi.py` 的源代码中，对于多根针的数据集，评分机制有所不同。新增了以下代码段，用于基于 `keyword` 对预测的答案进行评分校正：
-
-```python
-if keyword in prediction:
-    print(f'{keyword} is in {prediction}')
-    score = 100
-else:
-    print(f'{keyword} is not in {prediction}')
-    score = 0.2 * score
+```bash
+python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b  --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
 ```

-这段代码意味着如果预测的答案中包含了 `keyword`，则会给予高分（如100分）。如果不包含，则分数会被大幅度降低（原分数的20%）。这种评分机制更加注重关键词的准确性，是对传统评分方法的一个重要补充。
-
-### 评估
-
-#### 使用 `internlm` 模型进行评估
-
-例如，使用 `internlm` 模型进行评估，可以使用以下命令：
+您也可以进一步选择子数据集，如更换数据集`--datasets`的参数为`needlebench_single_4k/needlebench_zh_datasets`，仅仅进行中文版本的单针大海捞针任务测试，其中`/`后面的参数代表子数据集，您可以在`configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py`中找到可选的子数据集变量，如：

 ```bash
-python run.py configs/eval_needleinahaystack.py --slurm -p partition_name -q auto --max-num-workers 32
+python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b  --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
 ```

-这个命令将启动评估流程，其中模型将试图在生成的数据集中找到指定的“针”。参数 `-p partition_name -q auto` 和 `--max-num-workers 32` 用于指定 Slurm 队列和最大工作进程数。
+注意在评估前预先安装[LMDeploy](https://github.com/InternLM/lmdeploy)工具

-#### 使用 `LMDeploy` 进行大规模文本评估
+```bash
+pip install lmdeploy
+```

-当评估特别长的文本（例如 200k tokens）时，常规方法可能会导致显存不足。在这种情况下，可以使用量化模型进行评估。这可以通过使用 `LMDeploy` 工具（[LMDeploy](https://github.com/InternLM/lmdeploy)）完成。
+这个命令将启动评估流程，参数 `-p partition_name -q auto` 和 `--max-num-workers 32` 用于指定 Slurm 分区名称和最大工作进程数。

-安装和配置 `LMDeploy` 的详细信息可以在其 GitHub 页面上找到。安装完成后，可以使用 `configs/eval_needleinahaystack_turbomind.py` 配置文件中定义的 `TurboMindModel` 模型进行评估。
+#### 评估其他`Huggingface`模型

-以下是 `configs/eval_needleinahaystack_turbomind.py` 文件的示例配置：
+对于其他模型，我们建议额外书写一个运行的配置文件以便对模型的`max_seq_len`, `max_out_len`参数进行修改，以便模型可以接收到完整的长文本内容。如我们预先写好的`configs/eval_needlebench.py`文件。完整内容如下

 ```python
-from opencompass.models.turbomind import TurboMindModel
 from mmengine.config import read_base
-
 with read_base():
-    from .datasets.cdme.cdme200k import cdme_datasets
-
-datasets = [*cdme_datasets]
-
-internlm_meta_template = dict(round=[
-    dict(role='HUMAN', begin=':', end='\n'),
-    dict(role='BOT', begin=':', end='<eoa>\n', generate=True),
-],
-                              eos_token_id=103028)
-
-models = [
-    dict(
-        type=TurboMindModel,
-        abbr='internlm-chat-20b-turbomind',
-        path='./turbomind',
-        max_out_len=100,
-        max_seq_len=2048,
-        batch_size=8,
-        concurrency=8,
-        meta_template=internlm_meta_template,
-        run_cfg=dict(num_gpus=1, num_procs=1),
-    )
-]
-```
-
-在这个配置中，`TurboMindModel` 模型结合了 `LMDeploy` 的功能，适用于处理大规模文本数据集，有效减少显存的占用。
-
-### Score计算方法
-
-在 `CDMEEvaluator` 类中，我们使用两个主要方法来计算得分：`levenshtein_distance` 和 `score`。下面是这些方法的详细介绍和实现。
-
-#### Levenshtein Distance
-
-Levenshtein 距离是一种衡量两个字符串差异的方法。它表示将一个字符串转换为另一个所需的最少单字符编辑（插入、删除或替换）的数量。
-
-```python
-def levenshtein_distance(self, s1, s2):
-    if len(s1) < len(s2):
-        return self.levenshtein_distance(s2, s1)
-
-    if len(s2) == 0:
-        return len(s1)
-
-    previous_row = range(len(s2) + 1)
-    for i, c1 in enumerate(s1):
-        current_row = [i + 1]
-        for j, c2 in enumerate(s2):
-            insertions = previous_row[j + 1] + 1
-            deletions = current_row[j] + 1
-            substitutions = previous_row[j] + (c1 != c2)
-            current_row.append(min(insertions, deletions, substitutions))
-        previous_row = current_row
-
-    return previous_row[-1]
-```
-
-#### Score Calculation
+    from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k
+    from .models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b

-得分计算方法 `score` 接受预测值和参考值两个列表，并计算每对预测值和参考值的编辑距离和得分。
+    # Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary.
+    # from .datasets.needlebench.needlebench_4k.needlebench_4k import needlebench_datasets
+    # from .summarizers.needlebench import needlebench_4k_summarizer as summarizer

-```python
-def score(self, predictions, references):
-    if len(predictions) != len(references):
-        return {"error": "predictions and references have different lengths"}
-
-    total_score = 0
-    details = []
-    for prediction, reference in zip(predictions, references):
-        prediction = re.sub(r'\s+', '', prediction)
-        reference = re.sub(r'\s+', '', reference)
-        edit_distance = self.levenshtein_distance(prediction, reference)
-        max_len = max(len(prediction), len(reference))
-        score = 100 * (1 - edit_distance / max_len) if max_len != 0 else 100
-
-        detail = {
-            "pred": prediction,
-            "answer": reference,
-            "edit_distance": edit_distance,
-            "score": score
-        }
-        total_score += score
-        details.append(detail)
-
-    average_score = total_score / len(predictions) if predictions else 0
-    result = {"score": average_score, "details": details}
-    return result
-```
+    # only eval original "needle in a haystack test" in needlebench_4k
+    from .datasets.needlebench.needlebench_4k.needlebench_single_4k import needlebench_zh_datasets, needlebench_en_datasets
+    from .summarizers.needlebench import needlebench_4k_summarizer as summarizer

-该方法首先去除预测值和参考值中的所有空白字符，然后计算它们之间的 Levenshtein 距离。得分计算为 100 减去基于编辑距离的百分比损失。最后，返回每个预测值的详细得分和平均得分。
+    # eval Ancestral Tracing Challenge(ATC)
+    # from .datasets.needlebench.atc.atc_choice_50 import needlebench_datasets
+    # from .summarizers.needlebench import atc_summarizer_50 as summarizer

-### 可视化
+datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])

-可以使用 `tools_needleinahaystack.py` 脚本来对 CSV 文件进行可视化绘图。这个脚本支持通过 `--path` 参数指定一个或多个 CSV 文件的路径，并且可以使用 `--dataset_length` 参数来指定数据集的长度。
+for m in internlm2_chat_7b:
+    m['max_seq_len'] = 32768 # 保证InternLM2-7B模型能接收到完整的长文本，其他模型需要根据各自支持的最大序列长度修改。
+    m['max_out_len'] = 2000 # 保证在多针召回任务中能接收到模型完整的回答

-#### 使用示例
+models = internlm2_chat_7b

-绘制单个 CSV 文件的可视化：
-
-```bash
-python tools/tools_needleinahaystack.py --path 'outputs/default/20231216_161457/summary/summary_20231216_161457.csv'
+work_dir = './outputs/needlebench'
 ```

-绘制多个 CSV 文件的可视化：
+当书写好测试的`config`文件后，我们可以命令行中通过`run.py`文件传入对应的config文件路径，例如：

 ```bash
-python tools/tools_needleinahaystack.py --path 'path_to_first_csv.csv' 'path_to_second_csv.csv'
+python run.py configs/eval_needlebench.py  --slurm -p partition_name -q reserved --max-num-workers 128 --max-partition-size 8000
 ```

-指定数据集长度进行可视化,此参数用于生成可视化图中的图表标题：
+注意，此时我们不需传入`--dataset, --models, --summarizer `等参数，因为我们已经在config文件中定义了这些配置。你可以自己手动调节`--max-partition-size`的设定以实现最好的任务分片策略以提高评估效率。

-```bash
-python tools/tools_needleinahaystack.py --path 'path_to_csv.csv' --dataset_length 200K
-```
+### 可视化

-目前该方案仅支持 CDME 数据集，我们欢迎社区贡献更多的数据集。
+我们已经在最新的代码中将结果可视化内置到`summarizer`实现中，您在对应的output文件夹的plots目录下可以看到相应的可视化。而不需要自己手动可视化各个深度和长度下的分数。

 如果使用了该方法，请添加引用: