needleinahaystack_eval.md 14.4 KB
Newer Older
1
2
3
4
# Needle In A Haystack Experiment Evaluation

## Introduction to the Needle In A Haystack Test

5
The Needle In A Haystack test, inspired by [NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/main/LLMNeedleHaystackTester.py), is a method to evaluate the long-text information extraction ability of Large Language Models (LLMs). It involves randomly inserting key information at various points in a long text to form a prompt for LLMs. This test assesses the fundamental ability of LLMs to understand long texts by extracting critical information from them.
6

7
## Dataset Introduction
8

9
The `Skywork/ChineseDomainModelingEval` dataset includes high-quality Chinese articles published between September and October 2023, covering multiple domains. These articles ensure a fair and challenging benchmark for testing.
10
11
12

## File Description

13
The dataset includes files specific to various domains:
14
15
16

- `zh_finance.jsonl` - Finance
- `zh_game.jsonl` - Gaming
17
- `zh_government.jsonl` - Government
18
19
20
21
- `zh_movie.jsonl` - Movies
- `zh_tech.jsonl` - Technology
- `zh_general.jsonl` - General

22
These files are used to assess the LLM's understanding of different specific domains.
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60

### Evaluation Steps

1. Download the dataset from [Skywork/ChineseDomainModelingEval](https://huggingface.co/datasets/Skywork/ChineseDomainModelingEval/tree/main).

2. Place the downloaded files in `opencompass/data/CDME/`. The expected file structure in the `CDME` directory is as follows:

   ```
   opencompass/
   ├── configs
   ├── docs
   ├── data
   │   └── CDME
   │       ├── processed
   │       ├── README.md
   │       ├── zh_finance.jsonl
   │       ├── zh_game.jsonl
   │       ├── zh_general.jsonl
   │       ├── zh_government.jsonl
   │       ├── zh_movie.jsonl
   │       └── zh_tech.jsonl
   ├── LICENSE
   ├── opencompass
   ├── outputs
   ├── run.py
   ├── more...
   ```

### Environment Setup

```bash
conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
```

61
### Configuring the Dataset
62

63
In the latest version, datasets are no longer generated by running scripts but dynamically defined and loaded through configuration files. Users need to specify dataset parameters in the configuration file according to their needs, offering greater flexibility and customization options.
64

65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
#### Dataset Configuration Example

Here is an example of dataset configuration, showing how to define a dataset in the `configs/datasets/cdme/cdme8k.py` configuration file. This example demonstrates a Chinese dataset configuration with a length of 8000 tokens:

```python
for original_context_length in context_lengths:
    for depth_percent in generate_depth_percents(
            document_depth_percent_intervals,
            document_depth_percent_interval_type):
        dataset_dict = {
            'abbr': f'CDME_Length{original_context_length}Depth{int(depth_percent)}',
            'type': CDMEDataset,
            'path': base_path,
            'length': original_context_length,
            'depth': int(depth_percent),
            'tokenizer_model': 'gpt-4',
            'file_list': file_list,
            'num_repeats_per_file': 10,
            'length_buffer': 200,
            'guide': True,
            'language': 'Chinese',
            'needle': '\n小明最喜欢的实习的地点就是上海人工智能实验室。\n',
            'retrieval_question': '小明最喜欢的实习地点是哪里?请按照“小明最喜欢的实习地点就是________。”的格式回答。',
            'reader_cfg': cdme_reader_cfg,
            'infer_cfg': cdme_infer_cfg,
            'eval_cfg': cdme_eval_cfg
        }
        cdme_datasets.append(dataset_dict)
93
94
```

95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
In this configuration, the main parameters include:

- `abbr`: Abbreviation of the dataset.
- `type`: Dataset type.
- `path`: Path to the dataset files.
- `length`: Context length in tokens.
- `depth`: Depth percentage of the document.
- `tokenizer_model`: Tokenizer model used.
- `file_list`: List of data source files.
- `num_repeats_per_file`: Number of repeats per file.
- `length_buffer`: Length buffer.
- `guide`: Whether it's a guided dataset.
- `language`: Language of the dataset.
- `needle`: Specific text to find in the dataset (the 'needle').
- `retrieval_question`: Question used to prompt the model for retrieval.
- `reader_cfg`, `infer_cfg`, `eval_cfg`: Configurations for reading, inference, and evaluation, respectively.
111

112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
By defining these parameters in the configuration file, you can flexibly create datasets that suit your needs. Configuration files offer a highly customizable and scalable way to manage the generation and use of datasets.

### Multi-Needle Needle In A Haystack Test

The latest version introduces the multi-needle Needle In A Haystack test, allowing multiple different needles (text snippets) to be inserted into the same dataset. These needles are inserted in sequence according to a given depth parameter. Compared to the single-needle test, the multi-needle test provides a more complex data processing scenario.

#### Multi-Needle Dataset Configuration Example

Here is an example of configuring a multi-needle dataset, showing how to define a multi-needle dataset in the `configs/datasets/cdme/multi_needle/cdme8k_cot3_italy.py` configuration file. This example demonstrates a dataset configuration with three needles:

```python
# Basic dataset configuration
base_path = './data/CDME'
file_list = ['zh_finance.jsonl']

# Definition of Needles
needles = [
    '\n意大利的佛罗伦萨有一家名为"La Giostra"的餐馆,是整个佛罗伦萨中排行第一的餐馆。\n',
    '"La Giostra"餐馆的特色菜肴是松露奶酪通心粉。',
    '松露奶酪通心粉是该家餐馆的有着意大利皇室烹饪血统的大厨Jack制作'
]


# Configuration parameters
retrieval_question = ("制作佛罗伦萨中排行第一的餐馆的特色菜肴的人叫什么?"
                      "请按照'制作佛罗伦萨中排行第一的餐馆的特色菜肴的人叫______。'的格式回答。")
answer = "制作佛罗伦萨中排行第一的餐馆的特色菜肴的人叫Jack"
keyword = "Jack"
diff = 25

# Dataset generation loop
for original_context_length in context_lengths:
    for depth_percent in generate_depth_percents(
            document_depth_percent_intervals,
            document_depth_percent_interval_type):
        dataset_dict = {
            # Other configuration items...
            'needles': needles,
            'diff': diff,
            'keyword': keyword,
            # Other configuration items...
        }
        cdme_datasets.append(dataset_dict)
```

In this configuration, in addition to the standard parameters, the main new parameters include:

- `needles`: A list containing multiple strings, each representing a needle to be inserted.
- `diff`: Defines the depth increment for subsequent needles relative to the first needle.
- `keyword`: A keyword used for score correction during the evaluation process.

#### Change in Scoring Mechanism

In the source code of `opencompass/datasets/cdme/cdme_multi.py`, the scoring mechanism for multi-needle datasets differs. The following code segment has been added to adjust the scores based on the `keyword` in the predictions:

```python
if keyword in prediction:
    print(f'{keyword} is in {prediction}')
    score = 100
else:
    print(f'{keyword} is not in {prediction}')
    score = 0.2 * score
```

This code means that if the keyword is present in the prediction, it will be awarded a high score (e.g., 100). If not, the score will be significantly reduced (20% of the original score). This scoring mechanism places more emphasis on the accuracy of keywords, supplementing the traditional scoring methods.
177
178
179

### Evaluation

180
181
182
#### Evaluating with the `internlm` Model

For example, to evaluate using the `internlm` model, the following command can be used:
183
184

```bash
185
python run.py configs/eval_needleinahaystack.py --slurm -p partition_name -q auto --max-num-workers 32
186
187
```

188
This command initiates the evaluation process, where the model attempts to find the specified "needle" in the generated dataset. The parameters `-p partition_name -q auto` and `--max-num-workers 32` specify the Slurm queue and the maximum number of worker processes, respectively.
189

190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
#### Large-Scale Text Evaluation with `LMDeploy`

When evaluating especially long texts (e.g., 200k tokens), conventional methods might lead to memory overload. In such cases, quantized models can be used for evaluation. This can be achieved using the `LMDeploy` tool ([LMDeploy](https://github.com/InternLM/lmdeploy)).

Detailed information about installing and configuring `LMDeploy` can be found on its GitHub page. Once installed, the `TurboMindModel` defined in the `configs/eval_needleinahaystack_turbomind.py` configuration file can be used for evaluation.

Below is an example configuration in the `configs/eval_needleinahaystack_turbomind.py` file:

```python
from opencompass.models.turbomind import TurboMindModel
from mmengine.config import read_base

with read_base():
    from .datasets.cdme.cdme200k import cdme_datasets

datasets = [*cdme_datasets]

internlm_meta_template = dict(round=[
    dict(role='HUMAN', begin=':', end='\n'),
    dict(role='BOT', begin=':', end='<eoa>\n', generate=True),
],
                              eos_token_id=103028)

models = [
    dict(
        type=TurboMindModel,
        abbr='internlm-chat-20b-turbomind',
        path='./turbomind',
        max_out_len=100,
        max_seq_len=2048,
        batch_size=8,
        concurrency=8,
        meta_template=internlm_meta_template,
        run_cfg=dict(num_gpus=1, num_procs=1),
    )
]
```

In this configuration, the `TurboMindModel` combines the functionality of `LMDeploy`, suitable for handling large-scale text datasets and effectively reducing memory usage.
229

230
### Score Calculation Method
231

232
In the `CDMEEvaluator` class, we use two main methods to calculate scores: `levenshtein_distance` and `score`. Here are detailed explanations and implementations of these methods.
233
234
235

#### Levenshtein Distance

236
Levenshtein distance is a measure of the difference between two strings. It represents the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other.
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260

```python
def levenshtein_distance(self, s1, s2):
    if len(s1) < len(s2):
        return self.levenshtein_distance(s2, s1)

    if len(s2) == 0:
        return len(s1)

    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row

    return previous_row[-1]
```

#### Score Calculation

261
The `score` calculation method accepts two lists of predictions and references and calculates the edit distance and score for each pair of prediction and reference.
262
263
264
265
266
267
268
269
270
271
272
273

```python
def score(self, predictions, references):
    if len(predictions) != len(references):
        return {"error": "predictions and references have different lengths"}

    total_score = 0
    details = []
    for prediction, reference in zip(predictions, references):
        prediction = re.sub(r'\s+', '', prediction)
        reference = re.sub(r'\s+', '', reference)
        edit_distance = self.levenshtein_distance(prediction, reference)
274
275


276
        max_len = max(len(prediction), len(reference))
277
        score = 100 * (1 - edit_distance /max_len) if max_len != 0 else 100
278
279
280

        detail = {
            "pred": prediction,
281
            "ref": reference,
282
283
284
285
286
287
288
            "edit_distance": edit_distance,
            "score": score
        }
        total_score += score
        details.append(detail)

    average_score = total_score / len(predictions) if predictions else 0
289
    result = {"average_score": average_score, "details": details}
290
291
292
    return result
```

293
This scoring method first removes all whitespace characters from both predictions and references and then calculates the Levenshtein distance between them. The score is calculated as 100 minus the percentage loss based on edit distance. Finally, it returns detailed scores for each prediction and the average score overall.
294
295
296

### Visualization

297
298
299
300
301
The `tools_needleinahaystack.py` script can be used to visualize CSV files. This script supports specifying one or more CSV file paths through the `--path` parameter and can use the `--dataset_length` parameter to specify the length of the dataset.

#### Usage Examples

To visualize a single CSV file:
302
303

```bash
304
python tools/tools_needleinahaystack.py --path 'outputs/default/20231216_161457/summary/summary_20231216_161457.csv'
305
306
```

307
To visualize multiple CSV files:
308

309
310
311
```bash
python tools/tools_needleinahaystack.py --path 'path_to_first_csv.csv' 'path_to_second_csv.csv'
```
312

313
To specify the dataset length for visualization, which is used for generating titles in the visualization charts:
314

315
316
317
318
319
320
321
322
323
```bash
python tools/tools_needleinahaystack.py --path 'path_to_csv.csv' --dataset_length 200K
```

Currently, this approach only supports the CDME dataset, and we welcome community contributions for more datasets.

If you use this method, please cite as follows:

```bibtex
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
@misc{2023opencompass,
    title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
    author={OpenCompass Contributors},
    howpublished={\url{https://github.com/open-compass/opencompass}},
    year={2023}
}

@misc{LLMTest_NeedleInAHaystack,
  title={LLMTest Needle In A Haystack - Pressure Testing LLMs},
  author={gkamradt},
  year={2023},
  howpublished={\url{https://github.com/gkamradt/LLMTest_NeedleInAHaystack}}
}

@misc{wei2023skywork,
      title={Skywork: A More Open Bilingual Foundation Model},
340
      author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei Lü and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou},
341
342
343
344
345
346
      year={2023},
      eprint={2310.19341},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```