EVAL.md 2.14 KB
Newer Older
luopl's avatar
luopl committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# EVAL

Our evaluation process consists of the following steps:
1. Prepare the Environment and Dataset
   - Install required dependencies:
     ```bash
     conda env create -f qwen25vl_environment.yml
     conda activate qwen25vl
     ```
   - Set up your API keys in secret_t2.env for GPT4.1 access
   - Then download our dataset stepfun-ai/GEdit-Bench:
     ```python
     from datasets import load_dataset
     dataset = load_dataset("stepfun-ai/GEdit-Bench")
     ```

2. Generate and Organize Your Images
   - Generate images following the example code in `generate_image_example.py`
   - Organize your generated images in the following directory structure:
     ```
     results/
     ├── method_name/
     │   └── fullset/
     │       └── edit_task/
     │           ├── cn/  # Chinese instructions
     │           │   ├── key1.png
     │           │   ├── key2.png
     │           │   └── ...
     │           └── en/  # English instructions
     │               ├── key1.png
     │               ├── key2.png
     │               └── ...
     ```

3. Evaluate using GPT4.1/Qwen2.5VL-72B-Instruct-AWQ
   - For GPT-4.1 evaluation:
     ```bash
     python test_gedit_score.py --model_name your_method --save_path --backbone gpt4o
     ```
   - For Qwen evaluation:
     ```bash
     python test_gedit_score.py --model_name your_method --save_path --backbone qwen25vl
     ```

4. Analyze your results and obtain scores across all dimensions
   - Run the analysis script to get scores for semantics, quality, and overall performance:
     ```bash
     python calculate_statistics.py --model_name your_method --save_path /path/to/results --backbone gpt4o
     ```
   - This will output scores broken down by edit category and provide aggregate metrics

# Acknowledgements

This project builds upon and adapts code from the following excellent repositories:

- [VIEScore](https://github.com/TIGER-AI-Lab/VIEScore): A visual instruction-guided explainable metric for evaluating conditional image synthesis

We thank the authors of these repositories for making their code publicly available.