This directory contains the implementation for evaluating vision-language models on the ODinW (Object Detection in the Wild) 13 dataset using vLLM for high-speed inference.
## Overview
ODinW is a comprehensive object detection benchmark that consists of 13 diverse datasets spanning various domains. This implementation provides:
-**High-speed inference** using vLLM with automatic batch optimization
-**Unified evaluation** across 13 diverse object detection datasets
-**COCO-style metrics** including mAP, mAP_50, mAP_75, etc.
-**Modular code structure** for easy maintenance and extension
## Project Structure
```
ODinW-13/
├── run_odinw.py # Main script for inference and evaluation
├── dataset_utils.py # Dataset loading and preprocessing utilities
├── eval_utils.py # Evaluation logic and COCO metrics computation
├── infer_instruct.sh # Inference script for instruct models
├── infer_think.sh # Inference script for thinking models
├── eval_instruct.sh # Evaluation script for instruct model results
├── eval_think.sh # Evaluation script for thinking model results
├── requirements.txt # Python dependencies
└── README.md # This file
```
## Requirements
### Python Dependencies
```bash
pip install-r requirements.txt
```
Key dependencies:
-`vllm` - High-speed LLM inference engine
-`transformers` - HuggingFace transformers
-`qwen_vl_utils` - Qwen VL utilities for vision processing
-`pycocotools` - COCO evaluation API
-`pandas`, `numpy` - Data processing
-`Pillow` - Image processing
-`tabulate` - Table formatting (optional, for better output display)
### Data Preparation
The ODinW dataset requires a specific directory structure:
**Note:** Thinking models output reasoning steps wrapped in `<think>...</think>` tags. The evaluation automatically extracts the final answer after `</think>`.
## Performance Tips
1.**GPU Memory**: Adjust `--gpu-memory-utilization` based on your GPU:
- 0.9: Recommended for most cases
- 0.95: For maximum throughput (may cause OOM)
- 0.7-0.8: If experiencing OOM errors
2.**Batch Size**: vLLM automatically optimizes batch size based on available memory
3.**Tensor Parallelism**: Use `--tensor-parallel-size` for large models:
- 2B model: 1 GPU recommended
- 7B model: 1-2 GPUs
- 14B+ model: 2-4 GPUs
4.**Context Length**: Reduce `--max-model-len` if memory is limited:
- 128000: Default, works well for most cases
- 64000: Reduces memory usage by ~40%
5.**Image Processing**: The implementation uses `smart_resize` to automatically adjust image dimensions:
- Dimensions are made divisible by 32
- Total pixels are constrained to [min_pixels, max_pixels]
- Aspect ratio is preserved
## Troubleshooting
### Common Issues
**1. Config file not found**
```
FileNotFoundError: Config file not found: /path/to/odinw13_config.py
```
**Solution**: Ensure `odinw13_config.py` exists in `--data-dir`
**2. CUDA Out of Memory**
```bash
# Reduce GPU memory utilization
--gpu-memory-utilization 0.7
# Or reduce context length
--max-model-len 64000
```
**3. vLLM Multiprocessing Issues**
The code automatically sets `VLLM_WORKER_MULTIPROC_METHOD=spawn`. If you still encounter issues:
```bash
export VLLM_WORKER_MULTIPROC_METHOD=spawn
```
**4. Empty or Invalid JSON Output**
- Check model output format
- Verify prompt clarity
- Try adjusting temperature/top_p
**5. Low mAP Scores**
- Verify category names match dataset classes
- Check coordinate format (xyxy vs xywh)
- Ensure model outputs JSON format correctly
**6. COCO API Errors**
```
IndexError: The testing results of the whole dataset is empty.
```
**Solution**: No valid predictions were generated. Check model outputs.
## Advanced Usage
### Custom Image Resolution
Edit `dataset_utils.py` to modify resolution parameters:
To evaluate only specific datasets, edit `generate_odinw_jobs()` in `dataset_utils.py`:
```python
# Only process specific datasets
dataset_filter=['AerialMaritimeDrone','Aquarium']
fordata_name,data_configindatasets.items():
ifdata_namenotindataset_filter:
continue
# ... rest of the code
```
### Custom Prompt Format
Edit the prompt in `dataset_utils.py`:
```python
# Default prompt
prompt=f"Locate every instance that belongs to the following categories: '{obj_names}'. Report bbox coordinates in JSON format."
# Custom prompt example
prompt=f"Find all {obj_names} objects in the image and output their bounding boxes as JSON."
```
## Citation
If you use this code or the ODinW benchmark, please cite:
```bibtex
@inproceedings{li2022grounded,
title={Grounded language-image pre-training},
author={Li, Liunian Harold and Zhang, Pengchuan and Zhang, Haotian and Yang, Jianwei and Li, Chunyuan and Zhong, Yiwu and Wang, Lijuan and Yuan, Lu and Zhang, Lei and Hwang, Jenq-Neng and others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={10965--10975},
year={2022}
}
```
## License
This code is released under the same license as the Qwen3-VL model.