README.md

# OmniSQL Training and Evaluation

## Environment Setup
All experiments were conducted using:
- **Anaconda 3**
- **Python 3.9.5**
- **8 x NVIDIA A800 80GB GPUs**

**Note:** A single A800 80GB GPU is sufficient for inference and evaluation. For training OmniSQL from scratch, 8 x A800 80GB GPUs are recommended.

## Dataset Preparation

### Download
Download the datasets from:
- [ModelScope-OmniSQL-datasets](https://modelscope.cn/datasets/seeklhy/OmniSQL-datasets/summary)
- [HuggingFace-OmniSQL-datasets](https://huggingface.co/datasets/seeklhy/OmniSQL-datasets)

The datasets include BIRD, Spider, ScienceBenchmark, EHRSQL, Spider2-SQLite, Spider-DK, Spider-Realistic, Spider-Syn, and SynSQL-2.5M. Unzip `data.zip` in this folder.

### Pre-processing
The pre-processed datasets are included in `data.zip` (see the `*.json` files). You can also reproduce the pre-processing steps if needed.

1. **Set Up Environment:**
   ```sh
   conda create -n omnisql_process_data python=3.9.5
   conda activate omnisql_process_data

   apt-get update
   apt-get install -y openjdk-11-jdk
   
   pip3 install func_timeout ijson pyserini==0.22.1 faiss-cpu torch==2.1.0 numpy==1.24.3 nltk==3.8.1
   python3 nltk_downloader.py
   ```

2. **Run Pre-processing Scripts:**
   ```sh
   # Build BM25 index for database values
   python3 build_contents_index.py
   # Prepare input-output sequences
   sh process_dataset.sh
   ```

   **Note:** Processing SynSQL-2.5M may take over 24 hours due to its size (~2.5 million samples).

## Evaluation Reproduction
You can easily reproduce our evaluation results as follows:

1. **Set Up Environment:**
   ```sh
   conda create -n omnisql_eval python=3.9.5
   conda activate omnisql_eval
   pip3 install vllm==0.6.3.post1 func_timeout tqdm matplotlib nltk==3.8.1 sqlparse
   python3 nltk_downloader.py
   ```

2. **Download Evaluation Materials:**
   Download Spider's test-suite databases and evaluation scripts from [test_suite_sql_eval.zip](https://drive.google.com/file/d/1iNa1WgA9tN_OFna08nq_tHZdXx9Lz2vO/view) and unzip `test_suite_sql_eval.zip` in this folder.

3. **Run Evaluation:**
   ```python
   python3 eval_open_source_models.py
   ```
   Predicted SQL queries are saved in the `results` folder, and evaluation results (e.g., model accuracy) are stored in the `evaluation_results` folder.

## Training OmniSQL from Scratch
To train OmniSQL from scratch:

1. **Set Up Environment:**
   ```sh
   conda create -n omnisql_train python=3.9.5
   conda activate omnisql_train
   pip3 install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 transformers==4.45.1 accelerate==0.34.2 deepspeed==0.10.3 numpy==1.24.3 peft datasets tensorboard ijson
   ```

   To speed up attention calculation, install flash-attention:

   ```bash
   # Build from source (not recommended)
   pip3 install flash-attn==2.5.8 --no-build-isolation
   ```

   It's recommended to download a precompiled flash-attn Wheel from [flash-attn-2.5.8](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.8). Choose the appropriate `.whl` file based on your environment: `flash_attn-2.5.8+cu{cuda_version}torch{torch_version}cxx11abiFALSE-cp{python_version}-cp{python_version}-linux_x86_64.whl`. 

   For example, if your CUDA version is 12.2, PyTorch version is 2.1, and Python version is 3.9.5, download `flash_attn-2.5.8+cu122torch2.1cxx11abiFALSE-cp39-cp39-linux_x86_64.whl` and install it using `pip3 install`.

2. **Training Scripts:**
   ```sh
   # train OmniSQL-7B using SynSQL-2.5M
   sh train_omnisql_7b.sh
   # train OmniSQL-14B using SynSQL-2.5M
   sh train_omnisql_14b.sh
   # train OmniSQL-32B using SynSQL-2.5M
   sh train_omnisql_32b.sh
   ```

   To train the full version of OmniSQL, you should manually merge the three training sets (`./data/train_synsql.json`, `./data/train_bird.json`, and `./data/train_spider.json`) and update the `DATASET_DIR` in the scripts. For OmniSQL-32B, you can merge LoRA adapters into the base model using `merge_lora_adapter.py`.

   **Note:** Training OmniSQL from scratch is resource and time-intensive. As reported in our paper, training OmniSQL-7B/14B/32B requires approximately 6, 12, and 20 days, respectively, on a single machine equipped with 8 NVIDIA A800 80GB GPUs. Please consider whether you need to train them again. **We encourage using our open-sourced OmniSQL models directly or continuing to train your text-to-SQL model with a smaller dataset based on OmniSQL.**