readme.md

# Q&A Example

Vector Database can help LLMs to access external knowledge. 
You can load baai-general-embedding as the encoder to generate the vectors.
Here a example to build a bot which can answer your question using the knowledge in chinese wikipedia.

Here's a description of the Q&A dialogue scenario using flag embedding and a large language model:

1. **Data Preprocessing and Indexing:**
   - Download a Chinese wikipedia dataset.
   - Encode the Chinese wikipedia text using flag embedding.
   - Build an index using BM25.
2. **Query Enhancement with Large Language Model (LLM):**
   - Utilize a Large Language Model (LLM) to enhance and enrich the original user query based on the chat history.
   - The LLM can perform tasks such as text completion and paraphrasing to make the query more robust and comprehensive.
3. **Document Retrieval:**
   - Employ BM25 to retrieve the top-n documents from the locally stored Chinese wiki dataset based on the newly enhanced query.
4. **Embedding Retrieval:**
   - Perform an embedding retrieval on the top-n retrieved documents using brute force search to get top-k documents.
5. **Answer Retrieval with Language Model (LLM):**
   - Present the question, the top-k retrieved documents, and chat history to the Large Language Model (LLM).
   - The LLM can utilize its understanding of language and context to provide accurate and comprehensive answers to the user's question.

By following these steps, the Q&A system can leverage flag embedding, BM25 indexing, and a Large Language Model to improve the accuracy and intelligence of the system. The integration of these techniques can create a more sophisticated and reliable Q&A system for users, providing them with comprehensive information to effectively answer their questions.

### Installation

```shell
sudo apt install default-jdk
pip install -r requirements.txt
conda install -c anaconda openjdk
```

### Prepare Data

```shell
python pre_process.py --data_path ./data
```

This script will download the dataset (Chinese wikipedia), building BM25 index, inference embedding, and then save them to `data_path`.

## Q&A usage

### Run Directly

```shell
export OPENAI_API_KEY=...
python run.py --data_path ./data
```

This script will build a Q&A dialogue scenario.

### Quick Start

```python
# encoding=gbk
from tool import LocalDatasetLoader, BMVectorIndex, Agent
loader = LocalDatasetLoader(data_path="./data/dataset",
                            embedding_path="./data/emb/data.npy")
index = BMVectorIndex(model_path="BAAI/bge-large-zh",
                      bm_index_path="./data/index",
                      data_loader=loader)
agent = Agent(index)
question = "上次有人登月是什么时候"
agent.Answer(question, RANKING=1000, TOP_N=5, verbose=False)
```