@@ -36,25 +36,25 @@ A successful retrieval QA system starts with high-quality data. You need a colle
#### Step 2: Split Data
Document data is usually too long to fit into the prompt due to the context length limitation of LLMs. Supporting documents need to be splited into short chunks before constructing vector stores. In this demo, we use neural text spliter for better performance.
Document data is usually too long to fit into the prompt due to the context length limitation of LLMs. Supporting documents need to be split into short chunks before constructing vector stores. In this demo, we use neural text splitter for better performance.
#### Step 3: Construct Vector Stores
Choose a embedding function and embed your text chunk into high dimensional vectors. Once you have vectors for your documents, you need to create a vector store. The vector store should efficiently index and retrieve documents based on vector similarity. In this demo, we use [Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma) and incrementally update indexes of vector stores. Through incremental update, one can update and maintain a vector store without recalculating every embedding.
You are free to choose any vectorstore from a varity of [vector stores](https://python.langchain.com/docs/integrations/vectorstores/) supported by Langchain. However, the incremental update only works with LangChain vectorstore's that support:
You are free to choose any vectorstore from a variety of [vector stores](https://python.langchain.com/docs/integrations/vectorstores/) supported by Langchain. However, the incremental update only works with LangChain vectorstores that support:
- Document addition by id (add_documents method with ids argument)
- Delete by id (delete method with)
#### Step 4: Retrieve Relative Text
Upon querying, we will run a reference resolution on user's input, the goal of this step is to remove ambiguous reference in user's query such as "this company", "him". We then embed the query with the same embedding function and query the vectorstore to retrieve the top-k most similar documents.
Upon querying, we will run a reference resolution on user's input, the goal of this step is to remove ambiguous reference in user's query such as "this company", "him". We then embed the query with the same embedding function and query the vectorstore to retrieve the top-k most similar documents.
#### Step 5: Format Prompt
The prompt carries essential information including task description, conversation history, retrived documents, and user's query for the LLM to generate a response. Please refer to this [README](./colossalqa/prompt/README.md) for more details.
The prompt carries essential information including task description, conversation history, retrieved documents, and user's query for the LLM to generate a response. Please refer to this [README](./colossalqa/prompt/README.md) for more details.
#### Step 6: Inference
Pass the prompt to the LLM with additional generaton arguments to get agent response. You can control the generation with additional arguments such as temperature, top_k, top_p, max_new_tokens. You can also define when to stop by passing the stop substring to the retrieval QA chain.
Pass the prompt to the LLM with additional generation arguments to get agent response. You can control the generation with additional arguments such as temperature, top_k, top_p, max_new_tokens. You can also define when to stop by passing the stop substring to the retrieval QA chain.
#### Step 7: Update Memory
We designed a memory module that automatically summarize overlength conversation to fit the max context length of LLM. In this step, we update the memory with the newly generated response. To fix into the context length of a given LLM, we sumarize the overlength part of historical conversation and present the rest in round-based conversation format. Fig.2. shows how the memory is updated. Please refer to this [README](./colossalqa/prompt/README.md) for dialogue format.
We designed a memory module that automatically summarize overlength conversation to fit the max context length of LLM. In this step, we update the memory with the newly generated response. To fix into the context length of a given LLM, we summarize the overlength part of historical conversation and present the rest in round-based conversation format. Fig.2. shows how the memory is updated. Please refer to this [README](./colossalqa/prompt/README.md) for dialogue format.

<palign="center">
...
...
@@ -83,7 +83,7 @@ from langchain.llms import OpenAI
llm=OpenAI(openai_api_key="YOUR_OPENAI_API_KEY")
# For Pangu LLM
# set up your authentification info
# set up your authentication info
fromcolossalqa.local.pangu_llmimportPangu
os.environ["URL"]=""
os.environ["URLNAME"]=""
...
...
@@ -121,9 +121,9 @@ Read comments under ./colossalqa/data_loader for more detail regarding supported
### Run The Script
We provide a simple Web UI demo of ColossalQA, enabling you to upload your files as a knowledge base and interact with them through a chat interface in your browser. More details can be found [here](examples/webui_demo/README.md)
We also provided some scripts for Chinese document retrieval based conversation system, English document retrieval based conversation system, Bi-lingual document retrieval based conversation system and an experimental AI agent with document retrieval and SQL query functionality. The Bi-lingual one is a high-level wrapper for the other two clases. We write different scripts for different languages because retrieval QA requires different embedding models, LLMs, prompts for different language setting. For now, we use LLaMa2 for English retrieval QA and ChatGLM2 for Chinese retrieval QA for better performance.
We also provided some scripts for Chinese document retrieval based conversation system, English document retrieval based conversation system, Bi-lingual document retrieval based conversation system and an experimental AI agent with document retrieval and SQL query functionality. The Bi-lingual one is a high-level wrapper for the other two classes. We write different scripts for different languages because retrieval QA requires different embedding models, LLMs, prompts for different language setting. For now, we use LLaMa2 for English retrieval QA and ChatGLM2 for Chinese retrieval QA for better performance.
After runing the script, it will ask you to provide the path to your data during the execution of the script. You can also pass a glob path to load multiple files at once. Please read this [guide](https://docs.python.org/3/library/glob.html) on how to define glob path. Follow the instruction and provide all files for your retrieval conversation system then type "ESC" to finish loading documents. If csv files are provided, please use "," as delimiter and "\"" as quotation mark. For json and jsonl files. The default format is
After running the script, it will ask you to provide the path to your data during the execution of the script. You can also pass a glob path to load multiple files at once. Please read this [guide](https://docs.python.org/3/library/glob.html) on how to define glob path. Follow the instruction and provide all files for your retrieval conversation system then type "ESC" to finish loading documents. If csv files are provided, please use "," as delimiter and "\"" as quotation mark. For json and jsonl files. The default format is
_EN_RETRIEVAL_QA_PROMPT="""[INST] <<SYS>>Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist content.
If the answer cannot be infered based on the given context, please say "I cannot answer the question based on the information given.".<</SYS>>
...
...
@@ -105,20 +107,24 @@ sentence: {input}
disambiguated sentence:"""
# Prompt templates
# English retrieval prompt, the model generates answer based on this prompt
# English disambigate prompt, which replace any ambiguous references in the user's input with the specific names or entities mentioned in the chat history
# Chinese disambigate prompt, which replace any ambiguous references in the user's input with the specific names or entities mentioned in the chat history
@@ -16,22 +16,103 @@ cd ColossalAI/applications/ColossalQA/
pip install-e .
```
Install the dependencies for ColossalQA webui demo:
```sh
pip install-r requirements.txt
```
## Configure the RAG Chain
Customize the RAG Chain settings, such as the embedding model (default: moka-ai/m3e) and the language model, in the `start_colossal_qa.sh` script.
Customize the RAG Chain settings, such as the embedding model (default: moka-ai/m3e), the language model, and the prompts, in the `config.py`. Please refer to [`Prepare configuration file`](#prepare-configuration-file) for the details of `config.py`.
For API-based language models (like ChatGPT or Huawei Pangu), provide your API key for authentication. For locally-run models, indicate the path to the model's checkpoint file.
If you want to customize prompts in the RAG Chain, you can have a look at the `RAG_ChatBot.py` file to modify them.
### Prepare configuration file
All configs are defined in `ColossalQA/examples/webui_demo/config.py`. You can primarily modify the **bolded** sections in the config to switch the embedding model and the large model loaded by the backend. Other parameters can be left as default or adjusted based on your specific requirements.
-`embed`:
-**`embed_name`**: the embedding model name
-**`embed_model_name_or_path`**: path to embedding model, could be a local path or a huggingface path
-`embed_model_device`: device to load the embedding model
-`model`:
-**`mode`**: "local" for loading models, "api" for using model api
-**`model_name`**: "chatgpt_api", "pangu_api", or your local model name
-**`model_path`**: path to the model, could be a local path or a huggingface path. don't need if mode="api"
-`device`: device to load the LLM
-`splitter`:
-`name`: text splitter class name, the class should be imported at the beginning of `config.py`
-`retrieval`:
-`retri_top_k`: number of retrieval text which will be provided to the model
-`retri_kb_file_path`: path to store database files
-`verbose: Boolean type`, to control the level of detail in program output
-`chain`:
-`mem_summary_prompt`: summary prompt template
-`mem_human_prefix`: human prefix for prompt
-`mem_ai_prefix`: AI assistant prefix for prompt
-`mem_max_tokens`: max tokens for history information
-`mem_llm_kwargs`: model's generation kwargs for summarizing history
-`max_new_tokens`: int
-`temperature`: int
-`do_sample`: bool
-`disambig_prompt`: disambiguate prompt template
-`disambig_llm_kwargs`: model's generation kwargs for disambiguating user's input
-`max_new_tokens`: int
-`temperature`: int
-`do_sample`: bool
-`gen_llm_kwargs`: model's generation kwargs
-`max_new_tokens`: int
-`temperature`: int
-`do_sample`: bool
-`gen_qa_prompt`: generation prompt template
-`verbose`: Boolean type, to control the level of detail in program output
## Run WebUI Demo
## Run WebUI Demo
Execute the following command to start the demo:
1. If you want to use a local model as the backend model, you need to specify the model name and model path in `config.py` and run the following commands.
2. If you want to use chatgpt api as the backend model, you need to change the model mode to "api", change the model name to "chatgpt_api" in `config.py`, and run the following commands.
3. If you want to use pangu api as the backend model, you need to change the model mode to "api", change the model name to "pangu_api" in `config.py`, and run the following commands.