Unverified Commit e53e729d authored by YeAnbang's avatar YeAnbang Committed by GitHub
Browse files

[Feature] Add document retrieval QA (#5020)



* add langchain

* add langchain

* Add files via upload

* add langchain

* fix style

* fix style: remove extra space

* add pytest; modified retriever

* add pytest; modified retriever

* add tests to build_on_pr.yml

* fix build_on_pr.yml

* fix build on pr; fix environ vars

* seperate unit tests for colossalqa from build from pr

* fix container setting; fix environ vars

* commented dev code

* add incremental update

* remove stale code

* fix style

* change to sha3 224

* fix retriever; fix style; add unit test for document loader

* fix ci workflow config

* fix ci workflow config

* add set cuda visible device script in ci

* fix doc string

* fix style; update readme; refactored

* add force log info

* change build on pr, ignore colossalqa

* fix docstring, captitalize all initial letters

* fix indexing; fix text-splitter

* remove debug code, update reference

* reset previous commit

* update LICENSE update README add key-value mode, fix bugs

* add files back

* revert force push

* remove junk file

* add test files

* fix retriever bug, add intent classification

* change conversation chain design

* rewrite prompt and conversation chain

* add ui v1

* ui v1

* fix atavar

* add header

* Refactor the RAG Code and support Pangu

* Refactor the ColossalQA chain to Object-Oriented Programming and the UI demo.

* resolved conversation. tested scripts under examples. web demo still buggy

* fix ci tests

* Some modifications to add ChatGPT api

* modify llm.py and remove unnecessary files

* Delete applications/ColossalQA/examples/ui/test_frontend_input.json

* Remove OpenAI api key

* add colossalqa

* move files

* move files

* move files

* move files

* fix style

* Add Readme and fix some bugs.

* Add something to readme and modify some code

* modify a directory name for clarity

* remove redundant directory

* Correct a type in  llm.py

* fix AI prefix

* fix test_memory.py

* fix conversation

* fix some erros and typos

* Fix a missing import in RAG_ChatBot.py

* add colossalcloud LLM wrapper, correct issues in code review

---------
Co-authored-by: default avatarYeAnbang <anbangy2@outlook.com>
Co-authored-by: default avatarOrion-Zheng <zheng_zian@u.nus.edu>
Co-authored-by: default avatarZian(Andy) Zheng <62330719+Orion-Zheng@users.noreply.github.com>
Co-authored-by: default avatarOrion-Zheng <zhengzian@u.nus.edu>
parent 3acbf6d4
Index,Organization Id,Name,Website,Country,Description,Founded,Industry,Number of employees
1,FAB0d41d5b5d22c,Ferrell LLC,https://price.net/,Papua New Guinea,Horizontal empowering knowledgebase,1990,Plastics,3498
2,6A7EdDEA9FaDC52,"Mckinney, Riley and Day",http://www.hall-buchanan.info/,Finland,User-centric system-worthy leverage,2015,Glass / Ceramics / Concrete,4952
3,0bFED1ADAE4bcC1,Hester Ltd,http://sullivan-reed.com/,China,Switchable scalable moratorium,1971,Public Safety,5287
4,2bFC1Be8a4ce42f,Holder-Sellers,https://becker.com/,Turkmenistan,De-engineered systemic artificial intelligence,2004,Automotive,921
5,9eE8A6a4Eb96C24,Mayer Group,http://www.brewer.com/,Mauritius,Synchronized needs-based challenge,1991,Transportation,7870
6,cC757116fe1C085,Henry-Thompson,http://morse.net/,Bahamas,Face-to-face well-modulated customer loyalty,1992,Primary / Secondary Education,4914
7,219233e8aFF1BC3,Hansen-Everett,https://www.kidd.org/,Pakistan,Seamless disintermediate collaboration,2018,Publishing Industry,7832
8,ccc93DCF81a31CD,Mcintosh-Mora,https://www.brooks.com/,Heard Island and McDonald Islands,Centralized attitude-oriented capability,1970,Import / Export,4389
9,0B4F93aA06ED03e,Carr Inc,http://ross.com/,Kuwait,Distributed impactful customer loyalty,1996,Plastics,8167
10,738b5aDe6B1C6A5,Gaines Inc,http://sandoval-hooper.com/,Uzbekistan,Multi-lateral scalable protocol,1997,Outsourcing / Offshoring,9698
11,AE61b8Ffebbc476,Kidd Group,http://www.lyons.com/,Bouvet Island (Bouvetoya),Proactive foreground paradigm,2001,Primary / Secondary Education,7473
12,eb3B7D06cCdD609,Crane-Clarke,https://www.sandoval.com/,Denmark,Front-line clear-thinking encryption,2014,Food / Beverages,9011
13,8D0c29189C9798B,"Keller, Campos and Black",https://www.garner.info/,Liberia,Ameliorated directional emulation,2020,Museums / Institutions,2862
14,D2c91cc03CA394c,Glover-Pope,http://www.silva.biz/,United Arab Emirates,Persevering contextually-based approach,2013,Medical Practice,9079
15,C8AC1eaf9C036F4,Pacheco-Spears,https://aguilar.com/,Sweden,Secured logistical synergy,1984,Maritime,769
16,b5D10A14f7a8AfE,Hodge-Ayers,http://www.archer-elliott.com/,Honduras,Future-proofed radical implementation,1990,Facilities Services,8508
17,68139b5C4De03B4,"Bowers, Guerra and Krause",http://www.carrillo-nicholson.com/,Uganda,De-engineered transitional strategy,1972,Primary / Secondary Education,6986
18,5c2EffEfdba2BdF,Mckenzie-Melton,http://montoya-thompson.com/,Hong Kong,Reverse-engineered heuristic alliance,1998,Investment Management / Hedge Fund / Private Equity,4589
19,ba179F19F7925f5,Branch-Mann,http://www.lozano.com/,Botswana,Adaptive intangible frame,1999,Architecture / Planning,7961
20,c1Ce9B350BAc66b,Weiss and Sons,https://barrett.com/,Korea,Sharable optimal functionalities,2011,Plastics,5984
21,8de40AC4e6EaCa4,"Velez, Payne and Coffey",http://burton.com/,Luxembourg,Mandatory coherent synergy,1986,Wholesale,5010
22,Aad86a4F0385F2d,Harrell LLC,http://www.frey-rosario.com/,Guadeloupe,Reverse-engineered mission-critical moratorium,2018,Construction,2185
23,22aC3FFd64fD703,"Eaton, Reynolds and Vargas",http://www.freeman.biz/,Monaco,Self-enabling multi-tasking process improvement,2014,Luxury Goods / Jewelry,8987
24,5Ec4C272bCf085c,Robbins-Cummings,http://donaldson-wilkins.com/,Belgium,Organic non-volatile hierarchy,1991,Pharmaceuticals,5038
25,5fDBeA8BB91a000,Jenkins Inc,http://www.kirk.biz/,South Africa,Front-line systematic help-desk,2002,Insurance,1215
26,dFfD6a6F9AC2d9C,"Greene, Benjamin and Novak",http://www.kent.net/,Romania,Centralized leadingedge moratorium,2012,Museums / Institutions,4941
27,4B217cC5a0674C5,"Dickson, Richmond and Clay",http://everett.com/,Czech Republic,Team-oriented tangible complexity,1980,Real Estate / Mortgage,3122
28,88b1f1cDcf59a37,Prince-David,http://thompson.com/,Christmas Island,Virtual holistic methodology,1970,Banking / Mortgage,1046
29,f9F7bBCAEeC360F,Ayala LLC,http://www.zhang.com/,Philippines,Open-source zero administration hierarchy,2021,Legal Services,7664
30,7Cb3AeFcE4Ba31e,Rivas Group,https://hebert.org/,Australia,Open-architected well-modulated capacity,1998,Logistics / Procurement,4155
31,ccBcC32adcbc530,"Sloan, Mays and Whitehead",http://lawson.com/,Chad,Face-to-face high-level conglomeration,1997,Civil Engineering,365
32,f5afd686b3d05F5,"Durham, Allen and Barnes",http://chan-stafford.org/,Zimbabwe,Synergistic web-enabled framework,1993,Mechanical or Industrial Engineering,6135
33,38C6cfC5074Fa5e,Fritz-Franklin,http://www.lambert.com/,Nepal,Automated 4thgeneration website,1972,Hospitality,4516
34,5Cd7efccCcba38f,Burch-Ewing,http://cline.net/,Taiwan,User-centric 4thgeneration system engine,1981,Venture Capital / VC,7443
35,9E6Acb51e3F9d6F,"Glass, Barrera and Turner",https://dunlap.com/,Kyrgyz Republic,Multi-channeled 3rdgeneration open system,2020,Utilities,2610
36,4D4d7E18321eaeC,Pineda-Cox,http://aguilar.org/,Bolivia,Fundamental asynchronous capability,2010,Human Resources / HR,1312
37,485f5d06B938F2b,"Baker, Mccann and Macdonald",http://www.anderson-barker.com/,Kenya,Cross-group user-facing focus group,2013,Legislative Office,1638
38,19E3a5Bf6dBDc4F,Cuevas-Moss,https://dodson-castaneda.net/,Guatemala,Extended human-resource intranet,1994,Music,9995
39,6883A965c7b68F7,Hahn PLC,http://newman.com/,Belarus,Organic logistical leverage,2012,Electrical / Electronic Manufacturing,3715
40,AC5B7AA74Aa4A2E,"Valentine, Ferguson and Kramer",http://stuart.net/,Jersey,Centralized secondary time-frame,1997,Non - Profit / Volunteering,3585
41,decab0D5027CA6a,Arroyo Inc,https://www.turner.com/,Grenada,Managed demand-driven website,2006,Writing / Editing,9067
42,dF084FbBb613eea,Walls LLC,http://www.reese-vasquez.biz/,Cape Verde,Self-enabling fresh-thinking installation,1989,Investment Management / Hedge Fund / Private Equity,1678
43,A2D89Ab9bCcAd4e,"Mitchell, Warren and Schneider",https://fox.biz/,Trinidad and Tobago,Enhanced intangible time-frame,2021,Capital Markets / Hedge Fund / Private Equity,3816
44,77aDc905434a49f,Prince PLC,https://www.watts.com/,Sweden,Profit-focused coherent installation,2016,Individual / Family Services,7645
45,235fdEFE2cfDa5F,Brock-Blackwell,http://www.small.com/,Benin,Secured foreground emulation,1986,Online Publishing,7034
46,1eD64cFe986BBbE,Walton-Barnett,https://ashley-schaefer.com/,Western Sahara,Right-sized clear-thinking flexibility,2001,Luxury Goods / Jewelry,1746
47,CbBbFcdd0eaE2cF,Bartlett-Arroyo,https://cruz.com/,Northern Mariana Islands,Realigned didactic function,1976,Civic / Social Organization,3987
48,49aECbDaE6aBD53,"Wallace, Madden and Morris",http://www.blevins-fernandez.biz/,Germany,Persistent real-time customer loyalty,2016,Pharmaceuticals,9443
49,7b3fe6e7E72bFa4,Berg-Sparks,https://cisneros-love.com/,Canada,Stand-alone static implementation,1974,Arts / Crafts,2073
50,c6DedA82A8aef7E,Gonzales Ltd,http://bird.com/,Tonga,Managed human-resource policy,1988,Consumer Goods,9069
51,7D9FBF85cdC3871,Lawson and Sons,https://www.wong.com/,French Southern Territories,Compatible analyzing intranet,2021,Arts / Crafts,3527
52,7dd18Fb7cB07b65,"Mcguire, Mcconnell and Olsen",https://melton-briggs.com/,Korea,Profound client-server frame,1988,Printing,8445
53,EF5B55FadccB8Fe,Charles-Phillips,https://bowman.com/,Cote d'Ivoire,Monitored client-server implementation,2012,Mental Health Care,3450
54,f8D4B99e11fAF5D,Odom Ltd,https://www.humphrey-hess.com/,Cote d'Ivoire,Advanced static process improvement,2012,Management Consulting,1825
55,e24D21BFd3bF1E5,Richard PLC,https://holden-coleman.net/,Mayotte,Object-based optimizing model,1971,Broadcast Media,4942
56,B9BdfEB6D3Ca44E,Sampson Ltd,https://blevins.com/,Cayman Islands,Intuitive local adapter,2005,Farming,1418
57,2a74D6f3D3B268e,"Cherry, Le and Callahan",https://waller-delacruz.biz/,Nigeria,Universal human-resource collaboration,2017,Entertainment / Movie Production,7202
58,Bf3F3f62c8aBC33,Cherry PLC,https://www.avila.info/,Marshall Islands,Persistent tertiary website,1980,Plastics,8245
59,aeBe26B80a7a23c,Melton-Nichols,https://kennedy.com/,Palau,User-friendly clear-thinking productivity,2021,Legislative Office,8741
60,aAeb29ad43886C6,Potter-Walsh,http://thomas-french.org/,Turkey,Optional non-volatile open system,2008,Human Resources / HR,6923
61,bD1bc6bB6d1FeD3,Freeman-Chen,https://mathis.com/,Timor-Leste,Phased next generation adapter,1973,International Trade / Development,346
62,EB9f456e8b7022a,Soto Group,https://norris.info/,Vietnam,Enterprise-wide executive installation,1988,Business Supplies / Equipment,9097
63,Dfef38C51D8DAe3,"Poole, Cruz and Whitney",https://reed.info/,Reunion,Balanced analyzing groupware,1978,Marketing / Advertising / Sales,2992
64,055ffEfB2Dd95B0,Riley Ltd,http://wiley.com/,Brazil,Optional exuding superstructure,1986,Textiles,9315
65,cBfe4dbAE1699da,"Erickson, Andrews and Bailey",https://www.hobbs-grant.com/,Eritrea,Vision-oriented secondary project,2014,Consumer Electronics,7829
66,fdFbecbadcdCdf1,"Wilkinson, Charles and Arroyo",http://hunter-mcfarland.com/,United States Virgin Islands,Assimilated 24/7 archive,1996,Building Materials,602
67,5DCb8A5a5ca03c0,Floyd Ltd,http://www.whitney.com/,Falkland Islands (Malvinas),Function-based fault-tolerant concept,2017,Public Relations / PR,2911
68,ce57DCbcFD6d618,Newman-Galloway,https://www.scott.com/,Luxembourg,Enhanced foreground collaboration,1987,Information Technology / IT,3934
69,5aaD187dc929371,Frazier-Butler,https://www.daugherty-farley.info/,Northern Mariana Islands,Persistent interactive circuit,1972,Outsourcing / Offshoring,5130
70,902D7Ac8b6d476b,Newton Inc,https://www.richmond-manning.info/,Netherlands Antilles,Fundamental stable info-mediaries,1976,Military Industry,563
71,32BB9Ff4d939788,Duffy-Levy,https://www.potter.com/,Guernsey,Diverse exuding installation,1982,Wireless,6146
72,adcB0afbE58bAe3,Wagner LLC,https://decker-esparza.com/,Uruguay,Reactive attitude-oriented toolset,1987,International Affairs,6874
73,dfcA1c84AdB61Ac,Mccall-Holmes,http://www.dean.com/,Benin,Object-based value-added database,2009,Legal Services,696
74,208044AC2fe52F3,Massey LLC,https://frazier.biz/,Suriname,Configurable zero administration Graphical User Interface,1986,Accounting,5004
75,f3C365f0c1A0623,Hicks LLC,http://alvarez.biz/,Pakistan,Quality-focused client-server Graphical User Interface,1970,Computer Software / Engineering,8480
76,ec5Bdd3CBAfaB93,"Cole, Russell and Avery",http://www.blankenship.com/,Mongolia,De-engineered fault-tolerant challenge,2000,Law Enforcement,7012
77,DDB19Be7eeB56B4,Cummings-Rojas,https://simon-pearson.com/,Svalbard & Jan Mayen Islands,User-centric modular customer loyalty,2012,Financial Services,7529
78,dd6CA3d0bc3cAfc,"Beasley, Greene and Mahoney",http://www.petersen-lawrence.com/,Togo,Extended content-based methodology,1976,Religious Institutions,869
79,A0B9d56e61070e3,"Beasley, Sims and Allison",http://burke.info/,Latvia,Secured zero tolerance hub,1972,Facilities Services,6182
80,cBa7EFe5D05Adaf,Crawford-Rivera,https://black-ramirez.org/,Cuba,Persevering exuding budgetary management,1999,Online Publishing,7805
81,Ea3f6D52Ec73563,Montes-Hensley,https://krueger.org/,Liechtenstein,Multi-tiered secondary productivity,2009,Printing,8433
82,bC0CEd48A8000E0,Velazquez-Odom,https://stokes.com/,Djibouti,Streamlined 6thgeneration function,2002,Alternative Dispute Resolution,4044
83,c89b9b59BC4baa1,Eaton-Morales,https://www.reeves-graham.com/,Micronesia,Customer-focused explicit frame,1990,Capital Markets / Hedge Fund / Private Equity,7013
84,FEC51bce8421a7b,"Roberson, Pennington and Palmer",http://www.keith-fisher.com/,Cameroon,Adaptive bi-directional hierarchy,1993,Telecommunications,5571
85,e0E8e27eAc9CAd5,"George, Russo and Guerra",https://drake.com/,Sweden,Centralized non-volatile capability,1989,Military Industry,2880
86,B97a6CF9bf5983C,Davila Inc,https://mcconnell.info/,Cocos (Keeling) Islands,Profit-focused dedicated frame,2017,Consumer Electronics,2215
87,a0a6f9b3DbcBEb5,Mays-Preston,http://www.browning-key.com/,Mali,User-centric heuristic focus group,2006,Military Industry,5786
88,8cC1bDa330a5871,Pineda-Morton,https://www.carr.com/,United States Virgin Islands,Grass-roots methodical info-mediaries,1991,Printing,6168
89,ED889CB2FE9cbd3,Huang and Sons,https://www.bolton.com/,Eritrea,Re-contextualized dynamic hierarchy,1981,Semiconductors,7484
90,F4Dc1417BC6cb8f,Gilbert-Simon,https://www.bradford.biz/,Burundi,Grass-roots radical parallelism,1973,Newspapers / Journalism,1927
91,7ABc3c7ecA03B34,Sampson-Griffith,http://hendricks.org/,Benin,Multi-layered composite paradigm,1972,Textiles,3881
92,4e0719FBE38e0aB,Miles-Dominguez,http://www.turner.com/,Gibraltar,Organized empowering forecast,1996,Civic / Social Organization,897
93,dEbDAAeDfaed00A,Rowe and Sons,https://www.simpson.org/,El Salvador,Balanced multimedia knowledgebase,1978,Facilities Services,8172
94,61BDeCfeFD0cEF5,"Valenzuela, Holmes and Rowland",https://www.dorsey.net/,Taiwan,Persistent tertiary focus group,1999,Transportation,1483
95,4e91eD25f486110,"Best, Wade and Shepard",https://zimmerman.com/,Zimbabwe,Innovative background definition,1991,Gambling / Casinos,4873
96,0a0bfFbBbB8eC7c,Holmes Group,https://mcdowell.org/,Ethiopia,Right-sized zero tolerance focus group,1975,Photography,2988
97,BA6Cd9Dae2Efd62,Good Ltd,http://duffy.com/,Anguilla,Reverse-engineered composite moratorium,1971,Consumer Services,4292
98,E7df80C60Abd7f9,Clements-Espinoza,http://www.flowers.net/,Falkland Islands (Malvinas),Progressive modular hub,1991,Broadcast Media,236
99,AFc285dbE2fEd24,Mendez Inc,https://www.burke.net/,Kyrgyz Republic,User-friendly exuding migration,1993,Education Management,339
100,e9eB5A60Cef8354,Watkins-Kaiser,http://www.herring.com/,Togo,Synergistic background access,2009,Financial Services,2785
This diff is collapsed.
# README Format File for Testing
![Alt text](./examples/diagram.png?raw=true "Fig.1. design of the document retrieval conversation system")
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Install](#install)
- [How to Use](#how-to-use)
- Examples
- [Local Chinese Retrieval QA + Chat](examples/retrieval_conversation_zh.py)
- [Local English Retrieval QA + Chat](examples/retrieval_conversation_en.py)
- [Local Bi-lingual Retrieval QA + Chat](examples/retrieval_conversation_universal.py)
- [Experimental AI Agent Based on Chatgpt + Chat](examples/conversation_agent_chatgpt.py)
**As Colossal-AI is undergoing some major updates, this project will be actively maintained to stay in line with the Colossal-AI project.**
## Install
Install colossalqa
```bash
# python==3.8.17
cd ColossalAI/applications/ColossalQA
pip install -e .
```
To use the vllm server, please refer to the official guide [here](https://github.com/vllm-project/vllm/tree/main) for installation instruction. Simply run the following command from another terminal.
```bash
cd ./vllm/entrypoints
python api_server.py --host localhost --port $PORT_NUMBER --model $PATH_TO_MODEL --swap-space $SWAP_SPACE_IN_GB
```
## How to use
### Collect your data
For ChatGPT based Agent we support document retrieval and simple sql search.
If you want to run the demo locally, we provided document retrieval based conversation system built upon langchain. It accept a wide range of documents.
Read comments under ./colossalqa/data_loader for more detail
### Serving
Currently use vllm will replace with colossal inference when ready. Please refer class VllmLLM.
### Run the script
We provided scripts for Chinese document retrieval based conversation system, English document retrieval based conversation system, Bi-lingual document retrieval based conversation system and an experimental AI agent with document retrieval and SQL query functionality.
To run the bi-lingual scripts, set the following environmental variables before running the script.
```bash
export ZH_MODEL_PATH=XXX
export ZH_MODEL_NAME: chatglm2
export EN_MODEL_PATH: XXX
export EN_MODEL_NAME: llama
python retrieval_conversation_universal.py
```
To run retrieval_conversation_en.py. set the following environmental variables.
```bash
export EN_MODEL_PATH=XXX
export EN_MODEL_NAME: llama
python retrieval_conversation_en.py
```
To run retrieval_conversation_zh.py. set the following environmental variables.
```bash
export ZH_MODEL_PATH=XXX
export ZH_MODEL_NAME: chatglm2
python retrieval_conversation_en.py
```
It will ask you to provide the path to your data during the execution of the script. You can also pass a glob path to load multiple files at once. If csv files are provided, please use ',' as delimiter and '"' as quotation mark. There are no other formatting constraints for loading documents type files. For loading table type files, we use pandas, please refer to [Pandas-Input/Output](https://pandas.pydata.org/pandas-docs/stable/reference/io.html) for file format details.
## The Plan
- [x] build document retrieval QA tool
- [x] Add long + short term memory
- [x] Add demo for AI agent with SQL query
- [x] Add customer retriever for fast construction and retrieving (with incremental mode)
Your Name
Lorem ipsum dolor sit amet, consectetuer adipiscing elit
123 Your Street
Your City, ST 12345
(123) 456-7890
no_reply@example.com
EXPERIENCE
Company, Location — Job Title
MONTH 20XX - PRESENT
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh.
Company, Location — Job Title
MONTH 20XX - MONTH 20XX
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh.
Company, Location — Job Title
MONTH 20XX - MONTH 20XX
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh.
EDUCATION
School Name, Location — Degree
MONTH 20XX - MONTH 20XX
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore.
School Name, Location — Degree
MONTH 20XX - MONTH 20XX
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam.
PROJECTS
Project Name — Detail
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
SKILLS
* Lorem ipsum dolor sit amet.
* Consectetuer adipiscing elit.
* Sed diam nonummy nibh euismod tincidunt.
* L​​​‌​aoreet dolore magna aliquam erat volutpat.
AWARDS
Lorem ipsum dolor sit amet Consectetuer adipiscing elit, Sed diam nonummy
Nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.
Lorem ipsum dolor sit amet Consectetuer adipiscing elit, Sed diam nonummy
Nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.
LANGUAGES
Lorem ipsum, Dolor sit amet, Consectetuer
\ No newline at end of file
"""
Script for the multilingual conversation based experimental AI agent
We used ChatGPT as the language model
You need openai api key to run this script
"""
import argparse
import os
from colossalqa.data_loader.document_loader import DocumentLoader
from colossalqa.data_loader.table_dataloader import TableLoader
from langchain import LLMChain, OpenAI
from langchain.agents import Tool, ZeroShotAgent
from langchain.agents.agent import AgentExecutor
from langchain.agents.agent_toolkits import create_retriever_tool
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.memory import ChatMessageHistory, ConversationBufferMemory
from langchain.memory.chat_memory import ChatMessageHistory
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.utilities import SQLDatabase
from langchain.vectorstores import Chroma
from langchain_experimental.sql import SQLDatabaseChain
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Experimental AI agent powered by ChatGPT")
parser.add_argument("--open_ai_key_path", type=str, default=None, help="path to the plain text open_ai_key file")
args = parser.parse_args()
# Setup openai key
# Set env var OPENAI_API_KEY or load from a file
openai_key = open(args.open_ai_key_path).read()
os.environ["OPENAI_API_KEY"] = openai_key
# Load data served on sql
print("Select files for constructing sql database")
tools = []
llm = OpenAI(temperature=0.0)
while True:
file = input("Select a file to load or press Enter to exit:")
if file == "":
break
data_name = input("Enter a short description of the data:")
table_loader = TableLoader(
[[file, data_name.replace(" ", "_")]], sql_path=f"sqlite:///{data_name.replace(' ', '_')}.db"
)
sql_path = table_loader.get_sql_path()
# Create sql database
db = SQLDatabase.from_uri(sql_path)
print(db.get_table_info())
db_chain = SQLDatabaseChain.from_llm(llm, db, verbose=True)
name = f"Query the SQL database regarding {data_name}"
description = (
f"useful for when you need to answer questions based on data stored on a SQL database regarding {data_name}"
)
tools.append(
Tool(
name=name,
func=db_chain.run,
description=description,
)
)
print(f"Added sql dataset\n\tname={name}\n\tdescription:{description}")
# VectorDB
embedding = OpenAIEmbeddings()
# Load data serve on sql
print("Select files for constructing retriever")
while True:
file = input("Select a file to load or press Enter to exit:")
if file == "":
break
data_name = input("Enter a short description of the data:")
retriever_data = DocumentLoader([[file, data_name.replace(" ", "_")]]).all_data
# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=0)
splits = text_splitter.split_documents(retriever_data)
# Create vector store
vectordb = Chroma.from_documents(documents=splits, embedding=embedding)
# Create retriever
retriever = vectordb.as_retriever(
search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.5, "k": 5}
)
# Add to tool chain
name = f"Searches and returns documents regarding {data_name}."
tools.append(create_retriever_tool(retriever, data_name, name))
prefix = """Have a conversation with a human, answering the following questions as best you can. You have access to the following tools. If none of the tools can be used to answer the question. Do not share uncertain answer unless you think answering the question doesn't need any background information. In that case, try to answer the question directly."""
suffix = """You are provided with the following background knowledge:
Begin!"
{chat_history}
Question: {input}
{agent_scratchpad}"""
prompt = ZeroShotAgent.create_prompt(
tools,
prefix=prefix,
suffix=suffix,
input_variables=["input", "chat_history", "agent_scratchpad"],
)
memory = ConversationBufferMemory(memory_key="chat_history", chat_memory=ChatMessageHistory())
llm_chain = LLMChain(llm=OpenAI(temperature=0.7), prompt=prompt)
agent = ZeroShotAgent(llm_chain=llm_chain, tools=tools, verbose=True)
agent_chain = AgentExecutor.from_agent_and_tools(agent=agent, tools=tools, verbose=True, memory=memory)
while True:
user_input = input("User: ")
if " end " in user_input:
print("Agent: Happy to chat with you :)")
break
agent_response = agent_chain.run(user_input)
print(f"Agent: {agent_response}")
table_loader.sql_engine.dispose()
"""
Multilingual retrieval based conversation system backed by ChatGPT
"""
import argparse
import os
from colossalqa.data_loader.document_loader import DocumentLoader
from colossalqa.memory import ConversationBufferWithSummary
from colossalqa.retriever import CustomRetriever
from langchain import LLMChain
from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import OpenAI
from langchain.prompts.prompt import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Multilingual retrieval based conversation system backed by ChatGPT")
parser.add_argument("--open_ai_key_path", type=str, default=None, help="path to the model")
parser.add_argument(
"--sql_file_path", type=str, default=None, help="path to the a empty folder for storing sql files for indexing"
)
args = parser.parse_args()
if not os.path.exists(args.sql_file_path):
os.makedirs(args.sql_file_path)
# Setup openai key
# Set env var OPENAI_API_KEY or load from a file
openai_key = open(args.open_ai_key_path).read()
os.environ["OPENAI_API_KEY"] = openai_key
llm = OpenAI(temperature=0.6)
information_retriever = CustomRetriever(k=3, sql_file_path=args.sql_file_path, verbose=True)
# VectorDB
embedding = HuggingFaceEmbeddings(
model_name="moka-ai/m3e-base", model_kwargs={"device": "cpu"}, encode_kwargs={"normalize_embeddings": False}
)
# Define memory with summarization ability
memory = ConversationBufferWithSummary(llm=llm)
# Load data to vector store
print("Select files for constructing retriever")
documents = []
while True:
file = input("Enter a file path or press Enter directory without input to exit:").strip()
if file == "":
break
data_name = input("Enter a short description of the data:")
retriever_data = DocumentLoader([[file, data_name.replace(" ", "_")]]).all_data
# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=0)
splits = text_splitter.split_documents(retriever_data)
documents.extend(splits)
# Create retriever
information_retriever.add_documents(docs=documents, cleanup="incremental", mode="by_source", embedding=embedding)
prompt_template = """Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If the answer cannot be infered based on the given context, please don't share false information.
Use the context and chat history to respond to the human's input at the end or carry on the conversation. You should generate one response only. No following up is needed.
context:
{context}
chat history
{chat_history}
Human: {question}
Assistant:"""
prompt_template_disambiguate = """You are a helpful, respectful and honest assistant. You always follow the instruction.
Please replace any ambiguous references in the given sentence with the specific names or entities mentioned in the chat history or just output the original sentence if no chat history is provided or if the sentence doesn't contain ambiguous references. Your output should be the disambiguated sentence itself (in the same line as "disambiguated sentence:") and contain nothing else.
Here is an example:
Chat history:
Human: I have a friend, Mike. Do you know him?
Assistant: Yes, I know a person named Mike
sentence: What's his favorite food?
disambiguated sentence: What's Mike's favorite food?
END OF EXAMPLE
Chat history:
{chat_history}
sentence: {input}
disambiguated sentence:"""
PROMPT = PromptTemplate(template=prompt_template, input_variables=["question", "chat_history", "context"])
memory.initiate_document_retrieval_chain(
llm,
PROMPT,
information_retriever,
chain_type_kwargs={
"chat_history": "",
},
)
PROMPT_DISAMBIGUATE = PromptTemplate(
template=prompt_template_disambiguate, input_variables=["chat_history", "input"]
)
llm_chain = RetrievalQA.from_chain_type(
llm=llm,
verbose=False,
chain_type="stuff",
retriever=information_retriever,
chain_type_kwargs={"prompt": PROMPT, "memory": memory},
)
llm_chain_disambiguate = LLMChain(llm=llm, prompt=PROMPT_DISAMBIGUATE)
def disambiguity(input):
out = llm_chain_disambiguate.run({"input": input, "chat_history": memory.buffer})
return out.split("\n")[0]
information_retriever.set_rephrase_handler(disambiguity)
while True:
user_input = input("User: ")
if " end " in user_input:
print("Agent: Happy to chat with you :)")
break
agent_response = llm_chain.run(user_input)
agent_response = agent_response.split("\n")[0]
print(f"Agent: {agent_response}")
"""
Script for English retrieval based conversation system backed by LLaMa2
"""
import argparse
import os
from colossalqa.chain.retrieval_qa.base import RetrievalQA
from colossalqa.data_loader.document_loader import DocumentLoader
from colossalqa.local.llm import ColossalAPI, ColossalLLM
from colossalqa.memory import ConversationBufferWithSummary
from colossalqa.prompt.prompt import (
EN_RETRIEVAL_QA_REJECTION_ANSWER,
EN_RETRIEVAL_QA_TRIGGER_KEYWORDS,
PROMPT_DISAMBIGUATE_EN,
PROMPT_RETRIEVAL_QA_EN,
)
from colossalqa.retriever import CustomRetriever
from langchain import LLMChain
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
if __name__ == "__main__":
# Parse arguments
parser = argparse.ArgumentParser(description="English retrieval based conversation system backed by LLaMa2")
parser.add_argument("--model_path", type=str, default=None, help="path to the model")
parser.add_argument("--model_name", type=str, default=None, help="name of the model")
parser.add_argument(
"--sql_file_path", type=str, default=None, help="path to the a empty folder for storing sql files for indexing"
)
args = parser.parse_args()
if not os.path.exists(args.sql_file_path):
os.makedirs(args.sql_file_path)
colossal_api = ColossalAPI.get_api(args.model_name, args.model_path)
llm = ColossalLLM(n=1, api=colossal_api)
# Define the retriever
information_retriever = CustomRetriever(k=3, sql_file_path=args.sql_file_path, verbose=True)
# Setup embedding model locally
embedding = HuggingFaceEmbeddings(
model_name="moka-ai/m3e-base", model_kwargs={"device": "cpu"}, encode_kwargs={"normalize_embeddings": False}
)
# Define memory with summarization ability
memory = ConversationBufferWithSummary(
llm=llm, max_tokens=2000, llm_kwargs={"max_new_tokens": 50, "temperature": 0.6, "do_sample": True}
)
# Define the chain to preprocess the input
# Disambiguate the input. e.g. "What is the capital of that country?" -> "What is the capital of France?"
llm_chain_disambiguate = LLMChain(
llm=llm, prompt=PROMPT_DISAMBIGUATE_EN, llm_kwargs={"max_new_tokens": 30, "temperature": 0.6, "do_sample": True}
)
def disambiguity(input):
out = llm_chain_disambiguate.run(input=input, chat_history=memory.buffer, stop=["\n"])
return out.split("\n")[0]
# Load data to vector store
print("Select files for constructing retriever")
documents = []
while True:
file = input("Enter a file path or press Enter directory without input to exit:").strip()
if file == "":
break
data_name = input("Enter a short description of the data:")
separator = input(
"Enter a separator to force separating text into chunks, if no separator is given, the defaut separator is '\\n\\n'. Note that"
+ "we use neural text spliter to split texts into chunks, the seperator only serves as a delimiter to force split long passage into"
+ " chunks before passing to the neural network. Press ENTER directly to skip:"
)
separator = separator if separator != "" else "\n\n"
retriever_data = DocumentLoader([[file, data_name.replace(" ", "_")]]).all_data
# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
splits = text_splitter.split_documents(retriever_data)
documents.extend(splits)
# Create retriever
information_retriever.add_documents(docs=documents, cleanup="incremental", mode="by_source", embedding=embedding)
# Set document retrieval chain, we need this chain to calculate prompt length
memory.initiate_document_retrieval_chain(
llm,
PROMPT_RETRIEVAL_QA_EN,
information_retriever,
chain_type_kwargs={
"chat_history": "",
},
)
# Define retrieval chain
retrieval_chain = RetrievalQA.from_chain_type(
llm=llm,
verbose=False,
chain_type="stuff",
retriever=information_retriever,
chain_type_kwargs={"prompt": PROMPT_RETRIEVAL_QA_EN, "memory": memory},
llm_kwargs={"max_new_tokens": 50, "temperature": 0.75, "do_sample": True},
)
# Set disambiguity handler
information_retriever.set_rephrase_handler(disambiguity)
# Start conversation
while True:
user_input = input("User: ")
if "END" == user_input:
print("Agent: Happy to chat with you :)")
break
agent_response = retrieval_chain.run(
query=user_input,
stop=["Human: "],
rejection_trigger_keywrods=EN_RETRIEVAL_QA_TRIGGER_KEYWORDS,
rejection_answer=EN_RETRIEVAL_QA_REJECTION_ANSWER,
)
agent_response = agent_response.split("\n")[0]
print(f"Agent: {agent_response}")
"""
Script for English retrieval based conversation system backed by LLaMa2
"""
import argparse
import json
import os
from colossalqa.chain.retrieval_qa.base import RetrievalQA
from colossalqa.data_loader.document_loader import DocumentLoader
from colossalqa.local.llm import ColossalAPI, ColossalLLM
from colossalqa.memory import ConversationBufferWithSummary
from colossalqa.prompt.prompt import (
EN_RETRIEVAL_QA_REJECTION_ANSWER,
EN_RETRIEVAL_QA_TRIGGER_KEYWORDS,
PROMPT_DISAMBIGUATE_EN,
PROMPT_RETRIEVAL_QA_EN,
)
from colossalqa.retriever import CustomRetriever
from langchain import LLMChain
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
if __name__ == "__main__":
# Parse arguments
parser = argparse.ArgumentParser(description="English retrieval based conversation system backed by LLaMa2")
parser.add_argument("--model_path", type=str, default=None, help="path to the model")
parser.add_argument("--model_name", type=str, default=None, help="name of the model")
parser.add_argument(
"--sql_file_path", type=str, default=None, help="path to the a empty folder for storing sql files for indexing"
)
args = parser.parse_args()
if not os.path.exists(args.sql_file_path):
os.makedirs(args.sql_file_path)
colossal_api = ColossalAPI.get_api(args.model_name, args.model_path)
llm = ColossalLLM(n=1, api=colossal_api)
# Define the retriever
information_retriever = CustomRetriever(k=3, sql_file_path=args.sql_file_path, verbose=True)
# Setup embedding model locally
embedding = HuggingFaceEmbeddings(
model_name="moka-ai/m3e-base", model_kwargs={"device": "cpu"}, encode_kwargs={"normalize_embeddings": False}
)
# Define memory with summarization ability
memory = ConversationBufferWithSummary(
llm=llm, max_tokens=2000, llm_kwargs={"max_new_tokens": 50, "temperature": 0.6, "do_sample": True}
)
# Define the chain to preprocess the input
# Disambiguate the input. e.g. "What is the capital of that country?" -> "What is the capital of France?"
llm_chain_disambiguate = LLMChain(
llm=llm, prompt=PROMPT_DISAMBIGUATE_EN, llm_kwargs={"max_new_tokens": 30, "temperature": 0.6, "do_sample": True}
)
def disambiguity(input):
out = llm_chain_disambiguate.run(input=input, chat_history=memory.buffer, stop=["\n"])
return out.split("\n")[0]
# Load data to vector store
print("Select files for constructing retriever")
documents = []
# preprocess data
if not os.path.exists("../data/data_sample/custom_service_preprocessed.json"):
if not os.path.exists("../data/data_sample/custom_service.json"):
raise ValueError(
"custom_service.json not found, please download the data from HuggingFace Datasets: qgyd2021/e_commerce_customer_service"
)
data = json.load(open("../data/data_sample/custom_service.json", "r", encoding="utf8"))
preprocessed = []
for row in data["rows"]:
preprocessed.append({"key": row["row"]["query"], "value": row["row"]["response"]})
data = {}
data["data"] = preprocessed
with open("../data/data_sample/custom_service_preprocessed.json", "w", encoding="utf8") as f:
json.dump(data, f, ensure_ascii=False)
# define metadata function which is used to format the prompt with value in metadata instead of key,
# the later is langchain's default behavior
def metadata_func(data_sample, additional_fields):
"""
metadata_func (Callable[Dict, Dict]): A function that takes in the JSON
object extracted by the jq_schema and the default metadata and returns
a dict of the updated metadata.
To use key-value format, the metadata_func should be defined as follows:
metadata = {'value': 'a string to be used to format the prompt', 'is_key_value_mapping': True}
"""
metadata = {}
metadata["value"] = f"Question: {data_sample['key']}\nAnswer:{data_sample['value']}"
metadata["is_key_value_mapping"] = True
assert "value" not in additional_fields
assert "is_key_value_mapping" not in additional_fields
metadata.update(additional_fields)
return metadata
retriever_data = DocumentLoader(
[["../data/data_sample/custom_service_preprocessed.json", "CustomerServiceDemo"]],
content_key="key",
metadata_func=metadata_func,
).all_data
# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
splits = text_splitter.split_documents(retriever_data)
documents.extend(splits)
# Create retriever
information_retriever.add_documents(docs=documents, cleanup="incremental", mode="by_source", embedding=embedding)
# Set document retrieval chain, we need this chain to calculate prompt length
memory.initiate_document_retrieval_chain(
llm,
PROMPT_RETRIEVAL_QA_EN,
information_retriever,
chain_type_kwargs={
"chat_history": "",
},
)
# Define retrieval chain
retrieval_chain = RetrievalQA.from_chain_type(
llm=llm,
verbose=False,
chain_type="stuff",
retriever=information_retriever,
chain_type_kwargs={"prompt": PROMPT_RETRIEVAL_QA_EN, "memory": memory},
llm_kwargs={"max_new_tokens": 50, "temperature": 0.75, "do_sample": True},
)
# Set disambiguity handler
information_retriever.set_rephrase_handler(disambiguity)
# Start conversation
while True:
user_input = input("User: ")
if "END" == user_input:
print("Agent: Happy to chat with you :)")
break
agent_response = retrieval_chain.run(
query=user_input,
stop=["Human: "],
rejection_trigger_keywrods=EN_RETRIEVAL_QA_TRIGGER_KEYWORDS,
rejection_answer=EN_RETRIEVAL_QA_REJECTION_ANSWER,
)
agent_response = agent_response.split("\n")[0]
print(f"Agent: {agent_response}")
import argparse
from colossalqa.retrieval_conversation_universal import UniversalRetrievalConversation
if __name__ == '__main__':
# Parse arguments
parser = argparse.ArgumentParser()
parser.add_argument('--en_model_path', type=str, default=None)
parser.add_argument('--zh_model_path', type=str, default=None)
parser.add_argument('--zh_model_name', type=str, default=None)
parser.add_argument('--en_model_name', type=str, default=None)
parser.add_argument('--sql_file_path', type=str, default=None, help='path to the a empty folder for storing sql files for indexing')
args = parser.parse_args()
# Will ask for documents path in runnning time
session = UniversalRetrievalConversation(files_en=None,
files_zh=None,
zh_model_path=args.zh_model_path, en_model_path=args.en_model_path,
zh_model_name=args.zh_model_name, en_model_name=args.en_model_name,
sql_file_path=args.sql_file_path
)
session.start_test_session()
\ No newline at end of file
"""
Script for Chinese retrieval based conversation system backed by ChatGLM
"""
import argparse
import os
from colossalqa.chain.retrieval_qa.base import RetrievalQA
from colossalqa.data_loader.document_loader import DocumentLoader
from colossalqa.local.llm import ColossalAPI, ColossalLLM
from colossalqa.memory import ConversationBufferWithSummary
from colossalqa.prompt.prompt import (
PROMPT_DISAMBIGUATE_ZH,
PROMPT_RETRIEVAL_QA_ZH,
SUMMARY_PROMPT_ZH,
ZH_RETRIEVAL_QA_REJECTION_ANSWER,
ZH_RETRIEVAL_QA_TRIGGER_KEYWORDS,
)
from colossalqa.retriever import CustomRetriever
from colossalqa.text_splitter import ChineseTextSplitter
from langchain import LLMChain
from langchain.embeddings import HuggingFaceEmbeddings
if __name__ == "__main__":
# Parse arguments
parser = argparse.ArgumentParser(description="Chinese retrieval based conversation system backed by ChatGLM2")
parser.add_argument("--model_path", type=str, default=None, help="path to the model")
parser.add_argument("--model_name", type=str, default=None, help="name of the model")
parser.add_argument(
"--sql_file_path", type=str, default=None, help="path to the a empty folder for storing sql files for indexing"
)
args = parser.parse_args()
if not os.path.exists(args.sql_file_path):
os.makedirs(args.sql_file_path)
colossal_api = ColossalAPI.get_api(args.model_name, args.model_path)
llm = ColossalLLM(n=1, api=colossal_api)
# Setup embedding model locally
embedding = HuggingFaceEmbeddings(
model_name="moka-ai/m3e-base", model_kwargs={"device": "cpu"}, encode_kwargs={"normalize_embeddings": False}
)
# Define the retriever
information_retriever = CustomRetriever(k=3, sql_file_path=args.sql_file_path, verbose=True)
# Define memory with summarization ability
memory = ConversationBufferWithSummary(
llm=llm,
prompt=SUMMARY_PROMPT_ZH,
human_prefix="用户",
ai_prefix="Assistant",
max_tokens=2000,
llm_kwargs={"max_new_tokens": 50, "temperature": 0.6, "do_sample": True},
)
# Define the chain to preprocess the input
# Disambiguate the input. e.g. "What is the capital of that country?" -> "What is the capital of France?"
llm_chain_disambiguate = LLMChain(
llm=llm, prompt=PROMPT_DISAMBIGUATE_ZH, llm_kwargs={"max_new_tokens": 30, "temperature": 0.6, "do_sample": True}
)
def disambiguity(input: str):
out = llm_chain_disambiguate.run(input=input, chat_history=memory.buffer, stop=["\n"])
return out.split("\n")[0]
# Load data to vector store
print("Select files for constructing retriever")
documents = []
while True:
file = input("Enter a file path or press Enter directory without input to exit:").strip()
if file == "":
break
data_name = input("Enter a short description of the data:")
retriever_data = DocumentLoader([[file, data_name.replace(" ", "_")]]).all_data
# Split
text_splitter = ChineseTextSplitter()
splits = text_splitter.split_documents(retriever_data)
documents.extend(splits)
# Create retriever
information_retriever.add_documents(docs=documents, cleanup="incremental", mode="by_source", embedding=embedding)
# Set document retrieval chain, we need this chain to calculate prompt length
memory.initiate_document_retrieval_chain(llm, PROMPT_RETRIEVAL_QA_ZH, information_retriever)
# Define retrieval chain
llm_chain = RetrievalQA.from_chain_type(
llm=llm,
verbose=False,
chain_type="stuff",
retriever=information_retriever,
chain_type_kwargs={"prompt": PROMPT_RETRIEVAL_QA_ZH, "memory": memory},
llm_kwargs={"max_new_tokens": 150, "temperature": 0.6, "do_sample": True},
)
# Set disambiguity handler
information_retriever.set_rephrase_handler(disambiguity)
# Start conversation
while True:
user_input = input("User: ")
if "END" == user_input:
print("Agent: Happy to chat with you :)")
break
agent_response = llm_chain.run(
query=user_input,
stop=["</答案>"],
doc_prefix="支持文档",
rejection_trigger_keywrods=ZH_RETRIEVAL_QA_TRIGGER_KEYWORDS,
rejection_answer=ZH_RETRIEVAL_QA_REJECTION_ANSWER,
)
print(f"Agent: {agent_response}")
"""
Script for English retrieval based conversation system backed by LLaMa2
"""
import argparse
import os
from colossalqa.chain.retrieval_qa.base import RetrievalQA
from colossalqa.data_loader.document_loader import DocumentLoader
from colossalqa.local.llm import ColossalAPI, ColossalLLM
from colossalqa.prompt.prompt import PROMPT_RETRIEVAL_CLASSIFICATION_USE_CASE_ZH
from colossalqa.retriever import CustomRetriever
from colossalqa.text_splitter import ChineseTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
if __name__ == "__main__":
# Parse arguments
parser = argparse.ArgumentParser(description="English retrieval based conversation system backed by LLaMa2")
parser.add_argument("--model_path", type=str, default=None, help="path to the model")
parser.add_argument("--model_name", type=str, default=None, help="name of the model")
parser.add_argument(
"--sql_file_path", type=str, default=None, help="path to the a empty folder for storing sql files for indexing"
)
args = parser.parse_args()
if not os.path.exists(args.sql_file_path):
os.makedirs(args.sql_file_path)
colossal_api = ColossalAPI.get_api(args.model_name, args.model_path)
llm = ColossalLLM(n=1, api=colossal_api)
# Define the retriever
information_retriever = CustomRetriever(k=2, sql_file_path=args.sql_file_path, verbose=True)
# Setup embedding model locally
embedding = HuggingFaceEmbeddings(
model_name="moka-ai/m3e-base", model_kwargs={"device": "cpu"}, encode_kwargs={"normalize_embeddings": False}
)
# Load data to vector store
print("Select files for constructing retriever")
documents = []
# define metadata function which is used to format the prompt with value in metadata instead of key,
# the later is langchain's default behavior
def metadata_func(data_sample, additional_fields):
"""
metadata_func (Callable[Dict, Dict]): A function that takes in the JSON
object extracted by the jq_schema and the default metadata and returns
a dict of the updated metadata.
To use key-value format, the metadata_func should be defined as follows:
metadata = {'value': 'a string to be used to format the prompt', 'is_key_value_mapping': True}
"""
metadata = {}
metadata["value"] = f"Question: {data_sample['key']}\nAnswer:{data_sample['value']}"
metadata["is_key_value_mapping"] = True
assert "value" not in additional_fields
assert "is_key_value_mapping" not in additional_fields
metadata.update(additional_fields)
return metadata
retriever_data = DocumentLoader(
[["../data/data_sample/custom_service_classification.json", "CustomerServiceDemo"]],
content_key="key",
metadata_func=metadata_func,
).all_data
# Split
text_splitter = ChineseTextSplitter()
splits = text_splitter.split_documents(retriever_data)
documents.extend(splits)
# Create retriever
information_retriever.add_documents(docs=documents, cleanup="incremental", mode="by_source", embedding=embedding)
# Define retrieval chain
retrieval_chain = RetrievalQA.from_chain_type(
llm=llm,
verbose=True,
chain_type="stuff",
retriever=information_retriever,
chain_type_kwargs={"prompt": PROMPT_RETRIEVAL_CLASSIFICATION_USE_CASE_ZH},
llm_kwargs={"max_new_tokens": 50, "temperature": 0.75, "do_sample": True},
)
# Set disambiguity handler
# Start conversation
while True:
user_input = input("User: ")
if "END" == user_input:
print("Agent: Happy to chat with you :)")
break
# 要使用和custom_service_classification.json 里的key 类似的句子做输入
agent_response = retrieval_chain.run(query=user_input, stop=["Human: "])
agent_response = agent_response.split("\n")[0]
print(f"Agent: {agent_response}")
from typing import Dict, Tuple
from colossalqa.chain.retrieval_qa.base import RetrievalQA
from colossalqa.data_loader.document_loader import DocumentLoader
from colossalqa.memory import ConversationBufferWithSummary
from colossalqa.mylogging import get_logger
from colossalqa.prompt.prompt import (
PROMPT_DISAMBIGUATE_ZH,
PROMPT_RETRIEVAL_QA_ZH,
SUMMARY_PROMPT_ZH,
ZH_RETRIEVAL_QA_REJECTION_ANSWER,
ZH_RETRIEVAL_QA_TRIGGER_KEYWORDS,
)
from colossalqa.retriever import CustomRetriever
from colossalqa.text_splitter import ChineseTextSplitter
from langchain import LLMChain
from langchain.embeddings import HuggingFaceEmbeddings
logger = get_logger()
DEFAULT_RAG_CFG = {
"retri_top_k": 3,
"retri_kb_file_path": "./",
"verbose": True,
"mem_summary_prompt": SUMMARY_PROMPT_ZH,
"mem_human_prefix": "用户",
"mem_ai_prefix": "Assistant",
"mem_max_tokens": 2000,
"mem_llm_kwargs": {"max_new_tokens": 50, "temperature": 1, "do_sample": True},
"disambig_prompt": PROMPT_DISAMBIGUATE_ZH,
"disambig_llm_kwargs": {"max_new_tokens": 30, "temperature": 1, "do_sample": True},
"embed_model_name_or_path": "moka-ai/m3e-base",
"embed_model_device": {"device": "cpu"},
"gen_llm_kwargs": {"max_new_tokens": 100, "temperature": 1, "do_sample": True},
"gen_qa_prompt": PROMPT_RETRIEVAL_QA_ZH,
}
class RAG_ChatBot:
def __init__(
self,
llm,
rag_config,
) -> None:
self.llm = llm
self.rag_config = rag_config
self.set_embed_model(**self.rag_config)
self.set_text_splitter(**self.rag_config)
self.set_memory(**self.rag_config)
self.set_info_retriever(**self.rag_config)
self.set_rag_chain(**self.rag_config)
if self.rag_config.get("disambig_prompt", None):
self.set_disambig_retriv(**self.rag_config)
def set_embed_model(self, **kwargs):
self.embed_model = HuggingFaceEmbeddings(
model_name=kwargs["embed_model_name_or_path"],
model_kwargs=kwargs["embed_model_device"],
encode_kwargs={"normalize_embeddings": False},
)
def set_text_splitter(self, **kwargs):
# Initialize text_splitter
self.text_splitter = ChineseTextSplitter()
def set_memory(self, **kwargs):
params = {"llm_kwargs": kwargs["mem_llm_kwargs"]} if kwargs.get("mem_llm_kwargs", None) else {}
# Initialize memory with summarization ability
self.memory = ConversationBufferWithSummary(
llm=self.llm,
prompt=kwargs["mem_summary_prompt"],
human_prefix=kwargs["mem_human_prefix"],
ai_prefix=kwargs["mem_ai_prefix"],
max_tokens=kwargs["mem_max_tokens"],
**params,
)
def set_info_retriever(self, **kwargs):
self.info_retriever = CustomRetriever(
k=kwargs["retri_top_k"], sql_file_path=kwargs["retri_kb_file_path"], verbose=kwargs["verbose"]
)
def set_rag_chain(self, **kwargs):
params = {"llm_kwargs": kwargs["gen_llm_kwargs"]} if kwargs.get("gen_llm_kwargs", None) else {}
self.rag_chain = RetrievalQA.from_chain_type(
llm=self.llm,
verbose=kwargs["verbose"],
chain_type="stuff",
retriever=self.info_retriever,
chain_type_kwargs={"prompt": kwargs["gen_qa_prompt"], "memory": self.memory},
**params,
)
def split_docs(self, documents):
doc_splits = self.text_splitter.split_documents(documents)
return doc_splits
def set_disambig_retriv(self, **kwargs):
params = {"llm_kwargs": kwargs["disambig_llm_kwargs"]} if kwargs.get("disambig_llm_kwargs", None) else {}
self.llm_chain_disambiguate = LLMChain(llm=self.llm, prompt=kwargs["disambig_prompt"], **params)
def disambiguity(input: str):
out = self.llm_chain_disambiguate.run(input=input, chat_history=self.memory.buffer, stop=["\n"])
return out.split("\n")[0]
self.info_retriever.set_rephrase_handler(disambiguity)
def load_doc_from_console(self, json_parse_args: Dict = {}):
documents = []
print("Select files for constructing Chinese retriever")
while True:
file = input("Enter a file path or press Enter directly without input to exit:").strip()
if file == "":
break
data_name = input("Enter a short description of the data:")
docs = DocumentLoader([[file, data_name.replace(" ", "_")]], **json_parse_args).all_data
documents.extend(docs)
self.documents = documents
self.split_docs_and_add_to_mem(**self.rag_config)
def load_doc_from_files(self, files, data_name="default_kb", json_parse_args: Dict = {}):
documents = []
for file in files:
docs = DocumentLoader([[file, data_name.replace(" ", "_")]], **json_parse_args).all_data
documents.extend(docs)
self.documents = documents
self.split_docs_and_add_to_mem(**self.rag_config)
def split_docs_and_add_to_mem(self, **kwargs):
self.doc_splits = self.split_docs(self.documents)
self.info_retriever.add_documents(
docs=self.doc_splits, cleanup="incremental", mode="by_source", embedding=self.embed_model
)
self.memory.initiate_document_retrieval_chain(self.llm, kwargs["gen_qa_prompt"], self.info_retriever)
def reset_config(self, rag_config):
self.rag_config = rag_config
self.set_embed_model(**self.rag_config)
self.set_text_splitter(**self.rag_config)
self.set_memory(**self.rag_config)
self.set_info_retriever(**self.rag_config)
self.set_rag_chain(**self.rag_config)
if self.rag_config.get("disambig_prompt", None):
self.set_disambig_retriv(**self.rag_config)
def run(self, user_input: str, memory: ConversationBufferWithSummary) -> Tuple[str, ConversationBufferWithSummary]:
if memory:
memory.buffered_history.messages = memory.buffered_history.messages
memory.summarized_history_temp.messages = memory.summarized_history_temp.messages
result = self.rag_chain.run(
query=user_input,
stop=[memory.human_prefix + ": "],
rejection_trigger_keywrods=ZH_RETRIEVAL_QA_TRIGGER_KEYWORDS,
rejection_answer=ZH_RETRIEVAL_QA_REJECTION_ANSWER,
)
return result.split("\n")[0], memory
def start_test_session(self):
"""
Simple session for testing purpose
"""
while True:
user_input = input("User: ")
if "END" == user_input:
print("Agent: Happy to chat with you :)")
break
agent_response, self.memory = self.run(user_input, self.memory)
print(f"Agent: {agent_response}")
if __name__ == "__main__":
# Initialize an Langchain LLM(here we use ChatGPT as an example)
from langchain.llms import OpenAI
llm = OpenAI(openai_api_key="YOUR_OPENAI_API_KEY")
# chatgpt cannot control temperature, do_sample, etc.
DEFAULT_RAG_CFG["mem_llm_kwargs"] = None
DEFAULT_RAG_CFG["disambig_llm_kwargs"] = None
DEFAULT_RAG_CFG["gen_llm_kwargs"] = None
rag = RAG_ChatBot(llm, DEFAULT_RAG_CFG)
rag.load_doc_from_console()
rag.start_test_session()
# ColossalQA WebUI Demo
This demo provides a simple WebUI for ColossalQA, enabling you to upload your files as a knowledge base and interact with them through a chat interface in your browser.
The `server.py` initializes the backend RAG chain that can be backed by various language models (e.g., ChatGPT, Huawei Pangu, ChatGLM2). Meanwhile, `webui.py` launches a Gradio-supported chatbot interface.
# Usage
## Installation
First, install the necessary dependencies for ColossalQA:
```sh
git clone https://github.com/hpcaitech/ColossalAI.git
cd ColossalAI/applications/ColossalQA/
pip install -e .
```
## Configure the RAG Chain
Customize the RAG Chain settings, such as the embedding model (default: moka-ai/m3e) and the language model, in the `start_colossal_qa.sh` script.
For API-based language models (like ChatGPT or Huawei Pangu), provide your API key for authentication. For locally-run models, indicate the path to the model's checkpoint file.
If you want to customize prompts in the RAG Chain, you can have a look at the `RAG_ChatBot.py` file to modify them.
## Run WebUI Demo
Execute the following command to start the demo:
```sh
bash start_colossal_qa.sh
```
After launching the script, you can upload files and engage with the chatbot through your web browser.
![ColossalQA Demo](https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/colossalqa/img/qa_demo.png)
\ No newline at end of file
import argparse
import copy
import json
import os
import random
import string
from http.server import BaseHTTPRequestHandler, HTTPServer
from colossalqa.local.llm import ColossalAPI, ColossalLLM
from colossalqa.data_loader.document_loader import DocumentLoader
from colossalqa.retrieval_conversation_zh import ChineseRetrievalConversation
from colossalqa.retriever import CustomRetriever
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from RAG_ChatBot import RAG_ChatBot, DEFAULT_RAG_CFG
# Define the mapping between embed_model_name(passed from Front End) and the actual path on the back end server
EMBED_MODEL_DICT = {
"m3e": os.environ.get("EMB_MODEL_PATH", DEFAULT_RAG_CFG["embed_model_name_or_path"])
}
# Define the mapping between LLM_name(passed from Front End) and the actual path on the back end server
LLM_DICT = {
"chatglm2": os.environ.get("CHAT_LLM_PATH", "THUDM/chatglm-6b"),
"pangu": "Pangu_API",
"chatgpt": "OpenAI_API"
}
def randomword(length):
letters = string.ascii_lowercase
return "".join(random.choice(letters) for i in range(length))
class ColossalQAServerRequestHandler(BaseHTTPRequestHandler):
chatbot = None
def _set_response(self):
"""
set http header for response
"""
self.send_response(200)
self.send_header("Content-type", "application/json")
self.end_headers()
def do_POST(self):
content_length = int(self.headers["Content-Length"])
post_data = self.rfile.read(content_length)
received_json = json.loads(post_data.decode("utf-8"))
print(received_json)
# conversation_ready is False(user's first request): Need to upload files and initialize the RAG chain
if received_json["conversation_ready"] is False:
self.rag_config = DEFAULT_RAG_CFG.copy()
try:
assert received_json["embed_model_name"] in EMBED_MODEL_DICT
assert received_json["llm_name"] in LLM_DICT
self.docs_files = received_json["docs"]
embed_model_name, llm_name = received_json["embed_model_name"], received_json["llm_name"]
# Find the embed_model/llm ckpt path on the back end server.
embed_model_path, llm_path = EMBED_MODEL_DICT[embed_model_name], LLM_DICT[llm_name]
self.rag_config["embed_model_name_or_path"] = embed_model_path
# Create the storage path for knowledge base files
self.rag_config["retri_kb_file_path"] = os.path.join(os.environ["TMP"], "colossalqa_kb/"+randomword(20))
if not os.path.exists(self.rag_config["retri_kb_file_path"]):
os.makedirs(self.rag_config["retri_kb_file_path"])
if (embed_model_path is not None) and (llm_path is not None):
# ---- Intialize LLM, QA_chatbot here ----
print("Initializing LLM...")
if llm_path == "Pangu_API":
from colossalqa.local.pangu_llm import Pangu
self.llm = Pangu(id=1)
self.llm.set_auth_config() # verify user's auth info here
self.rag_config["mem_llm_kwargs"] = None
self.rag_config["disambig_llm_kwargs"] = None
self.rag_config["gen_llm_kwargs"] = None
elif llm_path == "OpenAI_API":
from langchain.llms import OpenAI
self.llm = OpenAI()
self.rag_config["mem_llm_kwargs"] = None
self.rag_config["disambig_llm_kwargs"] = None
self.rag_config["gen_llm_kwargs"] = None
else:
# ** (For Testing Only) **
# In practice, all LLMs will run on the cloud platform and accessed by API, instead of running locally.
# initialize model from model_path by using ColossalLLM
self.rag_config["mem_llm_kwargs"] = {"max_new_tokens": 50, "temperature": 1, "do_sample": True}
self.rag_config["disambig_llm_kwargs"] = {"max_new_tokens": 30, "temperature": 1, "do_sample": True}
self.rag_config["gen_llm_kwargs"] = {"max_new_tokens": 100, "temperature": 1, "do_sample": True}
self.colossal_api = ColossalAPI(llm_name, llm_path)
self.llm = ColossalLLM(n=1, api=self.colossal_api)
print(f"Initializing RAG Chain...")
print("RAG_CONFIG: ", self.rag_config)
self.__class__.chatbot = RAG_ChatBot(self.llm, self.rag_config)
print("Loading Files....\n", self.docs_files)
self.__class__.chatbot.load_doc_from_files(self.docs_files)
# -----------------------------------------------------------------------------------
res = {"response": f"文件上传完成,模型初始化完成,让我们开始对话吧!(后端模型:{llm_name})", "error": "", "conversation_ready": True}
except Exception as e:
res = {"response": "文件上传或模型初始化有误,无法开始对话。",
"error": f"Error in File Uploading and/or RAG initialization. Error details: {e}",
"conversation_ready": False}
# conversation_ready is True: Chatbot and docs are all set. Ready to chat.
else:
user_input = received_json["user_input"]
chatbot_response, self.__class__.chatbot.memory = self.__class__.chatbot.run(user_input, self.__class__.chatbot.memory)
res = {"response": chatbot_response, "error": "", "conversation_ready": True}
self._set_response()
self.wfile.write(json.dumps(res).encode("utf-8"))
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Chinese retrieval based conversation system")
parser.add_argument("--port", type=int, default=13666, help="port on localhost to start the server")
args = parser.parse_args()
server_address = ("localhost", args.port)
httpd = HTTPServer(server_address, ColossalQAServerRequestHandler)
print(f"Starting server on port {args.port}...")
httpd.serve_forever()
#!/bin/bash
cleanup() {
echo "Caught Signal ... cleaning up."
pkill -P $$ # kill all subprocess of this script
exit 1 # exit script
}
# 'cleanup' is trigered when receive SIGINT(Ctrl+C) OR SIGTERM(kill) signal
trap cleanup INT TERM
# Disable your proxy
# unset HTTP_PROXY HTTPS_PROXY http_proxy https_proxy
# Path to store knowledge base(Home Directory by default)
export TMP=$HOME
# Use m3e as embedding model
export EMB_MODEL="m3e" # moka-ai/m3e-base model will be download automatically
# export EMB_MODEL_PATH="PATH_TO_LOCAL_CHECKPOINT/m3e-base" # you can also specify the local path to embedding model
# Choose a backend LLM
# - ChatGLM2
# export CHAT_LLM="chatglm2"
# export CHAT_LLM_PATH="PATH_TO_LOCAL_CHECKPOINT/chatglm2-6b"
# - ChatGPT
export CHAT_LLM="chatgpt"
# Auth info for OpenAI API
export OPENAI_API_KEY="YOUR_OPENAI_API_KEY"
# - Pangu
# export CHAT_LLM="pangu"
# # Auth info for Pangu API
# export URL=""
# export USERNAME=""
# export PASSWORD=""
# export DOMAIN_NAME=""
# Run server.py and colossalqa_webui.py in the background
python server.py &
python webui.py &
# Wait for all processes to finish
wait
import json
import os
import gradio as gr
import requests
RAG_STATE = {"conversation_ready": False, # Conversation is not ready until files are uploaded and RAG chain is initialized
"embed_model_name": os.environ.get("EMB_MODEL", "m3e"),
"llm_name": os.environ.get("CHAT_LLM", "chatgpt")}
URL = "http://localhost:13666"
def get_response(client_data, URL):
headers = {"Content-type": "application/json"}
print(f"Sending request to server url: {URL}")
response = requests.post(URL, data=json.dumps(client_data), headers=headers)
response = json.loads(response.content)
return response
def add_text(history, text):
history = history + [(text, None)]
return history, gr.update(value=None, interactive=True)
def add_file(history, files):
global RAG_STATE
RAG_STATE["conversation_ready"] = False # after adding new files, reset the ChatBot
RAG_STATE["upload_files"]=[file.name for file in files]
files_string = "\n".join([os.path.basename(path) for path in RAG_STATE["upload_files"]])
print(files_string)
history = history + [(files_string, None)]
return history
def bot(history):
print(history)
global RAG_STATE
if not RAG_STATE["conversation_ready"]:
# Upload files and initialize models
client_data = {
"docs": RAG_STATE["upload_files"],
"embed_model_name": RAG_STATE["embed_model_name"], # Select embedding model name here
"llm_name": RAG_STATE["llm_name"], # Select LLM model name here. ["pangu", "chatglm2"]
"conversation_ready": RAG_STATE["conversation_ready"]
}
else:
client_data = {}
client_data["conversation_ready"] = RAG_STATE["conversation_ready"]
client_data["user_input"] = history[-1][0].strip()
response = get_response(client_data, URL) # TODO: async request, to avoid users waiting the model initialization too long
print(response)
if response["error"] != "":
raise gr.Error(response["error"])
RAG_STATE["conversation_ready"] = response["conversation_ready"]
history[-1][1] = response["response"]
yield history
CSS = """
.contain { display: flex; flex-direction: column; height: 100vh }
#component-0 { height: 100%; }
#chatbot { flex-grow: 1; }
"""
header_html = """
<div style="background: linear-gradient(to right, #2a0cf4, #7100ed, #9800e6, #b600df, #ce00d9, #dc0cd1, #e81bca, #f229c3, #f738ba, #f946b2, #fb53ab, #fb5fa5); padding: 20px; text-align: left;">
<h1 style="color: white;">ColossalQA</h1>
<h4 style="color: white;">ColossalQA</h4>
</div>
"""
with gr.Blocks(css=CSS) as demo:
html = gr.HTML(header_html)
chatbot = gr.Chatbot(
[],
elem_id="chatbot",
bubble_full_width=False,
avatar_images=(
(os.path.join(os.path.dirname(__file__), "img/avatar_user.png")),
(os.path.join(os.path.dirname(__file__), "img/avatar_ai.png")),
),
)
with gr.Row():
txt = gr.Textbox(
scale=4,
show_label=False,
placeholder="Enter text and press enter, or upload an image",
container=True,
autofocus=True,
)
btn = gr.UploadButton("📁", file_types=["file"], file_count="multiple")
txt_msg = txt.submit(add_text, [chatbot, txt], [chatbot, txt], queue=False).then(bot, chatbot, chatbot)
# Clear the original textbox
txt_msg.then(lambda: gr.update(value=None, interactive=True), None, [txt], queue=False)
# Click Upload Button: 1. upload files 2. send config to backend, initalize model 3. get response "conversation_ready" = True/False
file_msg = btn.upload(add_file, [chatbot, btn], [chatbot], queue=False).then(bot, chatbot, chatbot)
if __name__ == "__main__":
demo.queue()
demo.launch(share=True) # share=True will release a public link of the demo
[pytest]
markers =
dist: tests which are run in a multi-GPU or multi-machine environment (at least 4 GPUs)
largedist: tests which are run in a multi-GPU or multi-machine environment (at least 8 GPUs)
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment