Unverified Commit 8773c588 authored by techlead-krischang's avatar techlead-krischang Committed by GitHub
Browse files

Update and rename README-.md to README-EN.md

parent ed26a7db
# CMLM-ZhongJing(中医大语言模型-仲景) # CMLM-ZhongJing(中医大语言模型-仲景)
[English](https://github.com/pariskang/CMLM-ZhongJing/blob/main/README.md) | [中文](https://github.com/pariskang/CMLM-ZhongJing/blob/main/README-ZH.md) [English](https://github.com/pariskang/CMLM-ZhongJing/blob/main/README-EN.md) | [中文](https://github.com/pariskang/CMLM-ZhongJing/blob/main/README.md)
A Traditional Chinese Medicine large language model, inspired by the wisdom of the eminent representative of ancient Chinese medical scholars, Zhang Zhongjing. A Traditional Chinese Medicine large language model, inspired by the wisdom of the eminent representative of ancient Chinese medical scholars, Zhang Zhongjing.
This model aims to illuminate the profound knowledge of Traditional Chinese Medicine, bridging the gap between ancient wisdom and modern technology, and providing a reliable and professional tool for the Traditional Chinese Medical fields. However, all generated results are for reference only and should be provided by experienced professionals for diagnosis and treatment results and suggestions. This model aims to illuminate the profound knowledge of Traditional Chinese Medicine, bridging the gap between ancient wisdom and modern technology, and providing a reliable and professional tool for the Traditional Chinese Medical fields. However, all generated results are for reference only and should be provided by experienced professionals for diagnosis and treatment results and suggestions.
中医大语言模型,灵感来自中国古代杰出医家张仲景的智慧。 该模型旨在阐明中医博大精深之知识,传承古代智慧与现代技术创新,最终为医学领域提供可信赖和专业的工具。然而,目前所有产生的结果仅供参考,应由经验丰富的专业人员提供诊断和治疗结果和建议。
<p align="center"> <img src="https://raw.githubusercontent.com/pariskang/CMLM-ZhongJing/main/logo.png" alt="logo" title="logo" width="50%"> </p> <p align="center"> <img src="https://raw.githubusercontent.com/pariskang/CMLM-ZhongJing/main/logo.png" alt="logo" title="logo" width="50%"> </p>
<p align="center"><b>Fig 1. A logo of CMLM-ZhongJing generated by Bing’s drawing output combined with human creative prompts.</b></p> <p align="center"><b>Fig 1. A logo of CMLM-ZhongJing generated by Bing’s drawing output combined with human creative prompts.</b></p>
## 1.Instruction Data Construction ## 1.Instruction Data Construction
While many works such as Alpaca, Belle, etc., are based on the self-instruct approach which effectively harnesses the knowledge of large language models to generate diverse and creative instructions, this approach may lead to noise in instruction data, thereby affecting the accuracy of the model in fields where professional knowledge has a low tolerance for errors, such as medical and legal scenarios. Therefore, how to quickly invoke the OpenAI API without sacrificing the professionalism of instruction data has become an important research direction for instruction data construction and annotation scenarios. Here, we will briefly describe our preliminary experimental exploration. While many works such as Alpaca, Belle, etc., are based on the self-instruct approach which effectively harnesses the knowledge of large language models to generate diverse and creative instructions, this approach may lead to noise in instruction data, thereby affecting the accuracy of the model in fields where professional knowledge has a low tolerance for errors, such as medical and legal scenarios. Therefore, how to quickly invoke the OpenAI API without sacrificing the professionalism of instruction data has become an important research direction for instruction data construction and annotation scenarios. Here, we will briefly describe our preliminary experimental exploration.
## 1.指令数据构建:
目前大多如Alpaca、Belle等工作基于self-instruct思路。self-instruct思路可以很好的调用大语言模型的知识,生成多样和具有创造性的指令,在常规问答场景可以快速构造海量指令实现指令调优。但在一些专业知识容错率较低的领域,比如医疗和法律场景,幻觉输出会导致噪声指令数据从而影响模型的准确性。典型的情况是比如不当的诊断及处方建议甚至影响患者生命,事实性错误的法律条文和法理的引用会造成权益人的败诉。因此,如何快速调用OpenAI API且不牺牲指令数据的专业性成为指令数据构造及标注等场景的重要研究方向。以下将简述我们的初步实验探索。
<p align="center"> <img src="https://raw.githubusercontent.com/pariskang/CMLM-ZhongJing/main/logo_image/Strategy.jpeg" alt="strategy" title="strategy" width="100%"> </p> <p align="center"> <img src="https://raw.githubusercontent.com/pariskang/CMLM-ZhongJing/main/logo_image/Strategy.jpeg" alt="strategy" title="strategy" width="100%"> </p>
<p align="center"><b>Fig 2. A Multi-task Therapeutic Behavior Decomposition Instruction Construction Strategy in the Loop of Human Physicians.</b></p> <p align="center"><b>Fig 2. A Multi-task Therapeutic Behavior Decomposition Instruction Construction Strategy in the Loop of Human Physicians.</b></p>
...@@ -21,8 +17,6 @@ While many works such as Alpaca, Belle, etc., are based on the self-instruct app ...@@ -21,8 +17,6 @@ While many works such as Alpaca, Belle, etc., are based on the self-instruct app
#### 1.1 Multi-task Therapeutic Behavior Decomposition Instruction Construction Strategy #### 1.1 Multi-task Therapeutic Behavior Decomposition Instruction Construction Strategy
Human memory and understanding require the construction of various scenarios and stories to implicitly encode knowledge information. The clarity of memory depends on the duration and richness of the learning process. Interleaved learning, spaced practice, and diversified learning can enhance the consolidation of knowledge, thereby forming a deep understanding of domain knowledge. Our approach is to learn from the process of human memory knowledge, use professional tables, leverage the language representation capabilities of large language models, strictly set specific prompt templates, so that the model can generate 15 scenarios based on the table data of Chinese medicine gynecology prescriptions, including patient therapeutic story, diagnostic analysis, diagnosis treatment expected result, formula function, interactive story, patient therapeutic story, narrative medicine, tongue & pulse, therapeutic template making, critical thinking, follow up, prescription, herb dosage, case study, real-world problem, disease mechanism, etc., to promote the model's reasoning ability for prescription data and diagnostic thinking logic. Human memory and understanding require the construction of various scenarios and stories to implicitly encode knowledge information. The clarity of memory depends on the duration and richness of the learning process. Interleaved learning, spaced practice, and diversified learning can enhance the consolidation of knowledge, thereby forming a deep understanding of domain knowledge. Our approach is to learn from the process of human memory knowledge, use professional tables, leverage the language representation capabilities of large language models, strictly set specific prompt templates, so that the model can generate 15 scenarios based on the table data of Chinese medicine gynecology prescriptions, including patient therapeutic story, diagnostic analysis, diagnosis treatment expected result, formula function, interactive story, patient therapeutic story, narrative medicine, tongue & pulse, therapeutic template making, critical thinking, follow up, prescription, herb dosage, case study, real-world problem, disease mechanism, etc., to promote the model's reasoning ability for prescription data and diagnostic thinking logic.
#### 1.1多任务诊疗行为分解instruction构建策略
人类在记忆和理解时需要构建各种情景和故事,以隐式编码知识信息。记忆的清晰程度取决于学习的持续过程和丰富程度。穿插学习、间隔练习和多样化学习可以提升知识的巩固程度,由此形成深刻的领域知识理解能力。我们的思路是借鉴人类记忆知识的过程,采用专业表格,借助大语言模型的语言表征能力,严格设置特定的prompt模板,使得模型基于中医妇科方药表格数据生成包括患者治疗故事、诊断分析、诊断治疗预期结果、处方功用、互动故事、患者治疗故事、叙事医学、舌脉象、诊疗方案制定、批判性思维、随访、处方、药物用量、个例研究、真实世界问题、病因病机等15个场景,以促进模型对中医方药数据及诊断思维逻辑的推理能力。
``` ```
{ {
"instruction": "我对三元汤的全过程很好奇,能否从简介、病历、症状、诊断和治疗,以及结果讨论等方面给我详细介绍?", "instruction": "我对三元汤的全过程很好奇,能否从简介、病历、症状、诊断和治疗,以及结果讨论等方面给我详细介绍?",
...@@ -34,8 +28,7 @@ Human memory and understanding require the construction of various scenarios and ...@@ -34,8 +28,7 @@ Human memory and understanding require the construction of various scenarios and
#### 1.2 Regular TCM Instruction Data Construction Strategy #### 1.2 Regular TCM Instruction Data Construction Strategy
In addition, we have also added instructions based on the content of Chinese medicine ancient books, noun explanations, symptom synonyms, antonyms, syndromes, symptoms, treatment methods, etc. In order to form a control experiment, we only use one instruction template to represent data for this part, and the number of this part of the data is 80,000, which is significantly more than the number of instructions constructed by the above strategy. The following is the specific number of instructions and tokens information. In addition, we have also added instructions based on the content of Chinese medicine ancient books, noun explanations, symptom synonyms, antonyms, syndromes, symptoms, treatment methods, etc. In order to form a control experiment, we only use one instruction template to represent data for this part, and the number of this part of the data is 80,000, which is significantly more than the number of instructions constructed by the above strategy. The following is the specific number of instructions and tokens information.
Data Source and Instruction Quantity Table: Data Source and Instruction Quantity Table:
#### 1.2 中医常识指令数据构建策略
此外,我们还增加了基于中医古籍内容、名词解释、症状近义词、反义词、证候、症状、治法等指令内容,为了形成对照试验,我们对这部分仅仅采用一种指令模板以表征数据,同时这部分数据的数量约为8万条,明显多于上述策略构建的指令数量,以下为指令具体数量及tokens数量信息。
``` ```
{ {
"instruction": "请回答以下有关于中医疾病名词解释的相关问题:", "instruction": "请回答以下有关于中医疾病名词解释的相关问题:",
...@@ -75,8 +68,6 @@ Data Source and Instruction Quantity Table: ...@@ -75,8 +68,6 @@ Data Source and Instruction Quantity Table:
## 2.Model Performance Comparison ## 2.Model Performance Comparison
Our test data are based on real medical cases from highly skilled traditional Chinese medicine doctors, typically case reports from provincial renowned senior traditional Chinese medicine practitioners or national medical master level. This kind of data, which is strictly considered as out-of-distribution data (both in terms of subject matter and training dataset distribution, distinct from traditional training and validation sets), is used to ensure a degree of professionalism. In preliminary comparisons with large language models such as Wenxin Yiyan and Spark, we found that our model exhibits good generalization capabilities on a diversified therapeutic decomposition instruction dataset constructed based on 300 Traditional Chinese Medicine prescription data. This perhaps initially confirms that, like humans, large language models are more conducive to learning metaphorical knowledge and logic from text content represented in diverse forms. Our test data are based on real medical cases from highly skilled traditional Chinese medicine doctors, typically case reports from provincial renowned senior traditional Chinese medicine practitioners or national medical master level. This kind of data, which is strictly considered as out-of-distribution data (both in terms of subject matter and training dataset distribution, distinct from traditional training and validation sets), is used to ensure a degree of professionalism. In preliminary comparisons with large language models such as Wenxin Yiyan and Spark, we found that our model exhibits good generalization capabilities on a diversified therapeutic decomposition instruction dataset constructed based on 300 Traditional Chinese Medicine prescription data. This perhaps initially confirms that, like humans, large language models are more conducive to learning metaphorical knowledge and logic from text content represented in diverse forms.
## 2.模型效果对比:
我们的测试数据基于真实高水平中医师的医学案例,通常为省级名老中医或国医大师级别的个案报告,以保证一定层面专业性。这样的数据较严格的属于分布外数据(学科分布外与训练数据集分布外,有别于传统训练集和验证集)。通过与文心一言、星火等大语言模型进行初步对比,发现我们的模型在基于300条中医方药数据构建的多样化诊疗分解指令数据集上具备良好的泛化能力,或许初步证实大语言模型与和人类一样,对于多元形式表征的文本内容更有助于学习隐喻的知识及逻辑。
| | | | | | | | | | | | | |
|-|-|-|-|-|-| |-|-|-|-|-|-|
...@@ -93,36 +84,23 @@ Our test data are based on real medical cases from highly skilled traditional Ch ...@@ -93,36 +84,23 @@ Our test data are based on real medical cases from highly skilled traditional Ch
Our preliminary tests reveal that the ZhongJing large language model demonstrates a certain degree of diagnostic and prescription capabilities not only in gynecology but also in other clinical specialties of traditional Chinese medicine, indicating its potential for generalization. This finding is significant as it suggests that our approach of using a multi-task therapeutic decomposition strategy and a domain-specific million-level instruct data set is effective in enhancing the model's reasoning ability for prescription data and diagnostic thinking logic. It also indicates the potential of large language models (7B parameters level) in fields where professional knowledge has a low tolerance for errors, such as medical and legal scenarios. Our preliminary tests reveal that the ZhongJing large language model demonstrates a certain degree of diagnostic and prescription capabilities not only in gynecology but also in other clinical specialties of traditional Chinese medicine, indicating its potential for generalization. This finding is significant as it suggests that our approach of using a multi-task therapeutic decomposition strategy and a domain-specific million-level instruct data set is effective in enhancing the model's reasoning ability for prescription data and diagnostic thinking logic. It also indicates the potential of large language models (7B parameters level) in fields where professional knowledge has a low tolerance for errors, such as medical and legal scenarios.
我们的初步测试发现仲景大语言模型在妇科以外的中医临床专科领域也具备一定诊断和处方能力,具备一定的泛化能力。这一发现较有意义,因为它表明,我们使用多任务治疗分解策略和特定领域的百万级指导数据集的方法在增强模型对处方数据和诊断思维逻辑的推理能力方面是有效的。它还表明了在7B参数量的大型语言模型在专业知识对错误容忍度较低的领域的潜力,例如医疗和法律场景。
## To Do List ## To Do List
- [ ] Adopt a multi-task therapeutic decomposition strategy, based on multidisciplinary data such as internal medicine, gynecology, pediatrics, and orthopedics, to fine-tune the model with a domain-specific million-level instruct data. - [ ] Adopt a multi-task therapeutic decomposition strategy, based on multidisciplinary data such as internal medicine, gynecology, pediatrics, and orthopedics, to fine-tune the model with a domain-specific million-level instruct data.
- [ ] Continuously iterate and update. Subsequent releases will include Li Shizhen, Wang Shuhe, Huangfu Mi, Sun Simiao, Ge Hong, and Qihuang version of the large language model for Traditional Chinese Medicine. - [ ] Continuously iterate and update. Subsequent releases will include Li Shizhen, Wang Shuhe, Huangfu Mi, Sun Simiao, Ge Hong, and Qihuang version of the large language model for Traditional Chinese Medicine.
- [ ] Explore efficient domain fine-tuning strategies. - [ ] Explore efficient domain fine-tuning strategies.
## 待做清单
- [ ] 采用多任务诊疗分解策略,基于内外妇儿骨等多学科数据构建领域百万级instruct数据微调模型
- [ ] 持续迭代更新,后续将发布李时珍、王叔和、皇甫谧、孙思邈、葛洪、岐黄版中医药大语言模型
- [ ] 探索高效领域微调策略
## Acknowledgements ## Acknowledgements
The Lora fine-tuning part of this project draws on the ideas of alpaca-lora and Chinese-Vicuna. We would like to express our gratitude to the members of the relevant research teams. The Lora fine-tuning part of this project draws on the ideas of alpaca-lora and Chinese-Vicuna. We would like to express our gratitude to the members of the relevant research teams.
## 致谢声明
本项目Lora微调部分代码借鉴alpaca-lora、Chinese-Vicuna思路,我们对相关研究团队成员表示感谢。
## Disclaimer ## Disclaimer
This research is for academic research use only, commercial use is not allowed without permission, and it is not to be used in medical scenarios or scenarios with potential medical intent for clinical practice. This large language model for Traditional Chinese Medicine is still in the laboratory testing stage. The emerging syndrome classification and prescription generation capabilities at this stage are still rudimentary, and it does not yet have a highly reliable clinical diagnostic and therapeutic capability for gynecology and other clinical specialties. The output results are for internal reference testing only. Real medical diagnosis and decision-making still need to be issued by experienced physicians through a strictly regulated diagnostic and therapeutic process. This research is for academic research use only, commercial use is not allowed without permission, and it is not to be used in medical scenarios or scenarios with potential medical intent for clinical practice. This large language model for Traditional Chinese Medicine is still in the laboratory testing stage. The emerging syndrome classification and prescription generation capabilities at this stage are still rudimentary, and it does not yet have a highly reliable clinical diagnostic and therapeutic capability for gynecology and other clinical specialties. The output results are for internal reference testing only. Real medical diagnosis and decision-making still need to be issued by experienced physicians through a strictly regulated diagnostic and therapeutic process.
## 免责声明
本研究仅供学术研究使用,未经允许不得商业使用,不得在医疗场景或具有潜在医疗意图场景进行临床实践。本中医药大语言模型还处于实验室测试阶段,本阶段涌现的证型分类和处方生成能力尚且粗浅,对于妇科及其他临床专科尚不具备高度可信的临床诊疗能力,目前尚不具有医疗实践能力,输出结果仅供内部参考测试。真实的医疗诊断及决策依然需要经经验丰富的医师通过严格规范的诊疗过程出具。
## Collaboration ## Collaboration
Data processing and annotation is one of the important steps in training the model. We sincerely welcome Traditional Chinese Medicine practitioners with strong TCM thinking and innovative spirit to join us. We will also declare corresponding data contributions. We look forward to the day when we can achieve a reliable General Artificial Intelligence for Traditional Chinese Medicine, allowing the ancient Chinese medicine to blend with modern technology and shine anew. This is also the ultimate mission of this project. If interested, please send an email to 21110860035@m.fudan.edu.cn. Data processing and annotation is one of the important steps in training the model. We sincerely welcome Traditional Chinese Medicine practitioners with strong TCM thinking and innovative spirit to join us. We will also declare corresponding data contributions. We look forward to the day when we can achieve a reliable General Artificial Intelligence for Traditional Chinese Medicine, allowing the ancient Chinese medicine to blend with modern technology and shine anew. This is also the ultimate mission of this project. If interested, please send an email to 21110860035@m.fudan.edu.cn.
## 合作事宜
数据处理与标注是训练模型重要环节之一,我们诚挚欢迎具有浓厚中医思维及创新精神的中医师加入,也会在数据层面声明相应贡献,期待我们有朝一日实现可信赖的中医通用人工智能,让古老的中医学与新时代科技融合焕发新春,这也是本项目的最终使命。如有意向,请发邮件到21110860035@m.fudan.edu.cn。
## Team Introduction ## Team Introduction
This project is jointly guided by Professor Zhang Wenqiang from Fudan University and Professor Wang Haofen from Tongji University. It is completed by Kang Yanlan, Chang Yang, and Fu Jiyuan, members of the [ROI Lab](https://www.fudanroilab.com/) at Fudan University. This project is jointly guided by Professor Zhang Wenqiang from Fudan University and Professor Wang Haofen from Tongji University. It is completed by Kang Yanlan, Chang Yang, and Fu Jiyuan, members of the [ROI Lab](https://www.fudanroilab.com/) at Fudan University.
## 团队介绍
本项目由复旦大学张文强教授和同济大学王昊奋教授共同指导,由复旦大学[ROI Lab](https://www.fudanroilab.com/)成员康砚澜、[常扬](https://github.com/techlead-krischang)、符纪元通力协作完成。
## Citation ## Citation
If you find this work useful in your research, please cite our repository: If you find this work useful in your research, please cite our repository:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment