Commit 4f88fcaa authored by myhloli's avatar myhloli
Browse files

feat(ocr): add new Chinese OCR model and update language support

- Add new Chinese OCR model (ch_PP-OCRv4_rec_server_doc_infer) for server-side use
- Update language support in app.py to include new Chinese model
- Modify models_config.yml to add new model configuration
parent 3cf1ea1f
...@@ -48,6 +48,12 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte ...@@ -48,6 +48,12 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte
</div> </div>
# Changelog # Changelog
- Released on 2025/04/23, version 1.3.8
- The default `ocr` model (`ch`) has been updated to `PP-OCRv4_server_rec_doc` (model update required)
- `PP-OCRv4_server_rec_doc` is trained on a mix of more Chinese document data and PP-OCR training data, enhancing recognition capabilities for some traditional Chinese characters, Japanese, and special characters. It supports over 15,000 recognizable characters, improving text recognition in documents while also boosting general text recognition.
- [Performance comparison between PP-OCRv4_server_rec_doc, PP-OCRv4_server_rec, and PP-OCRv4_mobile_rec](https://paddlepaddle.github.io/PaddleX/latest/en/module_usage/tutorials/ocr_modules/text_recognition.html#ii-supported-model-list)
- Verified results show that the `PP-OCRv4_server_rec_doc` model significantly improves accuracy in both single-language (`Chinese`, `English`, `Japanese`, `Traditional Chinese`) and mixed-language scenarios, with speed comparable to `PP-OCRv4_server_rec`, making it suitable for most use cases.
- In a small number of pure English scenarios, the `PP-OCRv4_server_rec_doc` model may encounter word concatenation issues, whereas `PP-OCRv4_server_rec` performs better in such cases. Therefore, we have retained the `PP-OCRv4_server_rec` model, which users can invoke by passing the parameter `lang='ch_server'`(python api) or `--lang ch_server`(cli).
- 2025/04/22 1.3.7 Released - 2025/04/22 1.3.7 Released
- Fixed the issue where the `lang` parameter was ineffective during table parsing model initialization. - Fixed the issue where the `lang` parameter was ineffective during table parsing model initialization.
- Fixed the significant slowdown in OCR and table parsing speed in `cpu` mode. - Fixed the significant slowdown in OCR and table parsing speed in `cpu` mode.
......
...@@ -47,6 +47,12 @@ ...@@ -47,6 +47,12 @@
</div> </div>
# 更新记录 # 更新记录
- 2025/04/23 1.3.8 发布
- `ocr`默认模型(`ch`)更新为`PP-OCRv4_server_rec_doc`(需更新模型)
- `PP-OCRv4_server_rec_doc`是在`PP-OCRv4_server_rec`的基础上,在更多中文文档数据和PP-OCR训练数据的混合数据训练而成,增加了部分繁体字、日文、特殊字符的识别能力,可支持识别的字符为1.5万+,除文档相关的文字识别能力提升外,也同时提升了通用文字的识别能力。
- [PP-OCRv4_server_rec_doc/PP-OCRv4_server_rec/PP-OCRv4_mobile_rec 性能对比](https://paddlepaddle.github.io/PaddleX/latest/module_usage/tutorials/ocr_modules/text_recognition.html#_3)
- 经验证,`PP-OCRv4_server_rec_doc`模型在`中英日繁`单种语言或多种语言混合场景均有明显精度提升,且速度与`PP-OCRv4_server_rec`相当,适合绝大部分场景使用。
- `PP-OCRv4_server_rec_doc`在小部分纯英文场景可能会发生单词粘连问题,`PP-OCRv4_server_rec`则在此场景下表现更好,因此我们保留了`PP-OCRv4_server_rec`模型,用户可通过增加参数`lang='ch_server'`(python api)或`--lang ch_server`(命令行)调用。
- 2025/04/22 1.3.7 发布 - 2025/04/22 1.3.7 发布
- 修复表格解析模型初始化时lang参数失效的问题 - 修复表格解析模型初始化时lang参数失效的问题
- 修复在`cpu`模式下ocr和表格解析速度大幅下降的问题 - 修复在`cpu`模式下ocr和表格解析速度大幅下降的问题
......
...@@ -55,7 +55,8 @@ class PytorchPaddleOCR(TextSystem): ...@@ -55,7 +55,8 @@ class PytorchPaddleOCR(TextSystem):
self.lang = kwargs.get('lang', 'ch') self.lang = kwargs.get('lang', 'ch')
device = get_device() device = get_device()
if device == 'cpu' and self.lang == 'ch': if device == 'cpu' and self.lang in ['ch', 'ch_server']:
logger.warning("The current device in use is CPU. To ensure the speed of parsing, the language is automatically switched to ch_lite.")
self.lang = 'ch_lite' self.lang = 'ch_lite'
if self.lang in latin_lang: if self.lang in latin_lang:
......
...@@ -171,6 +171,31 @@ ch_PP-OCRv4_rec_server_infer: ...@@ -171,6 +171,31 @@ ch_PP-OCRv4_rec_server_infer:
nrtr_dim: 384 nrtr_dim: 384
max_text_length: 25 max_text_length: 25
ch_PP-OCRv4_rec_server_doc_infer:
model_type: rec
algorithm: SVTR_HGNet
Transform:
Backbone:
name: PPHGNet_small
Head:
name: MultiHead
out_channels_list:
CTCLabelDecode: 15631
head_list:
- CTCHead:
Neck:
name: svtr
dims: 120
depth: 2
hidden_dims: 120
kernel_size: [ 1, 3 ]
use_guide: True
Head:
fc_decay: 0.00001
- NRTRHead:
nrtr_dim: 384
max_text_length: 25
chinese_cht_PP-OCRv3_rec_infer: chinese_cht_PP-OCRv3_rec_infer:
model_type: rec model_type: rec
algorithm: SVTR algorithm: SVTR
......
...@@ -3,10 +3,14 @@ lang: ...@@ -3,10 +3,14 @@ lang:
det: ch_PP-OCRv3_det_infer.pth det: ch_PP-OCRv3_det_infer.pth
rec: ch_PP-OCRv4_rec_infer.pth rec: ch_PP-OCRv4_rec_infer.pth
dict: ppocr_keys_v1.txt dict: ppocr_keys_v1.txt
ch: ch_server:
det: ch_PP-OCRv3_det_infer.pth det: ch_PP-OCRv3_det_infer.pth
rec: ch_PP-OCRv4_rec_server_infer.pth rec: ch_PP-OCRv4_rec_server_infer.pth
dict: ppocr_keys_v1.txt dict: ppocr_keys_v1.txt
ch:
det: ch_PP-OCRv3_det_infer.pth
rec: ch_PP-OCRv4_rec_server_doc_infer.pth
dict: ppocrv4_doc_dict.txt
en: en:
det: en_PP-OCRv3_det_infer.pth det: en_PP-OCRv3_det_infer.pth
rec: en_PP-OCRv4_rec_infer.pth rec: en_PP-OCRv4_rec_infer.pth
......
...@@ -158,7 +158,7 @@ devanagari_lang = [ ...@@ -158,7 +158,7 @@ devanagari_lang = [
'hi', 'mr', 'ne', 'bh', 'mai', 'ang', 'bho', 'mah', 'sck', 'new', 'gom', # noqa: E126 'hi', 'mr', 'ne', 'bh', 'mai', 'ang', 'bho', 'mah', 'sck', 'new', 'gom', # noqa: E126
'sa', 'bgc' 'sa', 'bgc'
] ]
other_lang = ['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka'] other_lang = ['ch', 'ch_lite', 'ch_server', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka']
add_lang = ['latin', 'arabic', 'cyrillic', 'devanagari'] add_lang = ['latin', 'arabic', 'cyrillic', 'devanagari']
# all_lang = ['', 'auto'] # all_lang = ['', 'auto']
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment