Commits · 8a0aa7a479e9c8d5bc63589024be11ba5fdebf3f · wangsen / MinerU

06 Jan, 2025 5 commits

Merge branch 'dev' into dev · 8a0aa7a4
Xiaomeng Zhao authored Jan 06, 2025

8a0aa7a4

build(docker): update Dockerfiles for China and Huawei NPU versions · 2e1bf881

myhloli authored Jan 06, 2025

- Update package sources to use Aliyun mirrors for faster downloads
- Upgrade pip and install Python packages in virtual environment
- Add python3.10-dev package to Huawei NPU Dockerfile
- Update requirements file URLs to master branch- Install specific version of torch_npu in Huawei NPU Dockerfile
- Update magic-pdf installation method
- Improve modelscope installation process
- Optimize model download and configuration update steps

2e1bf881

build(docker): update Dockerfiles and download scripts · 36c3ad6f

myhloli authored Jan 06, 2025

- Update Dockerfiles in china, global, and huawei_npu directories
- Improve wget commands by specifying output file names
- Update READMEs to reflect new Dockerfile locations

36c3ad6f

Merge remote-tracking branch 'origin/dev' into dev · 0f1dff1e
myhloli authored Jan 06, 2025

0f1dff1e

build(docker): add Dockerfiles for global and Huawei NPU setups · ad099808

myhloli authored Jan 06, 2025

- Add Dockerfile for global setup with Ubuntu base image
- Add Dockerfile for Huawei NPU setup with Ascend base image
- Update requirements file structure:  - Rename requirements-docker.txt to docker/china/requirements.txt - Add new requirements files for global and Huawei NPU setups
- Install necessary packages and dependencies in both Dockerfiles- Set up virtual environment and install Python packages
- Download models and configure magic-pdf for both setups

ad099808

05 Jan, 2025 4 commits

docs(README): update documentation for NPU support · 2e8601ab

myhloli authored Jan 05, 2025

- Add section for using NPU acceleration in both English and Chinese README files
- Update system requirements to include CANN environment for NPU support
- Enhance the "Quick Start" guide with NPU-related information- Modify hardware requirements to specify "Ascend 910b" for NPU acceleration

2e8601ab

feat(tools): add character bounding box drawing functionality · f911a102

myhloli authored Jan 05, 2025

- Add `draw_char_bbox` function to `draw_bbox.py` for drawing character bounding boxes
- Integrate `draw_char_bbox` into `common.py` for use in PDF processing pipeline
- Include option to draw character bounding boxes in debug mode

f911a102

style(pdf_parse_union_core_v2): remove unnecessary spaces and improve code... · 9951a170

myhloli authored Jan 05, 2025

style(pdf_parse_union_core_v2): remove unnecessary spaces and improve code formatting- Remove extra space in conditional statement for character spacing logic
- Adjust spacing in trigonometric checks for line direction- Improve overall code readability and consistency

9951a170

fix(magic-pdf): update OCR model selection logic · 16a0a350

myhloli authored Jan 05, 2025

- Add missing 'else' statement in OCR model selection logic
- Ensure consistent formatting of 'if' statements for better readability
- Remove unnecessary empty line in the 'app.py' file

16a0a350

03 Jan, 2025 4 commits
- refactor(ocr): comment out unnecessary log statement · 04febf52
  myhloli authored Jan 03, 2025
```
- Remove logger.info() call for additional_ocr_params to reduce log verbosity
```
  04febf52
- feat(model): add onnxruntime support for paddleocr on cpu · 512adb67
  myhloli authored Jan 03, 2025
```
- Implement ONNXModelSingleton to manage ONNX models
- Modify ModifiedPaddleOCR to use ONNX models on ARM CPUs without CUDA
- Update RapidTableModel to use RapidOCR with ONNXRuntime on CPU
- Add rapidocr_onnxruntime dependency in setup.py
```
  512adb67
- Merge pull request #1398 from yzztin/dev · ad9abc32
  Xiaomeng Zhao authored Jan 03, 2025
```
fix(web_api): Modify the import path of InferenceResult
```
  ad9abc32
- fix(web_api): Modify the import path of InferenceResult · 05109c36
  yzz authored Jan 03, 2025
  
  05109c36
02 Jan, 2025 3 commits

Merge pull request #1386 from myhloli/fix-char-without-space · 26f8cbac
Xiaomeng Zhao authored Jan 02, 2025
```
refactor(pdf_parse): improve character spacing handling in PDF text extraction
```
26f8cbac

refactor(pdf_parse): improve character spacing handling in PDF text extraction · c93950dc

myhloli authored Jan 02, 2025

- Update the logic for inserting spaces between characters- Consider the next character's position instead of the previous one
- Adjust the spacing threshold to 25% of the average character width
- Ignore spaces at the end of lines to prevent double spaces

c93950dc

refactor(pdf_parse): improve character spacing handling in PDF text extraction · 7c5cdcd4

myhloli authored Jan 02, 2025

- Update the logic for inserting spaces between characters- Consider the next character's position instead of the previous one
- Adjust the spacing threshold to 25% of the average character width
- Ignore spaces at the end of lines to prevent double spaces

7c5cdcd4

30 Dec, 2024 3 commits

refactor(magic_pdf): comment out npu-related code · 88b909e2

myhloli authored Dec 30, 2024

- Remove use_npu variable initialization
- Comment out device assignment and npu check
- Comment out use_npu parameter in ModifiedPaddleOCR constructor

88b909e2

fix(npu): correct module name for NPU operations · 2684e775

myhloli authored Dec 30, 2024

- Update `clean_memory.py` to use `torch_npu.npu` instead of `torch.npu`
- Update `model_utils.py` to use `torch_npu.npu` instead of `torch.npu`
- Simplify NPU availability check and bfloat16 support in `pdf_parse_union_core_v2.py`

2684e775

build(deps): update pydantic to latest version · 2e87e649

myhloli authored Dec 30, 2024

- Remove upper version limit for pydantic dependency
- This change allows for the use of the latest pydantic version

2e87e649

27 Dec, 2024 3 commits
- Merge pull request #1370 from icecraft/fix/path_delimiter · e72709cc
  Xiaomeng Zhao authored Dec 27, 2024
```
fix: s3 path join method
```
  e72709cc
- fix: s3 path join method · d637dab3
  icecraft authored Dec 27, 2024
  
  d637dab3
- build: add openai to requirements-docker.txt · dc0d30f5
  myhloli authored Dec 27, 2024
```
- Add openai package to requirements-docker.txt
```
  dc0d30f5
26 Dec, 2024 4 commits

refactor(device): optimize memory cleaning and device selection · 50f48417

myhloli authored Dec 26, 2024

- Update clean_memory function to support both CUDA and NPU devices
- Implement get_device function to centralize device selection logic
- Modify model initialization and memory cleaning to use the selected device
- Update RapidTableModel to support both RapidOCR and PaddleOCR engines

50f48417

feat(model): add npu support and optimize table model · 7990e7df

myhloli authored Dec 26, 2024

- Add NPU support for memory cleaning and model initialization
- Optimize table model initialization and prediction process
- Update memory utils to support NPU
- Add language parameter for table model

7990e7df

Merge pull request #1365 from myhloli/dev · 667b1a39
Xiaomeng Zhao authored Dec 26, 2024
```
build(deps): upgrade unimernet to 0.2.3
```
667b1a39

build(deps): upgrade unimernet to 0.2.3 · 96f8da2a

myhloli authored Dec 26, 2024

- Update unimernet from 0.2.2 to 0.2.3 in requirements-docker.txt and setup.py
- Remove torchtext/eva-decord dependency

96f8da2a

25 Dec, 2024 3 commits

Merge pull request #1362 from myhloli/dev · f3ae9fd8
Xiaomeng Zhao authored Dec 25, 2024
```
feat(llm_aided): add title optimization feature
```
f3ae9fd8

refactor(magic_pdf): remove unnecessary logging statements · 192047a1

myhloli authored Dec 25, 2024

- Comment out logging statements for title list, title completion, and length comparison
- Improve code readability and reduce clutter by removing unused debug information

192047a1

feat(llm_aided): add title optimization feature · 0a468eca

myhloli authored Dec 25, 2024

- Implement llm_aided_title function to optimize document titles using LLM
- Update pdf_parse_union_core_v2.py to include title optimization
- Modify ocr_mkcontent.py to use optimized title levels- Add openai SDK dependency in setup.py

0a468eca

24 Dec, 2024 2 commits

Merge pull request #1352 from myhloli/add-llm-aided · da3257a6
Xiaomeng Zhao authored Dec 24, 2024
```
feat(llm): add LLM-aided formula and text correction
```
da3257a6

feat(llm): add LLM-aided formula and text correction · c660fdc8

myhloli authored Dec 24, 2024

- Add LLM-aided formula and text correction functionality
- Update config reader to include LLM-aided settings
- Create new LLM-aided processing module
- Update main processing script to incorporate LLM-aided corrections
- Modify download scripts to check for new config version

c660fdc8

20 Dec, 2024 5 commits
- Merge pull request #1338 from myhloli/dev · 0281048d
  Xiaomeng Zhao authored Dec 20, 2024
```
refactor(pre_proc): improve character overlap handling in spans 
```
  0281048d
- Merge remote-tracking branch 'origin/dev' into dev · 24dfd1a0
  myhloli authored Dec 20, 2024
  
  24dfd1a0
- refactor(pre_proc): improve character overlap handling in spans · 15e87667
  myhloli authored Dec 20, 2024
```
- Remove remove_overlaps_chars function
- Add check_chars_is_overlap_in_span function
- Update span processing logic to handle character overlaps- Improve efficiency and readability of overlap detection
```
  15e87667
- Merge pull request #1336 from icecraft/docs/add_more_method_usage · 58b2e78d
  Xiaomeng Zhao authored Dec 20, 2024
```
docs: add more method description
```
  58b2e78d
- docs: add more method description · 24ee9c41
  xu rui authored Dec 20, 2024
  
  24ee9c41
19 Dec, 2024 4 commits

Merge pull request #1330 from myhloli/dev · a9dea5f0
Xiaomeng Zhao authored Dec 19, 2024
```
feat(demo): add demo script for PDF processing
```
a9dea5f0

feat(demo): add demo script for PDF processing · d6a29162

myhloli authored Dec 19, 2024

- Create demo.py script for PDF file processing
- Implement PDF reading, classification, and inference usingOpendatalab's magic_pdf library- Add pipelines for OCR and text modes
- Include result visualization and markdown export

d6a29162

Merge pull request #1329 from myhloli/dev · 5eb9feee
Xiaomeng Zhao authored Dec 19, 2024
```
feat(pre_proc): add function to remove overlapping characters in spans
```
5eb9feee

feat(pre_proc): add function to remove overlapping characters in spans · 2f4d4b0c

myhloli authored Dec 19, 2024

- Implement remove_overlaps_chars function to detect and remove overlapping characters within spans
- Integrate remove_overlaps_chars function into the PDF parsing process
- Improve character-level processing and reduce redundancy in OCR results

2f4d4b0c