Commits · af27c0cc81e76199cfbfb5f1ca4cf1a360802fe4 · wangsen / MinerU

20 Mar, 2025 1 commit

refactor(magic_pdf): support mps device and optimize image processing · af27c0cc

myhloli authored Mar 20, 2025

- Add support for Apple M1 chips (mps device)
- Refactor image processing for better performance and compatibility
- Update model loading and inference for various devices
- Adjust batch processing and memory management

af27c0cc

13 Feb, 2025 2 commits
- build(deps): update PaddlePaddle dependency versions · 0e9a2518
  myhloli authored Feb 13, 2025
```
- Update PaddlePaddle to 3.0.0rc1 for Linux and macOS
```
  0e9a2518
- build(deps): update rapidocr-paddle and rapidocr_onnxruntime dependencies · 19b4f8d4
  myhloli authored Feb 13, 2025
```
- Update rapidocr-paddle dependency to version >=1.4.5, <2.0.0- Update rapidocr_onnxruntime dependency to version >=1.4.4, <2.0.0
```
  19b4f8d4
16 Jan, 2025 1 commit

feat(table): upgrade RapidTable to1.0.3 and add sub-model support · 79c8a5c8

myhloli authored Jan 16, 2025

- Update RapidTable dependency to version 1.0.3
- Add support for sub-models in RapidTable
- Update magic-pdf configuration to include table sub-model
- Modify table model initialization to support sub-models
- Update table prediction logic to handle new output format

79c8a5c8

14 Jan, 2025 1 commit

feat(layout): improve title block handling and layout detection · c20e9a1e

myhloli authored Jan 14, 2025

- Merge title blocks that are close to each other horizontally
- Adjust line insertion logic for title blocks- Increase image size and decrease confidence threshold for layout detection
- Update DocLayoutYOLO model weights
- Refactor drawing of bounding boxes for different block types

c20e9a1e

09 Jan, 2025 1 commit
- build(deps): specify version for rapid_table dependency · a935c33f
  myhloli authored Jan 09, 2025
```
- Update rapid_table dependency to version 0.3.0 in setup.py
```
  a935c33f
03 Jan, 2025 1 commit

feat(model): add onnxruntime support for paddleocr on cpu · 512adb67

myhloli authored Jan 03, 2025

- Implement ONNXModelSingleton to manage ONNX models
- Modify ModifiedPaddleOCR to use ONNX models on ARM CPUs without CUDA
- Update RapidTableModel to use RapidOCR with ONNXRuntime on CPU
- Add rapidocr_onnxruntime dependency in setup.py

512adb67

26 Dec, 2024 1 commit

build(deps): upgrade unimernet to 0.2.3 · 96f8da2a

myhloli authored Dec 26, 2024

- Update unimernet from 0.2.2 to 0.2.3 in requirements-docker.txt and setup.py
- Remove torchtext/eva-decord dependency

96f8da2a

25 Dec, 2024 1 commit

feat(llm_aided): add title optimization feature · 0a468eca

myhloli authored Dec 25, 2024

- Implement llm_aided_title function to optimize document titles using LLM
- Update pdf_parse_union_core_v2.py to include title optimization
- Modify ocr_mkcontent.py to use optimized title levels- Add openai SDK dependency in setup.py

0a468eca

11 Dec, 2024 1 commit

build(deps): update torch and torchvision version requirements · 9a96362d

myhloli authored Dec 11, 2024

- Specify torch==2.3.1 and torchvision==0.18.1 for Windows CUDA installation
- Add torch and torchvision version constraints in setup.py:
  - torch>=2.2.2,<=2.3.1
  - torchvision>=0.17.2,<=0.18.1
- Update installation instructions in both English and Chinese README files

9a96362d

09 Dec, 2024 2 commits

refactor(magic_pdf): optimize environment setup and dependencies · a296ea41

myhloli authored Dec 09, 2024

- Add environment variables to disable albumentations and yolo updates
- Import torchtext and disable deprecation warnings
- Update unimernet to 0.2.2
- Specify ultralytics version as >=8.3.48
- Remove upper version limit for torch

a296ea41

build(deps): update dependency versions · 2ae10394
myhloli authored Dec 09, 2024
```
- Update ultralytics to >=8.3.47
```
2ae10394

06 Dec, 2024 1 commit

build(deps): specify minimum version for ultralytics · 1f1335c2

myhloli authored Dec 06, 2024

- Update `ultralytics` dependency to version >= 8.3.43
- This change ensures compatibility with yolov8 for formula detection

1f1335c2

18 Nov, 2024 1 commit

build(setup): add old_linux specific dependencies · d0f633e2

myhloli authored Nov 18, 2024

- Add albumentations package with version <=1.4.20 for old_linux
- This version is compatible with Linux systems from 2019 and earlier
- Version 1.4.21 and above introduced simsimd which is not supported on older Linux systems

d0f633e2

15 Nov, 2024 1 commit
- refactor(model): rename and restructure model modules · 08f46125
  myhloli authored Nov 15, 2024
  
  08f46125
08 Nov, 2024 2 commits

feat(table): add RapidOCR support for RapidTable model · fe2c2c0d

myhloli authored Nov 09, 2024

- Integrate RapidOCR with RapidTable model for table recognition
- Improve memory management for devices with <= 8GB VRAM
- Update table recognition process to use RapidOCR for RapidTable
- Add rapidocr-paddle dependency in setup.py

fe2c2c0d

feat(table): integrate RapidTable model for table recognition · 240fe99e

myhloli authored Nov 08, 2024

- Add RapidTable model support for table recognition
- Update table model configuration and initialization
- Modify table recognition process to use RapidTable when specified
- Add RapidTable dependency to setup.py

240fe99e

04 Nov, 2024 1 commit

feat(table): upgrade StructEqTable model and integrate into PDF Extract Kit · 11f23843

myhloli authored Nov 04, 2024

- Update StructTableModel to use the latest struct-eqtable library
- Add support for HTML table extraction in PDF Extract Kit
- Improve error handling and model initialization
- Update dependencies in setup.py for struct-eqtable

11f23843

23 Oct, 2024 1 commit
- build(setup): add doclayout_yolo dependency · 73fe8914
  myhloli authored Oct 23, 2024
```
- Add doclayout_yolo==0.0.2 to the list of dependencies in setup.py
```
  73fe8914
10 Sep, 2024 2 commits

Update setup.py · 20212a37
Xiaomeng Zhao authored Sep 10, 2024
```
update UniMERNet to 0.2.1
```
20212a37

refactor(pdf_extract_kit): update model config and weight paths for UniMERNet-0.2.0 · 3e9bc7a4

myhloli authored Sep 10, 2024

Update the paths to model weights and configuration files for the UniMERNet architecture
in both the demo.yaml and model_configs.yaml files. Adjust the mfr_model_init function toreflect the new weight and configuration paths. The changes include specifying more detailed
paths to the unimernet_base directory and changing the weight file extension to .pth.

3e9bc7a4

04 Aug, 2024 2 commits

fix(setup): allow latest matplotlib versions on non-Windows platforms · 25213909

myhloli authored Aug 04, 2024

The restriction on the matplotlib version has been updated to only apply on Windows
platforms, where precompiled packages are not available starting from version 3.9.1.
This change enables users on Linux and macOS to install newer versions of matplotlib,
addressing compatibility issues with recent bug fixes.

25213909

fix(dependencies): remove unnecessary pypandoc and struct-eqtable packages;fix... · 9ececf3a

myhloli authored Aug 04, 2024

fix(dependencies): remove unnecessary pypandoc and struct-eqtable packages;fix matplotlib>=3.9.1 not support Windows system without compilation environment.

9ececf3a

01 Aug, 2024 1 commit

Feat/impl cli (#264) · 40e0827e

icecraft authored Aug 01, 2024



* feat: refractor cli command

* feat: add docs to describe the output files of cli

* feat: resove review comments

* feat: updat docs about middle.json

---------
Co-authored-by: shenguanlin <shenguanlin@pjlab.org.cn>

40e0827e

30 Jul, 2024 1 commit
- fix(setup): pin unimernet version to 0.1.6 for compatibility · 2c09109e
  myhloli authored Jul 30, 2024
  
  2c09109e
28 Jul, 2024 1 commit
- fix(setup): update PyMuPDF and paddlepaddle dependencies · 46d75499
  myhloli authored Jul 28, 2024
  
  46d75499
23 Jul, 2024 1 commit

feat(setup.py): restructure extras_require options for clarity · 5c963168

myhloli authored Jul 23, 2024

Refactor the `extras_require` section in `setup.py` to simplify and clarify
the available options. Consolidate CPU and GPU requirements into single
"lite" and "full" options to streamline installation for users.

5c963168

12 Jul, 2024 2 commits
- fix(setup): specify paddleocr version to fix compatibility issue · 61fab96e
  myhloli authored Jul 12, 2024
  
  61fab96e
- feat(setup.py): include package data for magic_pdf.resources · d458b705
  myhloli authored Jul 12, 2024
```
Update the setup.py file to explicitly include the package data for the
magic_pdf.resources directory. This ensures that all files within thisdirectory are packaged and available for use with the magic_pdf package.
```
  d458b705
11 Jul, 2024 1 commit

feat(model): add model mode selection for PDF analysis · bc0f6932

myhloli authored Jul 11, 2024

Introduce a new feature that allows users to choose between a "lite" and a "full"
model mode for PDF document analysis. The "lite" mode uses a faster, less
accurate model, while the "full" mode employs a higher-precision model at the
cost of speed. This selection can be made through the CLI or API, providing
flexibility for different use cases.

bc0f6932

08 Jul, 2024 1 commit
- update: Update the homepage link · 1cedf457
  myhloli authored Jul 08, 2024
  
  1cedf457
25 Jun, 2024 1 commit
- update requirements and setup · 3aa8ccdc
  赵小蒙 authored Jun 25, 2024
  
  3aa8ccdc
20 Jun, 2024 2 commits
- update setup config · 129288aa
  赵小蒙 authored Jun 20, 2024
  
  129288aa
- update: · 756792a3
  赵小蒙 authored Jun 20, 2024
```
add entry points can exec in shell
```
  756792a3
18 Jun, 2024 1 commit
- update requirements · 9dc5033c
  赵小蒙 authored Jun 18, 2024
  
  9dc5033c
05 Jun, 2024 1 commit
- fix: change garbled_rate 0.1 -> 0.02 · 9b5b1163
  赵小蒙 authored Jun 05, 2024
  
  9b5b1163
04 Jun, 2024 2 commits
- chanage update version logic · 07f6c497
  赵小蒙 authored Jun 04, 2024
  
  07f6c497
- add version_name to middle json · 1de37e4c
  赵小蒙 authored Jun 04, 2024
  
  1de37e4c
03 Jun, 2024 1 commit
- add version_name to middle json · bd183428
  赵小蒙 authored Jun 03, 2024
  
  bd183428
30 May, 2024 1 commit
- update setup · 75478eda
  赵小蒙 authored May 30, 2024
  
  75478eda