Commits · b04867f90a0c6bc35380ecd488261cf1da95e79f · wangsen / MinerU

11 Dec, 2024 12 commits
- docs: check links in doc · b04867f9
  xu rui authored Dec 11, 2024
  
  b04867f9
- feat: support ms-office and images file in command line tools · cece8f53
  xu rui authored Dec 11, 2024
  
  cece8f53
- docs: add quick_start example · 7dc3b0a9
  xu rui authored Dec 10, 2024
  
  7dc3b0a9
- fix: not create empty directory · 1d32722f
  xu rui authored Dec 10, 2024
  
  1d32722f
- feat: support convert ppt/pptx/doc/docx · f6af67eb
  xu rui authored Dec 10, 2024
  
  f6af67eb
- fix: read_api list files · f3ceebc4
  xu rui authored Dec 10, 2024
  
  f3ceebc4
- feat: rewrite code snippet · 3cd51d49
  xu rui authored Dec 09, 2024
  
  3cd51d49
- docs: rewrite install and usage docs · 6ca86bea
  xu rui authored Dec 09, 2024
  
  6ca86bea
- Merge pull request #1258 from icecraft/fix/dup_classify · fd2f3c58
  Xiaomeng Zhao authored Dec 11, 2024
```
fix: dup classify pdf type
```
  fd2f3c58
- fix: dup classify pdf type · 4e7511fb
  icecraft authored Dec 11, 2024
  
  4e7511fb
- Merge pull request #1255 from myhloli/dev · 168a1115
  Xiaomeng Zhao authored Dec 11, 2024
```
build(deps): update torch and torchvision version requirements
```
  168a1115
- build(deps): update torch and torchvision version requirements · 9a96362d
  myhloli authored Dec 11, 2024
```
- Specify torch==2.3.1 and torchvision==0.18.1 for Windows CUDA installation
- Add torch and torchvision version constraints in setup.py:
  - torch>=2.2.2,<=2.3.1
  - torchvision>=0.17.2,<=0.18.1
- Update installation instructions in both English and Chinese README files
```
  9a96362d
10 Dec, 2024 9 commits

Merge pull request #1252 from myhloli/dev · fdf15a45
Xiaomeng Zhao authored Dec 11, 2024
```
fix(detect_invalid_chars):fix the stack error caused by multiple memory releases in PyMuPDF
```
fdf15a45

build: enable pdfminer.six dependency · 023ed9c8

myhloli authored Dec 11, 2024

- Uncomment pdfminer.six in requirements.txt
- Specify version 20231228 for pdfminer.six

023ed9c8

refactor(model): update import paths for PaddleOCR modules · 061c03a0

myhloli authored Dec 11, 2024

- Change import paths from paddleocr.ppocr to ppocr for utility functions
- Update import paths for logging and utility modules in ppocr_273_mod.py- Modify import paths for tablemaster_paddle.py to use ppstructure instead of paddleocr.ppstructure

061c03a0

refactor(magic_pdf): switch to pdfminer for invalid character detection · e1be7da6

myhloli authored Dec 11, 2024

- Replace MuPDF with pdfminer for detecting invalid characters in PDFs
- Uncomment and update the detect_invalid_chars function to use pdfminer
- Update the check_invalid_chars function in pdf_meta_scan.py to use the new implementation

e1be7da6

refactor(tablemaster): update import paths for TableSystem and init_args · 01cd633d

myhloli authored Dec 11, 2024

- Change import path for TableSystem from 'ppstructure.table.predict_table' to 'paddleocr.ppstructure.table.predict_table'
- Change import path for init_args from 'ppstructure.utility' to 'paddleocr.ppstructure.utility'

01cd633d

refactor(magic_pdf): update paddleocr module import paths · 56fad23d

myhloli authored Dec 11, 2024

- Modify import paths for paddleocr utilities in ocr_utils.py and ppocr_273_mod.py
- Change from `ppocr.utils.utility` to `paddleocr.ppocr.utils.utility`
- Update related import statements in two files to reflect the new path

56fad23d

refactor(magic_pdf): remove unnecessary comment · 52dfdd53

myhloli authored Dec 10, 2024

- Remove commented-out call to clean_memory() function
- This change simplifies the code by eliminating an unused code snippet

52dfdd53

fix(magic_pdf): disable PaddlePaddle signal handler · dd7f6781

myhloli authored Dec 10, 2024

- Import paddle module and disable its signal handler to prevent interference with other components
- This change addresses potential conflicts between PaddlePaddle and other libraries or system signals

dd7f6781

refactor: comment out clean_memory function call · 2b6e9442

myhloli authored Dec 10, 2024

- Remove the call to clean_memory() function from pdf_parse_union_core_v2.py
- This change may affect memory usage and needs to be tested to ensure proper functionality

2b6e9442

09 Dec, 2024 10 commits
- Merge pull request #1239 from myhloli/dev · 8dbfea6d
  Xiaomeng Zhao authored Dec 09, 2024
```
docs(windows): update CUDA installation guide
```
  8dbfea6d
- docs(windows): update CUDA installation guide · ede7d361
  myhloli authored Dec 09, 2024
```
- Remove specific version requirements for torch and torchvision
- Simplify installation command in both English and Chinese guides
- Delete important note about version compatibility
```
  ede7d361
- Merge pull request #1238 from myhloli/dev · 4355b6e0
  Xiaomeng Zhao authored Dec 09, 2024
```
refactor(magic_pdf): optimize environment setup and dependencies
```
  4355b6e0
- refactor(magic_pdf): optimize environment setup and dependencies · a296ea41
  myhloli authored Dec 09, 2024
```
- Add environment variables to disable albumentations and yolo updates
- Import torchtext and disable deprecation warnings
- Update unimernet to 0.2.2
- Specify ultralytics version as >=8.3.48
- Remove upper version limit for torch
```
  a296ea41
- Merge pull request #1232 from myhloli/dev · 5c3bf21e
  Xiaomeng Zhao authored Dec 09, 2024
```
build(deps): update dependency versions
```
  5c3bf21e
- build(deps): update dependency versions · 2ae10394
  myhloli authored Dec 09, 2024
```
- Update ultralytics to >=8.3.47
```
  2ae10394
- Merge pull request #1231 from icecraft/fix/unicode_write · faad1664
  Xiaomeng Zhao authored Dec 09, 2024
```
fix: unicode decode error
```
  faad1664
- fix: unicode decode error · 11344890
  icecraft authored Dec 09, 2024
  
  11344890
- Merge pull request #1228 from icecraft/fix/pipe_result · c5a4150e
  Xiaomeng Zhao authored Dec 09, 2024
```
fix: add parse_pdf_type and version
```
  c5a4150e
- fix: add parse_pdf_type and version · 57f9f9dc
  icecraft authored Dec 09, 2024
  
  57f9f9dc
07 Dec, 2024 4 commits
- Merge pull request #1224 from icecraft/fix/new_api · 8f266869
  Xiaomeng Zhao authored Dec 07, 2024
  
  8f266869
- fix: 1. ocr txt mode error 2. lose pdf_parse_type field · 87af738a
  sawmice authored Dec 07, 2024
  
  87af738a
- Merge pull request #1222 from myhloli/dev · f58a7a7d
  Xiaomeng Zhao authored Dec 07, 2024
```
fix(dict2md): add space for inline equations in CJK contexts
```
  f58a7a7d
- fix(dict2md): add space for inline equations in CJK contexts · 74ee428b
  myhloli authored Dec 07, 2024
```
- In Chinese, Japanese, and Korean (CJK) languages, no space is needed for line breaks within paragraphs.
- However, if an inline equation is at the end of a line, a space should be added to separate it from the following text.
- This change improves the formatting of documents containing both CJK text and inline equations.
```
  74ee428b
06 Dec, 2024 5 commits

Merge pull request #1178 from icecraft/refactor/add_user_api · fa113b57
Xiaomeng Zhao authored Dec 06, 2024
```
Refactor/add user api
```
fa113b57
Merge pull request #1218 from myhloli/dev · 1c10dc55
Xiaomeng Zhao authored Dec 06, 2024
```
refactor(magic-pdf): optimize model initialization and concurrency control
```
1c10dc55

refactor(magic-pdf): optimize model initialization and concurrency control · 012a46e0

myhloli authored Dec 06, 2024

- Remove concurrency limit logic from app.py
- Update model initialization process in various modules
- Remove unused VRAM check for concurrency limit
- Refactor OCR model initialization in pdf_extract_kit.py
- Update txt_spans_extract_v2 function to use lang parameter instead of ocr_model

012a46e0

Merge pull request #1215 from myhloli/dev · ef5cffcb
Xiaomeng Zhao authored Dec 06, 2024
```
refactor(ocr): replace AtomModelSingleton with ocr_model_init for OCR model instantiation
```
ef5cffcb

refactor(ocr): replace AtomModelSingleton with ocr_model_init for OCR model instantiation · 47a83d28

myhloli authored Dec 06, 2024

- Remove usage of AtomModelSingleton for OCR model creation
- Add ocr_model_init function to initialize OCR model
- Update OCR model initialization in pdf_extract_kit.py and pdf_parse_union_core_v2.py
- Modify txt_spans_extract_v2 function to accept ocr_model as a parameter
- Update parse_page_core function to use ocr_model instead of lang for OCR processing

47a83d28