Commits · cece8f53759170bc486cc0b12f75ca652b2743ff · wangsen / MinerU

11 Dec, 2024 7 commits
- feat: support ms-office and images file in command line tools · cece8f53
  xu rui authored Dec 11, 2024
  
  cece8f53
- docs: add quick_start example · 7dc3b0a9
  xu rui authored Dec 10, 2024
  
  7dc3b0a9
- fix: not create empty directory · 1d32722f
  xu rui authored Dec 10, 2024
  
  1d32722f
- feat: support convert ppt/pptx/doc/docx · f6af67eb
  xu rui authored Dec 10, 2024
  
  f6af67eb
- fix: read_api list files · f3ceebc4
  xu rui authored Dec 10, 2024
  
  f3ceebc4
- docs: rewrite install and usage docs · 6ca86bea
  xu rui authored Dec 09, 2024
  
  6ca86bea
- fix: dup classify pdf type · 4e7511fb
  icecraft authored Dec 11, 2024
  
  4e7511fb
10 Dec, 2024 7 commits

refactor(model): update import paths for PaddleOCR modules · 061c03a0

myhloli authored Dec 11, 2024

- Change import paths from paddleocr.ppocr to ppocr for utility functions
- Update import paths for logging and utility modules in ppocr_273_mod.py- Modify import paths for tablemaster_paddle.py to use ppstructure instead of paddleocr.ppstructure

061c03a0

refactor(magic_pdf): switch to pdfminer for invalid character detection · e1be7da6

myhloli authored Dec 11, 2024

- Replace MuPDF with pdfminer for detecting invalid characters in PDFs
- Uncomment and update the detect_invalid_chars function to use pdfminer
- Update the check_invalid_chars function in pdf_meta_scan.py to use the new implementation

e1be7da6

refactor(tablemaster): update import paths for TableSystem and init_args · 01cd633d

myhloli authored Dec 11, 2024

- Change import path for TableSystem from 'ppstructure.table.predict_table' to 'paddleocr.ppstructure.table.predict_table'
- Change import path for init_args from 'ppstructure.utility' to 'paddleocr.ppstructure.utility'

01cd633d

refactor(magic_pdf): update paddleocr module import paths · 56fad23d

myhloli authored Dec 11, 2024

- Modify import paths for paddleocr utilities in ocr_utils.py and ppocr_273_mod.py
- Change from `ppocr.utils.utility` to `paddleocr.ppocr.utils.utility`
- Update related import statements in two files to reflect the new path

56fad23d

refactor(magic_pdf): remove unnecessary comment · 52dfdd53

myhloli authored Dec 10, 2024

- Remove commented-out call to clean_memory() function
- This change simplifies the code by eliminating an unused code snippet

52dfdd53

fix(magic_pdf): disable PaddlePaddle signal handler · dd7f6781

myhloli authored Dec 10, 2024

- Import paddle module and disable its signal handler to prevent interference with other components
- This change addresses potential conflicts between PaddlePaddle and other libraries or system signals

dd7f6781

refactor: comment out clean_memory function call · 2b6e9442

myhloli authored Dec 10, 2024

- Remove the call to clean_memory() function from pdf_parse_union_core_v2.py
- This change may affect memory usage and needs to be tested to ensure proper functionality

2b6e9442

09 Dec, 2024 3 commits
- refactor(magic_pdf): optimize environment setup and dependencies · a296ea41
  myhloli authored Dec 09, 2024
```
- Add environment variables to disable albumentations and yolo updates
- Import torchtext and disable deprecation warnings
- Update unimernet to 0.2.2
- Specify ultralytics version as >=8.3.48
- Remove upper version limit for torch
```
  a296ea41
- fix: unicode decode error · 11344890
  icecraft authored Dec 09, 2024
  
  11344890
- fix: add parse_pdf_type and version · 57f9f9dc
  icecraft authored Dec 09, 2024
  
  57f9f9dc
07 Dec, 2024 2 commits

fix: 1. ocr txt mode error 2. lose pdf_parse_type field · 87af738a
sawmice authored Dec 07, 2024

87af738a

fix(dict2md): add space for inline equations in CJK contexts · 74ee428b

myhloli authored Dec 07, 2024

- In Chinese, Japanese, and Korean (CJK) languages, no space is needed for line breaks within paragraphs.
- However, if an inline equation is at the end of a line, a space should be added to separate it from the following text.
- This change improves the formatting of documents containing both CJK text and inline equations.

74ee428b

06 Dec, 2024 10 commits

refactor(magic-pdf): optimize model initialization and concurrency control · 012a46e0

myhloli authored Dec 06, 2024

- Remove concurrency limit logic from app.py
- Update model initialization process in various modules
- Remove unused VRAM check for concurrency limit
- Refactor OCR model initialization in pdf_extract_kit.py
- Update txt_spans_extract_v2 function to use lang parameter instead of ocr_model

012a46e0

refactor(ocr): replace AtomModelSingleton with ocr_model_init for OCR model instantiation · 47a83d28

myhloli authored Dec 06, 2024

- Remove usage of AtomModelSingleton for OCR model creation
- Add ocr_model_init function to initialize OCR model
- Update OCR model initialization in pdf_extract_kit.py and pdf_parse_union_core_v2.py
- Modify txt_spans_extract_v2 function to accept ocr_model as a parameter
- Update parse_page_core function to use ocr_model instead of lang for OCR processing

47a83d28

refactor(model): implement thread-safe OCR model initialization · f2a92d57

myhloli authored Dec 06, 2024

- Add threading support for OCR model initialization
- Modify AtomModelSingleton to handle thread-specific instances
- Update PDFExtractKit and PDFParseUnionCoreV2 to use new thread-safe OCR initialization

f2a92d57

refactor(magic_pdf): remove unused threading lock and model initialization code · a1744b77

myhloli authored Dec 06, 2024

- Remove threading.Lock import and usage
- Delete unused model initialization comments and code- Simplify OCR model initialization in both pdf_extract_kit.py and pdf_parse_union_core_v2.py

a1744b77

refactor(magic_pdf): replace AtomModelSingleton with ocr_model_init for OCR model instantiation · 30220233

myhloli authored Dec 06, 2024

- Remove usage of AtomModelSingleton for OCR model initialization- Use ocr_model_init function for creating OCR model instance
- Update import statement to include ocr_model_init- Comment out old OCR model initialization code

30220233

refactor(model): replace AtomModelSingleton with ocr_model_init for OCR model initialization · 488660dd

myhloli authored Dec 06, 2024

- Remove usage of AtomModelSingleton for OCR model initialization
- Add import of ocr_model_init from model_init module
- Update OCR model initialization process to use ocr_model_init function
- Remove lock for OCR processing as it's no longer needed

488660dd

refactor(model): replace ModelSingleton with direct model initialization and improve threading · 6f636b6e

myhloli authored Dec 06, 2024

- Remove usage of ModelSingleton class
- Initialize model directly using custom_model_init function
- Add self._lock attribute to PDFExtractKit class for thread safety- Replace local lock with self._lock for OCR processing

6f636b6e

fix(model): simplify model initialization logic · a9723c61
myhloli authored Dec 06, 2024

a9723c61

refactor(magic_pdf): optimize model initialization and threading · 878f3de0

赵小蒙 authored Dec 06, 2024

- Remove unnecessary threading.Lock in AtomModelSingleton
- Add threading.Lock to CustomPEKModel for OCR processing
- Simplify model initialization logic in AtomModelSingleton

878f3de0

perf(model): optimize model initialization · ce592f8b

myhloli authored Dec 06, 2024

- Add condition to return existing model if already initialized
- Improve efficiency by avoiding redundant model creation

ce592f8b

05 Dec, 2024 1 commit

perf(model): add threading lock for OCR model initialization · 04478095

myhloli authored Dec 05, 2024

- Introduce a lock to synchronize access to OCR model initialization- This change improves thread safety when multiple threads access the OCR model concurrently
- The lock ensures that the OCR model is initialized only once, even in multi-threaded scenarios

04478095

03 Dec, 2024 7 commits
- fix(vram): improve VRAM checking logic · 104273cc
  myhloli authored Dec 03, 2024
```
- Update VRAM checking logic in app.py and model_utils.py
- Add None and type checks for VRAM values
- Adjust concurrency limit calculation in app.py
- Modify clean_vram function to handle cases with no VRAM information
```
  104273cc
- feat: add zh_cn docs · 11994506
  xu rui authored Dec 02, 2024
  
  11994506
- docs: add dataset method description · f6bd47de
  xu rui authored Dec 02, 2024
  
  f6bd47de
- refactor: add docs · d44e7a28
  xu rui authored Nov 29, 2024
  
  d44e7a28
- feat: add function definitions · 4a82d6a0
  icecraft authored Nov 28, 2024
  
  4a82d6a0
- refactor: isolate inference and pipeline · a3a720ea
  icecraft authored Nov 27, 2024
  
  a3a720ea
- feat(gradio_app): implement dynamic concurrency limit based on VRAM · b1fe9d4f
  myhloli authored Dec 03, 2024
```
- Add get_concurrency_limit function to calculate concurrency limit based on VRAM
- Update clean_vram function and rename to get_vram for better clarity
- Apply concurrency limit to the to_markdown function in the Gradio app
```
  b1fe9d4f
02 Dec, 2024 3 commits
- Update version.py with new version · b9f3435c
  myhloli authored Dec 02, 2024
  
  b9f3435c
- fix: reduce maximum image size · b0529b6f
  myhloli authored Dec 02, 2024
```
- Decrease the maximum width and height from 9000 to 4500 pixels
- This change aims to prevent excessive resource usage when rendering PDFs
```
  b0529b6f
- fix(pre_proc): prevent errors when imageWriter is None · 7f8dc353
  myhloli authored Dec 02, 2024
```
- Updated cut_image.py to check for NoneType imageWriter
- Prevents AttributeError when imageWriter is not provided
```
  7f8dc353