Commits · c660fdc8f0be3f7932616b5e757ec00a7543a430 · wangsen / MinerU

24 Dec, 2024 1 commit

feat(llm): add LLM-aided formula and text correction · c660fdc8

myhloli authored Dec 24, 2024

- Add LLM-aided formula and text correction functionality
- Update config reader to include LLM-aided settings
- Create new LLM-aided processing module
- Update main processing script to incorporate LLM-aided corrections
- Modify download scripts to check for new config version

c660fdc8

13 Dec, 2024 1 commit

fix(pdf): improve ligature handling and text extraction · c638fc5d

myhloli authored Dec 13, 2024

- Move ligature replacement function to pdf_parse_union_core_v2.py
- Optimize ligature replacement using a more efficient approach
- Modify text extraction flags to preserve ligatures in PDF content
- Remove unnecessary function from ocr_mkcontent.py

c638fc5d

12 Dec, 2024 1 commit

perf(layout): optimize layout detection for PDF extraction · 6a75d7dc

myhloli authored Dec 12, 2024

- Add initial setup for layout detection
- Implement conditional cropping for tall images
- Skip cropping for wide images to improve performance
- Reuse Image object across layout detection steps

6a75d7dc

11 Dec, 2024 14 commits
- fix: classif pdf type · 712d7d4a
  xu rui authored Dec 11, 2024
  
  712d7d4a
- Update version.py with new version · 391a9986
  myhloli authored Dec 11, 2024
  
  391a9986
- refactor(draw_bbox): remove redundant '_line_sort' suffix from output filename · ef78819a
  myhloli authored Dec 11, 2024
```
- Updated the filename generation logic in the draw_bbox function
- Removed the unnecessary '_line_sort' suffix from the output PDF filename
```
  ef78819a
- refactor(magic_pdf): remove unused import in pdf_parse_union_core_v2.py · 9efc35ec
  myhloli authored Dec 11, 2024
```
- Remove unused import of ocr_model_init from magic_pdf.model.sub_modules.model_init
- Keep existing functionality and structure intact
```
  9efc35ec
- feat(layout): improve layout detection for DocLayout_YOLO model · f5d812b3
  myhloli authored Dec 11, 2024
```
- Implement image cropping and pasting technique to enhance layout detection
- Adjust detected polygons to original image coordinates
- Add comments for better code readability
```
  f5d812b3
- feat: remove pipe_auto_mode · 302a6950
  xu rui authored Dec 11, 2024
  
  302a6950
- docs: check links in doc · b04867f9
  xu rui authored Dec 11, 2024
  
  b04867f9
- feat: support ms-office and images file in command line tools · cece8f53
  xu rui authored Dec 11, 2024
  
  cece8f53
- docs: add quick_start example · 7dc3b0a9
  xu rui authored Dec 10, 2024
  
  7dc3b0a9
- fix: not create empty directory · 1d32722f
  xu rui authored Dec 10, 2024
  
  1d32722f
- feat: support convert ppt/pptx/doc/docx · f6af67eb
  xu rui authored Dec 10, 2024
  
  f6af67eb
- fix: read_api list files · f3ceebc4
  xu rui authored Dec 10, 2024
  
  f3ceebc4
- docs: rewrite install and usage docs · 6ca86bea
  xu rui authored Dec 09, 2024
  
  6ca86bea
- fix: dup classify pdf type · 4e7511fb
  icecraft authored Dec 11, 2024
  
  4e7511fb
10 Dec, 2024 7 commits

refactor(model): update import paths for PaddleOCR modules · 061c03a0

myhloli authored Dec 11, 2024

- Change import paths from paddleocr.ppocr to ppocr for utility functions
- Update import paths for logging and utility modules in ppocr_273_mod.py- Modify import paths for tablemaster_paddle.py to use ppstructure instead of paddleocr.ppstructure

061c03a0

refactor(magic_pdf): switch to pdfminer for invalid character detection · e1be7da6

myhloli authored Dec 11, 2024

- Replace MuPDF with pdfminer for detecting invalid characters in PDFs
- Uncomment and update the detect_invalid_chars function to use pdfminer
- Update the check_invalid_chars function in pdf_meta_scan.py to use the new implementation

e1be7da6

refactor(tablemaster): update import paths for TableSystem and init_args · 01cd633d

myhloli authored Dec 11, 2024

- Change import path for TableSystem from 'ppstructure.table.predict_table' to 'paddleocr.ppstructure.table.predict_table'
- Change import path for init_args from 'ppstructure.utility' to 'paddleocr.ppstructure.utility'

01cd633d

refactor(magic_pdf): update paddleocr module import paths · 56fad23d

myhloli authored Dec 11, 2024

- Modify import paths for paddleocr utilities in ocr_utils.py and ppocr_273_mod.py
- Change from `ppocr.utils.utility` to `paddleocr.ppocr.utils.utility`
- Update related import statements in two files to reflect the new path

56fad23d

refactor(magic_pdf): remove unnecessary comment · 52dfdd53

myhloli authored Dec 10, 2024

- Remove commented-out call to clean_memory() function
- This change simplifies the code by eliminating an unused code snippet

52dfdd53

fix(magic_pdf): disable PaddlePaddle signal handler · dd7f6781

myhloli authored Dec 10, 2024

- Import paddle module and disable its signal handler to prevent interference with other components
- This change addresses potential conflicts between PaddlePaddle and other libraries or system signals

dd7f6781

refactor: comment out clean_memory function call · 2b6e9442

myhloli authored Dec 10, 2024

- Remove the call to clean_memory() function from pdf_parse_union_core_v2.py
- This change may affect memory usage and needs to be tested to ensure proper functionality

2b6e9442

09 Dec, 2024 3 commits
- refactor(magic_pdf): optimize environment setup and dependencies · a296ea41
  myhloli authored Dec 09, 2024
```
- Add environment variables to disable albumentations and yolo updates
- Import torchtext and disable deprecation warnings
- Update unimernet to 0.2.2
- Specify ultralytics version as >=8.3.48
- Remove upper version limit for torch
```
  a296ea41
- fix: unicode decode error · 11344890
  icecraft authored Dec 09, 2024
  
  11344890
- fix: add parse_pdf_type and version · 57f9f9dc
  icecraft authored Dec 09, 2024
  
  57f9f9dc
07 Dec, 2024 2 commits

fix: 1. ocr txt mode error 2. lose pdf_parse_type field · 87af738a
sawmice authored Dec 07, 2024

87af738a

fix(dict2md): add space for inline equations in CJK contexts · 74ee428b

myhloli authored Dec 07, 2024

- In Chinese, Japanese, and Korean (CJK) languages, no space is needed for line breaks within paragraphs.
- However, if an inline equation is at the end of a line, a space should be added to separate it from the following text.
- This change improves the formatting of documents containing both CJK text and inline equations.

74ee428b

06 Dec, 2024 10 commits

refactor(magic-pdf): optimize model initialization and concurrency control · 012a46e0

myhloli authored Dec 06, 2024

- Remove concurrency limit logic from app.py
- Update model initialization process in various modules
- Remove unused VRAM check for concurrency limit
- Refactor OCR model initialization in pdf_extract_kit.py
- Update txt_spans_extract_v2 function to use lang parameter instead of ocr_model

012a46e0

refactor(ocr): replace AtomModelSingleton with ocr_model_init for OCR model instantiation · 47a83d28

myhloli authored Dec 06, 2024

- Remove usage of AtomModelSingleton for OCR model creation
- Add ocr_model_init function to initialize OCR model
- Update OCR model initialization in pdf_extract_kit.py and pdf_parse_union_core_v2.py
- Modify txt_spans_extract_v2 function to accept ocr_model as a parameter
- Update parse_page_core function to use ocr_model instead of lang for OCR processing

47a83d28

refactor(model): implement thread-safe OCR model initialization · f2a92d57

myhloli authored Dec 06, 2024

- Add threading support for OCR model initialization
- Modify AtomModelSingleton to handle thread-specific instances
- Update PDFExtractKit and PDFParseUnionCoreV2 to use new thread-safe OCR initialization

f2a92d57

refactor(magic_pdf): remove unused threading lock and model initialization code · a1744b77

myhloli authored Dec 06, 2024

- Remove threading.Lock import and usage
- Delete unused model initialization comments and code- Simplify OCR model initialization in both pdf_extract_kit.py and pdf_parse_union_core_v2.py

a1744b77

refactor(magic_pdf): replace AtomModelSingleton with ocr_model_init for OCR model instantiation · 30220233

myhloli authored Dec 06, 2024

- Remove usage of AtomModelSingleton for OCR model initialization- Use ocr_model_init function for creating OCR model instance
- Update import statement to include ocr_model_init- Comment out old OCR model initialization code

30220233

refactor(model): replace AtomModelSingleton with ocr_model_init for OCR model initialization · 488660dd

myhloli authored Dec 06, 2024

- Remove usage of AtomModelSingleton for OCR model initialization
- Add import of ocr_model_init from model_init module
- Update OCR model initialization process to use ocr_model_init function
- Remove lock for OCR processing as it's no longer needed

488660dd

refactor(model): replace ModelSingleton with direct model initialization and improve threading · 6f636b6e

myhloli authored Dec 06, 2024

- Remove usage of ModelSingleton class
- Initialize model directly using custom_model_init function
- Add self._lock attribute to PDFExtractKit class for thread safety- Replace local lock with self._lock for OCR processing

6f636b6e

fix(model): simplify model initialization logic · a9723c61
myhloli authored Dec 06, 2024

a9723c61

refactor(magic_pdf): optimize model initialization and threading · 878f3de0

赵小蒙 authored Dec 06, 2024

- Remove unnecessary threading.Lock in AtomModelSingleton
- Add threading.Lock to CustomPEKModel for OCR processing
- Simplify model initialization logic in AtomModelSingleton

878f3de0

perf(model): optimize model initialization · ce592f8b

myhloli authored Dec 06, 2024

- Add condition to return existing model if already initialized
- Improve efficiency by avoiding redundant model creation

ce592f8b

05 Dec, 2024 1 commit

perf(model): add threading lock for OCR model initialization · 04478095

myhloli authored Dec 05, 2024

- Introduce a lock to synchronize access to OCR model initialization- This change improves thread safety when multiple threads access the OCR model concurrently
- The lock ensures that the OCR model is initialized only once, even in multi-threaded scenarios

04478095