Commits · 1c10dc55064c04f765a46739d3ca1bea8aee4da4 · wangsen / MinerU

06 Dec, 2024 10 commits

refactor(magic-pdf): optimize model initialization and concurrency control · 012a46e0

myhloli authored Dec 06, 2024

- Remove concurrency limit logic from app.py
- Update model initialization process in various modules
- Remove unused VRAM check for concurrency limit
- Refactor OCR model initialization in pdf_extract_kit.py
- Update txt_spans_extract_v2 function to use lang parameter instead of ocr_model

012a46e0

refactor(ocr): replace AtomModelSingleton with ocr_model_init for OCR model instantiation · 47a83d28

myhloli authored Dec 06, 2024

- Remove usage of AtomModelSingleton for OCR model creation
- Add ocr_model_init function to initialize OCR model
- Update OCR model initialization in pdf_extract_kit.py and pdf_parse_union_core_v2.py
- Modify txt_spans_extract_v2 function to accept ocr_model as a parameter
- Update parse_page_core function to use ocr_model instead of lang for OCR processing

47a83d28

refactor(model): implement thread-safe OCR model initialization · f2a92d57

myhloli authored Dec 06, 2024

- Add threading support for OCR model initialization
- Modify AtomModelSingleton to handle thread-specific instances
- Update PDFExtractKit and PDFParseUnionCoreV2 to use new thread-safe OCR initialization

f2a92d57

refactor(magic_pdf): remove unused threading lock and model initialization code · a1744b77

myhloli authored Dec 06, 2024

- Remove threading.Lock import and usage
- Delete unused model initialization comments and code- Simplify OCR model initialization in both pdf_extract_kit.py and pdf_parse_union_core_v2.py

a1744b77

refactor(magic_pdf): replace AtomModelSingleton with ocr_model_init for OCR model instantiation · 30220233

myhloli authored Dec 06, 2024

- Remove usage of AtomModelSingleton for OCR model initialization- Use ocr_model_init function for creating OCR model instance
- Update import statement to include ocr_model_init- Comment out old OCR model initialization code

30220233

refactor(model): replace AtomModelSingleton with ocr_model_init for OCR model initialization · 488660dd

myhloli authored Dec 06, 2024

- Remove usage of AtomModelSingleton for OCR model initialization
- Add import of ocr_model_init from model_init module
- Update OCR model initialization process to use ocr_model_init function
- Remove lock for OCR processing as it's no longer needed

488660dd

refactor(model): replace ModelSingleton with direct model initialization and improve threading · 6f636b6e

myhloli authored Dec 06, 2024

- Remove usage of ModelSingleton class
- Initialize model directly using custom_model_init function
- Add self._lock attribute to PDFExtractKit class for thread safety- Replace local lock with self._lock for OCR processing

6f636b6e

fix(model): simplify model initialization logic · a9723c61
myhloli authored Dec 06, 2024

a9723c61

refactor(magic_pdf): optimize model initialization and threading · 878f3de0

赵小蒙 authored Dec 06, 2024

- Remove unnecessary threading.Lock in AtomModelSingleton
- Add threading.Lock to CustomPEKModel for OCR processing
- Simplify model initialization logic in AtomModelSingleton

878f3de0

perf(model): optimize model initialization · ce592f8b

myhloli authored Dec 06, 2024

- Add condition to return existing model if already initialized
- Improve efficiency by avoiding redundant model creation

ce592f8b

05 Dec, 2024 1 commit

perf(model): add threading lock for OCR model initialization · 04478095

myhloli authored Dec 05, 2024

- Introduce a lock to synchronize access to OCR model initialization- This change improves thread safety when multiple threads access the OCR model concurrently
- The lock ensures that the OCR model is initialized only once, even in multi-threaded scenarios

04478095

03 Dec, 2024 2 commits

fix(vram): improve VRAM checking logic · 104273cc

myhloli authored Dec 03, 2024

- Update VRAM checking logic in app.py and model_utils.py
- Add None and type checks for VRAM values
- Adjust concurrency limit calculation in app.py
- Modify clean_vram function to handle cases with no VRAM information

104273cc

feat(gradio_app): implement dynamic concurrency limit based on VRAM · b1fe9d4f

myhloli authored Dec 03, 2024

- Add get_concurrency_limit function to calculate concurrency limit based on VRAM
- Update clean_vram function and rename to get_vram for better clarity
- Apply concurrency limit to the to_markdown function in the Gradio app

b1fe9d4f

02 Dec, 2024 3 commits
- Update version.py with new version · b9f3435c
  myhloli authored Dec 02, 2024
  
  b9f3435c
- fix: reduce maximum image size · b0529b6f
  myhloli authored Dec 02, 2024
```
- Decrease the maximum width and height from 9000 to 4500 pixels
- This change aims to prevent excessive resource usage when rendering PDFs
```
  b0529b6f
- fix(pre_proc): prevent errors when imageWriter is None · 7f8dc353
  myhloli authored Dec 02, 2024
```
- Updated cut_image.py to check for NoneType imageWriter
- Prevents AttributeError when imageWriter is not provided
```
  7f8dc353
30 Nov, 2024 1 commit

refactor(para): adjust line height multiplier for block splitting · 41545a13

myhloli authored Dec 01, 2024

- Decrease the line height multiplier from 0.8 to 0.7 for both left and right sides
- This modification aims to improve the accuracy of paragraph splitting

41545a13

29 Nov, 2024 8 commits

Update version.py with new version · f8828be7
myhloli authored Nov 29, 2024

f8828be7

refactor: modify bbox processing for layout separation · b3127233

myhloli authored Nov 30, 2024

- Remove overlap between bboxes for block separation
- Sort bboxes by combined x and y coordinates for better layout handling
- Comment out previous overlap removal function

b3127233

refactor(mkcontent): optimize paragraph text merging and language detection · b80befe9

myhloli authored Nov 30, 2024

- Extract language detection to block level instead of line level
- Improve logic for handling Chinese, Japanese, and Korean languages
- Refactor code for better readability and performance
- Optimize handling of hyphenated words at line ends

b80befe9

feat(ocr_mkcontent): add language detection for line spacing · c8cabb3c

myhloli authored Nov 30, 2024

- Introduce language detection to determine line spacing based on language context
- Implement different spacing rules for Chinese/Japanese/Korean and Western texts
- Adjust span content handling based on detected language and span type

c8cabb3c

Update version.py with new version · d19911f1
myhloli authored Nov 29, 2024

d19911f1
refactor(ocr): Fix the error of paddleocr failing to initialize in a multi-threaded environment · 7f2f2c0f
myhloli authored Nov 29, 2024

7f2f2c0f

refactor(pdf_parse): adjust character-axis alignment algorithm · d4345b6e

myhloli authored Nov 29, 2024

- Introduce `span_height_radio` parameter to calculate_char_in_span function
- Replace fixed ratio with dynamic ratio for character and span axis alignment
- Improve flexibility and accuracy of character placement within spans

d4345b6e

fix(ocr_mkcontent): handle empty paragraphs on pages · 782e6571

myhloli authored Nov 29, 2024

- Add empty paragraph handling for pages with no content
- Append an empty markdown object when a page has no paragraphs
- Increment page number even if no content is present

782e6571

28 Nov, 2024 7 commits

feat(pdf_parse): add line start flag detection and optimize line stop flag logic · 949d0867

myhloli authored Nov 28, 2024

- Add LINE_START_FLAG tuple to identify starting flags of a line
- Modify calculate_char_in_span function to handle both line start and stop flags
- Remove redundant char_is_line_stop_flag variable and simplify logic
- Improve line flag detection to enhance text extraction accuracy

949d0867

refactor(pdf_check): improve character detection using PyMuPDF · ac888156

myhloli authored Nov 28, 2024

- Replace pdfminer with PyMuPDF for character detection
- Implement new method detect_invalid_chars_by_pymupdf
- Update check_invalid_chars in pdf_meta_scan.py to use new method
- Add __replace_0xfffd function in pdf_parse_union_core_v2.py to handle special characters
- Remove unused imports and update requirements.txt

ac888156

refactor(ocr): improve text processing and span handling · 88c0854a

myhloli authored Nov 28, 2024

- Remove unused language detection code
- Simplify text content processing logic
- Update span sorting and text extraction in pdf_parse_union_core_v2.py

88c0854a

feat(pdf_parse): filter out skewed text lines · 37da8c44

myhloli authored Nov 28, 2024

- Add direction filtering to ignore highly skewed text lines
- Improve text extraction accuracy by focusing on non-skewed content

37da8c44

refactor(para): improve language detection and block splitting · f674b8d4

myhloli authored Nov 28, 2024

- Add language detection for each block of text
- Implement language-specific logic for right margin alignment
- Introduce logging for debugging purposes

f674b8d4

fix(Hybrid OCR):Enable Hybrid OCR for Empty Spans That Contain a Certain... · 08392d63
myhloli authored Nov 28, 2024
```
fix(Hybrid OCR):Enable Hybrid OCR for Empty Spans That Contain a Certain Number of Placeholders but No Actual Text
```
08392d63
fix(lite_model): Adapt iite Mode to the Hybrid OCR Mode in Version 0.10 · 9b4d77dc
myhloli authored Nov 28, 2024

9b4d77dc

27 Nov, 2024 5 commits

Update version.py with new version · 52ef1bc7
myhloli authored Nov 27, 2024

52ef1bc7
refactor(pdf_parse_union_core_v2): optimize page processing time logging · 1d2eb70a
myhloli authored Nov 27, 2024

1d2eb70a

refactor(ocr): remove unused functions and optimize OCR processing loop · 5f4410b4

myhloli authored Nov 27, 2024

- Remove unused function `calculate_angle_degrees`- Refactor `calculate_is_angle` to use directly in OCR processing
- Eliminate unnecessary loop index `idx` in OCR processing loops

5f4410b4

refactor(pre_proc): clean up OCR processing code · a46b12e9

myhloli authored Nov 27, 2024

- Remove commented-out code in ocr_dict_merge.py
- Improve imports and code organization in ocr_detect_all_bboxes.py
- Delete unnecessary empty lines and improve code readability

a46b12e9

refactor(libs): remove unused imports and functions · 2db3c263

myhloli authored Nov 27, 2024

- Remove unused imports from commons.py
- Delete unused functions related to AWS and S3 operations
- Update import statements in other modules to reflect changes in commons.py
- Remove redundant code and improve code readability

2db3c263

26 Nov, 2024 3 commits

perf(image_processing): reduce maximum image size for analysis · b3644157

myhloli authored Nov 26, 2024

- Decrease the maximum image size threshold from 9000 to 4500 pixels
- This change aims to improve performance and reduce memory usage
- Affects the custom model document analysis process

b3644157

refactor: remove deprecated markdown_utils function · ce202d92
myhloli authored Nov 26, 2024

ce202d92

refactor(pre_proc): remove unused functions and simplify code · 21fa7819

myhloli authored Nov 26, 2024

- Remove unused imports and functions across multiple files
- Simplify code by deleting unnecessary comments and empty lines
- Update function signatures to match actual usage
- Replace redundant code with more efficient alternatives

21fa7819