Commits · c5a4150e82d411925cdeded793487f54f8ec4e61 · wangsen / MinerU

09 Dec, 2024 1 commit
- fix: add parse_pdf_type and version · 57f9f9dc
  icecraft authored Dec 09, 2024
  
  57f9f9dc
07 Dec, 2024 1 commit
- fix: 1. ocr txt mode error 2. lose pdf_parse_type field · 87af738a
  sawmice authored Dec 07, 2024
  
  87af738a
06 Dec, 2024 9 commits

refactor(magic-pdf): optimize model initialization and concurrency control · 012a46e0

myhloli authored Dec 06, 2024

- Remove concurrency limit logic from app.py
- Update model initialization process in various modules
- Remove unused VRAM check for concurrency limit
- Refactor OCR model initialization in pdf_extract_kit.py
- Update txt_spans_extract_v2 function to use lang parameter instead of ocr_model

012a46e0

refactor(ocr): replace AtomModelSingleton with ocr_model_init for OCR model instantiation · 47a83d28

myhloli authored Dec 06, 2024

- Remove usage of AtomModelSingleton for OCR model creation
- Add ocr_model_init function to initialize OCR model
- Update OCR model initialization in pdf_extract_kit.py and pdf_parse_union_core_v2.py
- Modify txt_spans_extract_v2 function to accept ocr_model as a parameter
- Update parse_page_core function to use ocr_model instead of lang for OCR processing

47a83d28

refactor(model): implement thread-safe OCR model initialization · f2a92d57

myhloli authored Dec 06, 2024

- Add threading support for OCR model initialization
- Modify AtomModelSingleton to handle thread-specific instances
- Update PDFExtractKit and PDFParseUnionCoreV2 to use new thread-safe OCR initialization

f2a92d57

refactor(magic_pdf): remove unused threading lock and model initialization code · a1744b77

myhloli authored Dec 06, 2024

- Remove threading.Lock import and usage
- Delete unused model initialization comments and code- Simplify OCR model initialization in both pdf_extract_kit.py and pdf_parse_union_core_v2.py

a1744b77

refactor(model): replace AtomModelSingleton with ocr_model_init for OCR model initialization · 488660dd

myhloli authored Dec 06, 2024

- Remove usage of AtomModelSingleton for OCR model initialization
- Add import of ocr_model_init from model_init module
- Update OCR model initialization process to use ocr_model_init function
- Remove lock for OCR processing as it's no longer needed

488660dd

refactor(model): replace ModelSingleton with direct model initialization and improve threading · 6f636b6e

myhloli authored Dec 06, 2024

- Remove usage of ModelSingleton class
- Initialize model directly using custom_model_init function
- Add self._lock attribute to PDFExtractKit class for thread safety- Replace local lock with self._lock for OCR processing

6f636b6e

fix(model): simplify model initialization logic · a9723c61
myhloli authored Dec 06, 2024

a9723c61

refactor(magic_pdf): optimize model initialization and threading · 878f3de0

赵小蒙 authored Dec 06, 2024

- Remove unnecessary threading.Lock in AtomModelSingleton
- Add threading.Lock to CustomPEKModel for OCR processing
- Simplify model initialization logic in AtomModelSingleton

878f3de0

perf(model): optimize model initialization · ce592f8b

myhloli authored Dec 06, 2024

- Add condition to return existing model if already initialized
- Improve efficiency by avoiding redundant model creation

ce592f8b

05 Dec, 2024 1 commit

perf(model): add threading lock for OCR model initialization · 04478095

myhloli authored Dec 05, 2024

- Introduce a lock to synchronize access to OCR model initialization- This change improves thread safety when multiple threads access the OCR model concurrently
- The lock ensures that the OCR model is initialized only once, even in multi-threaded scenarios

04478095

03 Dec, 2024 5 commits

fix(vram): improve VRAM checking logic · 104273cc

myhloli authored Dec 03, 2024

- Update VRAM checking logic in app.py and model_utils.py
- Add None and type checks for VRAM values
- Adjust concurrency limit calculation in app.py
- Modify clean_vram function to handle cases with no VRAM information

104273cc

refactor: add docs · d44e7a28
xu rui authored Nov 29, 2024

d44e7a28
feat: add function definitions · 4a82d6a0
icecraft authored Nov 28, 2024

4a82d6a0
refactor: isolate inference and pipeline · a3a720ea
icecraft authored Nov 27, 2024

a3a720ea

feat(gradio_app): implement dynamic concurrency limit based on VRAM · b1fe9d4f

myhloli authored Dec 03, 2024

- Add get_concurrency_limit function to calculate concurrency limit based on VRAM
- Update clean_vram function and rename to get_vram for better clarity
- Apply concurrency limit to the to_markdown function in the Gradio app

b1fe9d4f

29 Nov, 2024 1 commit
- refactor(ocr): Fix the error of paddleocr failing to initialize in a multi-threaded environment · 7f2f2c0f
  myhloli authored Nov 29, 2024
  
  7f2f2c0f
28 Nov, 2024 1 commit
- fix(lite_model): Adapt iite Mode to the Hybrid OCR Mode in Version 0.10 · 9b4d77dc
  myhloli authored Nov 28, 2024
  
  9b4d77dc
27 Nov, 2024 2 commits

refactor(ocr): remove unused functions and optimize OCR processing loop · 5f4410b4

myhloli authored Nov 27, 2024

- Remove unused function `calculate_angle_degrees`- Refactor `calculate_is_angle` to use directly in OCR processing
- Eliminate unnecessary loop index `idx` in OCR processing loops

5f4410b4

refactor(libs): remove unused imports and functions · 2db3c263

myhloli authored Nov 27, 2024

- Remove unused imports from commons.py
- Delete unused functions related to AWS and S3 operations
- Update import statements in other modules to reflect changes in commons.py
- Remove redundant code and improve code readability

2db3c263

26 Nov, 2024 2 commits

perf(image_processing): reduce maximum image size for analysis · b3644157

myhloli authored Nov 26, 2024

- Decrease the maximum image size threshold from 9000 to 4500 pixels
- This change aims to improve performance and reduce memory usage
- Affects the custom model document analysis process

b3644157

feat(ocr): filter out low confidence ocr results · eb45a0e8

myhloli authored Nov 26, 2024

- Add confidence score threshold to filter out low confidence OCR results
- Improve OCR accuracy by ignoring less certain detections

eb45a0e8

24 Nov, 2024 2 commits
- fix: remove unused file · e9ace3eb
  icecraft authored Nov 24, 2024
  
  e9ace3eb
- fix: rewrite projects/ and demos with new data api · b1adde8e
  icecraft authored Nov 24, 2024
  
  b1adde8e
22 Nov, 2024 2 commits

refactor(model): move page total time logging to custom model analysis · f1e2f084

myhloli authored Nov 22, 2024

- Move page total time logging to doc_analyze_by_custom_model.py
- Remove page total time logging from pdf_extract_kit.py
- Add page_start timing variable to custom model analysis
- Update logger output format for page total time

f1e2f084

fix(table): add null check for OCR result in rapid table prediction · 18aa1a20

myhloli authored Nov 22, 2024

- Add a null check for OCR result in the predict method
- Return None values if OCR result is None to prevent further processing

18aa1a20

21 Nov, 2024 2 commits

refactor(txt_parse): improve text extraction accuracy with new algorithm · 309be741

myhloli authored Nov 21, 2024

- Implement new text extraction method (txt_spans_extract_v2) to enhance accuracy
- Add character filling in spans for better text reconstruction
- Introduce empty span handling using OCR for missed text
- Optimize span filtering and overlap removal

309be741

feat(ocr): improve text detection and OCR accuracy · b2e37a2d

myhloli authored Nov 21, 2024

- Update OCR utils to handle different box formats and improve angle calculation
- Modify PDF extraction kit to support OCR option and optimize processing flow
- Enhance PPOCR model to sort and filter detection boxes, improving text splitting accuracy

b2e37a2d

19 Nov, 2024 1 commit
- refactor: move some constants or enums defs to config folder · b492c19c
  icecraft authored Nov 19, 2024
  
  b492c19c
18 Nov, 2024 2 commits

feat(ocr): improve handling of angled text boxes · 4fd966eb

myhloli authored Nov 18, 2024

- Add calculate_is_angle function to detect angled text boxes
- Update update_det_boxes and merge_det_boxes functions to handle angled text boxes
- Modify angle detection logic in various parts of the code

4fd966eb

fix: using new data api replace old rw api · 6a481320
icecraft authored Nov 18, 2024

6a481320

15 Nov, 2024 1 commit
- refactor(model): rename and restructure model modules · 08f46125
  myhloli authored Nov 15, 2024
  
  08f46125
08 Nov, 2024 2 commits

feat(table): add RapidOCR support for RapidTable model · fe2c2c0d

myhloli authored Nov 09, 2024

- Integrate RapidOCR with RapidTable model for table recognition
- Improve memory management for devices with <= 8GB VRAM
- Update table recognition process to use RapidOCR for RapidTable
- Add rapidocr-paddle dependency in setup.py

fe2c2c0d

feat(table): integrate RapidTable model for table recognition · 240fe99e

myhloli authored Nov 08, 2024

- Add RapidTable model support for table recognition
- Update table model configuration and initialization
- Modify table recognition process to use RapidTable when specified
- Add RapidTable dependency to setup.py

240fe99e

07 Nov, 2024 1 commit

feat(model): add xycut algorithm for block sorting · 7d5850e3

myhloli authored Nov 08, 2024

- Implement xycut algorithm to sort blocks when layoutreader fails
- Add recursive_xy_cut function to perform the xycut algorithm- Update pdf_parse_union_core_v2.py to use xycut when layoutreader fails
- Modify draw_bbox.py to handle cases where layoutreader fails to sort blocks

7d5850e3

06 Nov, 2024 1 commit

refactor(model): remove unused code and simplify OCR model initialization · 4b0f1176

myhloli authored Nov 06, 2024

- Remove unused code for copying detection and recognition models
- Simplify OCR model initialization using atom_model_manager
- Delete unnecessary comments and empty lines

4b0f1176

05 Nov, 2024 1 commit

fix(table): improve table image processing · 401dfa4e

myhloli authored Nov 05, 2024

- Replace np.array with np.asarray for better performance
- Add image color conversion from RGB to BGR using OpenCV

401dfa4e

04 Nov, 2024 2 commits

feat(model): add HTML minification to StructTableModel · b5117e72

myhloli authored Nov 04, 2024

- Import 're' module for regular expression operations
- Implement HTML minification for 'output_format=html'
- Add 'minify_html' method to remove unnecessary whitespace and format HTML

b5117e72

refactor(model): comment out unused code in ppTableModel · 5ee02a99

myhloli authored Nov 04, 2024

- Comment out an unused code block in the ppTableModel.py file
- Improve code readability and maintainability by removing unnecessary code

5ee02a99