- 28 Aug, 2024 3 commits
-
-
Xiaomeng Zhao authored
Previously, small blocks that overlapped with larger ones were merely removed. This fix changes the approach to merge smaller blocks into the larger block instead, ensuring that no information is lost and the larger block encompasses all the text content fully.
-
Xiaomeng Zhao authored
Reduce the span threshold used in fill_spans_in_blocks from 0.6 to 0.3 to improve the accuracy of block filling based on layout analysis.
-
icecraft authored
Co-authored-by:icecraft <xurui1@pjlab.org.cn>
-
- 20 Aug, 2024 6 commits
-
-
Xiaomeng Zhao authored
* fix(ocr_mkcontent): revise table caption output - Ensuring that table captions are properly included in the output. - Remove the redundant `table_caption` variable。 * Update cla.yml * Update bug_report.yml * feat(cli): add debug option for detailed error handling Enable users to invoke the CLI command with a new debug flag to get detailed debugging information. * fix(pdf-extract-kit): adjust crop_paste parameters for better accuracyThe crop_paste_x and crop_paste_y values in the pdf_extract_kit.py have been modified to improve the accuracy and consistency of OCR processing. The new values are set to 25 to ensure more precise image cropping and pasting which leads to better OCR recognition results. * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by:
sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * fix(pdf-extract-kit): increase crop_paste margin for OCR processingDouble the crop_paste margin from25 to 50 to ensure better OCR accuracy and handling of border cases. This change will help in improving the overall quality of OCR'ed text by providing more context around the detected text areas. * fix(common): deep copy model list before drawing model bbox Use a deep copy of the original model list in `drow_model_bbox` to avoid potential modifications to the source data. This ensures the integrity of the original models is maintained while generating the model bounding boxes visualization. --------- Co-authored-by:
sfk <18810651050@163.com> Co-authored-by:
drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by:
github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
-
Kaiwen Liu authored
Co-authored-by:liukaiwen <liukaiwen@pjlab.org.cn>
-
icecraft authored
* feat: rename the file generated by command line tools * feat: add pdf filename as prefix to {span,layout,model}.pdf --------- Co-authored-by:icecraft <tmortred@gmail.com> Co-authored-by:
icecraft <xurui1@pjlab.org.cn>
-
Xiaomeng Zhao authored
Tuned the detection box threshold parameter in the OCR model initialization to improve the accuracy of text extraction from images. The threshold was modified from 0.6 to 0.3 to filter out smaller detection boxes, which is expected to enhance the quality of the extracted text by reducing noise and false positives in the OCR process.
-
Xiaomeng Zhao authored
Merge adjacent and overlapping detection boxes to optimize text region detection in the document. Post processing of text boxes is enhanced by consolidating them into larger text lines, taking into account their vertical and horizontal alignment. This improvement reduces fragmentation and improves the readability of detected text blocks.
-
Xiaomeng Zhao authored
Optimize the language detection logic to enhance content formatting. This change addresses issues with long word segmentation. Language detection now uses a threshold to determine the language of a text based on the proportion of English characters. Formatting rules for content have been updated to consider a list of languages (initially including Chinese, Japanese, and Korean) where no space is added between content segments for inline equations and text spans, improving the handling of Asian languages. The impact of these changes includes improved accuracy in language detection, better segmentation of long words, and more appropriate spacing in content formatting for multiple languages.
-
- 09 Aug, 2024 6 commits
-
-
myhloli authored
-
myhloli authored
Implement the feature to draw bounding boxes for model elements in the PDF. This includes adding new drawing functions and modifying existing ones to accommodate the new feature. Also, updates are made to CLI tools and common utilities to support the model bbox drawing.
-
myhloli authored
Previously, images were not enlarged if their width or height exceeded 3000 pixels. This threshold has been increased to 9000 pixels to better handle high-resolutionscans and improve the analysis of documents with larger dimensions.
-
myhloli authored
fix(pdf-extract-kit): ensure table extraction success with additional ending conditionAdd an additional condition to determine the success of table extraction by checking if the latex_code ends with 'end{table}'. This extends the validation to cover table environments that may not strictly end with 'end{tabular}', thus improving the robustnessof table recognition processing. -
myhloli authored
refactor(pdf_extract_kit): optimize image processing and table recognition logicRefactor the image processing logic for OCR and table recognition to ensure consistency and improve performance. Remove redundant initialization of PIL images, unify image cropping logic, and streamline the handling of formula detection results. Also, adjust the table recognition process to improve integration with the updated image processing logic and enhance overall efficiency.
-
icecraft authored
Co-authored-by:shenguanlin <shenguanlin@pjlab.org.cn>
-
- 07 Aug, 2024 2 commits
- 06 Aug, 2024 1 commit
-
-
myhloli authored
-
- 05 Aug, 2024 2 commits
- 04 Aug, 2024 3 commits
-
-
myhloli authored
-
myhloli authored
If 'table-config' is not present in the configuration file, the table recognition feature will default to being disabled to ensure consistent behavior. This change adds a warning log and sets a default configuration for table recognition when the expected config is missing.
-
myhloli authored
Ensure proper formatting of inline equations by adding spaces outside the equation delimitersto prevent markdown from interpreting the equation content as part of a link. This addresses the issue where inline OCR equations appear without the correct markdown formatting.
-
- 02 Aug, 2024 3 commits
-
-
xuchao authored
-
xuchao authored
-
Kaiwen Liu authored
* # add table recognition using struct-eqtable ## Changelog 31/07/20204 - Support table recognition. Table images will be converted into html. ### how to use the new feature: set the attribute 'table-mode' to 'true' in magic-pdf.json ### caution: it takes 200s to 500s to convert a single table image using cpu * # add table recognition using struct-eqtable ## Changelog 31/07/20204 - Support table recognition. Table images will be converted into LaTex. ### how to use the new feature: set the attribute 'table-mode' to 'true' in magic-pdf.json ### caution: it takes 200s to 500s to convert a single table image using cpu * # feat(model inference): add table recognition and convertion to LaTeX # What's Changed ### New Features - Add table content recognition, we use weights of [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) to convert table image to LaTex. ### Instruction - pip install pypandoc struct-eqtable==0.1.0 - Download [StructEqTable weights](https://huggingface.co/wanderkid/PDF-Extract-Kit/tree/main/models/TabRec ) and put it under models/ directory. - Edit 'table-mode' value to turn on table recognition function which is turned off by default. - If you did not download any models before, refer to [how to download models](docs/how_to_download_models_zh_cn.md)。 * add table recognition and convertion to LaTeX * add table recognition and conversion to LaTeX * add table recognition and conversion to LaTeX * add table recognition and conversion to LaTeX --------- Co-authored-by:
liukaiwen <liukaiwen@pjlab.org.cn>
-
- 01 Aug, 2024 8 commits
-
-
icecraft authored
* feat: remove dummpy code, magic_pdf/cli, magic_pdf/train_utils * feat: expose version in command line --------- Co-authored-by:shenguanlin <shenguanlin@pjlab.org.cn>
-
xuchao authored
-
icecraft authored
* feat: refractor cli command * feat: add docs to describe the output files of cli * feat: resove review comments * feat: updat docs about middle.json --------- Co-authored-by:shenguanlin <shenguanlin@pjlab.org.cn>
-
liukaiwen authored
-
liukaiwen authored
-
liukaiwen authored
-
liukaiwen authored
-
liukaiwen authored
# What's Changed ### New Features - Add table content recognition, we use weights of [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) to convert table image to LaTex. ### Instruction - pip install pypandoc struct-eqtable==0.1.0 - Download [StructEqTable weights](https://huggingface.co/wanderkid/PDF-Extract-Kit/tree/main/models/TabRec) and put it under models/ directory. - Edit 'table-mode' value to turn on table recognition function which is turned off by default. - If you did not download any models before, refer to [how to download models](docs/how_to_download_models_zh_cn.md)。
-
- 31 Jul, 2024 3 commits
-
-
liukaiwen authored
## Changelog 31/07/20204 - Support table recognition. Table images will be converted into LaTex. ### how to use the new feature: set the attribute 'table-mode' to 'true' in magic-pdf.json ### caution: it takes 200s to 500s to convert a single table image using cpu
-
myhloli authored
-
liukaiwen authored
## Changelog 31/07/20204 - Support table recognition. Table images will be converted into html. ### how to use the new feature: set the attribute 'table-mode' to 'true' in magic-pdf.json ### caution: it takes 200s to 500s to convert a single table image using cpu
-
- 30 Jul, 2024 2 commits
- 29 Jul, 2024 1 commit
-
-
myhloli authored
-