- 08 Oct, 2024 5 commits
-
-
myhloli authored
- Add function to get local LayoutReader model directory- Check and use local model directory if available - Fall back to online model if local directory not found - Update model initialization to support local path - Refactor model loading in singleton class
-
icecraft authored
-
icecraft authored
-
myhloli authored
- Introduce a conditional memory cleanup step in the PDF extraction process - Assess available GPU memory before deciding to perform memory cleanup- Log the time taken for garbage collection when it occurs - This optimization helps to balance performance and resource utilization
-
myhloli authored
feat: add arXiv paper link to header and adjust PDF parsing logic- Add arXiv paper link to the header template for easy access to the latest research paper. - Modify the PDF parsing logic to handle edge cases more accurately, particularly in determining the number of lines in a block based on its height.
-
- 06 Oct, 2024 1 commit
-
-
myhloli authored
- Enhance timing output precision to two decimal places for better readability- Calculate and log document analysis speed in pages per second - Optimize logging for YOLO and table recognition processes - Remove unnecessary comments and improve code efficiency
-
- 30 Sep, 2024 1 commit
-
-
myhloli authored
-
- 29 Sep, 2024 2 commits
-
-
myhloli authored
- Insert lines into blocks based on median line height- Calculate block index using line indices median - Remove virtual line information for table and image blocks - Enhance line sorting algorithm for different block types - Add line height calculation function
-
myhloli authored
The clean_memory function has been removed from pdf_parse_union_core_v2.py due to it not being used. This change streamlines the code and prevents potential confusion regarding its purpose.
-
- 28 Sep, 2024 3 commits
-
-
myhloli authored
Update import statements in `pdf_parse_union_core_v2.py` to directly import `prepare_inputs`, `boxes2inputs`, and `parse_logits` from `magic_pdf.model.v3.helpers` instead of from `magic_pdf.model.v3`. This change streamlines the imports, making the code more readable and maintaining a cleaner approach to modular design.
-
myhloli authored
Adapt import statements in `pdf_parse_union_core_v2.py` to reflect the updated packagestructure, changing from the `magic_pdf.v3.helpers` module to the `magic_pdf.model.v3` module. This ensures compatibility with the revised directory layout.
-
myhloli authored
Blocks without lines are now correctly indexed even when they contain textual content rendered as images. The sorting logic has been updated to accommodate this scenario. Additionally, the LayoutLMv3 model initialization has been enhanced to utilize bfloat16 precision on devices that support it, offering potential performance benefits on supported hardware.
-
- 27 Sep, 2024 9 commits
-
-
myhloli authored
Removed redundant sorting of lines by model and optimized calculation of block indexes by using a single pass through the sorted lines. This change simplifies the code and potentially improves performance by reducing the number of sortingoperations and unnecessary iterations over blocks without lines.
-
myhloli authored
refactor(draw_bbox): remove commented-out code and streamline bbox drawingRemoved legacy commented-out code related to layout_bbox_list from draw_bbox.py, which was used for diagnostic purposes and was no longer necessary. This change streamlines the codebase and clarifies the drawing process of bounding boxes on PDF pages. The update also adjusts the order of operations slightly for improved readability without altering the functionality.
-
myhloli authored
refactor(pdf_parse_union_core_v2): implement model initialization within classRefactored model initialization to be handled by a singleton class to ensure that model instances are reused across calls, avoiding redundant initializations. Removed logger information that was commented out and ensured consistency in logging behavior.
-
myhloli authored
Refactor the draw bbox functions by removing unused imports and simplifying the code logic for drawing layout and line sorting bounding boxes. Adjust the debug configuration to enable content list dumping and disable markdown making mode.
-
myhloli authored
Introduce an additional argument `draw_bbox` in the `draw_bbox_with_number` function to enable toggling the drawing of bounding boxes on or off. When set to `False`, no bounding box will be drawn, allowing for situations where only text
-
myhloli authored
Remove debug code related to layout bbox visualization and adjust drawing functions to support optional line sorting bboxes. This change includes the removal of `draw_layout_bbox` function and updates to `draw_bbox_with_number` to support variable line width for bbox drawing.
-
myhloli authored
Add a new function `draw_line_sort_bbox` to visualize the sorting of lines on each page. This includes indexing lines and handling both text and non-text elements such as tables and images for better content organization. Also, comment out GPU-related code for flexibility and remove overlaps in bounding box detection, which improves the accuracy of layout splitting.
-
myhloli authored
refactor(pdf_parse_union): integrate LayoutLMv3 for block orderingReplace the heuristic-based block ordering algorithm with LayoutLMv3 model predictions toimprove the accuracy of block ordering on PDF pages. Additionally, refactor the span handling during block filling to ensure spans are correctly assigned. - Introduce LayoutLMv3ForTokenClassification from 'hantian/layoutreader' to predict block order. - Implement span replacement strategy to use pymu spans for non-OCR content. - Enhance cleanup process to free GPU memory more effectively after model use. - Adjust block ordering logic to use median line index for text, title, and interline equation blocks. - Refactor page parsing core logic for better maintainability. BREAKING CHANGE: The integration of LayoutLMv3 changes the internal block handling and ordering mechanism, which may affect downstream systems relying on the previous implementation. Ensure to test thoroughly before deployment.
-
myhloli authored
- Added CUDA cache clearing after layoutreader prediction to free up GPU memory. - Modified the bbox sorting logic to sort text and title blocks separately. - Adjusted drawing colors for better distinction in debug visualizations.
-
- 26 Sep, 2024 2 commits
-
-
myhloli authored
- Added CUDA cache clearing after layoutreader prediction to free up GPU memory. - Modified the bbox sorting logic to sort text and title blocks separately. - Adjusted drawing colors for better distinction in debug visualizations.
-
myhloli authored
Implement a new function `draw_layout_sort_bbox` in `draw_bbox.py` to visualize the layout sorting results using the `LayoutLMv3ForTokenClassification` model. This function predicts the order of layout elements and draws them in the sorted sequence on the PDF pages.
-
- 25 Sep, 2024 1 commit
-
-
myhloli authored
Implement a new function `draw_layout_sort_bbox` in `draw_bbox.py` to visualize the layout sorting results using the `LayoutLMv3ForTokenClassification` model. This function predicts the order of layout elements and draws them in the sorted sequence on the PDF pages.
-
- 20 Sep, 2024 1 commit
-
-
myhloli authored
-
- 19 Sep, 2024 2 commits
- 18 Sep, 2024 4 commits
-
-
myhloli authored
-
myhloli authored
-
myhloli authored
-
myhloli authored
feat(ocr_mkcontent): support drop reason in none_with_reason modeEnable the `NONE_WITH_REASON` drop mode in `para_to_standard_format_v2` by updating the function signature to include the `drop_reason` parameter and handling it within the function logic. This enhancement allows the function to convey the reason for dropping content in the output.
-
- 12 Sep, 2024 7 commits
-
-
Xiaomeng Zhao authored
-
myhloli authored
-
icecraft authored
-
icecraft authored
fix: 1. resolve uncorrect pair relation of figure and footnote, 2. resolve uncorrect pair relation of table and caption #590
-
quyuan authored
-
quyuan authored
-
myhloli authored
The pipeline now supports passing the language parameter to parsing functions and during markdown conversion to optimize processing based on the specified language. This enhancement allows for more accurate parsing and markdown generation, particularly when dealing with non-English content.
-
- 10 Sep, 2024 2 commits
-
-
myhloli authored
-
drunkpig authored
* release: release 0.7.1 version (#526) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by:
sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by:
drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#493) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by:
sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by:
sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by:
drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by:
drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by:
sfk <18810651050@163.com> Co-authored-by:
Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by:
Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by:
liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by:
Xiaomeng Zhao <moe@myhloli.com> Co-authored-by:
sfk <18810651050@163.com> Co-authored-by:
drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by:
github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by:
Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by:
liukaiwen <liukaiwen@pjlab.org.cn> * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#508) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by:
sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by:
sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by:
drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by:
drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by:
sfk <18810651050@163.com> Co-authored-by:
Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by:
Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by:
liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * Update cla.yml * Delete .github/workflows/gpu-ci.yml * Update Huggingface and ModelScope links to organization account * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by:
Xiaomeng Zhao <moe@myhloli.com> Co-authored-by:
sfk <18810651050@163.com> Co-authored-by:
drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by:
github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by:
Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by:
liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by:
yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by:
wangbinDL <wangbin_research@163.com> * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#511) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by:
sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by:
sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by:
drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by:
drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by:
sfk <18810651050@163.com> Co-authored-by:
Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by:
Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by:
liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * Update cla.yml * Delete .github/workflows/gpu-ci.yml * Update Huggingface and ModelScope links to organization account * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by:
Xiaomeng Zhao <moe@myhloli.com> Co-authored-by:
sfk <18810651050@163.com> Co-authored-by:
drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by:
github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by:
Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by:
liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by:
yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by:
wangbinDL <wangbin_research@163.com> --------- Co-authored-by:
drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by:
sfk <18810651050@163.com> Co-authored-by:
Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by:
Xiaomeng Zhao <moe@myhloli.com> Co-authored-by:
Kaiwen Liu <lkw_buaa@163.com> Co-authored-by:
github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by:
liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by:
wangbinDL <wangbin_research@163.com> * Hotfix readme 0.7.1 (#528) * Update README.md * Update README_zh-CN.md * Update README_zh-CN.md * Update README.md * Update README_zh-CN.md * Update README_zh-CN.md add HF、modelscope、colab url * Update README.md * Update README.md * Update README.md * Update README.md * Update README_zh-CN.md * Rename README.md to README_zh-CN.md * Create readme.md * Rename readme.md to README.md * Rename README.md to README_zh-CN.md * Update README_zh-CN.md * Create README.md * Update README.md * Update README.md * Update README.md * Update README_zh-CN.md * Create download_models_hf.py * Update README.md * Update README_zh-CN.md * Update README_zh-CN.md * Update README.md * Update README_zh-CN.md * Update FAQ_zh_cn.md * Update FAQ_en_us.md * Update FAQ_zh_cn.md * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 (#573) * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * Update README_zh-CN.md * Update README.md * Update README.md * Update README.md * Update README_zh-CN.md * add rag data api * Update README_zh-CN.md update rag api image * Update README.md docs: remove RAG related release notes * Update README_zh-CN.md docs: remove RAG related release notes * Update README_zh-CN.md update 更新记录 --------- Co-authored-by:
yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by:
sfk <18810651050@163.com> Co-authored-by:
Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by:
Xiaomeng Zhao <moe@myhloli.com> Co-authored-by:
Kaiwen Liu <lkw_buaa@163.com> Co-authored-by:
github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by:
liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by:
wangbinDL <wangbin_research@163.com>
-