1. 24 Dec, 2024 1 commit
    • myhloli's avatar
      feat(llm): add LLM-aided formula and text correction · c660fdc8
      myhloli authored
      - Add LLM-aided formula and text correction functionality
      - Update config reader to include LLM-aided settings
      - Create new LLM-aided processing module
      - Update main processing script to incorporate LLM-aided corrections
      - Modify download scripts to check for new config version
      c660fdc8
  2. 11 Dec, 2024 2 commits
  3. 10 Dec, 2024 1 commit
  4. 03 Dec, 2024 2 commits
  5. 02 Dec, 2024 1 commit
  6. 29 Nov, 2024 2 commits
  7. 28 Nov, 2024 1 commit
    • myhloli's avatar
      refactor(pdf_check): improve character detection using PyMuPDF · ac888156
      myhloli authored
      - Replace pdfminer with PyMuPDF for character detection
      - Implement new method detect_invalid_chars_by_pymupdf
      - Update check_invalid_chars in pdf_meta_scan.py to use new method
      - Add __replace_0xfffd function in pdf_parse_union_core_v2.py to handle special characters
      - Remove unused imports and update requirements.txt
      ac888156
  8. 27 Nov, 2024 2 commits
  9. 26 Nov, 2024 3 commits
  10. 25 Nov, 2024 1 commit
  11. 22 Nov, 2024 1 commit
  12. 21 Nov, 2024 1 commit
  13. 19 Nov, 2024 1 commit
  14. 18 Nov, 2024 1 commit
  15. 15 Nov, 2024 1 commit
  16. 08 Nov, 2024 2 commits
  17. 07 Nov, 2024 1 commit
    • myhloli's avatar
      feat(model): add xycut algorithm for block sorting · 7d5850e3
      myhloli authored
      - Implement xycut algorithm to sort blocks when layoutreader fails
      - Add recursive_xy_cut function to perform the xycut algorithm- Update pdf_parse_union_core_v2.py to use xycut when layoutreader fails
      - Modify draw_bbox.py to handle cases where layoutreader fails to sort blocks
      7d5850e3
  18. 06 Nov, 2024 2 commits
  19. 01 Nov, 2024 1 commit
    • myhloli's avatar
      feat(pdf_parse): improve span filtering and add new block types · 149132d6
      myhloli authored
      - Refactor remove_outside_spans function to filter spans more accurately
      - Add image_footnote, index, and list block types to output file documentation
      - Update draw_span_bbox to use preproc_blocks instead of para_blocks
      - Bump version to 0.9.0
      149132d6
  20. 28 Oct, 2024 1 commit
  21. 26 Oct, 2024 1 commit
    • myhloli's avatar
      feat(draw_bbox): update bounding box drawing for tables and images · 0e8d5893
      myhloli authored
      - Add support for drawing bounding boxes of table and image sub-blocks
      - Implement sorting of table blocks based on type order
      - Update bounding box drawing for text and title blocks
      - Refactor code to handle different block types and their sub-blocks
      0e8d5893
  22. 24 Oct, 2024 1 commit
  23. 23 Oct, 2024 1 commit
    • myhloli's avatar
      feat(model): add support for DocLayout-YOLO model · 1279f2cd
      myhloli authored
      - Add new layout model option: DocLayout-YOLO
      - Implement model initialization and prediction for DocLayout-YOLO
      - Update configuration options to include new model- Modify existing code to support both LayoutLMv3 and DocLayout-YOLO models
      - Update Gradio app to support more Custom Switch
      1279f2cd
  24. 17 Oct, 2024 1 commit
  25. 14 Oct, 2024 2 commits
    • myhloli's avatar
      fix(magic_pdf): include List and Index block types in processing · 0a9a6d3e
      myhloli authored
      Add List and Index to the list of block types being processed in the draw_bbox.py file. This inclusion ensures that these block types are handled similarly to other text-containing blocks, improving the overall document processing accuracy and consistency.
      0a9a6d3e
    • myhloli's avatar
      feat(list&index block): detect and merge list and index blocks · 1f1dd353
      myhloli authored
      - Add detection for list and index blocks in OCR processing- Implement merging of list and index blocks across pages
      - Update block types to include list and index categories
      - Adjust text merging logic to handle new block types
      - Modify layout drawing to distinguish list and index blocks
      1f1dd353
  26. 08 Oct, 2024 1 commit
  27. 29 Sep, 2024 2 commits
  28. 27 Sep, 2024 3 commits
    • myhloli's avatar
      refactor(draw_bbox): remove commented-out code and streamline bbox... · 83c07387
      myhloli authored
      refactor(draw_bbox): remove commented-out code and streamline bbox drawingRemoved legacy commented-out code related to layout_bbox_list from draw_bbox.py, which
      was used for diagnostic purposes and was no longer necessary. This change streamlines
      the codebase and clarifies the drawing process of bounding boxes on PDF pages. The update
      also adjusts the order of operations slightly for improved readability without altering
      the functionality.
      83c07387
    • myhloli's avatar
      refactor(drawing): simplify draw bbox functions and adjust debug config · b2790f6f
      myhloli authored
      Refactor the draw bbox functions by removing unused imports and simplifying the
      code logic for drawing layout and line sorting bounding boxes. Adjust the debug
      configuration to enable content list dumping and disable markdown making mode.
      b2790f6f
    • myhloli's avatar
      feat(draw_bbox): add option to toggle bounding box drawing · 43a57d56
      myhloli authored
      Introduce an additional argument `draw_bbox` in the `draw_bbox_with_number` function to
      enable toggling the drawing of bounding boxes on or off. When set to `False`, no bounding
      box will be drawn, allowing for situations where only text
      43a57d56