Commits · d01acab4be2c3a2c1ed5d0517cce15ccf2fc1724 · wangsen / MinerU

25 Jul, 2024 1 commit

fix(pdf_extract_kit): specify utf-8 encoding when reading model configEnsure... · 20499ec3

myhloli authored Jul 25, 2024

fix(pdf_extract_kit): specify utf-8 encoding when reading model configEnsure the model configuration file is read with utf-8 encoding to support
non-ASCII characters and prevent potential encoding errors.

20499ec3

24 Jul, 2024 5 commits
- fix(config_reader): add utf-8 encoding when reading config file · d244a1c1
  myhloli authored Jul 25, 2024
```
Specify utf-8 encoding when opening the configuration file to ensure
compatibility with files containing non-ASCII characters, avoiding potentialencoding errors.
```
  d244a1c1
- feat(magic-pdf): add conditional application of formula detection and recognition · 4c39bcd3
  赵小蒙 authored Jul 24, 2024
  
  4c39bcd3
- fix(magic-pdf): add config file name constant and improve error messages · bba53839
  myhloli authored Jul 24, 2024
  
  bba53839
- fix(magic-pdf): add default values and improve warning logs for config... · 30ac6f22
  myhloli authored Jul 24, 2024
```
fix(magic-pdf): add default values and improve warning logs for config optionsEnsure that 'temp-output-dir', 'models-dir', and 'device-mode' have sensible default
values in case they are not specified in the config file.
```
  30ac6f22
- fix(magic-pdf): prevent Albumentations update check · 4bf58088
  myhloli authored Jul 24, 2024
  
  4bf58088
23 Jul, 2024 6 commits
- feat(magic_pdf): update installation commands for simplified dependency options · f6f1d00d
  myhloli authored Jul 24, 2024
  
  f6f1d00d
- fix(magic_pdf): use interline_equations instead of interline_equation_blocks · e831df80
  myhloli authored Jul 23, 2024
  
  e831df80
- fix(magic_pdf): prevent division by zero in citationmarker removal · 8411c910
  myhloli authored Jul 23, 2024
  
  8411c910
- refactor(magic_pdf): replace math module with local_math · 12bec17e
  myhloli authored Jul 23, 2024
  
  12bec17e
- feat(language): add FT LANG cache directory setup · 57380cbe
  myhloli authored Jul 23, 2024
  
  57380cbe
- fix(magic_pdf): filter out formulas outside image bounds during cropped_img · ee81b339
  myhloli authored Jul 23, 2024
  
  ee81b339
22 Jul, 2024 3 commits
- fix(magic_pdf): correct color channel conversion for OCR in PDF extract · c9059987
  myhloli authored Jul 22, 2024
  
  c9059987
- fix(magic_pdf): optimize formula area selection for OCR · e7ce3051
  myhloli authored Jul 22, 2024
  
  e7ce3051
- fix(magic_pdf): prevent removal of low-confidence spans already dropped · 5f992de4
  myhloli authored Jul 22, 2024
  
  5f992de4
19 Jul, 2024 3 commits
- fix: remove personal info · 81260a22
  myhloli authored Jul 19, 2024
  
  81260a22
- add gpu ci · 305c77cd
  quyuan authored Jul 19, 2024
  
  305c77cd
- add gpu ci · f7120c82
  quyuan authored Jul 19, 2024
  
  f7120c82
18 Jul, 2024 1 commit
- fix(magic_pdf): handle import errors with exception logging · 46cacbc0
  myhloli authored Jul 18, 2024
  
  46cacbc0
17 Jul, 2024 4 commits
- docs(cli_help): update Chinese PDF path description · 63b3cfeb
  myhloli authored Jul 17, 2024
  
  63b3cfeb
- fix: object cluster algorithm · ddff4b42
  blue authored Jul 17, 2024
  
  ddff4b42
- feat(magic_pdf): enable inside model usage by default · 6d65855c
  myhloli authored Jul 17, 2024
  
  6d65855c
- feat(magicpdf): set default value for inside_model to True · 1e3c1ef5
  myhloli authored Jul 17, 2024
  
  1e3c1ef5
15 Jul, 2024 1 commit
- refactor(layoutlmv3): remove outdated COCO instances registration · 724001df
  myhloli authored Jul 15, 2024
  
  724001df
14 Jul, 2024 2 commits

refactor(magic_pdf): optimize model loading and support list file input · 13788ca1

myhloli authored Jul 14, 2024

Improve the model loading mechanism in magic_pdf by implementing a Singleton
pattern to reduce redundant model instantiation. Additionally, enhance the
command-line interface to support input from list files, allowing batch
processing of multiple PDF documents.

13788ca1

refactor(magic_pdf): implement model singleton pattern for custom models · 054abe33

myhloli authored Jul 14, 2024

Introduce a Singleton pattern to manage custom models in the magic_pdf module.
This change improves the efficiency by ensuring that a single instance of the
custom model is created and reused, thereby reducing the overhead of multiple
instantiate calls for the same model configuration.

054abe33

13 Jul, 2024 2 commits
- Update version.py with new version · 2c9e69a5
  myhloli authored Jul 13, 2024
  
  2c9e69a5
- fix(mkmarkdown): add 2 space after image and table URLs · ff13c8e1
  myhloli authored Jul 13, 2024
  
  ff13c8e1
12 Jul, 2024 7 commits
- update error output · 278ba2f6
  myhloli authored Jul 13, 2024
  
  278ba2f6
- Update version.py with new version · c89af637
  myhloli authored Jul 12, 2024
  
  c89af637
- refactor(model): update init methods and improve model loading logic · 4101c357
  zhaoxiaomeng authored Jul 12, 2024
  
  4101c357
- remove useless files · 9515c2aa
  myhloli authored Jul 12, 2024
  
  9515c2aa
- feat(cli): set "full" as default model_mode for better accuracy · b6df9b18
  myhloli authored Jul 12, 2024
  
  b6df9b18
- feat(config-reader): add models-dir and device-mode configurations · 695b3579
  myhloli authored Jul 12, 2024
```
Add new configuration options for custom model directories and device modeselection. This allows users to specify the directory where models are stored
and choose between CPU and GPU modes for model inference. The configurations
are read from a JSON file and can be easily extended to support additional
options in the future.
```
  695b3579
- feat(model-config): Unify all device selections through a single YAML file · 45e7fbd2
  myhloli authored Jul 12, 2024
  
  45e7fbd2
11 Jul, 2024 5 commits

feat(model): add model mode selection for PDF analysis · bc0f6932

myhloli authored Jul 11, 2024

Introduce a new feature that allows users to choose between a "lite" and a "full"
model mode for PDF document analysis. The "lite" mode uses a faster, less
accurate model, while the "full" mode employs a higher-precision model at the
cost of speed. This selection can be made through the CLI or API, providing
flexibility for different use cases.

bc0f6932

update:Add md make mode config in do_parse.You can control whether the... · f8f6ba6f

myhloli authored Jul 11, 2024

update:Add md make mode config in do_parse.You can control whether the produced md is for NLP or MM by changing the value of f_make_md_mode

f8f6ba6f

update:add PEK model download readme · c5f939c5
myhloli authored Jul 11, 2024

c5f939c5
update:remove useless file · 7a61afb9
myhloli authored Jul 11, 2024

7a61afb9
update:Modify the PEK module to parse page by page. · 2b8db660
myhloli authored Jul 11, 2024

2b8db660