Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
wangsen
MinerU
Commits
3379f3b3
Commit
3379f3b3
authored
Apr 03, 2025
by
icecraft
Browse files
fix: support non-pdf file in batch mode
parent
e38efb97
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
4 additions
and
0 deletions
+4
-0
magic_pdf/tools/cli.py
magic_pdf/tools/cli.py
+4
-0
No files found.
magic_pdf/tools/cli.py
View file @
3379f3b3
...
@@ -137,6 +137,10 @@ def cli(path, output_dir, method, lang, debug_able, start_page_id, end_page_id):
...
@@ -137,6 +137,10 @@ def cli(path, output_dir, method, lang, debug_able, start_page_id, end_page_id):
doc_paths
=
[]
doc_paths
=
[]
for
doc_path
in
Path
(
path
).
glob
(
'*'
):
for
doc_path
in
Path
(
path
).
glob
(
'*'
):
if
doc_path
.
suffix
in
pdf_suffixes
+
image_suffixes
+
ms_office_suffixes
:
if
doc_path
.
suffix
in
pdf_suffixes
+
image_suffixes
+
ms_office_suffixes
:
if
doc_path
.
suffix
not
in
ms_office_suffixes
:
basename
=
Path
(
doc_path
).
stem
convert_file_to_pdf
(
str
(
doc_path
),
temp_dir
)
doc_path
=
Path
(
os
.
path
.
join
(
temp_dir
,
f
'
{
basename
}
.pdf'
))
doc_paths
.
append
(
doc_path
)
doc_paths
.
append
(
doc_path
)
datasets
=
batch_build_dataset
(
doc_paths
,
4
,
lang
)
datasets
=
batch_build_dataset
(
doc_paths
,
4
,
lang
)
batch_do_parse
(
output_dir
,
[
str
(
doc_path
.
stem
)
for
doc_path
in
doc_paths
],
datasets
,
method
,
debug_able
,
lang
=
lang
)
batch_do_parse
(
output_dir
,
[
str
(
doc_path
.
stem
)
for
doc_path
in
doc_paths
],
datasets
,
method
,
debug_able
,
lang
=
lang
)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment