Merge remote-tracking branch 'origin/dev' into dev

869cf0a6 · myhloli · 29681c4f · cc859604 · 869cf0a6 · 869cf0a6
Commit 869cf0a6 authored Jan 09, 2025 by myhloli
20 changed files
--- a/README.md
+++ b/README.md
@@ -42,13 +42,15 @@
 </div>

 # Changelog
- 2025/01/06 1.0.0 released. This is our first official release, where we have introduced a completely new API interface and enhanced compatibility through extensive refactoring:
+- 2025/01/06 1.0.0 released. This is our first official release, where we have introduced a completely new API interface and enhanced compatibility through extensive refactoring, as well as a brand new automatic language identification feature:
  - New API Interface
    - For the data-side API, we have introduced the Dataset class, designed to provide a robust and flexible data processing framework. This framework currently supports a variety of document formats, including images (.jpg and .png), PDFs, Word documents (.doc and .docx), and PowerPoint presentations (.ppt and .pptx). It ensures effective support for data processing tasks ranging from simple to complex.
    - For the user-side API, we have meticulously designed the MinerU processing workflow as a series of composable Stages. Each Stage represents a specific processing step, allowing users to define new Stages according to their needs and creatively combine these stages to customize their data processing workflows.
  - Enhanced Compatibility
    - By optimizing the dependency environment and configuration items, we ensure stable and efficient operation on ARM architecture Linux systems.
-    - We have deeply integrated with Huawei Ascend NPU acceleration, providing autonomous and controllable high-performance computing capabilities. This supports the localization and development of AI application platforms in China.
+    - We have deeply integrated with Huawei Ascend NPU acceleration, providing autonomous and controllable high-performance computing capabilities. This supports the localization and development of AI application platforms in China. [Ascend NPU Acceleration](docs/README_Ascend_NPU_Acceleration_zh_CN.md)
+  - Automatic Language Identification
+    - By introducing a new language recognition model, setting the `lang` configuration to `auto` during document parsing will automatically select the appropriate OCR language model, improving the accuracy of scanned document parsing.
 - 2024/11/22 0.10.0 released. Introducing hybrid OCR text extraction capabilities,
  - Significantly improved parsing performance in complex text distribution scenarios such as dense formulas, irregular span regions, and text represented by images.
  - Combines the dual advantages of accurate content extraction and faster speed in text mode, and more precise span/line region recognition in OCR mode.

--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -42,13 +42,15 @@
 </div>

 # 更新记录
- 2025/01/06 1.0.0 发布，这是我们的第一个正式版本，在这个版本中，我们通过大量重构带来了全新的API接口和更广泛的兼容性：
+- 2025/01/06 1.0.0 发布，这是我们的第一个正式版本，在这个版本中，我们通过大量重构带来了全新的API接口和更广泛的兼容性，以及全新的自动语言识别功能：
  - 全新API接口 
    - 对于数据侧API，我们引入了Dataset类，旨在提供一个强大而灵活的数据处理框架。该框架当前支持包括图像（.jpg及.png）、PDF、Word（.doc及.docx）、以及PowerPoint（.ppt及.pptx）在内的多种文档格式，确保了从简单到复杂的数据处理任务都能得到有效的支持。
    - 针对用户侧API，我们将MinerU的处理流程精心设计为一系列可组合的Stage阶段。每个Stage代表了一个特定的处理步骤，用户可以根据自身需求自由地定义新的Stage，并通过创造性地组合这些阶段来定制专属的数据处理流程。
  - 更广泛的兼容性适配
    - 通过优化依赖环境和配置项，确保在ARM架构的Linux系统上能够稳定高效运行。
-    - 深度适配华为昇腾NPU加速，积极响应信创要求，提供自主可控的高性能计算能力，助力人工智能应用平台的国产化应用与发展。
+    - 深度适配华为昇腾NPU加速，积极响应信创要求，提供自主可控的高性能计算能力，助力人工智能应用平台的国产化应用与发展。[NPU加速教程](docs/README_Ascend_NPU_Acceleration_zh_CN.md)
+  - 自动语言识别
+    - 通过引入全新的语言识别模型， 在文档解析中将`lang`配置为`auto`，即可自动选择合适的OCR语言模型，提升扫描类文档解析的准确性。
 - 2024/11/22 0.10.0发布，通过引入混合OCR文本提取能力，
  - 在公式密集、span区域不规范、部分文本使用图像表现等复杂文本分布场景下获得解析效果的显著提升
  - 同时具备文本模式内容提取准确、速度更快与OCR模式span/line区域识别更准的双重优势

--- a/docs/README_Ascend_NPU_Acceleration_zh_CN.md
+++ b/docs/README_Ascend_NPU_Acceleration_zh_CN.md
@@ -51,6 +51,7 @@ magic-pdf --help

 ## 已知问题

- paddleocr使用内嵌onnx模型，仅支持中英文ocr，不支持其他语言ocr
+- paddleocr使用内嵌onnx模型，仅在默认语言配置下能以较快速度对中英文进行识别
+- 自定义lang参数时，paddleocr速度会存在明显下降情况
 - layout模型使用layoutlmv3时会发生间歇性崩溃，建议使用默认配置的doclayout_yolo模型
 - 表格解析仅适配了rapid_table模型，其他模型可能会无法使用
\ No newline at end of file
--- a/magic_pdf/data/dataset.py
+++ b/magic_pdf/data/dataset.py
@@ -153,6 +153,7 @@ class PymuDocDataset(Dataset):
            logger.info(f"lang: {lang}, detect_lang: {self._lang}")
        else:
            self._lang = lang
+            logger.info(f"lang: {lang}")
    def __len__(self) -> int:
        """The page number of the pdf."""
        return len(self._records)

--- a/magic_pdf/model/model_list.py
+++ b/magic_pdf/model/model_list.py
@@ -9,3 +9,4 @@ class AtomicModel:
    MFR = "mfr"
    OCR = "ocr"
    Table = "table"
+    LangDetect = "langdetect"
--- a/magic_pdf/model/sub_modules/language_detection/utils.py
+++ b/magic_pdf/model/sub_modules/language_detection/utils.py
@@ -12,7 +12,6 @@ from magic_pdf.data.utils import load_images_from_pdf
 from magic_pdf.libs.config_reader import get_local_models_dir, get_device
 from magic_pdf.libs.pdf_check import extract_pages
 from magic_pdf.model.model_list import AtomicModel
-from magic_pdf.model.sub_modules.language_detection.yolov11.YOLOv11 import YOLOv11LangDetModel
 from magic_pdf.model.sub_modules.model_init import AtomModelSingleton


@@ -25,11 +24,11 @@ def get_model_config():
    config_path = os.path.join(model_config_dir, 'model_configs.yaml')
    with open(config_path, 'r', encoding='utf-8') as f:
        configs = yaml.load(f, Loader=yaml.FullLoader)
-    return local_models_dir, device, configs
+    return root_dir, local_models_dir, device, configs


 def get_text_images(simple_images):
-    local_models_dir, device, configs = get_model_config()
+    _, local_models_dir, device, configs = get_model_config()
    atom_model_manager = AtomModelSingleton()
    temp_layout_model = atom_model_manager.get_atom_model(
        atom_model_name=AtomicModel.Layout,
@@ -59,15 +58,25 @@ def get_text_images(simple_images):
 def auto_detect_lang(pdf_bytes: bytes):
    sample_docs = extract_pages(pdf_bytes)
    sample_pdf_bytes = sample_docs.tobytes()
-    simple_images = load_images_from_pdf(sample_pdf_bytes, dpi=96)
+    simple_images = load_images_from_pdf(sample_pdf_bytes, dpi=200)
    text_images = get_text_images(simple_images)
-    local_models_dir, device, configs = get_model_config()
-    # 用yolo11做语言分类
-    langdetect_model_weights = str(
-        os.path.join(
-            local_models_dir, configs['weights'][MODEL_NAME.YOLO_V11_LangDetect]
-        )
-    )
-    langdetect_model = YOLOv11LangDetModel(langdetect_model_weights, device)
+    langdetect_model = model_init(MODEL_NAME.YOLO_V11_LangDetect)
    lang = langdetect_model.do_detect(text_images)
-    return lang
\ No newline at end of file
+    return lang
+
+
+def model_init(model_name: str):
+    atom_model_manager = AtomModelSingleton()
+
+    if model_name == MODEL_NAME.YOLO_V11_LangDetect:
+        root_dir, _, device, _ = get_model_config()
+        model = atom_model_manager.get_atom_model(
+            atom_model_name=AtomicModel.LangDetect,
+            langdetect_model_name=MODEL_NAME.YOLO_V11_LangDetect,
+            langdetect_model_weight=str(os.path.join(root_dir, 'resources', 'yolov11-langdetect', 'yolo_v11_ft.pt')),
+            device=device,
+        )
+    else:
+        raise ValueError(f"model_name {model_name} not found")
+    return model
+
--- a/magic_pdf/model/sub_modules/language_detection/yolov11/YOLOv11.py
+++ b/magic_pdf/model/sub_modules/language_detection/yolov11/YOLOv11.py
@@ -2,6 +2,7 @@
 from collections import Counter
 from uuid import uuid4

+import torch
 from PIL import Image
 from loguru import logger
 from ultralytics import YOLO
@@ -83,10 +84,14 @@ def resize_images_to_224(image):


 class YOLOv11LangDetModel(object):
-    def __init__(self, weight, device):
-        self.model = YOLO(weight)
-        self.device = device
+    def __init__(self, langdetect_model_weight, device):

+        self.model = YOLO(langdetect_model_weight)
+
+        if str(device).startswith("npu"):
+            self.device = torch.device(device)
+        else:
+            self.device = device
    def do_detect(self, images: list):
        all_images = []
        for image in images:
@@ -99,7 +104,7 @@ class YOLOv11LangDetModel(object):
                all_images.append(resize_images_to_224(temp_image))

        images_lang_res = self.batch_predict(all_images, batch_size=8)
-        logger.info(f"images_lang_res: {images_lang_res}")
+        # logger.info(f"images_lang_res: {images_lang_res}")
        if len(images_lang_res) > 0:
            count_dict = Counter(images_lang_res)
            language = max(count_dict, key=count_dict.get)
@@ -107,7 +112,6 @@ class YOLOv11LangDetModel(object):
            language = None
        return language

-
    def predict(self, image):
        results = self.model.predict(image, verbose=False, device=self.device)
        predicted_class_id = int(results[0].probs.top1)
@@ -117,6 +121,7 @@ class YOLOv11LangDetModel(object):

    def batch_predict(self, images: list, batch_size: int) -> list:
        images_lang_res = []
+
        for index in range(0, len(images), batch_size):
            lang_res = [
                image_res.cpu()

--- a/magic_pdf/model/sub_modules/model_init.py
+++ b/magic_pdf/model/sub_modules/model_init.py
@@ -2,8 +2,8 @@ import torch
 from loguru import logger

 from magic_pdf.config.constants import MODEL_NAME
-from magic_pdf.libs.config_reader import get_device
 from magic_pdf.model.model_list import AtomicModel
+from magic_pdf.model.sub_modules.language_detection.yolov11.YOLOv11 import YOLOv11LangDetModel
 from magic_pdf.model.sub_modules.layout.doclayout_yolo.DocLayoutYOLO import \
    DocLayoutYOLOModel
 from magic_pdf.model.sub_modules.layout.layoutlmv3.model_init import \
@@ -63,6 +63,13 @@ def doclayout_yolo_model_init(weight, device='cpu'):
    return model


+def langdetect_model_init(langdetect_model_weight, device='cpu'):
+    if str(device).startswith("npu"):
+        device = torch.device(device)
+    model = YOLOv11LangDetModel(langdetect_model_weight, device)
+    return model
+
+
 def ocr_model_init(show_log: bool = False,
                   det_db_box_thresh=0.3,
                   lang=None,
@@ -130,6 +137,9 @@ def atom_model_init(model_name: str, **kwargs):
                kwargs.get('doclayout_yolo_weights'),
                kwargs.get('device')
            )
+        else:
+            logger.error('layout model name not allow')
+            exit(1)
    elif model_name == AtomicModel.MFD:
        atom_model = mfd_model_init(
            kwargs.get('mfd_weights'),
@@ -155,6 +165,15 @@ def atom_model_init(model_name: str, **kwargs):
            kwargs.get('device'),
            kwargs.get('ocr_engine')
        )
+    elif model_name == AtomicModel.LangDetect:
+        if kwargs.get('langdetect_model_name') == MODEL_NAME.YOLO_V11_LangDetect:
+            atom_model = langdetect_model_init(
+                kwargs.get('langdetect_model_weight'),
+                kwargs.get('device')
+            )
+        else:
+            logger.error('langdetect model name not allow')
+            exit(1)
    else:
        logger.error('model name not allow')
        exit(1)

--- a/magic_pdf/model/sub_modules/ocr/paddleocr/ppocr_273_mod.py
+++ b/magic_pdf/model/sub_modules/ocr/paddleocr/ppocr_273_mod.py
@@ -21,7 +21,7 @@ class ModifiedPaddleOCR(PaddleOCR):
    def __init__(self, *args, **kwargs):

        super().__init__(*args, **kwargs)
-
+        self.lang = kwargs.get('lang', 'ch')
        # 在cpu架构为arm且不支持cuda时调用onnx、
        if not torch.cuda.is_available() and platform.machine() in ['arm64', 'aarch64']:
            self.use_onnx = True
@@ -94,7 +94,7 @@ class ModifiedPaddleOCR(PaddleOCR):
            ocr_res = []
            for img in imgs:
                img = preprocess_image(img)
-                if self.use_onnx:
+                if self.lang in ['ch'] and self.use_onnx:
                    dt_boxes, elapse = self.additional_ocr.text_detector(img)
                else:
                    dt_boxes, elapse = self.text_detector(img)
@@ -124,7 +124,7 @@ class ModifiedPaddleOCR(PaddleOCR):
                    img, cls_res_tmp, elapse = self.text_classifier(img)
                    if not rec:
                        cls_res.append(cls_res_tmp)
-                if self.use_onnx:
+                if self.lang in ['ch'] and self.use_onnx:
                    rec_res, elapse = self.additional_ocr.text_recognizer(img)
                else:
                    rec_res, elapse = self.text_recognizer(img)
@@ -142,7 +142,7 @@ class ModifiedPaddleOCR(PaddleOCR):

        start = time.time()
        ori_im = img.copy()
-        if self.use_onnx:
+        if self.lang in ['ch'] and self.use_onnx:
            dt_boxes, elapse = self.additional_ocr.text_detector(img)
        else:
            dt_boxes, elapse = self.text_detector(img)
@@ -183,7 +183,7 @@ class ModifiedPaddleOCR(PaddleOCR):
            time_dict['cls'] = elapse
            logger.debug("cls num  : {}, elapsed : {}".format(
                len(img_crop_list), elapse))
-        if self.use_onnx:
+        if self.lang in ['ch'] and self.use_onnx:
            rec_res, elapse = self.additional_ocr.text_recognizer(img_crop_list)
        else:
            rec_res, elapse = self.text_recognizer(img_crop_list)

--- a/magic_pdf/model/sub_modules/table/rapidtable/rapid_table.py
+++ b/magic_pdf/model/sub_modules/table/rapidtable/rapid_table.py
@@ -8,17 +8,25 @@ from rapid_table import RapidTable
 class RapidTableModel(object):
    def __init__(self, ocr_engine):
        self.table_model = RapidTable()
-        if ocr_engine is None:
-            self.ocr_model_name = "RapidOCR"
-            if torch.cuda.is_available():
-                from rapidocr_paddle import RapidOCR
-                self.ocr_engine = RapidOCR(det_use_cuda=True, cls_use_cuda=True, rec_use_cuda=True)
-            else:
-                from rapidocr_onnxruntime import RapidOCR
-                self.ocr_engine = RapidOCR()
+        # if ocr_engine is None:
+        #     self.ocr_model_name = "RapidOCR"
+        #     if torch.cuda.is_available():
+        #         from rapidocr_paddle import RapidOCR
+        #         self.ocr_engine = RapidOCR(det_use_cuda=True, cls_use_cuda=True, rec_use_cuda=True)
+        #     else:
+        #         from rapidocr_onnxruntime import RapidOCR
+        #         self.ocr_engine = RapidOCR()
+        # else:
+        #     self.ocr_model_name = "PaddleOCR"
+        #     self.ocr_engine = ocr_engine
+
+        self.ocr_model_name = "RapidOCR"
+        if torch.cuda.is_available():
+            from rapidocr_paddle import RapidOCR
+            self.ocr_engine = RapidOCR(det_use_cuda=True, cls_use_cuda=True, rec_use_cuda=True)
        else:
-            self.ocr_model_name = "PaddleOCR"
-            self.ocr_engine = ocr_engine
+            from rapidocr_onnxruntime import RapidOCR
+            self.ocr_engine = RapidOCR()

    def predict(self, image):


--- a/magic_pdf/pdf_parse_union_core_v2.py
+++ b/magic_pdf/pdf_parse_union_core_v2.py
@@ -373,6 +373,8 @@ def cal_block_index(fix_blocks, sorted_bboxes):
        # 使用xycut排序
        block_bboxes = []
        for block in fix_blocks:
+            # 如果block['bbox']任意值小于0，将其置为0
+            block['bbox'] = [max(0, x) for x in block['bbox']]
            block_bboxes.append(block['bbox'])

            # 删除图表body block中的虚拟line信息, 并用real_lines信息回填
@@ -766,6 +768,11 @@ def parse_page_core(
    """重排block"""
    sorted_blocks = sorted(fix_blocks, key=lambda b: b['index'])

+    """block内重排(img和table的block内多个caption或footnote的排序)"""
+    for block in sorted_blocks:
+        if block['type'] in [BlockType.Image, BlockType.Table]:
+            block['blocks'] = sorted(block['blocks'], key=lambda b: b['index'])
+
    """获取QA需要外置的list"""
    images, tables, interline_equations = get_qa_need_list_v2(sorted_blocks)


--- a/magic_pdf/resources/model_config/model_configs.yaml
+++ b/magic_pdf/resources/model_config/model_configs.yaml
@@ -5,5 +5,4 @@ weights:
  unimernet_small: MFR/unimernet_small
  struct_eqtable: TabRec/StructEqTable
  tablemaster: TabRec/TableMaster
-  rapid_table: TabRec/RapidTable
-  yolo_v11n_langdetect: LangDetect/YOLO/yolo_v11_cls_ft.pt
\ No newline at end of file
+  rapid_table: TabRec/RapidTable
\ No newline at end of file
--- a/magic_pdf/resources/yolov11-langdetect/yolo_v11_ft.pt
+++ b/magic_pdf/resources/yolov11-langdetect/yolo_v11_ft.pt
--- a/next_docs/en/user_guide/quick_start.rst
+++ b/next_docs/en/user_guide/quick_start.rst
@@ -9,7 +9,4 @@ Want to learn about the usage methods under different scenarios ? This page give

    quick_start/convert_pdf 
    quick_start/convert_image
-    quick_start/convert_ppt
-    quick_start/convert_pptx
-    quick_start/convert_doc
-    quick_start/convert_docx
+    quick_start/convert_ms_office
--- a/next_docs/en/user_guide/quick_start/convert_docx.rst
+++ b/next_docs/en/user_guide/quick_start/convert_docx.rst
-
-Convert DocX
-=============
-
-.. admonition:: Warning
-    :class: tip
-
-    When processing MS-Office files, we first use third-party software to convert the MS-Office files to PDF.
-
-    For certain MS-Office files, the quality of the converted PDF files may not be very high, which can affect the quality of the final output.
-
-
-Command Line
-^^^^^^^^^^^^^
-
-.. code:: python
-
-    # make sure the file have correct suffix
-    magic-pdf -p a.docx -o output -m auto
-
-
-API
-^^^^^
-
-.. code:: python
-
-    import os
-
-    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-    from magic_pdf.data.read_api import read_local_office
-
-    # prepare env
-    local_image_dir, local_md_dir = "output/images", "output"
-    image_dir = str(os.path.basename(local_image_dir))
-
-    os.makedirs(local_image_dir, exist_ok=True)
-
-    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
-        local_md_dir
-    )
-
-    # proc
-    ## Create Dataset Instance
-    input_file = "some_docx.docx"     # replace with real ms-office file
-
-    input_file_name = input_file.split(".")[0]
-    ds = read_local_office(input_file)[0]
-
-    # ocr mode
-    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir
-    )
-
-    # txt mode
-    ds.apply(doc_analyze, ocr=True).pipe_txt_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir
-    )
--- a/next_docs/en/user_guide/quick_start/convert_image.rst
+++ b/next_docs/en/user_guide/quick_start/convert_image.rst
@@ -45,8 +45,3 @@ API
    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
        md_writer, f"{input_file_name}.md", image_dir
    )
-
-    # txt mode
-    ds.apply(doc_analyze, ocr=True).pipe_txt_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir
-    )
--- a/next_docs/en/user_guide/quick_start/convert_doc.rst
+++ b/next_docs/en/user_guide/quick_start/convert_doc.rst
@@ -17,7 +17,7 @@ Command Line

 .. code:: python

-    # make sure the file have correct suffix
+    # replace with real ms-office file, we support MS-DOC, MS-DOCX, MS-PPT, MS-PPTX now
    magic-pdf -p a.doc -o output -m auto


@@ -30,6 +30,8 @@ API
    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
    from magic_pdf.data.read_api import read_local_office
+    from magic_pdf.config.enums import SupportedPdfParseMethod
+

    # prepare env
    local_image_dir, local_md_dir = "output/images", "output"
@@ -43,17 +45,16 @@ API

    # proc
    ## Create Dataset Instance
-    input_file = "some_doc.doc"     # replace with real ms-office file
+    input_file = "some_doc.doc"     # replace with real ms-office file, we support MS-DOC, MS-DOCX, MS-PPT, MS-PPTX now

    input_file_name = input_file.split(".")[0]
    ds = read_local_office(input_file)[0]

-    # ocr mode
-    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir
-    )

-    # txt mode
-    ds.apply(doc_analyze, ocr=True).pipe_txt_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir
-    )
+    ## inference
+    if ds.classify() == SupportedPdfParseMethod.OCR:
+        ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
+        md_writer, f"{input_file_name}.md", image_dir)
+    else:
+        ds.apply(doc_analyze, ocr=False).pipe_txt_mode(image_writer).dump_md(
+        md_writer, f"{input_file_name}.md", image_dir)
--- a/next_docs/en/user_guide/quick_start/convert_pdf.rst
+++ b/next_docs/en/user_guide/quick_start/convert_pdf.rst
@@ -44,12 +44,13 @@ API
    ## Create Dataset Instance
    ds = PymuDocDataset(pdf_bytes)

-    # ocr mode
-    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
+    ## inference
+    if ds.classify() == SupportedPdfParseMethod.OCR:
+        ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
        md_writer, f"{name_without_suff}.md", image_dir
    )

-    # txt mode
-    ds.apply(doc_analyze, ocr=True).pipe_txt_mode(image_writer).dump_md(
+    else:
+        ds.apply(doc_analyze, ocr=False).pipe_txt_mode(image_writer).dump_md(
        md_writer, f"{name_without_suff}.md", image_dir
    )
--- a/next_docs/en/user_guide/quick_start/convert_ppt.rst
+++ b/next_docs/en/user_guide/quick_start/convert_ppt.rst
-
-
-Convert PPT
-============
-
-.. admonition:: Warning
-    :class: tip
-
-    When processing MS-Office files, we first use third-party software to convert the MS-Office files to PDF.
-
-    For certain MS-Office files, the quality of the converted PDF files may not be very high, which can affect the quality of the final output.
-
-Command Line
-^^^^^^^^^^^^^
-
-.. code:: python
-
-    # make sure the file have correct suffix
-    magic-pdf -p a.ppt -o output -m auto
-
-
-API
-^^^^^
-
-.. code:: python
-
-    import os
-
-    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-    from magic_pdf.data.read_api import read_local_office
-
-    # prepare env
-    local_image_dir, local_md_dir = "output/images", "output"
-    image_dir = str(os.path.basename(local_image_dir))
-
-    os.makedirs(local_image_dir, exist_ok=True)
-
-    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
-        local_md_dir
-    )
-
-    # proc
-    ## Create Dataset Instance
-    input_file = "some_ppt.ppt"     # replace with real ms-office file
-
-    input_file_name = input_file.split(".")[0]
-    ds = read_local_office(input_file)[0]
-
-    # ocr mode
-    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir
-    )
-
-    # txt mode
-    ds.apply(doc_analyze, ocr=True).pipe_txt_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir
-    )
--- a/next_docs/en/user_guide/quick_start/convert_pptx.rst
+++ b/next_docs/en/user_guide/quick_start/convert_pptx.rst
-
-
-Convert PPTX
-=================
-
-.. admonition:: Warning
-    :class: tip
-
-    When processing MS-Office files, we first use third-party software to convert the MS-Office files to PDF.
-
-    For certain MS-Office files, the quality of the converted PDF files may not be very high, which can affect the quality of the final output.
-
-
-Command Line
-^^^^^^^^^^^^^
-
-.. code:: python
-
-    # make sure the file have correct suffix
-    magic-pdf -p a.pptx -o output -m auto
-
-
-
-
-API
-^^^^^^
-
-.. code:: python
-
-    import os
-
-    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-    from magic_pdf.data.read_api import read_local_office
-
-    # prepare env
-    local_image_dir, local_md_dir = "output/images", "output"
-    image_dir = str(os.path.basename(local_image_dir))
-
-    os.makedirs(local_image_dir, exist_ok=True)
-
-    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
-        local_md_dir
-    )
-
-    # proc
-    ## Create Dataset Instance
-    input_file = "some_pptx.pptx"     # replace with real ms-office file
-
-    input_file_name = input_file.split(".")[0]
-    ds = read_local_office(input_file)[0]
-
-    # ocr mode
-    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir
-    )
-
-    # txt mode
-    ds.apply(doc_analyze, ocr=True).pipe_txt_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir
-    )