is_not_end_with_ending_puncs# not end with ending punctuation marks
andis_not_only_no_meaning_symbols# not only have no meaning symbols
andis_title_by_len# is a title by length, default max length is 200
andnotis_equation# an interline equation should never be a title
andis_potential_title_font# is a potential title font, which is bold or larger than the document average font size or not the same font type as the document average font type
This function processes the paragraphs, including:
1. Read raw input json file into pdf_dic
2. Detect and replace equations
3. Combine spans into a natural line
4. Check if the paragraphs are inside bboxes passed from "layout_bboxes" key
5. Compute statistics for each block
6. Detect titles in the document
7. Detect paragraphs inside each block
8. Divide the level of the titles
9. Detect and combine paragraphs from different blocks into one paragraph
10. Check whether the final results after checking headings, dividing paragraphs within blocks, and merging paragraphs between blocks are plausible and reasonable.
11. Draw annotations on the pdf file
Parameters
----------
pdf_dic_json_fpath : str
path to the pdf dictionary json file.
Notice: data noises, including overlap blocks, header, footer, watermark, vertical margin note have been removed already.
input_pdf_doc : str
path to the input pdf file
output_pdf_path : str
path to the output pdf file
Returns
-------
pdf_dict : dict
result dictionary
"""
error_info=None
output_json_file=""
output_dir=""
ifinput_pdf_pathisnotNone:
input_pdf_path=os.path.abspath(input_pdf_path)
# print_green_on_red(f">>>>>>>>>>>>>>>>>>> Process the paragraphs of {input_pdf_path}")