"git@developer.sourcefind.cn:OpenDAS/ollama.git" did not exist on "f93ffb969537aec4e48ab5b61af60850e618bb4d"
Commit 7c5cdcd4 authored by myhloli's avatar myhloli
Browse files

refactor(pdf_parse): improve character spacing handling in PDF text extraction

- Update the logic for inserting spaces between characters- Consider the next character's position instead of the previous one
- Adjust the spacing threshold to 25% of the average character width
- Ignore spaces at the end of lines to prevent double spaces
parent 88b909e2
...@@ -92,9 +92,12 @@ def chars_to_content(span): ...@@ -92,9 +92,12 @@ def chars_to_content(span):
content = '' content = ''
for char in span['chars']: for char in span['chars']:
# 如果下一个char的x0和上一个char的x1距离超过一个字符宽度,则需要在中间插入一个空格 # 如果下一个char的x0和上一个char的x1距离超过一个字符宽度,则需要在中间插入一个空格
if char['bbox'][0] - span['chars'][span['chars'].index(char) - 1]['bbox'][2] > char_avg_width: char1 = char
content += ' ' char2 = span['chars'][span['chars'].index(char) + 1] if span['chars'].index(char) + 1 < len(span['chars']) else None
content += char['c'] if char2 and char2['bbox'][0] - char1['bbox'][2] > char_avg_width * 0.25 and char['c'] != ' ' and char2['c'] != ' ':
content += f"{char['c']} "
else:
content += char['c']
content = __replace_ligatures(content) content = __replace_ligatures(content)
span['content'] = __replace_0xfffd(content) span['content'] = __replace_0xfffd(content)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment