{"transcriptions":["There is no clear relationship between the barking and the music, as they seem to be independent of each other.","(B) To indicate that language cannot express clearly, satirizing the inversion of black and white in the world"],"token_ids":[[3862,374,902,2797,5025,1948,279,293,33452,323,279,4627,11,438,807,2803,311,387,9489,315,1817,1008,13,151645],[5349,8,2014,13216,429,4128,4157,3158,9355,11,7578,404,4849,279,46488,315,3691,323,4158,304,279,1879,151645,151671]]}
{"transcriptions":["The content of the input audio is 'you can ask why over and over and over again forever even if one day we explain every physical interaction and scientific law and hope and dream and regret with a single elegant equation'."],"token_ids":[[785,2213,315,279,1946,7699,374,364,9330,646,2548,3170,916,323,916,323,916,1549,15683,1496,421,825,1899,582,10339,1449,6961,16230,323,12344,2329,323,3900,323,7904,323,22231,448,264,3175,25777,23606,4427,151645]]}
DOTS_OCR_PROMPT="""Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.
1. Bbox format: [x1, y1, x2, y2]
2. Layout Categories: The possible categories are ['Caption', 'Footnote', 'Formula', 'List-item', 'Page-footer', 'Page-header', 'Picture', 'Section-header', 'Table', 'Text', 'Title'].
3. Text Extraction & Formatting Rules:
- Picture: For the 'Picture' category, the text field should be omitted.
- Formula: Format its text as LaTeX.
- Table: Format its text as HTML.
- All Others (Text, Title, etc.): Format their text as Markdown.
4. Constraints:
- The output text must be the original text from the image, with no translation.
- All layout elements must be sorted according to human reading order.
5. Final Output: The entire output must be a single JSON object.