The requested images should be placed in the "./images" directory, and the results will be stored in the "./outputs" directory. Follow https://github.com/salesforce/LAVIS, https://github.com/facebookresearch/segment-anything, https://github.com/JialianW/GRiT and https://github.com/PaddlePaddle/PaddleOCR.git to prepare the enverionment. Download GRiT(Dense Captioning on VG Dataset) and place it under ./grit/model_weight. Download SAM and place it under ./model_weight. Generation Steps: 1. Generate global description for each image. ```python blip2.py``` 2. Use the Grit model to generate dense captions for each image. ```python grit.py``` 3. Generate segmentation maps for each image using the SAM model, and save the segmentation maps in the "./masks" directory. ```python amg.py --checkpoint ./model_weight/ --model-type --input ./images --output ./data_gen/masks --convert-to-rle``` 4. Generate corresponding descriptions for the segmentation maps. ```python sam_blip.py``` 5. Compute the similarity score. ```python image_text_matching.py --ann_path ./outputs/sam_blip2.json --output_path ./outputs/sam_blip2_score.json``` 6. Compute the similarity score. ```python image_text_matching.py --ann_path ./outputs/grit.json --output_path ./outputs/grit_score.json``` 7. Use ppocr to detect text in images. ```python ocr_ppocr.py``` 8. Integrate the generated annotations into ann_all.json. ```python add_all_json.py``` 9. Use ChatGPT API to generate the final detailed description and save it in ./outputs/ann_all.json. ```python chatgpt.py```