README.md

# Vision-Language Web Demo

A chatbot demo with image input.

## Supported Models

- [InternLM/InternLM-XComposer](https://github.com/InternLM/InternLM-XComposer/tree/main)
- [Qwen/Qwen-VL-Chat](https://huggingface.co/Qwen/Qwen-VL-Chat)

## Quick Start

### internlm/internlm-xcomposer-7b

- extract llm model from huggingface model
  ```python
  python extract_xcomposer_llm.py
  # the llm part will saved to internlm_model folder.
  ```
- lanuch the demo
  ```python
  python app.py --model-name internlm-xcomposer-7b --llm-ckpt internlm_model
  ```

### Qwen-VL-Chat

- lanuch the dmeo
  ```python
  python app.py --model-name qwen-vl-chat --hf-ckpt Qwen/Qwen-VL-Chat
  ```

## Limitations

- this demo uses the code in their repo to extract image features that might not very efficiency.
- this demo only contains the chat function. If you want to use localization ability in Qwen-VL-Chat or article generation function in InternLM-XComposer, you need implement these pre/post processes. The difference compared to chat is how to build prompts and use the output of model.