Introduction of InternVL-Chat-V1-1#
We released 🤗 InternVL-Chat-V1-1, featuring a structure similar to LLaVA, including a ViT, an MLP projector, and an LLM. As shown in the figure below, we connected our InternViT-6B to LLaMA2-13B through a simple MLP projector. Note that the LLaMA2-13B used here is not the original model but an internal chat version obtained by incrementally pre-training and fine-tuning the LLaMA2-13B base model for Chinese language tasks. Overall, our model has a total of 19 billion parameters.
In this version, we explored increasing the resolution to 448 × 448, enhancing OCR capabilities, and improving support for Chinese conversations. Since the 448 × 448 input image generates 1024 visual tokens after passing through the ViT, leading to a significant computational burden, we use a pixel shuffle (unshuffle) operation to reduce the 1024 tokens to 256 tokens.
For more detailed information about this model, please read our blog.
Performance#
model |
LLaVA-1.5 |
InternVL-Chat-V1-0 |
InternVL-Chat-V1-0 |
InternVL-Chat-V1-1 |
|---|---|---|---|---|
resolution |
336 |
336 |
448 |
448 |
vision encoder |
CLIP-L-336px |
InternViT-6B-224px |
InternViT-6B-448px |
InternViT-6B-448px |
language model |
Vicuna-13B |
Vicuna-13B |
Vicuna-13B |
LLaMA2-13B |
VQAv2testdev |
80.0 |
80.2 |
82.0 |
80.9 |
GQAtestdev |
63.3 |
63.9 |
64.1 |
62.5 |
VizWiztest |
53.6 |
54.6 |
60.1 |
57.3 |
SQAtest |
71.6 |
70.1 |
71.6 |
90.1 |
TextVQAval, w/o OCR |
- |
- |
- |
64.2 |
TextVQAval, w/ OCR |
61.3 |
58.7 |
64.8 |
68.6 |
POPE |
85.9 |
87.1 |
87.2 |
87.1 |
MMEperception |
1531.3 |
1546.9 |
1579.0 |
1659.8 |
MMB-ENtest |
67.7 |
66.5 |
68.2 |
75.4 |
MMB-CNtest |
63.6 |
61.9 |
64.0 |
70.3 |
MMVetGPT-4-0613 |
35.4 |
33.7 |
36.7 |
46.7 |
Note that we use the official evaluation server to test the MMVet scores, with
GPT-4-0613serving as the judge model. Using different versions of GPT-4 as the judge can result in significant score variations.
Here, we have conducted only a simple performance comparison. For more detailed performance information and additional evaluation metrics, please refer to our performance summary table.
Citation#
If you find this project useful in your research, please consider citing:
@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={24185--24198},
year={2024}
}