Introduction of InternVL-Chat-V1-1

Introduction of InternVL-Chat-V1-1#

We released 🤗 InternVL-Chat-V1-1, featuring a structure similar to LLaVA, including a ViT, an MLP projector, and an LLM. As shown in the figure below, we connected our InternViT-6B to LLaMA2-13B through a simple MLP projector. Note that the LLaMA2-13B used here is not the original model but an internal chat version obtained by incrementally pre-training and fine-tuning the LLaMA2-13B base model for Chinese language tasks. Overall, our model has a total of 19 billion parameters.

In this version, we explored increasing the resolution to 448 × 448, enhancing OCR capabilities, and improving support for Chinese conversations. Since the 448 × 448 input image generates 1024 visual tokens after passing through the ViT, leading to a significant computational burden, we use a pixel shuffle (unshuffle) operation to reduce the 1024 tokens to 256 tokens.

For more detailed information about this model, please read our blog.

Performance#

model

LLaVA-1.5

InternVL-Chat-V1-0

InternVL-Chat-V1-0

InternVL-Chat-V1-1

resolution

336

336

448

448

vision encoder

CLIP-L-336px

InternViT-6B-224px

InternViT-6B-448px

InternViT-6B-448px

language model

Vicuna-13B

Vicuna-13B

Vicuna-13B

LLaMA2-13B

VQAv2testdev

80.0

80.2

82.0

80.9

GQAtestdev

63.3

63.9

64.1

62.5

VizWiztest

53.6

54.6

60.1

57.3

SQAtest

71.6

70.1

71.6

90.1

TextVQAval, w/o OCR

-

-

-

64.2

TextVQAval, w/ OCR

61.3

58.7

64.8

68.6

POPE

85.9

87.1

87.2

87.1

MMEperception

1531.3

1546.9

1579.0

1659.8

MMB-ENtest

67.7

66.5

68.2

75.4

MMB-CNtest

63.6

61.9

64.0

70.3

MMVetGPT-4-0613

35.4

33.7

36.7

46.7

  • Note that we use the official evaluation server to test the MMVet scores, with GPT-4-0613 serving as the judge model. Using different versions of GPT-4 as the judge can result in significant score variations.

Here, we have conducted only a simple performance comparison. For more detailed performance information and additional evaluation metrics, please refer to our performance summary table.

Citation#

If you find this project useful in your research, please consider citing:

@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}