Introduction of InternVL-Chat-V1-2#

We are excited to introduce 🤗 InternVL-Chat-V1-2. Inspired by LLaVA-NeXT-34B, we have also adopted Nous-Hermes-2-Yi-34B as the language model. Below is the pipeline.

From the experimental results, we’ve observed that a stronger language model (34B) can better leverage the powerful capabilities of our vision foundation model.

For better training reproducibility, we follow the minimalist design and data efficiency similar to LLaVA-NeXT. To reduce training costs, we provide a pre-trained MLP projector and only employ around 1.2 million visual instruction tuning samples for SFT. Our model has a total of 40 billion parameters and can be trained within 1.5 days using 32 A100 GPUs. The code, data, and model have been made publicly available.

Additionally, 🤗 InternVL-Chat-V1-2-Plus uses the same model architecture as InternVL-Chat-V1-2, but the difference lies in the SFT dataset. InternVL-Chat-V1-2 only utilizes an SFT dataset with 1.2M samples, while our plus version employs an SFT dataset with 12M samples.

Performance#

* Proprietary Model † Training Set Observed

name	image size	MMMU (val)	MMMU (test)	MathVista (testmini)	MMB (test)	MMB−CN (test)	MMVP	MME	ScienceQA (image)	POPE	TextVQA (val)	SEEDv1 (image)	VizWiz (test)	GQA (test)
GPT-4V*	unknown	56.8	55.7	49.9	77.0	74.4	38.7	1409/517	-	-	78.0	71.6	-	-
Gemini Ultra*	unknown	59.4	-	53.0	-	-	-	-	-	-	82.3	-	-	-
Gemini Pro*	unknown	47.9	-	45.2	73.6	74.3	40.7	1497/437	-	-	74.6	70.7	-	-
Qwen−VL−Plus*	unknown	45.2	40.8	43.3	67.0	70.7	-	1681/502	-	-	78.9	65.7	-	-
Qwen−VL−Max*	unknown	51.4	46.8	51.0	77.6	75.7	-	-	-	-	79.5	-	-	-

LLaVA−NeXT−34B	672x672	51.1	44.7	46.5	79.3	79.0	-	1631/397	81.8	87.7	69.5	75.9	63.8	67.1†
InternVL−Chat −V1-2	448x448	51.6	46.2	47.7	82.2	81.2	56.7	1687/489	83.3	88.0	72.5	75.6	60.0	64.0†
InternVL−Chat −V1-2−Plus	448x448	50.3	45.6	59.9	83.8	82.0	58.7	1625/553	98.1†	88.7	74.1†	76.4	-	66.9†

Note that we use the official evaluation server to test the MMVet scores, with GPT-4-0613 serving as the judge model. Using different versions of GPT-4 as the judge can result in significant score variations.

Here, we have conducted only a simple performance comparison. For more detailed performance information and additional evaluation metrics, please refer to our performance summary table.

Citation#

If you find this project useful in your research, please consider citing:

@article{chen2024far,
  title={How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={Science China Information Sciences},
  volume={67},
  number={12},
  pages={220101},
  year={2024},
  publisher={Springer}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}

Introduction of InternVL-Chat-V1-2

Contents

Introduction of InternVL-Chat-V1-2#

Performance#

Citation#