Reproduce InternVL-Chat-V1-2#
Here, we provide all the necessary code, data, and models to reproduce InternVL-Chat-V1-2. Please follow the guidelines below for preparation.
Model Preparation#
model name |
type |
param |
download |
size |
|---|---|---|---|---|
InternViT-6B-448px-V1-2 |
ViT |
5.5B |
🤗 HF link |
11.1 GB |
Nous-Hermes-2-Yi-34B |
LLM |
34.4B |
🤗 HF link |
65.0 GB |
If you want to replicate the training of InternVL-Chat-V1-2, please follow the commands below to download InternViT-6B-448px-V1-2 and Nous-Hermes-2-Yi-34B.
cd pretrained/
# pip install -U huggingface_hub
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternViT-6B-448px-V1-2 --local-dir InternViT-6B-448px-V1-2
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/Nous-Hermes-2-Yi-34B --local-dir Nous-Hermes-2-Yi-34B
The directory structure is:
pretrained
├── InternViT-6B-448px-V1-2
└── Nous-Hermes-2-Yi-34B
Training Datasets Preparation#
Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train InternVL-Chat-V1-2, utilizing approximately 1.2M of visual instruction tuning samples in total, all of which are fully open-source. In a macro sense, we build upon ShareGPT-4V and additionally integrate LLaVA-ZH, DVQA, ChartQA, AI2D, DocVQA, GeoQA+, and SynthDoG-EN. Most of the data remains consistent with LLaVA-NeXT.
Preferred Method: Download from HuggingFace#
To simplify the dataset preparation, we recommend downloading the complete dataset directly from HuggingFace. This method is straightforward and ensures you have all the necessary data in one place.
Download the entire dataset: InternVL-Chat-V1-2-SFT-Data
Alternative Method: Manual Download#
If you prefer, you can manually download the annotation files and images as detailed below.
First, download the annotation files and place them in the playground/opensource/ folder.
Second, download all the images we used.
AI2D: ai2d_images (provided by InternLM-XComposer)
ChartQA: ChartQA Dataset
COCO: train2017
DVQA: images
GQA: images
LLaVA-Pretrain: images
OCR-VQA: download script. We save all files as
.jpgSAM: We only use 000000~000050.tar for now. You can quickly download 9K images from here.
TextVQA: trainvalimages
SynthDoG-EN: We only use 00000~00004 parquet files for now, with a total of 30K images. We provide the converted images.
WebData: images. Only for academic usage.
GeoQA+: images. We have converted the data format and redistributed it.
⚠️ Warning: Note that in the
sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.jsonlfile, the format of the RefCOCO data is consistent with LLaVA 1.5, which is[x1, y1, x2, y2]with coordinates ranging from0-1. During the training of InternVL-Chat-V1-2, we did not apply any special processing to this format. However, for the training of InternVL-Chat-V1-2-Plus, we converted the coordinate format to<box>[[x1, y1, x2, y2]]</box>and adjusted the coordinate range to0-1000.
Then, organize the data as follows in playground/data:
playground/
├── opensource
│ ├── ai2d_train_12k.jsonl
│ ├── chartqa_train_18k.jsonl
│ ├── docvqa_train_10k.jsonl
│ ├── dvqa_train_200k.jsonl
│ ├── geoqa+.jsonl
│ ├── llava_instruct_150k_zh.jsonl
│ ├── sharegpt4v_instruct_gpt4-vision_cap100k.jsonl
│ ├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.jsonl
│ └── synthdog_en.jsonl
├── data
│ ├── ai2d
│ │ ├── abc_images
│ │ └── images
│ ├── chartqa
│ │ ├── test
│ │ ├── train
│ │ └── val
│ ├── coco
│ │ └── train2017
│ ├── docvqa
│ │ ├── test
│ │ ├── train
│ │ └── val
│ ├── dvqa
│ │ └── images
│ ├── gqa
│ │ └── images
│ ├── llava
│ │ └── llava_pretrain
│ │ └── images
│ ├── ocr_vqa
│ │ └── images
│ ├── sam
│ │ └── images
│ ├── share_textvqa
│ │ └── images
│ ├── synthdog-en
│ │ └── images
│ ├── textvqa
│ │ └── train_images
│ ├── vg
│ │ ├── VG_100K
│ │ └── VG_100K_2
│ ├── web-celebrity
│ │ └── images
│ ├── web-landmark
│ │ └── images
│ ├── wikiart
│ │ └── images
│ ├── geoqa+
│ │ └── images
Start Training#
We provide slurm scripts for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
If you encounter an OOM error, you can decrease the
PER_DEVICE_BATCH_SIZE, for example, setPER_DEVICE_BATCH_SIZE=4.
# using 32 GPUs
PARTITION='your partition' GPUS=32 PER_DEVICE_BATCH_SIZE=8 sh shell/internvl1.2/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_res_finetune.sh
# using 64 GPUs
PARTITION='your partition' GPUS=64 PER_DEVICE_BATCH_SIZE=8 sh shell/internvl1.2/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_res_finetune.sh
The hyperparameters used for fine-tuning are listed in the following table. And, you can view the training logs in tensorboard at here.
Hyperparameter |
Trainable param |
Global batch size |
Learning rate |
Epoch |
Max length |
Weight decay |
|---|---|---|---|---|---|---|
InternVL-Chat- |
40B |
512 |
1e-5 |
1 |
2048 |
0.05 |
Citation#
If you find this project useful in your research, please consider citing:
@article{chen2024far,
title={How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites},
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
journal={Science China Information Sciences},
volume={67},
number={12},
pages={220101},
year={2024},
publisher={Springer}
}
@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={24185--24198},
year={2024}
}