Domain Adaptation#
Multi-View Image-Based Autonomous Driving#
Data Preparation#
Prepare
InternVL-Chat-V1-2-SFT-Data, See Document.Download
drivelm_train.jsonlanddrivelm_val.jsonlfromInternVL-Domain-Adaptation-Data.drivelm_train.jsonlanddrivelm_val.jsonlare the data after format conversion.Download the images from DriveLM and process the images using
tools/images_stitching.py:
python tools/images_stitching.py --data-root InternVL-Domain-Adaptation-Data/images/drivelm --ann-file path/to/v1_1_val_nus_q_only.json
Download autonomous driving subset of mme-realworld.
Organize the files according to the following structure.
path/to/internvl_chat/InternVL-Domain-Adaptation-Data ├── train_data │ └── drivelm_train.jsonl ├── images │ ├── MME-RealWorld | | └── data/AutonomousDriving/ | └── drivelm | ├── nuscenes/ | └── stitch/ ├── train_meta | ├── internvl_1_2_finetune_drivelm.json └── val ├── MME_RealWorld.json └── drivelm_val.jsonl
Finetune#
After downloading the pre-trained model and preparing the training data, you can adapte the model using following scripts.
Before fine-tuning, set the --model_name_or_path to the path of the path of the pre-trained model.
In the default settings, we conduct full-parameter fine-tuning, but you can optionally freeze the visual encoder depending on your computational resources.
GPUS=8 PER_DEVICE_BATCH_SIZE=1 sh shell/mini_internvl/domain_adaptation/internvl2_1b_qwen2_0_5b_dynamic_res_finetune_drivelm.sh
GPUS=8 PER_DEVICE_BATCH_SIZE=1 sh shell/mini_internvl/domain_adaptation/internvl2_2b_internlm2_1_8b_dynamic_res_finetune_drivelm.sh
GPUS=8 PER_DEVICE_BATCH_SIZE=1 sh shell/mini_internvl/domain_adaptation/internvl2_4b_phi3_3_8b_dynamic_res_finetune_drivelm.sh
Evaluation#
This dataset contains data for perception, prediction, and planning, providing a comprehensive view of autonomous driving scenarios. To test our fine-tuned model on the DriveLM Challenge, we have already pre-processed the data, including both images and annotations. You can now directly use the following command to run the test with 8 GPUs:
GPUS=8 sh evaluate.sh ${checkpoint} drivelm
MME-Realworld-AD
MME-Realworld contains a subset of autonomous driving scenes, on which we assess the model’s performance on perception and reasoning tasks. Please use the following command to perform the test with 8 GPU:
GPUS=8 sh evaluate.sh ${checkpoint} mme—realworld --dynamic --max-num 12 --subtask Autonomous_Driving
Medical Images#
Data Preparation#
Prepare InternVL-Chat-V1-2-SFT-Data, See Document
Download the following files fromInternVL-Domain-Adaptation-Data, extract the images, and organize them into the following directory structure.
path/to/internvl_chat/InternVL-Domain-Adaptation-Data
├── train_data
│ └── medical_sft_sample500k.jsonl
├── images
│ └── medical_images
└── train_meta
└── internvl_1_2_finetune_medical.json
Finetune#
Please finetune the model using following scripts:
GPUS=8 PER_DEVICE_BATCH_SIZE=1 sh shell/mini_internvl/domain_adaptation/internvl2_1b_qwen2_0_5b_dynamic_res_finetune_medical.sh
GPUS=8 PER_DEVICE_BATCH_SIZE=1 sh shell/mini_internvl/domain_adaptation/internvl2_2b_internlm2_1_8b_dynamic_res_finetune_medical.sh
GPUS=8 PER_DEVICE_BATCH_SIZE=1 sh shell/mini_internvl/domain_adaptation/internvl2_4b_phi3_3_8b_dynamic_res_finetune_medical.sh
Evaluation#
we test our model on a comprehensive medical AI benchmark, GMAI-MMBench. Our evaluation was conducted using the VLMEvalKit framework.
Please refer to Document for testing.
Importantly, before testing, please add the model to the internvl_series in config_file:
'Mini-InternVL-DA-1B': partial(InternVLChat, model_path='path/to/your/checkpoints', version='V2.0'),
'Mini-InternVL-DA-2B': partial(InternVLChat, model_path='path/to/your/checkpoints', version='V2.0'),
'Mini-InternVL-DA-4B': partial(InternVLChat, model_path='path/to/your/checkpoints', version='V2.0')
Remote Sensing#
Data Preparation#
Prepare InternVL-Chat-V1-2-SFT-Data, See Document
Please download the corresponding files in train_data, train_meta, and val directories from InternVL-Domain-Adaptation-Data, following the directory tree structure below.
Download the images from GeoChat, FIT-RS, RSVQA and DIOR-RSVG. Extract the files and place them in the corresponding locations within the directory structure below.
path/to/internvl_chat/InternVL-Domain-Adaptation-Data
├── train_data
│ ├── dior_rsvg_instruct_26k.jsonl
| ├── fit_rs_vqa_100k.jsonl
| ├── rsvqa_hr_train_instruct_100k.jsonl
│ └── geochat_instruct.jsonl
├── images
| ├── RSVQA_L
| | └── Images_LR
| ├── RSVQA-H
| | └── Data
| ├── DIOR-RSVG
| | └── JPEGImages
| ├── FIT-RS
| | └── imgv2_split_512_100_vaild
| └── GeoChat
| └── images
| └── final_images_llava
├── train_meta
| └── internvl_1_2_finetune_remote.json
└── val
├── dior_rsvg_test.json
├── rsvqa_h_test_1_instruct.json
├── rsvqa_h_test_2_instruct.json
└── rsvqa_l_test_instruct.json
Finetune#
Please finetune the model using following scripts:
GPUS=8 PER_DEVICE_BATCH_SIZE=1 sh shell/mini_internvl/domain_adaptation/internvl2_1b_qwen2_0_5b_dynamic_res_finetune_remote.sh
GPUS=8 PER_DEVICE_BATCH_SIZE=1 sh shell/mini_internvl/domain_adaptation/internvl2_2b_internlm2_1_8b_dynamic_res_finetune_remote.sh
GPUS=8 PER_DEVICE_BATCH_SIZE=1 sh shell/mini_internvl/domain_adaptation/internvl2_4b_phi3_3_8b_dynamic_res_finetune_remote.sh
Evaluation#
We assess the performance of our transferred model using the RSVQA dataset for the VQA task and the DIOR-RSVG dataset for the visual grounding task.
RS-VQA
We chose the Presence, Comparison, and Rural/Urban subsets of the RSVQA-LR and RSVQA-HR datasets for assessment.
You can now directly use the following command to run the test with 8 GPUs:
# RSVQA-LR
GPUS=8 sh evaluate.sh ${checkpoint} rsvqa-lr --dynamic --max-num 12
# RSVQA-HR-test1
GPUS=8 sh evaluate.sh ${checkpoint} rsvqa-hr-test1 --dynamic --max-num 12
# RSVQA-LR-test2
GPUS=8 sh evaluate.sh ${checkpoint} rsvqa-hr-test2 --dynamic --max-num 12
DIOR-RSVG
You can now directly use the following command to run the test with 8 GPUs:
GPUS=8 sh evaluate.sh ${checkpoint} dior-rsvg --dynamic --max-num 12
Autonomous Driving with Temporal Information#
Coming soon…
Citation#
If you find this project useful in your research, please consider citing:
@article{gao2024mini,
title={Mini-InternVL: a flexible-transfer pocket multi-modal model with 5\% parameters and 90\% performance},
author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others},
journal={Visual Intelligence},
volume={2},
number={1},
pages={1--17},
year={2024},
publisher={Springer}
}