Domain Adaptation#

Multi-View Image-Based Autonomous Driving#

Data Preparation#

  • Prepare InternVL-Chat-V1-2-SFT-Data, See Document.

  • Download drivelm_train.jsonl and drivelm_val.jsonl fromInternVL-Domain-Adaptation-Data. drivelm_train.jsonl and drivelm_val.jsonl are the data after format conversion.

  • Download the images from DriveLM and process the images using tools/images_stitching.py:

python tools/images_stitching.py --data-root InternVL-Domain-Adaptation-Data/images/drivelm --ann-file path/to/v1_1_val_nus_q_only.json
  • Download autonomous driving subset of mme-realworld.

  • Organize the files according to the following structure.

    path/to/internvl_chat/InternVL-Domain-Adaptation-Data 
    ├── train_data
    │   └── drivelm_train.jsonl
    ├── images
    │   ├── MME-RealWorld
    |   |   └── data/AutonomousDriving/
    |   └── drivelm
    |       ├── nuscenes/
    |       └── stitch/
    ├── train_meta
    |   ├── internvl_1_2_finetune_drivelm.json
    └── val
        ├── MME_RealWorld.json
        └── drivelm_val.jsonl
    

Finetune#

After downloading the pre-trained model and preparing the training data, you can adapte the model using following scripts.

Before fine-tuning, set the --model_name_or_path to the path of the path of the pre-trained model.

In the default settings, we conduct full-parameter fine-tuning, but you can optionally freeze the visual encoder depending on your computational resources.

GPUS=8 PER_DEVICE_BATCH_SIZE=1 sh shell/mini_internvl/domain_adaptation/internvl2_1b_qwen2_0_5b_dynamic_res_finetune_drivelm.sh

Evaluation#

This dataset contains data for perception, prediction, and planning, providing a comprehensive view of autonomous driving scenarios. To test our fine-tuned model on the DriveLM Challenge, we have already pre-processed the data, including both images and annotations. You can now directly use the following command to run the test with 8 GPUs:

GPUS=8 sh evaluate.sh ${checkpoint} drivelm
  • MME-Realworld-AD

MME-Realworld contains a subset of autonomous driving scenes, on which we assess the model’s performance on perception and reasoning tasks. Please use the following command to perform the test with 8 GPU:

GPUS=8 sh evaluate.sh ${checkpoint} mme—realworld --dynamic --max-num  12 --subtask  Autonomous_Driving

Medical Images#

Data Preparation#

  • Prepare InternVL-Chat-V1-2-SFT-Data, See Document

  • Download the following files fromInternVL-Domain-Adaptation-Data, extract the images, and organize them into the following directory structure.

path/to/internvl_chat/InternVL-Domain-Adaptation-Data 
├── train_data
│   └── medical_sft_sample500k.jsonl
├── images
│   └── medical_images
└── train_meta
    └── internvl_1_2_finetune_medical.json

Finetune#

Please finetune the model using following scripts:

GPUS=8 PER_DEVICE_BATCH_SIZE=1 sh shell/mini_internvl/domain_adaptation/internvl2_1b_qwen2_0_5b_dynamic_res_finetune_medical.sh

Evaluation#

we test our model on a comprehensive medical AI benchmark, GMAI-MMBench. Our evaluation was conducted using the VLMEvalKit framework.

Please refer to Document for testing.

Importantly, before testing, please add the model to the internvl_series in config_file:

  'Mini-InternVL-DA-1B': partial(InternVLChat, model_path='path/to/your/checkpoints', version='V2.0'),
  'Mini-InternVL-DA-2B': partial(InternVLChat, model_path='path/to/your/checkpoints', version='V2.0'),
  'Mini-InternVL-DA-4B': partial(InternVLChat, model_path='path/to/your/checkpoints', version='V2.0')

Remote Sensing#

Data Preparation#

  • Prepare InternVL-Chat-V1-2-SFT-Data, See Document

  • Please download the corresponding files in train_data, train_meta, and val directories from InternVL-Domain-Adaptation-Data, following the directory tree structure below.

  • Download the images from GeoChat, FIT-RS, RSVQA and DIOR-RSVG. Extract the files and place them in the corresponding locations within the directory structure below.

path/to/internvl_chat/InternVL-Domain-Adaptation-Data 
├── train_data
│   ├── dior_rsvg_instruct_26k.jsonl
|   ├── fit_rs_vqa_100k.jsonl
|   ├── rsvqa_hr_train_instruct_100k.jsonl
│   └── geochat_instruct.jsonl
├── images
|   ├── RSVQA_L
|   |   └── Images_LR
|   ├── RSVQA-H
|   |   └── Data
|   ├── DIOR-RSVG
|   |   └── JPEGImages
|   ├── FIT-RS
|   |   └── imgv2_split_512_100_vaild
|   └── GeoChat
|       └── images
|           └── final_images_llava
├── train_meta
|   └── internvl_1_2_finetune_remote.json
└── val
    ├── dior_rsvg_test.json
    ├── rsvqa_h_test_1_instruct.json
    ├── rsvqa_h_test_2_instruct.json
    └── rsvqa_l_test_instruct.json

Finetune#

Please finetune the model using following scripts:

GPUS=8 PER_DEVICE_BATCH_SIZE=1 sh shell/mini_internvl/domain_adaptation/internvl2_1b_qwen2_0_5b_dynamic_res_finetune_remote.sh

Evaluation#

We assess the performance of our transferred model using the RSVQA dataset for the VQA task and the DIOR-RSVG dataset for the visual grounding task.

  • RS-VQA

We chose the Presence, Comparison, and Rural/Urban subsets of the RSVQA-LR and RSVQA-HR datasets for assessment.

You can now directly use the following command to run the test with 8 GPUs:

# RSVQA-LR 
GPUS=8 sh evaluate.sh ${checkpoint} rsvqa-lr --dynamic --max-num  12
# RSVQA-HR-test1
GPUS=8 sh evaluate.sh ${checkpoint} rsvqa-hr-test1 --dynamic --max-num  12
# RSVQA-LR-test2
GPUS=8 sh evaluate.sh ${checkpoint} rsvqa-hr-test2 --dynamic --max-num  12
  • DIOR-RSVG

You can now directly use the following command to run the test with 8 GPUs:

GPUS=8 sh evaluate.sh ${checkpoint} dior-rsvg --dynamic --max-num  12

Autonomous Driving with Temporal Information#

Coming soon…

Citation#

If you find this project useful in your research, please consider citing:

@article{gao2024mini,
  title={Mini-InternVL: a flexible-transfer pocket multi-modal model with 5\% parameters and 90\% performance},
  author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others},
  journal={Visual Intelligence},
  volume={2},
  number={1},
  pages={1--17},
  year={2024},
  publisher={Springer}
}