Mixed Preference Optimization#

Please use trl==0.10.1 to ensure the model works normally.

Model Preparation#

model name	type	param	download	size
InternVL2_5-1B	MLLM	0.9B	🤗 HF link	1.8 GB
InternVL2_5-1B-MPO	MLLM	0.9B	🤗 HF link	1.8 GB
InternVL2_5-2B	MLLM	2.2B	🤗 HF link	4.2 GB
InternVL2_5-2B-MPO	MLLM	2.2B	🤗 HF link	4.2 GB
InternVL2_5-4B	MLLM	4.2B	🤗 HF link	7.8 GB
InternVL2_5-4B-MPO	MLLM	4.2B	🤗 HF link	7.8 GB
InternVL2_5-8B	MLLM	8.1B	🤗 HF link	16 GB
InternVL2_5-8B-MPO	MLLM	8.1B	🤗 HF link	16 GB
InternVL2_5-26B	MLLM	25.5B	🤗 HF link	48 GB
InternVL2_5-26B-MPO	MLLM	25.5B	🤗 HF link	48 GB
InternVL2_5-38B	MLLM	40.1B	🤗 HF link	75 GB
InternVL2_5-38B-MPO	MLLM	40.1B	🤗 HF link	75 GB
InternVL2_5-78B	MLLM	76.3B	🤗 HF link	143 GB
InternVL2_5-78B-MPO	MLLM	76.3B	🤗 HF link	143 GB

Before starting the preference optimization, download the pre-trained model we provide.

cd ckpt/
# pip install -U huggingface_hub
# Download OpenGVLab/InternVL2_5-8B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-8B --local-dir InternVL2_5-8B
# Download OpenGVLab/InternVL2_5-8B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-8B-MPO --local-dir InternVL2_5-8B-MPO

The directory structure is:

ckpt
├── InternVL2_5-8B
└── InternVL2_5-8B-MPO

Prepare Our MMPR Dataset#

To prepare the training data, please first download our MMPR dataset and the JSON file.

Our dataset contains approximately 3 million preference pairs, of which only around 400k are utilized during training. You can adjust the number of active data samples and the data mixture ratio by modifying the repeat parameter in the JSON file.

The directory structure is:

MMPR
├── images
└── annotations

Please note that our training data includes instructions collected from InternVL demo. However, due to privacy protection concerns, we are unable to release these portion of the data. Therefore, the reproduced results on general VQA (i.e., MMVet, LLaVABench, and MMHal-Bench) may be inferior to our released model.

We recommend incorporating additional general VQA data to preserve the general VQA abilities, following our DropoutNTP pipeline.

Prepare Customized Data#

If you want to prepare your customized preference data, please create a JSON file similar to this example.

The format for the JSON file should be:

{
  "your-custom-dataset-1": {
    "root": "path/to/the/image/",
    "annotation": "path/to/the/jsonl/annotation",
    "data_augment": false,
    "max_dynamic_patch": 12,
    "repeat_time": 1,
    "length": "number of samples in the dataset"
  }
}

Example:

{
  "scienceqa_multi_choice_en_20240402_extracted_pairs_vqa_format_rules": {
    "root": "MMPR/images/ScienceQA",
    "annotation": "MMPR/annotations/scienceqa_multi_choice_en_20240402_extracted_pairs_vqa_format_rules.jsonl",
    "data_augment": false,
    "repeat_time": 1,
    "length": 66457
  }
}

The format for each specific JSONL (such as plain text data, single-image data, multi-image data) can be organized as the following format:

{"image": "1.png", "question": "xxx", "chosen": "xxx", "rejected": "xxx",}
{"image": "2.png", "question": "xxx", "chosen": "xxx", "rejected": "xxx",}
...

Our suggestion is to add new domain-specific data on top of MMPR. This will enhance downstream capabilities while retaining the foundational skills. Of course, you can also choose to fine-tune solely on the new data based on your requirements.

Start Preference Optimization#

Commands for preference optimization:

cd internvl_chat
sh shell/internvl2.5_mpo/preference_optimization/internvl2_5_8b_internlm2_5_7b_dynamic_res_mpo.sh

If you encounter any issues, please let us know, and we will update the training guide to enhance its usability.

Based on the environment of InternVL, you need to additionally run pip install trl==0.10.1.

Evaluation#

We evaluate the performance on other benchmarks (e.g., MMVet, LLaVABench, and CRPE) using VLMEvalKit. You need to set use_mpo_prompt=True in config.py and USE_COT="1" in environment variable to activate the CoT prompt.

Generate Additional Preference Data#

To construct additional open-ended VQA preference data, you can use our DropoutNTP pipeline with the following command:

srun -p ${PARTITION} \
    --gres=gpu:${GPUS_PER_NODE} \
    --nodes=${NODES} \
    --ntasks=${GPUS} \
    --ntasks-per-node=${GPUS_PER_NODE} \
    --cpus-per-task=${CPUS_PER_TASK} \
    --kill-on-bad-exit=1 \
    --quotatype=${QUOTA_TYPE} \
python -u tools/reasoning_data_pipeline/mmpr_data_pipeline_dropout_ntp.py \
    --checkpoint ${model_path} \  # the model you want to use to generate negative samples
    --prompt-path ${dataset} \  # please refer to the following format example
    --out-dir ${out_dir} \  # the output directory you want to save the resulting data
    --batch-size 1 \
    --num-workers 8 \
    --num-return-sequences 1 \  # the number of generated negative samples per item
    --top-k 50 \
    --temperature 1.0 \
    --dynamic \
    --max-num ${max_num} \  # max_tiles when enabling dynamic resolution
    --sample-max-num 500000 \
    --tp 8 \
    --start-ratio ${START_RATIO} \  # We set it to 0.5 by default
2>&1 | tee -a "${LOG_PATH}"  # the file path you want to save your log

The format for the prompt file should be:

{"image": "1.png", "question": "xxx", "chosen": "xxx", "rejected": null,}
{"image": "2.png", "question": "xxx", "chosen": "xxx", "rejected": null,}
...

To constrct additional CoT reasoning preference data, you can use our correctness-based pipeline with the following command:

srun -p ${PARTITION} \
    --gres=gpu:${GPUS_PER_NODE} \
    --nodes=${NODES} \
    --ntasks=${GPUS} \
    --ntasks-per-node=${GPUS_PER_NODE} \
    --cpus-per-task=${CPUS_PER_TASK} \
    --kill-on-bad-exit=1 \
    --quotatype=${QUOTA_TYPE} \
python -u tools/reasoning_data_pipeline/mmpr_data_pipeline_correctness.py \
    --checkpoint ${model_path} \  # the model you want to use to generate negative samples
    --prompt-path ${dataset} \  # please refer to the following format example
    --out-dir ${out_dir} \  # the output directory you want to save the resulting data
    --batch-size 1 \
    --num-workers 8 \
    --num-return-sequences 32 \  # the number of generated reasoning processes per item
    --top-k 50 \
    --temperature 1.0 \
    --dynamic \
    --max-num ${max_num} \  # max_tiles when enabling dynamic resolution
    --sample-max-num 20000 \
    --tp 8 \
2>&1 | tee -a "${LOG_PATH}"  # the file path you want to save your log

The format for the prompt file should be:

{"image": "1.png", "question": "xxx", "answer": "xxx"}
{"image": "2.png", "question": "xxx", "answer": "xxx"}
...

After sample multiple reasoning processes, you can use this command to convert them into preference data based on the correctness:

python -u tools/mm_reasoning_pipeline/internvl_lmdeploy_cot_postprocess.py \
    --data-dir "${data_dir}" \  # should be same with the ${out_dir} when sampling reasoning processes
    --save-dir "${save_dir}" \  # the output directory you want to save the resulting data
    --answer-fix \
    --force \
    --num-pairs-per-key 15 \
    --max-lines 1200000 \

Citation#

If you find this project useful in your research, please consider citing:

@article{wang2024mpo,
  title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
  author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2411.10442},
  year={2024}
}
@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}

Mixed Preference Optimization

Contents