Evaluation of InternVL2.5 Series#
To evaluate the performance of the InternVL2.5 series across various tasks, follow the instructions for each specific dataset. Ensure that the appropriate number of GPUs is allocated as specified.
1⃣️ We mainly use VLMEvalKit repositories for model evaluation.
2⃣️ Please note that evaluating the same model using different testing toolkits like InternVL and VLMEvalKit can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results.
Model Preparation#
model name |
type |
param |
download |
size |
|---|---|---|---|---|
InternVL2_5-1B |
MLLM |
0.9B |
🤗 HF link |
1.8 GB |
InternVL2_5-1B-MPO |
MLLM |
0.9B |
🤗 HF link |
1.8 GB |
InternVL2_5-2B |
MLLM |
2.2B |
🤗 HF link |
4.2 GB |
InternVL2_5-2B-MPO |
MLLM |
2.2B |
🤗 HF link |
4.2 GB |
InternVL2_5-4B |
MLLM |
4.2B |
🤗 HF link |
7.8 GB |
InternVL2_5-4B-MPO |
MLLM |
4.2B |
🤗 HF link |
7.8 GB |
InternVL2_5-8B |
MLLM |
8.1B |
🤗 HF link |
16 GB |
InternVL2_5-8B-MPO |
MLLM |
8.1B |
🤗 HF link |
16 GB |
InternVL2_5-26B |
MLLM |
25.5B |
🤗 HF link |
48 GB |
InternVL2_5-26B-MPO |
MLLM |
25.5B |
🤗 HF link |
48 GB |
InternVL2_5-38B |
MLLM |
40.1B |
🤗 HF link |
75 GB |
InternVL2_5-38B-MPO |
MLLM |
40.1B |
🤗 HF link |
75 GB |
InternVL2_5-78B |
MLLM |
76.3B |
🤗 HF link |
143 GB |
InternVL2_5-78B-MPO |
MLLM |
76.3B |
🤗 HF link |
143 GB |
Before evaluation, download the trained model we provide.
cd pretrained/
# pip install -U huggingface_hub
# Download OpenGVLab/InternVL2_5-1B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-1B --local-dir InternVL2_5-1B
# Download OpenGVLab/InternVL2_5-1B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-1B-MPO --local-dir InternVL2_5-1B-MPO
# Download OpenGVLab/InternVL2_5-2B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-2B --local-dir InternVL2_5-2B
# Download OpenGVLab/InternVL2_5-2B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-2B-MPO --local-dir InternVL2_5-2B-MPO
# Download OpenGVLab/InternVL2_5-4B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-4B --local-dir InternVL2_5-4B
# Download OpenGVLab/InternVL2_5-4B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-4B-MPO --local-dir InternVL2_5-4B-MPO
# Download OpenGVLab/InternVL2_5-8B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-8B --local-dir InternVL2_5-8B
# Download OpenGVLab/InternVL2_5-8B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-8B-MPO --local-dir InternVL2_5-8B-MPO
# Download OpenGVLab/InternVL2_5-26B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-26B --local-dir InternVL2_5-26B
# Download OpenGVLab/InternVL2_5-26B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-26B-MPO --local-dir InternVL2_5-26B-MPO
# Download OpenGVLab/InternVL2_5-38B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-38B --local-dir InternVL2_5-38B
# Download OpenGVLab/InternVL2_5-38B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-38B-MPO --local-dir InternVL2_5-38B-MPO
# Download OpenGVLab/InternVL2_5-78B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-78B --local-dir InternVL2_5-78B
# Download OpenGVLab/InternVL2_5-78B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-78B-MPO --local-dir InternVL2_5-78B-MPO
The directory structure is:
pretrained
├── InternVL2_5-1B
├── InternVL2_5-1B-MPO
├── InternVL2_5-2B
├── InternVL2_5-2B-MPO
├── InternVL2_5-4B
├── InternVL2_5-4B-MPO
├── InternVL2_5-8B
├── InternVL2_5-8B-MPO
├── InternVL2_5-26B
├── InternVL2_5-26B-MPO
├── InternVL2_5-38B
├── InternVL2_5-38B-MPO
├── InternVL2_5-78B
└── InternVL2_5-78B-MPO
Evaluation using VLMEvalKit Codebase#
We evaluate the performance on most benchmarks (e.g., MMVet, LLaVABench, and CRPE) using VLMEvalKit. You need to set and USE_COT="1" in environment variable to activate the CoT prompt.
Data Preparation#
VLMEvalKit will automatically download the data for evaluation, so you do not need to prepare it manually.
Evaluation on Different Benchmarks#
To evaluate our models on different benchmarks, you can refer to the following script:
#!/bin/bash
set -x
PARTITION=${PARTITION:-"Intern5"}
GPUS=${GPUS:-64}
GPUS_PER_NODE=${GPUS_PER_NODE:-8}
GPUS_PER_TASK=${GPUS_PER_TASK:-1}
QUOTA_TYPE=${QUOTA_TYPE:-"reserved"}
declare -a models=( \
"InternVL2-5-1B" \
"InternVL2-5-1B-MPO" \
"InternVL2-5-2B" \
"InternVL2-5-2B-MPO" \
"InternVL2-5-4B" \
"InternVL2-5-4B-MPO" \
"InternVL2-5-8B" \
"InternVL2-5-8B-MPO" \
"InternVL2-5-38B" \
"InternVL2-5-38B-MPO" \
"InternVL2-5-78B" \
"InternVL2-5-78B-MPO" \
)
datasets="MMBench_TEST_EN_V11 MMStar MMMU_DEV_VAL MathVista_MINI HallusionBench AI2D_TEST OCRBench MMVet"
LOG_DIR="logs_eval"
export OPENAI_API_KEY="xxx"
for ((i=0; i<${#models[@]}; i++)); do
model=${models[i]}
if [[ "$model" =~ 38B|78B ]]; then
GPUS_PER_TASK=8
else
GPUS_PER_TASK=1
fi
srun -p ${PARTITION} \
--gres=gpu:${GPUS_PER_NODE} \
--ntasks=$((GPUS / GPUS_PER_TASK)) \
--ntasks-per-node=$((GPUS_PER_NODE / GPUS_PER_TASK)) \
--quotatype=${QUOTA_TYPE} \
--job-name="eval_wwy" \
-o "${LOG_DIR}/${model}/evaluation.log" \
-e "${LOG_DIR}/${model}/evaluation.log" \
--async \
python -u run.py \
--data ${datasets} \
--model ${model} \
--verbose \
done
Note that VLMEvalkit does not officially support launching evaluation tasks with Slurm. You need to modify the run.py script to support the Slurm launcher as follows:
def init_dist():
if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
pass
elif 'SLURM_PROCID' in os.environ:
rank = int(os.getenv('SLURM_PROCID', '0'))
world_size = int(os.getenv('SLURM_NTASKS', '1'))
local_rank = rank % torch.cuda.device_count()
os.environ['RANK'] = str(rank)
os.environ['LOCAL_RANK'] = str(local_rank)
os.environ['WORLD_SIZE'] = str(world_size)
if 'MASTER_ADDR' not in os.environ:
node_list = os.environ["SLURM_NODELIST"]
addr = subprocess.getoutput(f"scontrol show hostname {node_list} | head -n1")
os.environ['MASTER_ADDR'] = addr
if 'MASTER_PORT' not in os.environ:
os.environ['MASTER_PORT'] = '22110'
...
if __name__ == '__main__':
load_env()
init_dist()
main()
Please refer to their document for more details.
Citation#
If you find this project useful in your research, please consider citing:
@article{chen2024expanding,
title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
journal={arXiv preprint arXiv:2412.05271},
year={2024}
}
@article{wang2024mpo,
title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
journal={arXiv preprint arXiv:2411.10442},
year={2024}
}