Evaluation of InternVL2.5 Series#

To evaluate the performance of the InternVL2.5 series across various tasks, follow the instructions for each specific dataset. Ensure that the appropriate number of GPUs is allocated as specified.

1⃣️ We mainly use VLMEvalKit repositories for model evaluation.

2⃣️ Please note that evaluating the same model using different testing toolkits like InternVL and VLMEvalKit can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results.

Model Preparation#

model name	type	param	download	size
InternVL2_5-1B	MLLM	0.9B	🤗 HF link	1.8 GB
InternVL2_5-1B-MPO	MLLM	0.9B	🤗 HF link	1.8 GB
InternVL2_5-2B	MLLM	2.2B	🤗 HF link	4.2 GB
InternVL2_5-2B-MPO	MLLM	2.2B	🤗 HF link	4.2 GB
InternVL2_5-4B	MLLM	4.2B	🤗 HF link	7.8 GB
InternVL2_5-4B-MPO	MLLM	4.2B	🤗 HF link	7.8 GB
InternVL2_5-8B	MLLM	8.1B	🤗 HF link	16 GB
InternVL2_5-8B-MPO	MLLM	8.1B	🤗 HF link	16 GB
InternVL2_5-26B	MLLM	25.5B	🤗 HF link	48 GB
InternVL2_5-26B-MPO	MLLM	25.5B	🤗 HF link	48 GB
InternVL2_5-38B	MLLM	40.1B	🤗 HF link	75 GB
InternVL2_5-38B-MPO	MLLM	40.1B	🤗 HF link	75 GB
InternVL2_5-78B	MLLM	76.3B	🤗 HF link	143 GB
InternVL2_5-78B-MPO	MLLM	76.3B	🤗 HF link	143 GB

Before evaluation, download the trained model we provide.

cd pretrained/
# pip install -U huggingface_hub
# Download OpenGVLab/InternVL2_5-1B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-1B --local-dir InternVL2_5-1B

# Download OpenGVLab/InternVL2_5-1B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-1B-MPO --local-dir InternVL2_5-1B-MPO

# Download OpenGVLab/InternVL2_5-2B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-2B --local-dir InternVL2_5-2B

# Download OpenGVLab/InternVL2_5-2B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-2B-MPO --local-dir InternVL2_5-2B-MPO

# Download OpenGVLab/InternVL2_5-4B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-4B --local-dir InternVL2_5-4B

# Download OpenGVLab/InternVL2_5-4B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-4B-MPO --local-dir InternVL2_5-4B-MPO

# Download OpenGVLab/InternVL2_5-8B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-8B --local-dir InternVL2_5-8B

# Download OpenGVLab/InternVL2_5-8B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-8B-MPO --local-dir InternVL2_5-8B-MPO

# Download OpenGVLab/InternVL2_5-26B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-26B --local-dir InternVL2_5-26B

# Download OpenGVLab/InternVL2_5-26B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-26B-MPO --local-dir InternVL2_5-26B-MPO

# Download OpenGVLab/InternVL2_5-38B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-38B --local-dir InternVL2_5-38B

# Download OpenGVLab/InternVL2_5-38B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-38B-MPO --local-dir InternVL2_5-38B-MPO

# Download OpenGVLab/InternVL2_5-78B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-78B --local-dir InternVL2_5-78B

# Download OpenGVLab/InternVL2_5-78B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2_5-78B-MPO --local-dir InternVL2_5-78B-MPO

The directory structure is:

pretrained
├── InternVL2_5-1B
├── InternVL2_5-1B-MPO
├── InternVL2_5-2B
├── InternVL2_5-2B-MPO
├── InternVL2_5-4B
├── InternVL2_5-4B-MPO
├── InternVL2_5-8B
├── InternVL2_5-8B-MPO
├── InternVL2_5-26B
├── InternVL2_5-26B-MPO
├── InternVL2_5-38B
├── InternVL2_5-38B-MPO
├── InternVL2_5-78B
└── InternVL2_5-78B-MPO

Evaluation using VLMEvalKit Codebase#

We evaluate the performance on most benchmarks (e.g., MMVet, LLaVABench, and CRPE) using VLMEvalKit. You need to set and USE_COT="1" in environment variable to activate the CoT prompt.

Data Preparation#

VLMEvalKit will automatically download the data for evaluation, so you do not need to prepare it manually.

Evaluation on Different Benchmarks#

To evaluate our models on different benchmarks, you can refer to the following script:

#!/bin/bash
set -x
PARTITION=${PARTITION:-"Intern5"}
GPUS=${GPUS:-64}
GPUS_PER_NODE=${GPUS_PER_NODE:-8}
GPUS_PER_TASK=${GPUS_PER_TASK:-1}
QUOTA_TYPE=${QUOTA_TYPE:-"reserved"}

declare -a models=( \
  "InternVL2-5-1B" \
  "InternVL2-5-1B-MPO" \
  "InternVL2-5-2B" \
  "InternVL2-5-2B-MPO" \
  "InternVL2-5-4B" \
  "InternVL2-5-4B-MPO" \
  "InternVL2-5-8B" \
  "InternVL2-5-8B-MPO" \
  "InternVL2-5-38B" \
  "InternVL2-5-38B-MPO" \
  "InternVL2-5-78B" \
  "InternVL2-5-78B-MPO" \
)

datasets="MMBench_TEST_EN_V11 MMStar MMMU_DEV_VAL MathVista_MINI HallusionBench AI2D_TEST OCRBench MMVet"
LOG_DIR="logs_eval"

export OPENAI_API_KEY="xxx"

for ((i=0; i<${#models[@]}; i++)); do

  model=${models[i]}

  if [[ "$model" =~ 38B|78B ]]; then
      GPUS_PER_TASK=8
  else
      GPUS_PER_TASK=1
  fi

  srun -p ${PARTITION} \
    --gres=gpu:${GPUS_PER_NODE} \
    --ntasks=$((GPUS / GPUS_PER_TASK)) \
    --ntasks-per-node=$((GPUS_PER_NODE / GPUS_PER_TASK)) \
    --quotatype=${QUOTA_TYPE} \
    --job-name="eval_wwy" \
    -o "${LOG_DIR}/${model}/evaluation.log" \
    -e "${LOG_DIR}/${model}/evaluation.log" \
    --async \
  python -u run.py \
    --data ${datasets} \
    --model ${model} \
    --verbose \

done

Note that VLMEvalkit does not officially support launching evaluation tasks with Slurm. You need to modify the run.py script to support the Slurm launcher as follows:

def init_dist():
    if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
        pass
    elif 'SLURM_PROCID' in os.environ:
        rank = int(os.getenv('SLURM_PROCID', '0'))
        world_size = int(os.getenv('SLURM_NTASKS', '1'))
        local_rank = rank % torch.cuda.device_count()

        os.environ['RANK'] = str(rank)
        os.environ['LOCAL_RANK'] = str(local_rank)
        os.environ['WORLD_SIZE'] = str(world_size)

        if 'MASTER_ADDR' not in os.environ:
            node_list = os.environ["SLURM_NODELIST"]
            addr = subprocess.getoutput(f"scontrol show hostname {node_list} | head -n1")
            os.environ['MASTER_ADDR'] = addr
        if 'MASTER_PORT' not in os.environ:
            os.environ['MASTER_PORT'] = '22110'

...

if __name__ == '__main__':
    load_env()
    init_dist()
    main()

Please refer to their document for more details.

Citation#

If you find this project useful in your research, please consider citing:

@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}
@article{wang2024mpo,
  title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
  author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2411.10442},
  year={2024}
}

Evaluation of InternVL2.5 Series

Contents