Evaluation of InternVL2 Series#

To evaluate the performance of the InternVL2 series across various tasks, follow the instructions for each specific dataset. Ensure that the appropriate number of GPUs is allocated as specified.

1⃣️ We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.

2⃣️ Please note that evaluating the same model using different testing toolkits like InternVL and VLMEvalKit can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results.

3⃣️️ Note, the dataset description is generated by GPT-4 and may contain errors.

Model Preparation#

model name	type	param	download	size
InternVL2-1B	MLLM	0.9B	🤗 HF link	1.8 GB
InternVL2-2B	MLLM	2.2B	🤗 HF link	4.2 GB
InternVL2-4B	MLLM	4.2B	🤗 HF link	7.8 GB
InternVL2-8B	MLLM	8.1B	🤗 HF link	16 GB
InternVL2-26B	MLLM	25.5B	🤗 HF link	48 GB
InternVL2-40B	MLLM	40.1B	🤗 HF link	75 GB
InternVL2-Llama3-76B	MLLM	76.3B	🤗 HF link	143 GB

Before evaluation, download the trained model we provide.

cd pretrained/
# pip install -U huggingface_hub
# Download OpenGVLab/InternVL2-1B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-1B --local-dir InternVL2-1B
# Download OpenGVLab/InternVL2-2B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-2B --local-dir InternVL2-2B
# Download OpenGVLab/InternVL2-4B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-4B --local-dir InternVL2-4B
# Download OpenGVLab/InternVL2-8B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-8B --local-dir InternVL2-8B
# Download OpenGVLab/InternVL2-26B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-26B --local-dir InternVL2-26B
# Download OpenGVLab/InternVL2-40B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-40B --local-dir InternVL2-40B
# Download OpenGVLab/InternVL2-Llama3-76B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-Llama3-76B --local-dir InternVL2-Llama3-76B

The directory structure is:

pretrained
├── InternVL2-1B
├── InternVL2-2B
├── InternVL2-4B
├── InternVL2-8B
├── InternVL2-26B
├── InternVL2-40B
└── InternVL2-Llama3-76B

Evaluation using InternVL Codebase#

Data Preparation#

Please prepare the evaluation data according to the guidance provided here.

MME#

MME is a comprehensive benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on both perception and cognition abilities across 14 different subtasks, ensuring robust and diverse testing of these models.

Please use the following command to perform the test with 1 GPU:

GPUS=1 sh evaluate.sh pretrained/InternVL2-1B mme --dynamic

The expected test results are:

=========== Perception ===========
total score: 1346.1990796318528

         existence  score: 175.0
         count  score: 113.33333333333334
         position  score: 135.0
         color  score: 138.33333333333331
         posters  score: 116.32653061224491
         celebrity  score: 144.70588235294116
         scene  score: 143.25
         landmark  score: 128.5
         artwork  score: 141.75
         OCR  score: 110.0


=========== Cognition ===========
total score: 448.2142857142857

         commonsense_reasoning  score: 95.71428571428571
         numerical_calculation  score: 57.5
         text_translation  score: 177.5
         code_reasoning  score: 117.5

Please use the following command to perform the test with 1 GPU:

GPUS=1 sh evaluate.sh pretrained/InternVL2-2B mme --dynamic

The expected test results are:

=========== Perception ===========
total score: 1439.6688675470189

         existence  score: 200.0
         count  score: 128.33333333333334
         position  score: 145.0
         color  score: 163.33333333333334
         posters  score: 131.97278911564626
         celebrity  score: 118.52941176470588
         scene  score: 157.0
         landmark  score: 154.0
         artwork  score: 146.5
         OCR  score: 95.0


=========== Cognition ===========
total score: 437.1428571428571

         commonsense_reasoning  score: 112.14285714285714
         numerical_calculation  score: 45.0
         text_translation  score: 177.5
         code_reasoning  score: 102.5

Please use the following command to perform the test with 1 GPU:

GPUS=1 sh evaluate.sh pretrained/InternVL2-4B mme --dynamic

The expected test results are:

=========== Perception ===========
total score: 1532.31662665066

         existence  score: 200.0
         count  score: 123.33333333333333
         position  score: 148.33333333333331
         color  score: 165.0
         posters  score: 155.78231292517006
         celebrity  score: 124.11764705882354
         scene  score: 158.75
         landmark  score: 165.0
         artwork  score: 144.5
         OCR  score: 147.5


=========== Cognition ===========
total score: 531.7857142857142

         commonsense_reasoning  score: 129.28571428571428
         numerical_calculation  score: 115.0
         text_translation  score: 170.0
         code_reasoning  score: 117.5

Please use the following command to perform the test with 1 GPU:

GPUS=1 sh evaluate.sh pretrained/InternVL2-8B mme --dynamic

The expected test results are:

=========== Perception ===========
total score: 1648.1331532613044

         existence  score: 190.0
         count  score: 158.33333333333331
         position  score: 163.33333333333334
         color  score: 175.0
         posters  score: 167.68707482993196
         celebrity  score: 148.52941176470586
         scene  score: 152.5
         landmark  score: 176.5
         artwork  score: 153.75
         OCR  score: 162.5


=========== Cognition ===========
total score: 562.1428571428571

         commonsense_reasoning  score: 147.14285714285714
         numerical_calculation  score: 87.5
         text_translation  score: 192.5
         code_reasoning  score: 135.0

Please use the following command to perform the test with 1 GPU:

GPUS=1 sh evaluate.sh pretrained/InternVL2-26B mme --dynamic

The expected test results are:

=========== Perception ===========
total score: 1720.0325130052022

         existence  score: 195.0
         count  score: 170.0
         position  score: 176.66666666666669
         color  score: 168.33333333333331
         posters  score: 176.87074829931973
         celebrity  score: 159.41176470588235
         scene  score: 154.0
         landmark  score: 179.5
         artwork  score: 162.75
         OCR  score: 177.5


=========== Cognition ===========
total score: 540.7142857142858

         commonsense_reasoning  score: 145.71428571428572
         numerical_calculation  score: 95.0
         text_translation  score: 185.0
         code_reasoning  score: 115.0

Please use the following command to perform the test with 8 GPU:

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mme --dynamic --auto

The expected test results are:

=========== Perception ===========
total score: 1715.390456182473

         existence  score: 185.0
         count  score: 175.0
         position  score: 158.33333333333331
         color  score: 188.33333333333331
         posters  score: 187.41496598639458
         celebrity  score: 162.05882352941177
         scene  score: 152.5
         landmark  score: 180.25
         artwork  score: 171.5
         OCR  score: 155.0


=========== Cognition ===========
total score: 599.6428571428571

         commonsense_reasoning  score: 152.14285714285714
         numerical_calculation  score: 125.0
         text_translation  score: 177.5
         code_reasoning  score: 145.0

Please use the following command to perform the test with 8 GPU:

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mme --dynamic --auto

The expected test results are:

=========== Perception ===========
total score: 1731.095538215286

         existence  score: 200.0
         count  score: 175.0
         position  score: 168.33333333333331
         color  score: 185.0
         posters  score: 186.39455782312925
         celebrity  score: 169.11764705882354
         scene  score: 152.0
         landmark  score: 182.0
         artwork  score: 173.25
         OCR  score: 140.0


=========== Cognition ===========
total score: 683.5714285714286

         commonsense_reasoning  score: 158.57142857142856
         numerical_calculation  score: 185.0
         text_translation  score: 177.5
         code_reasoning  score: 162.5

OKVQA#

OKVQA (Outside Knowledge Visual Question Answering) is a dataset designed for visual question answering tasks that require external knowledge beyond what is visible in the image, featuring over 14,000 questions to evaluate the reasoning abilities of AI models.

Please use the following command to perform the test with 8 GPU:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-okvqa-val --dynamic

The expected test results are:

okvqa_val 0.48513674197383483

Please use the following command to perform the test with 8 GPU:

GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-okvqa-val --dynamic

The expected test results are:

okvqa_val 0.5316290130796605

Please use the following command to perform the test with 8 GPU:

GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-okvqa-val --dynamic

The expected test results are:

okvqa_val 0.6007530717399846

Please use the following command to perform the test with 8 GPU:

GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-okvqa-val --dynamic

The expected test results are:

okvqa_val 0.6289734443123187

Please use the following command to perform the test with 8 GPU:

GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-okvqa-val --dynamic

The expected test results are:

okvqa_val 0.6594530321046287

Please use the following command to perform the test with 8 GPU:

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-okvqa-val --dynamic --auto

The expected test results are:

okvqa_val 0.664288545382473

Please use the following command to perform the test with 8 GPU:

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-okvqa-val --dynamic --auto

The expected test results are:

okvqa_val 0.683432421720166

TextVQA#

TextVQA is a dataset designed to evaluate visual question answering models by requiring them to read and reason about text present within images, containing 45,336 questions over 28,408 images from the OpenImages dataset.

The TextVQA dataset provides official OCR results, specifically Rosetta OCR tokens. During testing with InstructBLIP and LLaVA 1.5, the OCR results are input to the LLM as a prompt. If you want to input Rosetta OCR tokens, use the following command:

We do not use Rosetta OCR tokens, run this command:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-textvqa-val --dynamic

The expected test results are:

textvqa_val 0.7052400000000033

We do not use Rosetta OCR tokens, run this command:

GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-textvqa-val --dynamic

The expected test results are:

textvqa_val 0.7335600000000035

We do not use Rosetta OCR tokens, run this command:

GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-textvqa-val --dynamic

The expected test results are:

textvqa_val 0.7437000000000039

We do not use Rosetta OCR tokens, run this command:

GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-textvqa-val --dynamic

The expected test results are:

textvqa_val 0.773740000000004

We do not use Rosetta OCR tokens, run this command:

GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-textvqa-val --dynamic

The expected test results are:

textvqa_val 0.8228200000000048

We do not use Rosetta OCR tokens, run this command:

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-textvqa-val --dynamic --auto

The expected test results are:

textvqa_val 0.8301600000000046

We do not use Rosetta OCR tokens, run this command:

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-textvqa-val --dynamic --auto

The expected test results are:

textvqa_val 0.844100000000004

VizWiz#

The VizWiz VQA dataset is a visual question answering dataset created to help answer visual questions posed by blind individuals. It contains over 31,000 visual questions, where users took a picture using a mobile phone and recorded a spoken question about it. Each question comes with 10 crowdsourced answers. This dataset addresses tasks such as predicting the answer to a visual question and determining whether a visual question can be answered.

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-vizwiz-val --dynamic

The expected test results are:

vizwiz_val 0.5306783977772626

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-vizwiz-test --dynamic

For the test set, submit the results to the evaluation server.

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-vizwiz-val --dynamic

The expected test results are:

vizwiz_val 0.47376707571196724

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-vizwiz-test --dynamic

For the test set, submit the results to the evaluation server.

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-vizwiz-val --dynamic

The expected test results are:

vizwiz_val 0.622088446399631

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-vizwiz-test --dynamic

For the test set, submit the results to the evaluation server.

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-vizwiz-val --dynamic

The expected test results are:

vizwiz_val 0.6290808057420708

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-vizwiz-test --dynamic

For the test set, submit the results to the evaluation server.

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-vizwiz-val --dynamic

The expected test results are:

vizwiz_val 0.6839083121092873

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-vizwiz-test --dynamic

For the test set, submit the results to the evaluation server.

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-vizwiz-val --dynamic --auto

The expected test results are:

vizwiz_val 0.6521880064829846

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-vizwiz-test --dynamic --auto

For the test set, submit the results to the evaluation server.

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-vizwiz-val --dynamic --auto

The expected test results are:

vizwiz_val 0.6767075711970381

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-vizwiz-test --dynamic --auto

For the test set, submit the results to the evaluation server.

ChartQA#

The ChartQA dataset is a comprehensive benchmark for question answering about charts that involves both visual and logical reasoning. It includes a mix of 9.6K human-written questions and 23.1K machine-generated questions derived from chart summaries. This dataset is designed to evaluate models that can understand and analyze charts by answering complex questions that often require multiple logical and arithmetic operations, as well as referencing visual features of the charts.

The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. The final score for model evaluation is calculated as the average of the scores on these two test sets:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-chartqa-test --dynamic --max-num 12

The expected test results are:

['chartqa_test_human', {'relaxed_accuracy': 0.5392}]
['chartqa_test_augmented', {'relaxed_accuracy': 0.9184}]

result = (53.92 + 91.84) / 2 = 72.88

The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. The final score for model evaluation is calculated as the average of the scores on these two test sets:

GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-chartqa-test --dynamic --max-num 12

The expected test results are:

['chartqa_test_human', {'relaxed_accuracy': 0.5952}]
['chartqa_test_augmented', {'relaxed_accuracy': 0.9296}]

result = (59.52 + 92.96) / 2 = 76.24

The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. The final score for model evaluation is calculated as the average of the scores on these two test sets:

GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-chartqa-test --dynamic --max-num 12

The expected test results are:

['chartqa_test_human', {'relaxed_accuracy': 0.6992}]
['chartqa_test_augmented', {'relaxed_accuracy': 0.9304}]

result = (69.92 + 93.04) / 2 = 81.48

The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. The final score for model evaluation is calculated as the average of the scores on these two test sets:

GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-chartqa-test --dynamic --max-num 12

The expected test results are:

['chartqa_test_human', {'relaxed_accuracy': 0.7288}]
['chartqa_test_augmented', {'relaxed_accuracy': 0.9368}]

result = (72.88 + 93.68) / 2 = 83.28

The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. The final score for model evaluation is calculated as the average of the scores on these two test sets:

GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-chartqa-test --dynamic --max-num 12

The expected test results are:

['chartqa_test_human', {'relaxed_accuracy': 0.7528}]
['chartqa_test_augmented', {'relaxed_accuracy': 0.9448}]

result = (75.28 + 94.48) / 2 = 84.88

The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. The final score for model evaluation is calculated as the average of the scores on these two test sets:

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-chartqa-test --dynamic --max-num 12 --auto

The expected test results are:

['chartqa_test_human', {'relaxed_accuracy': 0.772}]
['chartqa_test_augmented', {'relaxed_accuracy': 0.952}]

result = (77.2 + 95.2) / 2 = 86.2

The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. The final score for model evaluation is calculated as the average of the scores on these two test sets:

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-chartqa-test --dynamic --max-num 12 --auto

The expected test results are:

['chartqa_test_human', {'relaxed_accuracy': 0.816}]
['chartqa_test_augmented', {'relaxed_accuracy': 0.952}]

result = (81.6 + 95.2) / 2 = 88.4

DocVQA#

The DocVQA dataset consists of 50,000 questions on 12,000+ document images. It is designed for visual question answering tasks where questions are answered using text within the document images. The dataset includes OCR transcriptions and ground truth answers, supporting evaluation of models that interpret and extract information from documents.

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-docvqa-val --dynamic --max-num 18

The expected test results are:

Overall ANLS: 0.7999

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-docvqa-test --dynamic --max-num 18

For the test set, submit the results to the evaluation server.

The expected test results are:

Overall ANLS: 0.8170

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-docvqa-val --dynamic --max-num 18

The expected test results are:

Overall ANLS: 0.8590

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-docvqa-test --dynamic --max-num 18

For the test set, submit the results to the evaluation server.

The expected test results are:

Overall ANLS: 0.8690

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-docvqa-val --dynamic --max-num 18

The expected test results are:

Overall ANLS: 0.8809

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-docvqa-test --dynamic --max-num 18

For the test set, submit the results to the evaluation server.

The expected test results are:

Overall ANLS: 0.8920

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-docvqa-val --dynamic --max-num 18

The expected test results are:

Overall ANLS: 0.9081

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-docvqa-test --dynamic --max-num 18

For the test set, submit the results to the evaluation server.

The expected test results are:

Overall ANLS: 0.9160

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-docvqa-val --dynamic --max-num 18

The expected test results are:

Overall ANLS: 0.9212

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-docvqa-test --dynamic --max-num 18

For the test set, submit the results to the evaluation server.

The expected test results are:

Overall ANLS: 0.9290

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-docvqa-val --dynamic --max-num 18 --auto

The expected test results are:

Overall ANLS: 0.9373

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-docvqa-test --dynamic --max-num 18 --auto

For the test set, submit the results to the evaluation server.

The expected test results are:

Overall ANLS: 0.9390

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-docvqa-val --dynamic --max-num 18 --auto

The expected test results are:

Overall ANLS: 0.9417

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-docvqa-test --dynamic --max-num 18 --auto

For the test set, submit the results to the evaluation server.

The expected test results are:

Overall ANLS: 0.9410

AI2D#

The AI2D dataset contains over 5,000 grade school science diagrams with extensive annotations and 15,000 multiple-choice questions for research on diagram understanding and question answering.

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-ai2d-test --dynamic

The expected test results are:

ai2diagram_test {'accuracy': 0.6408678756476683}

GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-ai2d-test --dynamic

The expected test results are:

ai2diagram_test {'accuracy': 0.7409326424870466}

GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-ai2d-test --dynamic

The expected test results are:

ai2diagram_test {'accuracy': 0.788860103626943}

GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-ai2d-test --dynamic

The expected test results are:

ai2diagram_test {'accuracy': 0.8377590673575129}

GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-ai2d-test --dynamic

The expected test results are:

ai2diagram_test {'accuracy': 0.844559585492228}

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-ai2d-test --dynamic --auto

The expected test results are:

ai2diagram_test {'accuracy': 0.8711139896373057}

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-ai2d-test --dynamic --auto

The expected test results are:

ai2diagram_test {'accuracy': 0.8762953367875648}

InfographicVQA#

The InfographicVQA dataset is a collection of infographics accompanied by natural language questions and answers. This dataset includes a diverse range of infographics sourced from thousands of different websites, ensuring a variety of layouts and designs. It comprises 30,035 questions across 5,485 images, split into training, validation, and test sets.

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-infovqa-val --dynamic --max-num 24

The expected test results are:

Overall ANLS: 0.5018

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-infovqa-test --dynamic --max-num 24

For the test set, submit the results to the evaluation server.

The expected test results are:

Overall ANLS: 0.5090

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-infovqa-val --dynamic --max-num 24

The expected test results are:

Overall ANLS: 0.5766

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-infovqa-test --dynamic --max-num 24

For the test set, submit the results to the evaluation server.

The expected test results are:

Overall ANLS: 0.5890

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-infovqa-val --dynamic --max-num 24

The expected test results are:

Overall ANLS: 0.6625

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-infovqa-test --dynamic --max-num 24

For the test set, submit the results to the evaluation server.

The expected test results are:

Overall ANLS: 0.6700

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-infovqa-val --dynamic --max-num 24

The expected test results are:

Overall ANLS: 0.7260

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-infovqa-test --dynamic --max-num 24

For the test set, submit the results to the evaluation server.

The expected test results are:

Overall ANLS: 0.7480

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-infovqa-val --dynamic --max-num 24

The expected test results are:

Overall ANLS: 0.7601

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-infovqa-test --dynamic --max-num 24

For the test set, submit the results to the evaluation server.

The expected test results are:

Overall ANLS: 0.7590

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-infovqa-val --dynamic --max-num 24 --auto

The expected test results are:

Overall ANLS: 0.7851

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-infovqa-test --dynamic --max-num 24 --auto

For the test set, submit the results to the evaluation server.

The expected test results are:

Overall ANLS: 0.7870

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-infovqa-val --dynamic --max-num 24 --auto

The expected test results are:

Overall ANLS: 0.8021

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-infovqa-test --dynamic --max-num 24 --auto

For the test set, submit the results to the evaluation server.

The expected test results are:

Overall ANLS: 0.8200

GQA#

The GQA dataset is a large-scale visual question answering dataset designed for real-world visual reasoning and compositional question answering. It contains over 22 million questions grounded in real images, each accompanied by detailed scene graphs that describe objects, their attributes, and relationships within the scene. The dataset includes images from the Visual Genome dataset, with questions that require various reasoning skills such as spatial understanding and multi-step inference.

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-gqa-testdev --dynamic

The expected test results are:

Accuracy: 59.77%

GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-gqa-testdev --dynamic

The expected test results are:

Accuracy: 61.03%

GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-gqa-testdev --dynamic

The expected test results are:

Accuracy: 62.07%

GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-gqa-testdev --dynamic

The expected test results are:

Accuracy: 63.23%

GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-gqa-testdev --dynamic

The expected test results are:

Accuracy: 64.89%

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-gqa-testdev --dynamic --auto

The expected test results are:

Accuracy: 64.89%

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-gqa-testdev --dynamic --auto

The expected test results are:

Accuracy: 65.22%

POPE#

The POPE (Polling-based Object Probing Evaluation) dataset is designed to evaluate object hallucination in MLLMs. The dataset consists of 3,000 questions related to the captions of 500 images. By treating the MLLMs’ answers to these questions as a binary classification task, the dataset allows researchers to measure accuracy, precision, recall, and F1 scores to determine the extent of hallucination in the models.

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B pope --dynamic

The expected test results are:

Category: random, # samples: 2910
TP      FP      TN      FN
1239    51      1359    261
Accuracy: 0.8927835051546392
Precision: 0.9604651162790697
Recall: 0.826
F1 score: 0.8881720430107527
Yes ratio: 0.44329896907216493
0.888, 0.893, 0.960, 0.826, 0.443
====================================
Category: popular, # samples: 3000
TP      FP      TN      FN
1239    93      1407    261
Accuracy: 0.882
Precision: 0.9301801801801802
Recall: 0.826
F1 score: 0.875
Yes ratio: 0.444
0.875, 0.882, 0.930, 0.826, 0.444
====================================
Category: adversarial, # samples: 3000
TP      FP      TN      FN
1239    151     1349    261
Accuracy: 0.8626666666666667
Precision: 0.8913669064748202
Recall: 0.826
F1 score: 0.8574394463667819
Yes ratio: 0.4633333333333333
0.857, 0.863, 0.891, 0.826, 0.463
====================================

result = (88.8 + 87.5 + 85.7) / 3 = 87.3

GPUS=8 sh evaluate.sh pretrained/InternVL2-2B pope --dynamic

The expected test results are:

Category: random, # samples: 2910
TP      FP      TN      FN
1256    39      1371    244
Accuracy: 0.9027491408934708
Precision: 0.9698841698841699
Recall: 0.8373333333333334
F1 score: 0.898747763864043
Yes ratio: 0.44501718213058417
0.899, 0.903, 0.970, 0.837, 0.445
====================================
Category: popular, # samples: 3000
TP      FP      TN      FN
1256    89      1411    244
Accuracy: 0.889
Precision: 0.9338289962825279
Recall: 0.8373333333333334
F1 score: 0.8829525483304044
Yes ratio: 0.4483333333333333
0.883, 0.889, 0.934, 0.837, 0.448
====================================
Category: adversarial, # samples: 3000
TP      FP      TN      FN
1256    139     1361    244
Accuracy: 0.8723333333333333
Precision: 0.9003584229390681
Recall: 0.8373333333333334
F1 score: 0.8677029360967184
Yes ratio: 0.465
0.868, 0.872, 0.900, 0.837, 0.465
====================================

result = (89.9 + 88.3 + 86.8) / 3 = 88.3

GPUS=8 sh evaluate.sh pretrained/InternVL2-4B pope --dynamic

The expected test results are:

Category: random, # samples: 2910
TP      FP      TN      FN
1247    54      1356    253
Accuracy: 0.8945017182130585
Precision: 0.9584934665641814
Recall: 0.8313333333333334
F1 score: 0.8903962870403428
Yes ratio: 0.4470790378006873
0.890, 0.895, 0.958, 0.831, 0.447
====================================
Category: popular, # samples: 3000
TP      FP      TN      FN
1247    116     1384    253
Accuracy: 0.877
Precision: 0.9148936170212766
Recall: 0.8313333333333334
F1 score: 0.8711142158574922
Yes ratio: 0.4543333333333333
0.871, 0.877, 0.915, 0.831, 0.454
====================================
Category: adversarial, # samples: 3000
TP      FP      TN      FN
1247    175     1325    253
Accuracy: 0.8573333333333333
Precision: 0.8769338959212377
Recall: 0.8313333333333334
F1 score: 0.8535249828884327
Yes ratio: 0.474
0.854, 0.857, 0.877, 0.831, 0.474
====================================

result = (89.0 + 87.1 + 85.4) / 3 = 87.2

GPUS=8 sh evaluate.sh pretrained/InternVL2-8B pope --dynamic

The expected test results are:

Category: random, # samples: 2910
TP      FP      TN      FN
1204    29      1381    296
Accuracy: 0.8883161512027491
Precision: 0.9764801297648013
Recall: 0.8026666666666666
F1 score: 0.8810830589096232
Yes ratio: 0.42371134020618556
0.881, 0.888, 0.976, 0.803, 0.424
====================================
Category: popular, # samples: 3000
TP      FP      TN      FN
1204    67      1433    296
Accuracy: 0.879
Precision: 0.9472856018882769
Recall: 0.8026666666666666
F1 score: 0.8690003608805486
Yes ratio: 0.4236666666666667
0.869, 0.879, 0.947, 0.803, 0.424
====================================
Category: adversarial, # samples: 3000
TP      FP      TN      FN
1204    101     1399    296
Accuracy: 0.8676666666666667
Precision: 0.9226053639846743
Recall: 0.8026666666666666
F1 score: 0.8584670231729055
Yes ratio: 0.435
0.858, 0.868, 0.923, 0.803, 0.435
====================================

result = (88.1 + 86.9 + 85.8) / 3 = 86.9

GPUS=8 sh evaluate.sh pretrained/InternVL2-26B pope --dynamic

The expected test results are:

Category: random, # samples: 2910
TP      FP      TN      FN
1221    25      1385    279
Accuracy: 0.89553264604811
Precision: 0.9799357945425361
Recall: 0.814
F1 score: 0.8892935178441369
Yes ratio: 0.4281786941580756
0.889, 0.896, 0.980, 0.814, 0.428
====================================
Category: popular, # samples: 3000
TP      FP      TN      FN
1221    57      1443    279
Accuracy: 0.888
Precision: 0.9553990610328639
Recall: 0.814
F1 score: 0.8790496760259179
Yes ratio: 0.426
0.879, 0.888, 0.955, 0.814, 0.426
====================================
Category: adversarial, # samples: 3000
TP      FP      TN      FN
1221    84      1416    279
Accuracy: 0.879
Precision: 0.9356321839080459
Recall: 0.814
F1 score: 0.8705882352941177
Yes ratio: 0.435
0.871, 0.879, 0.936, 0.814, 0.435
====================================

result = (88.9 + 87.9 + 87.1) / 3 = 88.0

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B pope --dynamic --auto

The expected test results are:

Category: random, # samples: 2910
TP      FP      TN      FN
1232    16      1394    268
Accuracy: 0.902405498281787
Precision: 0.9871794871794872
Recall: 0.8213333333333334
F1 score: 0.8966521106259098
Yes ratio: 0.4288659793814433
0.897, 0.902, 0.987, 0.821, 0.429
====================================
Category: popular, # samples: 3000
TP      FP      TN      FN
1232    65      1435    268
Accuracy: 0.889
Precision: 0.9498843484965305
Recall: 0.8213333333333334
F1 score: 0.8809438684304614
Yes ratio: 0.43233333333333335
0.881, 0.889, 0.950, 0.821, 0.432
====================================
Category: adversarial, # samples: 3000
TP      FP      TN      FN
1232    87      1413    268
Accuracy: 0.8816666666666667
Precision: 0.934040940106141
Recall: 0.8213333333333334
F1 score: 0.8740688187300462
Yes ratio: 0.43966666666666665
0.874, 0.882, 0.934, 0.821, 0.440
====================================

result = (89.7 + 88.1 + 87.4) / 3 = 88.4

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B pope --dynamic --auto

The expected test results are:

Category: random, # samples: 2910
TP      FP      TN      FN
1251    26      1384    249
Accuracy: 0.9054982817869416
Precision: 0.9796397807361003
Recall: 0.834
F1 score: 0.9009722722362261
Yes ratio: 0.4388316151202749
0.901, 0.905, 0.980, 0.834, 0.439
====================================
Category: popular, # samples: 3000
TP      FP      TN      FN
1251    62      1438    249
Accuracy: 0.8963333333333333
Precision: 0.9527798933739527
Recall: 0.834
F1 score: 0.8894418769996445
Yes ratio: 0.43766666666666665
0.889, 0.896, 0.953, 0.834, 0.438
====================================
Category: adversarial, # samples: 3000
TP      FP      TN      FN
1251    91      1409    249
Accuracy: 0.8866666666666667
Precision: 0.9321907600596125
Recall: 0.834
F1 score: 0.8803659394792399
Yes ratio: 0.44733333333333336
0.880, 0.887, 0.932, 0.834, 0.447
====================================

result = (90.1 + 88.9 + 88.0) / 3 = 89.0

Tiny LVLM#

The Tiny LVLM-eHub is a streamlined evaluation benchmark designed to assess the multimodal capabilities of MLLMs, including models like Bard. It focuses on six categories of multimodal abilities: visual perception, visual knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence.

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B tiny_lvlm --dynamic

The expected test results are:

Visual_Knowledge_Acquisition: 0.6857142857142857
Object_Hallucination: 0.91
Visual_Commonsense: 0.556
Visual_Perception: 0.4875
Visual_Reasoning: 0.6145454545454545
Overall: 3.2537597402597402

GPUS=8 sh evaluate.sh pretrained/InternVL2-2B tiny_lvlm --dynamic

The expected test results are:

Visual_Knowledge_Acquisition: 0.71
Object_Hallucination: 0.91
Visual_Commonsense: 0.558
Visual_Perception: 0.4675
Visual_Reasoning: 0.649090909090909
Overall: 3.294590909090909

GPUS=8 sh evaluate.sh pretrained/InternVL2-4B tiny_lvlm --dynamic

The expected test results are:

Visual_Knowledge_Acquisition: 0.6814285714285714
Object_Hallucination: 0.89
Visual_Commonsense: 0.652
Visual_Perception: 0.4875
Visual_Reasoning: 0.6563636363636364
Overall: 3.3672922077922074

GPUS=8 sh evaluate.sh pretrained/InternVL2-8B tiny_lvlm --dynamic

The expected test results are:

Visual_Knowledge_Acquisition: 0.6985714285714286
Object_Hallucination: 0.8966666666666666
Visual_Commonsense: 0.652
Visual_Perception: 0.485
Visual_Reasoning: 0.6854545454545454
Overall: 3.417692640692641

GPUS=8 sh evaluate.sh pretrained/InternVL2-26B tiny_lvlm --dynamic

The expected test results are:

Visual_Knowledge_Acquisition: 0.7614285714285715
Object_Hallucination: 0.9
Visual_Commonsense: 0.652
Visual_Perception: 0.555
Visual_Reasoning: 0.7109090909090909
Overall: 3.5793376623376627

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B tiny_lvlm --dynamic --auto

The expected test results are:

Visual_Knowledge_Acquisition: 0.75
Object_Hallucination: 0.8966666666666666
Visual_Commonsense: 0.674
Visual_Perception: 0.5325
Visual_Reasoning: 0.730909090909091
Overall: 3.5840757575757576

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B tiny_lvlm --dynamic --auto

The expected test results are:

Visual_Knowledge_Acquisition: 0.7557142857142857
Object_Hallucination: 0.9166666666666666
Visual_Commonsense: 0.69
Visual_Perception: 0.525
Visual_Reasoning: 0.7418181818181818
Overall: 3.629199134199134

MMMU#

The MMMU dataset is a comprehensive benchmark designed to evaluate multimodal models on college-level tasks that require domain-specific knowledge and reasoning. It includes 11,500 questions sourced from college exams, quizzes, and textbooks, spanning six disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions cover 30 subjects and feature 30 types of images, such as charts, diagrams, maps, tables, and more.

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmmu-val --dynamic

The expected test results are:

{'Overall-Art and Design': {'num': 120, 'acc': 0.383}, 'Art': {'num': 30, 'acc': 0.4}, 'Art_Theory': {'num': 30, 'acc': 0.4}, 'Design': {'num': 30, 'acc': 0.567}, 'Music': {'num': 30, 'acc': 0.167}, 'Overall-Business': {'num': 150, 'acc': 0.333}, 'Accounting': {'num': 30, 'acc': 0.333}, 'Economics': {'num': 30, 'acc': 0.433}, 'Finance': {'num': 30, 'acc': 0.067}, 'Manage': {'num': 30, 'acc': 0.367}, 'Marketing': {'num': 30, 'acc': 0.467}, 'Overall-Science': {'num': 150, 'acc': 0.3}, 'Biology': {'num': 30, 'acc': 0.267}, 'Chemistry': {'num': 30, 'acc': 0.233}, 'Geography': {'num': 30, 'acc': 0.367}, 'Math': {'num': 30, 'acc': 0.167}, 'Physics': {'num': 30, 'acc': 0.467}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.313}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.433}, 'Clinical_Medicine': {'num': 30, 'acc': 0.233}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.4}, 'Pharmacy': {'num': 30, 'acc': 0.3}, 'Public_Health': {'num': 30, 'acc': 0.2}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.483}, 'History': {'num': 30, 'acc': 0.4}, 'Literature': {'num': 30, 'acc': 0.667}, 'Sociology': {'num': 30, 'acc': 0.467}, 'Psychology': {'num': 30, 'acc': 0.4}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.348}, 'Agriculture': {'num': 30, 'acc': 0.233}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.367}, 'Computer_Science': {'num': 30, 'acc': 0.4}, 'Electronics': {'num': 30, 'acc': 0.4}, 'Energy_and_Power': {'num': 30, 'acc': 0.333}, 'Materials': {'num': 30, 'acc': 0.4}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.3}, 
'Overall': {'num': 900, 'acc': 0.354}}

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmmu-test --dynamic

For the test set, submit the results to the evaluation server.

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmmu-val --dynamic

The expected test results are:

{'Overall-Art and Design': {'num': 120, 'acc': 0.392}, 'Art': {'num': 30, 'acc': 0.467}, 'Art_Theory': {'num': 30, 'acc': 0.4}, 'Design': {'num': 30, 'acc': 0.5}, 'Music': {'num': 30, 'acc': 0.2}, 'Overall-Business': {'num': 150, 'acc': 0.347}, 'Accounting': {'num': 30, 'acc': 0.367}, 'Economics': {'num': 30, 'acc': 0.333}, 'Finance': {'num': 30, 'acc': 0.333}, 'Manage': {'num': 30, 'acc': 0.367}, 'Marketing': {'num': 30, 'acc': 0.333}, 'Overall-Science': {'num': 150, 'acc': 0.213}, 'Biology': {'num': 30, 'acc': 0.233}, 'Chemistry': {'num': 30, 'acc': 0.1}, 'Geography': {'num': 30, 'acc': 0.167}, 'Math': {'num': 30, 'acc': 0.367}, 'Physics': {'num': 30, 'acc': 0.2}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.373}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.433}, 'Clinical_Medicine': {'num': 30, 'acc': 0.4}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.4}, 'Pharmacy': {'num': 30, 'acc': 0.267}, 'Public_Health': {'num': 30, 'acc': 0.367}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.492}, 'History': {'num': 30, 'acc': 0.4}, 'Literature': {'num': 30, 'acc': 0.767}, 'Sociology': {'num': 30, 'acc': 0.433}, 'Psychology': {'num': 30, 'acc': 0.367}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.3}, 'Agriculture': {'num': 30, 'acc': 0.433}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.233}, 'Computer_Science': {'num': 30, 'acc': 0.233}, 'Electronics': {'num': 30, 'acc': 0.367}, 'Energy_and_Power': {'num': 30, 'acc': 0.233}, 'Materials': {'num': 30, 'acc': 0.4}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.2}, 
'Overall': {'num': 900, 'acc': 0.343}}

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmmu-test --dynamic

For the test set, submit the results to the evaluation server.

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmmu-val --dynamic

The expected test results are:

'Overall': {'num': 900, 'acc': 0.470}

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmmu-test --dynamic

For the test set, submit the results to the evaluation server.

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmmu-val --dynamic

The expected test results are:

{'Overall-Art and Design': {'num': 120, 'acc': 0.608}, 'Art': {'num': 30, 'acc': 0.733}, 'Art_Theory': {'num': 30, 'acc': 0.7}, 'Design': {'num': 30, 'acc': 0.733}, 'Music': {'num': 30, 'acc': 0.267}, 'Overall-Business': {'num': 150, 'acc': 0.453}, 'Accounting': {'num': 30, 'acc': 0.467}, 'Economics': {'num': 30, 'acc': 0.533}, 'Finance': {'num': 30, 'acc': 0.333}, 'Manage': {'num': 30, 'acc': 0.4}, 'Marketing': {'num': 30, 'acc': 0.533}, 'Overall-Science': {'num': 150, 'acc': 0.393}, 'Biology': {'num': 30, 'acc': 0.467}, 'Chemistry': {'num': 30, 'acc': 0.267}, 'Geography': {'num': 30, 'acc': 0.4}, 'Math': {'num': 30, 'acc': 0.5}, 'Physics': {'num': 30, 'acc': 0.333}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.507}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.567}, 'Clinical_Medicine': {'num': 30, 'acc': 0.667}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.467}, 'Pharmacy': {'num': 30, 'acc': 0.367}, 'Public_Health': {'num': 30, 'acc': 0.467}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.717}, 'History': {'num': 30, 'acc': 0.767}, 'Literature': {'num': 30, 'acc': 0.9}, 'Sociology': {'num': 30, 'acc': 0.7}, 'Psychology': {'num': 30, 'acc': 0.5}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.39}, 'Agriculture': {'num': 30, 'acc': 0.533}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.333}, 'Computer_Science': {'num': 30, 'acc': 0.5}, 'Electronics': {'num': 30, 'acc': 0.467}, 'Energy_and_Power': {'num': 30, 'acc': 0.4}, 'Materials': {'num': 30, 'acc': 0.233}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.267}, 
'Overall': {'num': 900, 'acc': 0.493}}

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmmu-test --dynamic

For the test set, submit the results to the evaluation server.

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmmu-val --dynamic

The expected test results are:

{'Overall-Art and Design': {'num': 120, 'acc': 0.7}, 'Art': {'num': 30, 'acc': 0.767}, 'Art_Theory': {'num': 30, 'acc': 0.867}, 'Design': {'num': 30, 'acc': 0.867}, 'Music': {'num': 30, 'acc': 0.3}, 'Overall-Business': {'num': 150, 'acc': 0.407}, 'Accounting': {'num': 30, 'acc': 0.467}, 'Economics': {'num': 30, 'acc': 0.3}, 'Finance': {'num': 30, 'acc': 0.333}, 'Manage': {'num': 30, 'acc': 0.5}, 'Marketing': {'num': 30, 'acc': 0.433}, 'Overall-Science': {'num': 150, 'acc': 0.373}, 'Biology': {'num': 30, 'acc': 0.6}, 'Chemistry': {'num': 30, 'acc': 0.2}, 'Geography': {'num': 30, 'acc': 0.5}, 'Math': {'num': 30, 'acc': 0.233}, 'Physics': {'num': 30, 'acc': 0.333}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.453}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.467}, 'Clinical_Medicine': {'num': 30, 'acc': 0.567}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.367}, 'Pharmacy': {'num': 30, 'acc': 0.367}, 'Public_Health': {'num': 30, 'acc': 0.5}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.7}, 'History': {'num': 30, 'acc': 0.7}, 'Literature': {'num': 30, 'acc': 0.9}, 'Sociology': {'num': 30, 'acc': 0.6}, 'Psychology': {'num': 30, 'acc': 0.6}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.39}, 'Agriculture': {'num': 30, 'acc': 0.467}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.267}, 'Computer_Science': {'num': 30, 'acc': 0.367}, 'Electronics': {'num': 30, 'acc': 0.367}, 'Energy_and_Power': {'num': 30, 'acc': 0.5}, 'Materials': {'num': 30, 'acc': 0.433}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.333}, 
'Overall': {'num': 900, 'acc': 0.483}}

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmmu-test --dynamic

For the test set, submit the results to the evaluation server.

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmmu-val --dynamic --auto

The expected test results are:

{'Overall-Art and Design': {'num': 120, 'acc': 0.675}, 'Art': {'num': 30, 'acc': 0.733}, 'Art_Theory': {'num': 30, 'acc': 0.833}, 'Design': {'num': 30, 'acc': 0.767}, 'Music': {'num': 30, 'acc': 0.367}, 'Overall-Business': {'num': 150, 'acc': 0.44}, 'Accounting': {'num': 30, 'acc': 0.467}, 'Economics': {'num': 30, 'acc': 0.567}, 'Finance': {'num': 30, 'acc': 0.333}, 'Manage': {'num': 30, 'acc': 0.367}, 'Marketing': {'num': 30, 'acc': 0.467}, 'Overall-Science': {'num': 150, 'acc': 0.493}, 'Biology': {'num': 30, 'acc': 0.633}, 'Chemistry': {'num': 30, 'acc': 0.3}, 'Geography': {'num': 30, 'acc': 0.5}, 'Math': {'num': 30, 'acc': 0.5}, 'Physics': {'num': 30, 'acc': 0.533}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.593}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.5}, 'Clinical_Medicine': {'num': 30, 'acc': 0.6}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.4}, 'Pharmacy': {'num': 30, 'acc': 0.667}, 'Public_Health': {'num': 30, 'acc': 0.8}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.717}, 'History': {'num': 30, 'acc': 0.767}, 'Literature': {'num': 30, 'acc': 0.833}, 'Sociology': {'num': 30, 'acc': 0.6}, 'Psychology': {'num': 30, 'acc': 0.667}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.424}, 'Agriculture': {'num': 30, 'acc': 0.6}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.333}, 'Computer_Science': {'num': 30, 'acc': 0.467}, 'Electronics': {'num': 30, 'acc': 0.433}, 'Energy_and_Power': {'num': 30, 'acc': 0.467}, 'Materials': {'num': 30, 'acc': 0.3}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.367}, 
'Overall': {'num': 900, 'acc': 0.539}}

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmmu-test --dynamic --auto

For the test set, submit the results to the evaluation server.

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmmu-val --dynamic --auto

The expected test results are:

{'Overall-Art and Design': {'num': 120, 'acc': 0.683}, 'Art': {'num': 30, 'acc': 0.767}, 'Art_Theory': {'num': 30, 'acc': 0.933}, 'Design': {'num': 30, 'acc': 0.7}, 'Music': {'num': 30, 'acc': 0.333}, 'Overall-Business': {'num': 150, 'acc': 0.567}, 'Accounting': {'num': 30, 'acc': 0.5}, 'Economics': {'num': 30, 'acc': 0.567}, 'Finance': {'num': 30, 'acc': 0.433}, 'Manage': {'num': 30, 'acc': 0.633}, 'Marketing': {'num': 30, 'acc': 0.7}, 'Overall-Science': {'num': 150, 'acc': 0.413}, 'Biology': {'num': 30, 'acc': 0.467}, 'Chemistry': {'num': 30, 'acc': 0.3}, 'Geography': {'num': 30, 'acc': 0.433}, 'Math': {'num': 30, 'acc': 0.367}, 'Physics': {'num': 30, 'acc': 0.5}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.587}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.533}, 'Clinical_Medicine': {'num': 30, 'acc': 0.667}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.433}, 'Pharmacy': {'num': 30, 'acc': 0.6}, 'Public_Health': {'num': 30, 'acc': 0.7}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.725}, 'History': {'num': 30, 'acc': 0.733}, 'Literature': {'num': 30, 'acc': 0.867}, 'Sociology': {'num': 30, 'acc': 0.633}, 'Psychology': {'num': 30, 'acc': 0.667}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.443}, 'Agriculture': {'num': 30, 'acc': 0.6}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.367}, 'Computer_Science': {'num': 30, 'acc': 0.567}, 'Electronics': {'num': 30, 'acc': 0.433}, 'Energy_and_Power': {'num': 30, 'acc': 0.367}, 'Materials': {'num': 30, 'acc': 0.267}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.5}, 
'Overall': {'num': 900, 'acc': 0.552}}

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmmu-test --dynamic --auto

For the test set, submit the results to the evaluation server.

MMVet (GPT-4-0613)#

⚠️ Warning: Here, we use GPT-4-0613 as the judge model, while in VLMEvalKit, GPT-4-Turbo is used as the judge model. Using different versions of GPT-4 can result in significant score variations. Therefore, testing the same model with the two codebases can lead to notable score differences.

The MM-Vet dataset is a comprehensive benchmark designed to evaluate the integrated capabilities of MLLMs. It encompasses six core vision-language (VL) capabilities: recognition, knowledge, optical character recognition (OCR), spatial awareness, language generation, and math. The dataset includes 200 images and 218 questions, each requiring one or more of these capabilities to answer. The evaluation uses an open-ended LLM-based approach, allowing assessment across various answer styles and question types.

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmvet --dynamic

Then, submit the results to the evaluation server. The expected test results are:

runs: [37.8]

GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmvet --dynamic

Then, submit the results to the evaluation server. The expected test results are:

runs: [44.6]

GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmvet --dynamic

Then, submit the results to the evaluation server. The expected test results are:

runs: [55.7]

GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmvet --dynamic

Then, submit the results to the evaluation server. The expected test results are:

runs: [60.0]

GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmvet --dynamic

Then, submit the results to the evaluation server. The expected test results are:

runs: [64.2]

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmvet --dynamic --auto

Then, submit the results to the evaluation server. The expected test results are:

runs: [68.5]

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmvet --dynamic --auto

Then, submit the results to the evaluation server. The expected test results are:

runs: [69.8]

MMBench#

The MMBench dataset is a comprehensive multi-modality benchmark designed to evaluate the fine-grained abilities of vision-language models. It contains around 3,000 multiple-choice questions covering 20 ability dimensions, structured into a hierarchical taxonomy. These dimensions include perception and reasoning abilities, further broken down into specific skills like coarse and fine-grained perception, attribute reasoning, and logic reasoning.

For the English dev / test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmbench-dev-en --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmbench-test-en --dynamic

Then, submit the results to the evaluation server. The expected test results are:

mmbench-dev-en: -
mmbench-test-en: 65.4

For the Chinese dev / test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmbench-dev-cn --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmbench-test-cn --dynamic

Then, submit the results to the evaluation server. The expected test results are:

mmbench-dev-cn: -
mmbench-test-cn: 60.7

For the English dev / test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmbench-dev-en --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmbench-test-en --dynamic

Then, submit the results to the evaluation server. The expected test results are:

mmbench-dev-en: -
mmbench-test-en: 73.2

For the Chinese dev / test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmbench-dev-cn --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmbench-test-cn --dynamic

Then, submit the results to the evaluation server. The expected test results are:

mmbench-dev-cn: -
mmbench-test-cn: 70.9

For the English dev / test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmbench-dev-en --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmbench-test-en --dynamic

Then, submit the results to the evaluation server. The expected test results are:

mmbench-dev-en: -
mmbench-test-en: 78.6

For the Chinese dev / test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmbench-dev-cn --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmbench-test-cn --dynamic

Then, submit the results to the evaluation server. The expected test results are:

mmbench-dev-cn: -
mmbench-test-cn: 73.9

For the English dev / test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmbench-dev-en --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmbench-test-en --dynamic

Then, submit the results to the evaluation server. The expected test results are:

mmbench-dev-en: -
mmbench-test-en: 81.7

For the Chinese dev / test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmbench-dev-cn --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmbench-test-cn --dynamic

Then, submit the results to the evaluation server. The expected test results are:

mmbench-dev-cn: -
mmbench-test-cn: 81.2

For the English dev / test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmbench-dev-en --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmbench-test-en --dynamic

Then, submit the results to the evaluation server. The expected test results are:

mmbench-dev-en: -
mmbench-test-en: 83.4

For the Chinese dev / test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmbench-dev-cn --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmbench-test-cn --dynamic

Then, submit the results to the evaluation server. The expected test results are:

mmbench-dev-cn: -
mmbench-test-cn: 82.0

For the English dev / test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmbench-dev-en --dynamic --auto
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmbench-test-en --dynamic --auto

Then, submit the results to the evaluation server. The expected test results are:

mmbench-dev-en: -
mmbench-test-en: 86.8

For the Chinese dev / test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmbench-dev-cn --dynamic --auto
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmbench-test-cn --dynamic --auto

Then, submit the results to the evaluation server. The expected test results are:

mmbench-dev-cn: -
mmbench-test-cn: 86.5

For the English dev / test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmbench-dev-en --dynamic --auto
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmbench-test-en --dynamic --auto

Then, submit the results to the evaluation server. The expected test results are:

mmbench-dev-en: -
mmbench-test-en: 86.5

For the Chinese dev / test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmbench-dev-cn --dynamic --auto
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmbench-test-cn --dynamic --auto

Then, submit the results to the evaluation server. The expected test results are:

mmbench-dev-cn: -
mmbench-test-cn: 86.3

CCBench#

CCBench, a multi-modal benchmark in the domain of Chinese Culture, is designed to evaluate the performance of MLLMs on tasks specifically related to Chinese cultural content.

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B ccbench-dev --dynamic

Then, submit the results to the evaluation server. The expected test results are:

ccbench-dev: 75.7

GPUS=8 sh evaluate.sh pretrained/InternVL2-2B ccbench-dev --dynamic

Then, submit the results to the evaluation server. The expected test results are:

ccbench-dev: 74.7

GPUS=8 sh evaluate.sh pretrained/InternVL2-4B ccbench-dev --dynamic

Then, submit the results to the evaluation server. The expected test results are:

ccbench-dev: 66.5

GPUS=8 sh evaluate.sh pretrained/InternVL2-8B ccbench-dev --dynamic

Then, submit the results to the evaluation server. The expected test results are:

ccbench-dev: 75.9

GPUS=8 sh evaluate.sh pretrained/InternVL2-26B ccbench-dev --dynamic

Then, submit the results to the evaluation server. The expected test results are:

ccbench-dev: 73.5

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B ccbench-dev --dynamic --auto

Then, submit the results to the evaluation server. The expected test results are:

ccbench-dev: 80.6

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B ccbench-dev --dynamic --auto

Then, submit the results to the evaluation server. The expected test results are:

ccbench-dev: 81.0

SEED#

CCBench is a multimodal benchmark specifically designed to evaluate models on tasks related to Chinese culture. It is part of the larger MMBench suite of benchmarks, developed by the OpenCompass Community, and aims to provide fine-grained evaluations across various capabilities of vision-language models. CCBench includes 510 questions in a multiple-choice format, focusing on cultural knowledge and understanding.

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B seed --dynamic

The expected test results are:

Acc@1: 0.6074485825458588
length: 17990
Accuracy for each data type:
Data type Scene Understanding: 73.05%
Data type Instance Identity: 71.16%
Data type Instance Location: 69.23%
Data type Instance Attributes: 58.49%
Data type Instances Counting: 52.55%
Data type Spatial Relation: 43.53%
Data type Instance Interaction: 71.13%
Data type Visual Reasoning: 72.51%
Data type Text Understanding: 68.60%
Data type Action Recognition: 53.55%
Data type Action Prediction: 39.92%
Data type Procedure Understanding: 28.74%
Total accuracy: 60.76%
Image accuracy: 65.62%
Video accuracy: 42.35%

GPUS=8 sh evaluate.sh pretrained/InternVL2-2B seed --dynamic

The expected test results are:

Acc@1: 0.6656475819899944
length: 17990
Accuracy for each data type:
Data type Scene Understanding: 76.92%
Data type Instance Identity: 76.79%
Data type Instance Location: 75.04%
Data type Instance Attributes: 65.44%
Data type Instances Counting: 60.40%
Data type Spatial Relation: 54.03%
Data type Instance Interaction: 72.16%
Data type Visual Reasoning: 76.74%
Data type Text Understanding: 74.42%
Data type Action Recognition: 60.04%
Data type Action Prediction: 43.27%
Data type Procedure Understanding: 34.70%
Total accuracy: 66.56%
Image accuracy: 71.55%
Video accuracy: 47.67%

GPUS=8 sh evaluate.sh pretrained/InternVL2-4B seed --dynamic

The expected test results are:

Acc@1: 0.6934408004446915
length: 17990
Accuracy for each data type:
Data type Scene Understanding: 78.75%
Data type Instance Identity: 76.79%
Data type Instance Location: 77.45%
Data type Instance Attributes: 66.36%
Data type Instances Counting: 64.57%
Data type Spatial Relation: 56.47%
Data type Instance Interaction: 71.13%
Data type Visual Reasoning: 78.25%
Data type Text Understanding: 75.58%
Data type Action Recognition: 60.57%
Data type Action Prediction: 47.84%
Data type Procedure Understanding: 47.80%
Total accuracy: 69.34%
Image accuracy: 73.67%
Video accuracy: 52.94%

GPUS=8 sh evaluate.sh pretrained/InternVL2-8B seed --dynamic

The expected test results are:

Acc@1: 0.7072262367982213
length: 17990
Accuracy for each data type:
Data type Scene Understanding: 79.89%
Data type Instance Identity: 78.97%
Data type Instance Location: 79.50%
Data type Instance Attributes: 69.84%
Data type Instances Counting: 68.08%
Data type Spatial Relation: 64.23%
Data type Instance Interaction: 79.38%
Data type Visual Reasoning: 78.85%
Data type Text Understanding: 75.58%
Data type Action Recognition: 60.70%
Data type Action Prediction: 48.57%
Data type Procedure Understanding: 36.56%
Total accuracy: 70.72%
Image accuracy: 76.15%
Video accuracy: 50.17%

GPUS=8 sh evaluate.sh pretrained/InternVL2-26B seed --dynamic

The expected test results are:

Acc@1: 0.7245136186770428
length: 17990
Accuracy for each data type:
Data type Scene Understanding: 80.30%
Data type Instance Identity: 80.39%
Data type Instance Location: 79.88%
Data type Instance Attributes: 71.78%
Data type Instances Counting: 69.68%
Data type Spatial Relation: 61.95%
Data type Instance Interaction: 75.26%
Data type Visual Reasoning: 79.15%
Data type Text Understanding: 68.60%
Data type Action Recognition: 65.47%
Data type Action Prediction: 54.20%
Data type Procedure Understanding: 44.28%
Total accuracy: 72.45%
Image accuracy: 76.79%
Video accuracy: 56.03%

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B seed --dynamic --auto

The expected test results are:

Acc@1: 0.7464146748193441
length: 17990
Accuracy for each data type:
Data type Scene Understanding: 80.62%
Data type Instance Identity: 82.36%
Data type Instance Location: 80.92%
Data type Instance Attributes: 71.68%
Data type Instances Counting: 72.46%
Data type Spatial Relation: 66.36%
Data type Instance Interaction: 78.35%
Data type Visual Reasoning: 80.06%
Data type Text Understanding: 66.28%
Data type Action Recognition: 67.93%
Data type Action Prediction: 57.47%
Data type Procedure Understanding: 56.40%
Total accuracy: 74.65%
Image accuracy: 78.15%
Video accuracy: 61.38%

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B seed --dynamic --auto

The expected test results are:

Acc@1: 0.7446359088382435
length: 17990
Accuracy for each data type:
Data type Scene Understanding: 80.40%
Data type Instance Identity: 82.25%
Data type Instance Location: 80.66%
Data type Instance Attributes: 73.31%
Data type Instances Counting: 72.78%
Data type Spatial Relation: 65.14%
Data type Instance Interaction: 79.38%
Data type Visual Reasoning: 79.15%
Data type Text Understanding: 77.91%
Data type Action Recognition: 68.26%
Data type Action Prediction: 55.10%
Data type Procedure Understanding: 55.23%
Total accuracy: 74.46%
Image accuracy: 78.17%
Video accuracy: 60.42%

MMVP#

The MMVP dataset is designed to benchmark the performance of multimodal large language models (MLLMs) in visual question answering tasks. This dataset focuses on identifying “CLIP-blind pairs,” which are images that appear similar to the CLIP model despite having clear visual differences. The MMVP dataset includes 300 images derived from ImageNet-1k and LAION-Aesthetics, each paired with straightforward questions to evaluate the models’ visual capabilities. It highlights the challenges these systems face, often leading to incorrect responses and hallucinated explanations.

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmvp --dynamic

The expected test results are:

Evaluating MMVP ...
Results saved to results/MMVP_240708020850.jsonl
The accuracy is 0.2

GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmvp --dynamic

The expected test results are:

Evaluating MMVP ...
Results saved to results/MMVP_240702122300.jsonl
The accuracy is 0.35333333333333333

GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmvp --dynamic

The expected test results are:

Evaluating MMVP ...
Results saved to results/MMVP_240702144108.jsonl
The accuracy is 0.4066666666666667

GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmvp --dynamic

The expected test results are:

Evaluating MMVP ...
Results saved to results/MMVP_240703200956.jsonl
The accuracy is 0.5133333333333333

GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmvp --dynamic

The expected test results are:

Evaluating MMVP ...
Results saved to results/MMVP_240704024433.jsonl
The accuracy is 0.5466666666666666

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmvp --dynamic --auto

The expected test results are:

Evaluating MMVP ...
Results saved to results/MMVP_240708045836.jsonl
The accuracy is 0.5866666666666667

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmvp --dynamic --auto

The expected test results are:

Evaluating MMVP ...
Results saved to results/MMVP_240718203234.jsonl
The accuracy is 0.5266666666666666

RefCOCO Series#

RefCOCO, RefCOCO+, and RefCOCOg are datasets used for tasks involving referring expression comprehension, segmentation, and generation. These datasets are built upon the MSCOCO dataset, and they are essential for evaluating models in natural language processing and computer vision.

GPUS=8 sh evalulate.sh pretrained/InternVL2-1B refcoco --dynamic

GPUS=8 sh evalulate.sh pretrained/InternVL2-2B refcoco --dynamic

GPUS=8 sh evalulate.sh pretrained/InternVL2-4B refcoco --dynamic

GPUS=8 sh evalulate.sh pretrained/InternVL2-8B refcoco --dynamic

GPUS=8 sh evalulate.sh pretrained/InternVL2-26B refcoco --dynamic

GPUS=8 sh evalulate.sh pretrained/InternVL2-40B refcoco --dynamic --auto

GPUS=8 sh evalulate.sh pretrained/InternVL2-Llama3-76B refcoco --dynamic --auto

The expected test results are:

Model	avg.	RefCOCO (val)	RefCOCO (testA)	RefCOCO (testB)	RefCOCO+ (val)	RefCOCO+ (testA)	RefCOCO+ (testB)	RefCOCO‑g (val)	RefCOCO‑g (test)
InternVL2‑1B	79.9	83.6	88.7	79.8	76.0	83.6	67.7	80.2	79.9
InternVL2‑2B	77.7	82.3	88.2	75.9	73.5	82.8	63.3	77.6	78.3
InternVL2‑4B	84.4	88.5	91.2	83.9	81.2	87.2	73.8	84.6	84.6
InternVL2‑8B	82.9	87.1	91.1	80.7	79.8	87.9	71.4	82.7	82.7
InternVL2‑26B	88.5	91.2	93.3	87.4	86.8	91.0	81.2	88.5	88.6
InternVL2‑40B	90.3	93.0	94.7	89.2	88.5	92.8	83.6	90.3	90.6
InternVL2- Llama3‑76B	90.0	92.2	94.8	88.4	88.8	93.1	82.8	89.5	90.3

MVBench#

MVBench is a comprehensive multimodal video understanding benchmark developed to evaluate the temporal comprehension capabilities of MLLMs. It includes 20 challenging video tasks that require temporal understanding and cannot be effectively solved using a single frame. The benchmark uses a novel static-to-dynamic method, transforming static tasks into dynamic ones to systematically generate video tasks that demand a wide range of temporal skills, from perception to cognition.

We evaluate our models on MVBench by extracting 16 frames from each video, and each frame was resized to a 448x448 image.

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mvbench --dynamic --max-num 1

The expected test results are:

57.9

GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mvbench --dynamic --max-num 1

The expected test results are:

60.2

GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mvbench --dynamic --max-num 1

The expected test results are:

63.7

GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mvbench --dynamic --max-num 1

The expected test results are:

66.4

GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mvbench --dynamic --max-num 1

The expected test results are:

67.5

GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mvbench --dynamic --max-num 1 --auto

The expected test results are:

72.5

GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mvbench --dynamic --max-num 1 --auto

The expected test results are:

69.6

Evaluation using VLMEvalKit Codebase#

Data Preparation#

VLMEvalKit will automatically download the data for evaluation, so you do not need to prepare it manually.

MathVista#

The MathVista dataset is a comprehensive benchmark for evaluating mathematical reasoning within visual contexts. It consists of three newly created datasets—IQTest, FunctionQA, and PaperQA—designed to address logical reasoning on puzzle test figures, algebraic reasoning over functional plots, and scientific reasoning with academic paper figures, respectively.

torchrun --nproc-per-node=8 run.py --data MathVista_MINI --model InternVL2-1B --verbose

The expected test results are:

"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","489","377","48.9","37.7"
"scientific reasoning","122","85","45","69.67213114754098","36.885245901639344"
"textbook question answering","158","92","63","58.22784810126582","39.87341772151899"
"numeric commonsense","144","39","24","27.083333333333332","16.666666666666664"
"arithmetic reasoning","353","102","103","28.89518413597734","29.178470254957507"
"visual question answering","179","92","53","51.39664804469274","29.608938547486037"
"geometry reasoning","239","147","95","61.50627615062761","39.74895397489539"
"algebraic reasoning","281","170","112","60.4982206405694","39.8576512455516"
"geometry problem solving","208","138","85","66.34615384615384","40.86538461538461"
"math word problem","186","26","52","13.978494623655912","27.956989247311824"
"logical reasoning","37","11","5","29.72972972972973","13.513513513513514"
"figure question answering","269","141","124","52.41635687732342","46.09665427509294"
"statistical reasoning","301","144","148","47.840531561461795","49.16943521594684"

torchrun --nproc-per-node=8 run.py --data MathVista_MINI --model InternVL2-2B --verbose

The expected test results are:

"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","476","464","47.599999999999994","46.400000000000006"
"scientific reasoning","122","83","68","68.0327868852459","55.73770491803278"
"textbook question answering","158","95","79","60.12658227848101","50.0"
"numeric commonsense","144","35","37","24.305555555555554","25.694444444444443"
"arithmetic reasoning","353","100","146","28.328611898016998","41.359773371104815"
"visual question answering","179","91","86","50.83798882681564","48.04469273743017"
"geometry reasoning","239","144","103","60.25104602510461","43.09623430962343"
"algebraic reasoning","281","171","117","60.854092526690394","41.637010676156585"
"geometry problem solving","208","136","94","65.38461538461539","45.19230769230769"
"math word problem","186","20","62","10.75268817204301","33.33333333333333"
"logical reasoning","37","11","4","29.72972972972973","10.81081081081081"
"figure question answering","269","134","143","49.814126394052046","53.159851301115246"
"statistical reasoning","301","137","180","45.51495016611295","59.800664451827245"

torchrun --nproc-per-node=8 run.py --data MathVista_MINI --model InternVL2-4B --verbose

The expected test results are:

"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","544","587","54.400000000000006","58.699999999999996"
"scientific reasoning","122","88","73","72.1311475409836","59.83606557377049"
"textbook question answering","158","97","93","61.39240506329114","58.86075949367089"
"numeric commonsense","144","37","43","25.694444444444443","29.86111111111111"
"arithmetic reasoning","353","139","197","39.376770538243626","55.80736543909348"
"visual question answering","179","94","87","52.513966480446925","48.60335195530726"
"geometry reasoning","239","146","133","61.08786610878661","55.64853556485355"
"algebraic reasoning","281","169","156","60.14234875444839","55.51601423487544"
"geometry problem solving","208","137","119","65.86538461538461","57.21153846153846"
"math word problem","186","54","119","29.03225806451613","63.97849462365591"
"logical reasoning","37","19","9","51.35135135135135","24.324324324324326"
"figure question answering","269","162","169","60.223048327137555","62.825278810408925"
"statistical reasoning","301","167","215","55.48172757475083","71.42857142857143"

torchrun --nproc-per-node=8 run.py --data MathVista_MINI --model InternVL2-8B --verbose

The expected test results are:

"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","549","583","54.900000000000006","58.3"
"scientific reasoning","122","89","72","72.95081967213115","59.01639344262295"
"textbook question answering","158","101","97","63.92405063291139","61.39240506329114"
"numeric commonsense","144","39","44","27.083333333333332","30.555555555555557"
"arithmetic reasoning","353","128","199","36.26062322946176","56.37393767705382"
"visual question answering","179","92","89","51.39664804469274","49.72067039106145"
"geometry reasoning","239","160","144","66.94560669456067","60.25104602510461"
"algebraic reasoning","281","185","168","65.83629893238434","59.7864768683274"
"geometry problem solving","208","150","129","72.11538461538461","62.019230769230774"
"math word problem","186","49","110","26.344086021505376","59.13978494623656"
"logical reasoning","37","16","4","43.24324324324324","10.81081081081081"
"figure question answering","269","157","158","58.36431226765799","58.7360594795539"
"statistical reasoning","301","155","207","51.49501661129568","68.77076411960132"

torchrun --nproc-per-node=8 run.py --data MathVista_MINI --model InternVL2-26B --verbose

The expected test results are:

"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","588","594","58.8","59.4"
"scientific reasoning","122","87","73","71.31147540983606","59.83606557377049"
"textbook question answering","158","98","97","62.0253164556962","61.39240506329114"
"numeric commonsense","144","38","49","26.38888888888889","34.02777777777778"
"arithmetic reasoning","353","157","212","44.47592067988669","60.05665722379604"
"visual question answering","179","91","97","50.83798882681564","54.18994413407822"
"geometry reasoning","239","164","139","68.6192468619247","58.15899581589959"
"algebraic reasoning","281","188","159","66.90391459074732","56.58362989323843"
"geometry problem solving","208","154","121","74.03846153846155","58.17307692307693"
"math word problem","186","76","116","40.86021505376344","62.365591397849464"
"logical reasoning","37","17","3","45.94594594594595","8.108108108108109"
"figure question answering","269","169","163","62.825278810408925","60.594795539033456"
"statistical reasoning","301","168","212","55.81395348837209","70.43189368770764"

torchrun --nproc-per-node=8 run.py --data MathVista_MINI --model InternVL2-40B --verbose

The expected test results are:

"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","552","637","55.2","63.7"
"scientific reasoning","122","90","76","73.77049180327869","62.295081967213115"
"textbook question answering","158","101","99","63.92405063291139","62.65822784810127"
"numeric commonsense","144","34","58","23.61111111111111","40.27777777777778"
"arithmetic reasoning","353","147","229","41.64305949008499","64.87252124645893"
"visual question answering","179","92","103","51.39664804469274","57.54189944134078"
"geometry reasoning","239","155","131","64.85355648535564","54.811715481171554"
"algebraic reasoning","281","180","152","64.05693950177937","54.092526690391466"
"geometry problem solving","208","146","114","70.1923076923077","54.807692307692314"
"math word problem","186","65","135","34.946236559139784","72.58064516129032"
"logical reasoning","37","11","10","29.72972972972973","27.027027027027028"
"figure question answering","269","148","186","55.01858736059479","69.14498141263941"
"statistical reasoning","301","150","233","49.83388704318937","77.40863787375415"

torchrun --nproc-per-node=1 run.py --data MathVista_MINI --model InternVL2-76B --verbose

The expected test results are:

"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","534","655","53.400000000000006","65.5"
"scientific reasoning","122","89","77","72.95081967213115","63.114754098360656"
"textbook question answering","158","100","106","63.29113924050633","67.08860759493672"
"numeric commonsense","144","42","64","29.166666666666668","44.44444444444444"
"arithmetic reasoning","353","154","218","43.626062322946176","61.756373937677054"
"visual question answering","179","95","89","53.072625698324025","49.72067039106145"
"geometry reasoning","239","143","160","59.83263598326359","66.94560669456067"
"algebraic reasoning","281","168","187","59.7864768683274","66.54804270462633"
"geometry problem solving","208","134","142","64.42307692307693","68.26923076923077"
"math word problem","186","73","143","39.247311827956985","76.88172043010752"
"logical reasoning","37","7","6","18.91891891891892","16.216216216216218"
"figure question answering","269","132","175","49.07063197026022","65.05576208178438"
"statistical reasoning","301","139","232","46.179401993355484","77.0764119601329"

HallusionBench#

HallusionBench is a comprehensive benchmark designed to evaluate image-context reasoning in MLLMs, focusing on identifying issues related to language hallucination and visual illusion. The dataset consists of 346 images paired with 1,129 questions crafted by human experts. These questions are divided into two categories: Visual Dependent (VD) and Visual Supplement (VS), allowing the benchmark to assess the nuanced understanding and interpretation of visual data by MLLMs.

torchrun --nproc-per-node=8 run.py --data HallusionBench --model InternVL2-1B --verbose

The expected test results are:

"split","aAcc","fAcc","qAcc"
"Overall","54.363827549947416","23.98843930635838","21.978021978021978"
"VS","58.333333333333336","15.517241379310345","28.651685393258425"
"VD","51.945854483925544","28.26086956521739","17.689530685920577"
"VS_map","56.25","9.090909090909092","12.5"
"VD_illusion","48.61111111111111","25.806451612903224","8.333333333333332"
"VD_figure","58.75","36.58536585365854","23.076923076923077"
"VS_ocr","44.44444444444444","23.076923076923077","3.7037037037037033"
"VD_video","51.76470588235295","14.583333333333334","11.594202898550725"
"VD_ocr","78.65168539325843","58.139534883720934","55.81395348837209"
"VS_chart","66.15384615384615","17.5","47.368421052631575"
"VD_math","29.629629629629626","5.555555555555555","3.7037037037037033"
"VS_table","57.14285714285714","10.714285714285714","23.25581395348837"

result = (54.363827549947416 + 23.98843930635838 + 21.978021978021978) / 3 = 33.4

torchrun --nproc-per-node=8 run.py --data HallusionBench --model InternVL2-2B --verbose

The expected test results are:

"split","aAcc","fAcc","qAcc"
"Overall","58.359621451104104","26.589595375722542","28.79120879120879"
"VS","65.27777777777779","24.137931034482758","41.57303370786517"
"VD","54.145516074450086","27.82608695652174","20.577617328519857"
"VS_chart","70.0","27.500000000000004","59.210526315789465"
"VD_math","38.88888888888889","2.7777777777777777","11.11111111111111"
"VS_table","65.17857142857143","14.285714285714285","37.2093023255814"
"VD_ocr","71.91011235955057","46.51162790697674","44.18604651162791"
"VD_figure","60.0","39.02439024390244","23.076923076923077"
"VD_illusion","57.638888888888886","32.25806451612903","23.61111111111111"
"VD_video","48.8235294117647","14.583333333333334","8.695652173913043"
"VS_map","64.0625","27.27272727272727","28.125"
"VS_ocr","55.55555555555556","26.923076923076923","14.814814814814813"

result = (58.359621451104104 + 26.589595375722542 + 28.79120879120879) / 3 = 37.9

torchrun --nproc-per-node=8 run.py --data HallusionBench --model InternVL2-4B --verbose

The expected test results are:

"split","aAcc","fAcc","qAcc"
"Overall","61.09358569926393","32.369942196531795","32.30769230769231"
"VD","56.17597292724196","30.0","22.743682310469314"
"VS","69.16666666666667","37.06896551724138","47.19101123595505"
"VS_map","56.25","27.27272727272727","15.625"
"VS_ocr","55.55555555555556","38.46153846153847","18.51851851851852"
"VD_ocr","75.28089887640449","51.162790697674424","51.162790697674424"
"VS_table","75.89285714285714","35.714285714285715","55.81395348837209"
"VD_figure","62.5","39.02439024390244","25.64102564102564"
"VD_illusion","55.55555555555556","33.87096774193548","19.444444444444446"
"VD_video","48.8235294117647","8.333333333333332","7.246376811594203"
"VD_math","48.148148148148145","16.666666666666664","22.22222222222222"
"VS_chart","75.38461538461539","42.5","65.78947368421053"

result = (61.09358569926393 + 32.369942196531795 + 32.30769230769231) / 3 = 41.9

torchrun --nproc-per-node=8 run.py --data HallusionBench --model InternVL2-8B --verbose

The expected test results are:

"split","aAcc","fAcc","qAcc"
"Overall","64.03785488958991","35.83815028901734","35.824175824175825"
"VS","69.16666666666667","36.206896551724135","45.50561797752809"
"VD","60.913705583756354","35.65217391304348","29.602888086642597"
"VS_chart","76.15384615384615","42.5","63.1578947368421"
"VD_ocr","74.15730337078652","51.162790697674424","48.837209302325576"
"VD_figure","67.5","53.65853658536586","35.8974358974359"
"VD_video","51.17647058823529","14.583333333333334","11.594202898550725"
"VD_math","55.55555555555556","16.666666666666664","29.629629629629626"
"VD_illusion","64.58333333333334","40.32258064516129","31.944444444444443"
"VS_map","56.25","31.818181818181817","18.75"
"VS_ocr","53.70370370370371","26.923076923076923","11.11111111111111"
"VS_table","75.89285714285714","39.285714285714285","55.81395348837209"

result = (64.03785488958991 + 35.83815028901734 + 35.824175824175825) / 3 = 45.2

torchrun --nproc-per-node=8 run.py --data HallusionBench --model InternVL2-26B --verbose

The expected test results are:

"split","aAcc","fAcc","qAcc"
"Overall","67.2975814931651","43.641618497109825","41.098901098901095"
"VD","63.45177664974619","42.608695652173914","33.935018050541515"
"VS","73.61111111111111","45.689655172413794","52.24719101123596"
"VD_illusion","65.97222222222221","50.0","33.33333333333333"
"VS_chart","80.0","50.0","68.42105263157895"
"VD_ocr","77.52808988764045","58.139534883720934","55.81395348837209"
"VD_figure","72.5","53.65853658536586","43.58974358974359"
"VS_map","54.6875","22.727272727272727","18.75"
"VD_video","54.70588235294118","25.0","17.391304347826086"
"VS_ocr","51.85185185185185","34.61538461538461","14.814814814814813"
"VD_math","55.55555555555556","22.22222222222222","31.48148148148148"
"VS_table","87.5","67.85714285714286","72.09302325581395"

result = (67.2975814931651 + 43.641618497109825 + 41.098901098901095) / 3 = 50.7

torchrun --nproc-per-node=8 run.py --data HallusionBench --model InternVL2-40B --verbose

The expected test results are:

"split","aAcc","fAcc","qAcc"
"Overall","71.39852786540484","51.73410404624278","47.69230769230769"
"VS","78.88888888888889","56.896551724137936","58.98876404494382"
"VD","66.83587140439933","49.130434782608695","40.43321299638989"
"VD_math","62.03703703703704","36.11111111111111","38.88888888888889"
"VD_ocr","80.89887640449437","62.7906976744186","60.46511627906976"
"VD_figure","85.0","78.04878048780488","69.23076923076923"
"VS_chart","84.61538461538461","60.0","76.31578947368422"
"VS_map","62.5","45.45454545454545","25.0"
"VS_ocr","72.22222222222221","53.84615384615385","44.44444444444444"
"VS_table","84.82142857142857","64.28571428571429","62.7906976744186"
"VD_video","52.94117647058824","20.833333333333336","15.942028985507244"
"VD_illusion","68.05555555555556","50.0","37.5"

result = (71.39852786540484 + 51.73410404624278 + 47.69230769230769) / 3 = 56.9

torchrun --nproc-per-node=1 run.py --data HallusionBench --model InternVL2-76B --verbose

The expected test results are:

"split","aAcc","fAcc","qAcc"
"Overall","71.1882229232387","48.26589595375722","46.15384615384615"
"VS","76.38888888888889","53.44827586206896","56.74157303370787"
"VD","68.02030456852792","45.65217391304348","39.35018050541516"
"VD_ocr","80.89887640449437","65.11627906976744","65.11627906976744"
"VS_chart","81.53846153846153","60.0","73.68421052631578"
"VD_video","60.588235294117645","25.0","20.28985507246377"
"VD_math","64.81481481481481","27.77777777777778","37.03703703703704"
"VD_illusion","62.5","40.32258064516129","29.166666666666668"
"VS_ocr","64.81481481481481","42.30769230769231","29.629629629629626"
"VD_figure","83.75","73.17073170731707","66.66666666666666"
"VS_table","82.14285714285714","60.71428571428571","62.7906976744186"
"VS_map","65.625","45.45454545454545","31.25"

result = (71.1882229232387 + 48.26589595375722 + 46.15384615384615) / 3 = 55.2

MMStar#

The MMStar dataset is an advanced multimodal benchmark designed to evaluate the capabilities of MLLMs. It comprises 1,500 carefully selected samples that are balanced and purified to ensure they exhibit visual dependency and minimal data leakage. The dataset evaluates models across six core capabilities and 18 detailed axes, focusing on complex multimodal tasks that require advanced reasoning and understanding of visual content.

torchrun --nproc-per-node=8 run.py --data MMStar --model InternVL2-1B --verbose

The expected test results are:

"split","Overall","coarse perception","fine-grained perception","instance reasoning","logical reasoning","math","science & technology"
"none","0.452","0.588","0.368","0.548","0.352","0.46","0.396"

torchrun --nproc-per-node=8 run.py --data MMStar --model InternVL2-2B --verbose

The expected test results are:

"split","Overall","coarse perception","fine-grained perception","instance reasoning","logical reasoning","math","science & technology"
"none","0.5013333333333333","0.644","0.392","0.608","0.44","0.496","0.428"

torchrun --nproc-per-node=8 run.py --data MMStar --model InternVL2-4B --verbose

The expected test results are:

"split","Overall","coarse perception","fine-grained perception","instance reasoning","logical reasoning","math","science & technology"
"none","0.5426666666666666","0.672","0.384","0.624","0.532","0.588","0.456"

torchrun --nproc-per-node=8 run.py --data MMStar --model InternVL2-8B --verbose

The expected test results are:

"split","Overall","coarse perception","fine-grained perception","instance reasoning","logical reasoning","math","science & technology"
"none","0.62","0.704","0.504","0.68","0.656","0.672","0.504"

torchrun --nproc-per-node=8 run.py --data MMStar --model InternVL2-26B --verbose

The expected test results are:

"split","Overall","coarse perception","fine-grained perception","instance reasoning","logical reasoning","math","science & technology"
"none","0.612","0.716","0.544","0.688","0.6","0.624","0.5"

torchrun --nproc-per-node=8 run.py --data MMStar --model InternVL2-40B --verbose

The expected test results are:

"split","Overall","coarse perception","fine-grained perception","instance reasoning","logical reasoning","math","science & technology"
"none","0.654","0.692","0.528","0.716","0.696","0.72","0.572"

torchrun --nproc-per-node=1 run.py --data MMStar --model InternVL2-76B --verbose

The expected test results are:

"split","Overall","coarse perception","fine-grained perception","instance reasoning","logical reasoning","math","science & technology"
"none","0.674","0.704","0.568","0.728","0.724","0.752","0.568"

OCRBench#

OCRBench is a comprehensive evaluation benchmark designed to assess the OCR capabilities of MLLMs. It includes five components: Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). The benchmark encompasses data from 29 datasets, making it one of the most thorough OCR evaluation tools available. OCRBench aims to reveal both the strengths and weaknesses of MLLMs, particularly in handling multilingual text, handwritten text, non-semantic text, and mathematical expressions. The benchmark includes 1,000 question-answer pairs, all manually verified for precision.

torchrun --nproc-per-node=8 run.py --data OCRBench --model InternVL2-1B --verbose

The expected test results are:

{
    "Text Recognition": 243,
    "Scene Text-centric VQA": 165,
    "Doc-oriented VQA": 125,
    "Key Information Extraction": 149,
    "Handwritten Mathematical Expression Recognition": 72,
    "Final Score": 754,
    "Final Score Norm": 75.4
}

torchrun --nproc-per-node=8 run.py --data OCRBench --model InternVL2-2B --verbose

The expected test results are:

{
    "Text Recognition": 246,
    "Scene Text-centric VQA": 170,
    "Doc-oriented VQA": 133,
    "Key Information Extraction": 167,
    "Handwritten Mathematical Expression Recognition": 68,
    "Final Score": 784,
    "Final Score Norm": 78.4
}

torchrun --nproc-per-node=8 run.py --data OCRBench --model InternVL2-4B --verbose

The expected test results are:

{
    "Text Recognition": 237,
    "Scene Text-centric VQA": 170,
    "Doc-oriented VQA": 154,
    "Key Information Extraction": 159,
    "Handwritten Mathematical Expression Recognition": 68,
    "Final Score": 788,
    "Final Score Norm": 78.8
}

torchrun --nproc-per-node=8 run.py --data OCRBench --model InternVL2-8B --verbose

The expected test results are:

{
    "Text Recognition": 236,
    "Scene Text-centric VQA": 175,
    "Doc-oriented VQA": 156,
    "Key Information Extraction": 162,
    "Handwritten Mathematical Expression Recognition": 65,
    "Final Score": 794,
    "Final Score Norm": 79.4
}

torchrun --nproc-per-node=8 run.py --data OCRBench --model InternVL2-26B --verbose

The expected test results are:

{
    "Text Recognition": 250,
    "Scene Text-centric VQA": 185,
    "Doc-oriented VQA": 154,
    "Key Information Extraction": 168,
    "Handwritten Mathematical Expression Recognition": 68,
    "Final Score": 825,
    "Final Score Norm": 82.5
}

torchrun --nproc-per-node=8 run.py --data OCRBench --model InternVL2-40B --verbose

The expected test results are:

{
    "Text Recognition": 246,
    "Scene Text-centric VQA": 181,
    "Doc-oriented VQA": 160,
    "Key Information Extraction": 175,
    "Handwritten Mathematical Expression Recognition": 75,
    "Final Score": 837,
    "Final Score Norm": 83.7
}

torchrun --nproc-per-node=1 run.py --data OCRBench --model InternVL2-76B --verbose

The expected test results are:

{
    "Text Recognition": 244,
    "Scene Text-centric VQA": 182,
    "Doc-oriented VQA": 165,
    "Key Information Extraction": 176,
    "Handwritten Mathematical Expression Recognition": 72,
    "Final Score": 839,
    "Final Score Norm": 83.9
}

MMMU#

torchrun --nproc-per-node=8 run.py --data MMMU_DEV_VAL --model InternVL2-1B --verbose

The expected test results are:

"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"dev","0.34","0.2","0.0","0.2","0.2","0.4","0.4","0.0","0.4","0.0","0.2","0.4","0.4","0.2","0.0","0.6","0.6","0.4","0.2","0.6","0.6","0.6","0.2","0.2","0.0","0.4","0.4","0.8","0.6","0.2","0.8","0.35","0.44","0.28","0.55","0.36","0.17142857142857143"
"validation","0.3688888888888889","0.2","0.2","0.23333333333333334","0.4666666666666667","0.43333333333333335","0.4666666666666667","0.3333333333333333","0.4","0.3333333333333333","0.3333333333333333","0.5333333333333333","0.4666666666666667","0.36666666666666664","0.4666666666666667","0.4","0.23333333333333334","0.4","0.43333333333333335","0.7666666666666667","0.43333333333333335","0.43333333333333335","0.4","0.16666666666666666","0.26666666666666666","0.26666666666666666","0.2","0.36666666666666664","0.26666666666666666","0.3","0.5","0.425","0.3333333333333333","0.35333333333333333","0.49166666666666664","0.3333333333333333","0.32857142857142857"

torchrun --nproc-per-node=8 run.py --data MMMU_DEV_VAL --model InternVL2-2B --verbose

The expected test results are:

"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"dev","0.3333333333333333","0.4","0.0","0.0","0.2","0.2","0.6","0.2","0.2","0.2","0.4","0.6","0.2","0.8","0.6","0.2","0.6","0.0","0.4","0.8","0.2","0.2","0.2","0.8","0.8","0.0","0.2","0.2","0.2","0.0","0.6","0.25","0.44","0.24","0.5","0.28","0.3142857142857143"
"validation","0.36333333333333334","0.3333333333333333","0.4","0.26666666666666666","0.43333333333333335","0.36666666666666664","0.43333333333333335","0.23333333333333334","0.3","0.4","0.3","0.4666666666666667","0.36666666666666664","0.36666666666666664","0.5","0.26666666666666666","0.4","0.23333333333333334","0.43333333333333335","0.7666666666666667","0.43333333333333335","0.3333333333333333","0.3","0.4","0.23333333333333334","0.3","0.2","0.26666666666666666","0.36666666666666664","0.36666666666666664","0.43333333333333335","0.39166666666666666","0.37333333333333335","0.35333333333333333","0.5","0.2866666666666667","0.3238095238095238"

torchrun --nproc-per-node=8 run.py --data MMMU_DEV_VAL --model InternVL2-4B --verbose

The expected test results are:

"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"validation","0.47888888888888886","0.43333333333333335","0.5333333333333333","0.3","0.6","0.6","0.43333333333333335","0.36666666666666664","0.36666666666666664","0.3333333333333333","0.4","0.9","0.4666666666666667","0.5666666666666667","0.43333333333333335","0.4666666666666667","0.4","0.36666666666666664","0.5666666666666667","0.8333333333333334","0.5666666666666667","0.43333333333333335","0.36666666666666664","0.3333333333333333","0.26666666666666666","0.3333333333333333","0.43333333333333335","0.3333333333333333","0.6666666666666666","0.5666666666666667","0.7","0.6083333333333333","0.48","0.44666666666666666","0.6916666666666667","0.35333333333333333","0.3952380952380952"
"dev","0.4866666666666667","0.2","0.2","0.4","0.6","0.6","0.8","1.0","0.4","0.0","0.4","0.6","0.2","0.6","0.4","0.4","0.4","0.0","1.0","0.8","0.6","0.6","0.2","0.6","0.6","0.4","0.4","0.2","0.8","0.6","0.6","0.55","0.48","0.4","0.8","0.44","0.37142857142857144"

torchrun --nproc-per-node=8 run.py --data MMMU_DEV_VAL --model InternVL2-8B --verbose

The expected test results are:

"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"dev","0.49333333333333335","0.2","0.2","0.4","0.6","0.8","0.6","1.0","0.2","0.2","0.6","0.6","0.4","0.2","0.6","0.4","0.6","0.0","1.0","1.0","0.6","0.6","0.2","0.6","0.4","0.2","0.6","0.4","0.6","0.4","0.6","0.55","0.44","0.44","0.8","0.44","0.4"
"validation","0.5177777777777778","0.5333333333333333","0.5333333333333333","0.3","0.7","0.7","0.4666666666666667","0.5","0.5","0.7","0.6333333333333333","0.7","0.43333333333333335","0.5333333333333333","0.4666666666666667","0.4","0.3333333333333333","0.4666666666666667","0.7","0.9","0.5333333333333333","0.5333333333333333","0.3333333333333333","0.5","0.4","0.36666666666666664","0.3333333333333333","0.26666666666666666","0.6","0.5666666666666667","0.6","0.6166666666666667","0.49333333333333335","0.5","0.7","0.44666666666666666","0.4380952380952381"

torchrun --nproc-per-node=8 run.py --data MMMU_DEV_VAL --model InternVL2-26B --verbose

The expected test results are:

"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"dev","0.5266666666666666","0.4","0.4","0.2","0.8","0.8","0.6","0.4","0.4","0.0","0.6","0.6","0.2","0.2","0.6","0.4","1.0","0.0","1.0","0.8","0.6","0.6","0.4","0.6","0.8","0.6","0.6","0.4","0.8","0.4","0.6","0.7","0.56","0.36","0.8","0.36","0.4857142857142857"
"validation","0.5122222222222222","0.43333333333333335","0.4666666666666667","0.26666666666666666","0.8","0.8666666666666667","0.5666666666666667","0.5666666666666667","0.3333333333333333","0.5666666666666667","0.4666666666666667","0.8333333333333334","0.36666666666666664","0.4","0.5","0.4666666666666667","0.4","0.5333333333333333","0.7","0.9","0.5666666666666667","0.4666666666666667","0.36666666666666664","0.3333333333333333","0.4","0.3","0.3333333333333333","0.3333333333333333","0.6","0.6","0.6333333333333333","0.7","0.4533333333333333","0.4866666666666667","0.7083333333333334","0.42","0.41904761904761906"

torchrun --nproc-per-node=8 run.py --data MMMU_DEV_VAL --model InternVL2-40B --verbose

The expected test results are:

"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"validation","0.5522222222222222","0.4","0.6","0.36666666666666664","0.7","0.8666666666666667","0.5333333333333333","0.5333333333333333","0.4666666666666667","0.6","0.5666666666666667","0.7333333333333333","0.36666666666666664","0.6","0.4666666666666667","0.4666666666666667","0.43333333333333335","0.5333333333333333","0.7666666666666667","0.8333333333333334","0.4666666666666667","0.5666666666666667","0.3333333333333333","0.43333333333333335","0.36666666666666664","0.3","0.7","0.5333333333333333","0.6333333333333333","0.8","0.6","0.65","0.49333333333333335","0.6","0.7083333333333334","0.5","0.4523809523809524"
"dev","0.54","0.2","0.2","0.4","1.0","0.8","0.8","0.6","0.2","0.4","0.6","0.6","0.4","0.2","0.4","0.4","0.8","0.0","1.0","1.0","0.6","0.6","0.4","0.4","0.8","0.4","0.8","0.4","0.8","0.4","0.6","0.7","0.48","0.56","0.85","0.32","0.45714285714285713"

torchrun --nproc-per-node=1 run.py --data MMMU_DEV_VAL --model InternVL2-76B --verbose

The expected test results are:

"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"validation","0.5822222222222222","0.5","0.6333333333333333","0.4666666666666667","0.7666666666666667","0.9666666666666667","0.5333333333333333","0.5","0.5","0.6666666666666666","0.6333333333333333","0.7666666666666667","0.43333333333333335","0.5333333333333333","0.6","0.4","0.6333333333333333","0.4666666666666667","0.7","0.9","0.7333333333333333","0.6","0.3","0.3","0.4666666666666667","0.3333333333333333","0.5666666666666667","0.5333333333333333","0.7","0.7","0.6333333333333333","0.7083333333333334","0.6","0.58","0.7333333333333333","0.46","0.5"
"dev","0.5666666666666667","0.2","0.2","0.4","0.8","0.8","0.8","1.0","0.2","0.4","0.6","0.6","0.6","0.2","0.4","0.4","1.0","0.0","1.0","1.0","0.8","0.4","0.2","0.6","1.0","0.2","0.6","0.4","0.8","0.6","0.8","0.6","0.52","0.6","0.9","0.44","0.45714285714285713"

RealWorldQA#

The RealWorldQA dataset is a benchmark designed to evaluate the real-world spatial understanding capabilities of multimodal AI models. It consists of over 700 images, each accompanied by a question and a verifiable answer, focusing on various real-world scenarios, including those captured from vehicles. This dataset aims to test how well AI models comprehend physical environments and spatial relations, enhancing their ability to interpret and analyze real-world scenes.

torchrun --nproc-per-node=8 run.py --data RealWorldQA --model InternVL2-1B --verbose

The expected test results are:

"split","Overall"
"none","0.5032679738562091"

torchrun --nproc-per-node=8 run.py --data RealWorldQA --model InternVL2-2B --verbose

The expected test results are:

"split","Overall"
"none","0.5725490196078431"

torchrun --nproc-per-node=8 run.py --data RealWorldQA --model InternVL2-4B --verbose

The expected test results are:

"split","Overall"
"none","0.6065359477124183"

torchrun --nproc-per-node=8 run.py --data RealWorldQA --model InternVL2-8B --verbose

The expected test results are:

"split","Overall"
"none","0.6444444444444445"

torchrun --nproc-per-node=8 run.py --data RealWorldQA --model InternVL2-26B --verbose

The expected test results are:

"split","Overall"
"none","0.6836601307189543"

torchrun --nproc-per-node=8 run.py --data RealWorldQA --model InternVL2-40B --verbose

The expected test results are:

"split","Overall"
"none","0.7176470588235294"

torchrun --nproc-per-node=1 run.py --data RealWorldQA --model InternVL2-76B --verbose

The expected test results are:

"split","Overall"
"none","0.7215686274509804"

MMVet (GPT-4-Turbo)#

torchrun --nproc-per-node=8 run.py --data MMVet --model InternVL2-1B --verbose

The expected test results are:

"Category","tot","acc"
"rec","187","37.27272727272725"
"ocr","108","37.96296296296297"
"know","84","14.76190476190476"
"gen","80","14.624999999999996"
"spat","75","33.733333333333334"
"math","26","22.692307692307693"
"Overall","218","33.25688073394493"

torchrun --nproc-per-node=8 run.py --data MMVet --model InternVL2-2B --verbose

The expected test results are:

"Category","tot","acc"
"rec","187","41.71122994652404"
"ocr","108","44.62962962962963"
"know","84","24.999999999999993"
"gen","80","26.25"
"spat","75","40.800000000000004"
"math","26","30.76923076923077"
"Overall","218","39.541284403669714"

torchrun --nproc-per-node=8 run.py --data MMVet --model InternVL2-4B --verbose

The expected test results are:

"Category","tot","acc"
"rec","187","50.000000000000036"
"ocr","108","58.611111111111114"
"know","84","37.26190476190476"
"gen","80","36.499999999999986"
"spat","75","47.20000000000001"
"math","26","57.30769230769231"
"Overall","218","51.00917431192664"

torchrun --nproc-per-node=8 run.py --data MMVet --model InternVL2-8B --verbose

The expected test results are:

"Category","tot","acc"
"rec","187","51.81818181818184"
"ocr","108","63.42592592592594"
"know","84","36.904761904761905"
"gen","80","35.87499999999999"
"spat","75","61.86666666666667"
"math","26","60.769230769230774"
"Overall","218","54.174311926605526"

torchrun --nproc-per-node=8 run.py --data MMVet --model InternVL2-26B --verbose

The expected test results are:

"Category","tot","acc"
"rec","187","62.67379679144389"
"ocr","108","69.72222222222223"
"know","84","50.119047619047606"
"gen","80","48.62499999999999"
"spat","75","61.066666666666656"
"math","26","61.53846153846154"
"Overall","218","62.1100917431193"

torchrun --nproc-per-node=8 run.py --data MMVet --model InternVL2-40B --verbose

The expected test results are:

"Category","tot","acc"
"rec","187","66.25668449197867"
"ocr","108","70.18518518518522"
"know","84","54.40476190476189"
"gen","80","54.74999999999998"
"spat","75","68.53333333333332"
"math","26","64.23076923076924"
"Overall","218","65.50458715596335"

torchrun --nproc-per-node=1 run.py --data MMVet --model InternVL2-76B --verbose

The expected test results are:

"Category","tot","acc"
"rec","187","65.66844919786104"
"ocr","108","70.09259259259262"
"know","84","58.3333333333333"
"gen","80","58.49999999999997"
"spat","75","60.79999999999999"
"math","26","75.76923076923077"
"Overall","218","65.7339449541285"

Note that because the version of GPT-4 used for scoring differs from the official server, the scores tested by VLMEvalKit will be slightly different.

LLaVA-Bench (GPT-4-Turbo)#

The LLaVA-Bench-in-the-Wild dataset is designed to evaluate the capabilities of MLLMs in handling more complex and diverse visual tasks. It includes a set of 24 images with 60 associated questions, covering a range of indoor and outdoor scenes, memes, paintings, and sketches. Each image is paired with detailed, manually curated descriptions and questions that test the model’s generalizability to novel domains.

torchrun --nproc-per-node=8 run.py --data LLaVABench --model InternVL2-1B --verbose

The expected test results are:

"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","51.6","39.5","76.5"
"detail","58.9","37.3","63.3"
"conv","43.0","40.0","92.9"
"complex","54.9","40.4","73.6"

torchrun --nproc-per-node=8 run.py --data LLaVABench --model InternVL2-2B --verbose

The expected test results are:

"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","62.5","47.8","76.5"
"detail","61.8","42.0","68.0"
"complex","63.5","46.1","72.5"
"conv","61.7","55.9","90.6"

torchrun --nproc-per-node=8 run.py --data LLaVABench --model InternVL2-4B --verbose

The expected test results are:

"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","68.2","51.0","74.8"
"conv","62.3","55.3","88.8"
"detail","65.3","42.7","65.3"
"complex","74.0","52.9","71.4"

torchrun --nproc-per-node=8 run.py --data LLaVABench --model InternVL2-8B --verbose

The expected test results are:

"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","73.2","53.3","72.8"
"complex","86.1","61.8","71.8"
"conv","61.6","54.7","88.8"
"detail","63.5","36.0","56.7"

torchrun --nproc-per-node=8 run.py --data LLaVABench --model InternVL2-26B --verbose

The expected test results are:

"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","92.3","68.0","73.7"
"detail","85.6","51.3","60.0"
"complex","99.0","73.6","74.3"
"conv","86.8","73.5","84.7"

torchrun --nproc-per-node=8 run.py --data LLaVABench --model InternVL2-40B --verbose

The expected test results are:

"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","100.5","72.7","72.3"
"detail","90.4","56.7","62.7"
"complex","104.4","76.1","72.9"
"conv","101.5","81.2","80.0"

torchrun --nproc-per-node=1 run.py --data LLaVABench --model InternVL2-76B --verbose

The expected test results are:

"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","99.3","71.7","72.2"
"detail","92.1","54.7","59.3"
"complex","107.7","79.6","73.9"
"conv","91.2","73.5","80.6"

VideoMME#

The Video-MME dataset is a comprehensive benchmark designed to evaluate the capabilities of MLLMs in video analysis. It is the first benchmark specifically tailored for this purpose, focusing on a high-quality assessment of models’ performance in processing sequential visual data.

When testing without subtitles:

torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-1B --verbose --nframe 16

The expected test results are:

{
    "short": {
        "overall": "0.5289",
        "domain": {
            "Knowledge": "0.5481",
            "Film & Television": "0.6167",
            "Sports Competition": "0.4667",
            "Artistic Performance": "0.5333",
            "Life Record": "0.5143",
            "Multilingual": "0.4000"
        },
        "sub_category": {
            "Humanity & History": "0.3333",
            "Literature & Art": "0.4000",
            "Biology & Medicine": "0.7000",
            "Finance & Commerce": "0.6333",
            "Astronomy": "0.5667",
            "Geography": "0.5333",
            "Law": "0.6000",
            "Life Tip": "0.5333",
            "Technology": "0.6333",
            "Animation": "0.6000",
            "Movie & TV Show": "0.7333",
            "Documentary": "0.5333",
            "News Report": "0.6000",
            "Esports": "0.3667",
            "Basketball": "0.3667",
            "Football": "0.5333",
            "Athletics": "0.5333",
            "Other Sports": "0.5333",
            "Stage Play": "0.7333",
            "Magic Show": "0.3333",
            "Variety Show": "0.6333",
            "Acrobatics": "0.4333",
            "Handicraft": "0.4667",
            "Food": "0.5000",
            "Fashion": "0.6333",
            "Daily Life": "0.4000",
            "Travel": "0.6333",
            "Pet & Animal": "0.6667",
            "Exercise": "0.3000",
            "Multilingual": "0.4000"
        },
        "task_type": {
            "Temporal Perception": "0.6667",
            "Spatial Perception": "0.6000",
            "Attribute Perception": "0.6721",
            "Action Recognition": "0.4427",
            "Object Recognition": "0.4821",
            "OCR Problems": "0.6316",
            "Counting Problem": "0.3040",
            "Temporal Reasoning": "0.6154",
            "Spatial Reasoning": "0.6667",
            "Action Reasoning": "0.6170",
            "Object Reasoning": "0.4750",
            "Information Synopsis": "0.7073"
        }
    },
    "medium": {
        "overall": "0.4144",
        "domain": {
            "Knowledge": "0.3630",
            "Film & Television": "0.5250",
            "Sports Competition": "0.3933",
            "Artistic Performance": "0.4750",
            "Life Record": "0.3952",
            "Multilingual": "0.4333"
        },
        "sub_category": {
            "Humanity & History": "0.2000",
            "Literature & Art": "0.4000",
            "Biology & Medicine": "0.5000",
            "Finance & Commerce": "0.4333",
            "Astronomy": "0.4333",
            "Geography": "0.2333",
            "Law": "0.4000",
            "Life Tip": "0.4333",
            "Technology": "0.2333",
            "Animation": "0.3333",
            "Movie & TV Show": "0.5333",
            "Documentary": "0.6000",
            "News Report": "0.6333",
            "Esports": "0.5000",
            "Basketball": "0.1333",
            "Football": "0.4333",
            "Athletics": "0.3333",
            "Other Sports": "0.5667",
            "Stage Play": "0.5667",
            "Magic Show": "0.3333",
            "Variety Show": "0.5000",
            "Acrobatics": "0.5000",
            "Handicraft": "0.4667",
            "Food": "0.3000",
            "Fashion": "0.3667",
            "Daily Life": "0.3333",
            "Travel": "0.4333",
            "Pet & Animal": "0.4000",
            "Exercise": "0.4667",
            "Multilingual": "0.4333"
        },
        "task_type": {
            "Temporal Perception": "0.3871",
            "Spatial Perception": "0.6190",
            "Attribute Perception": "0.4110",
            "Action Recognition": "0.3613",
            "Object Recognition": "0.5000",
            "OCR Problems": "0.4706",
            "Counting Problem": "0.2526",
            "Temporal Reasoning": "0.2740",
            "Spatial Reasoning": "0.6667",
            "Action Reasoning": "0.3276",
            "Object Reasoning": "0.4179",
            "Information Synopsis": "0.5897"
        }
    },
    "long": {
        "overall": "0.3333",
        "domain": {
            "Knowledge": "0.3259",
            "Film & Television": "0.3250",
            "Sports Competition": "0.3000",
            "Artistic Performance": "0.3167",
            "Life Record": "0.3762",
            "Multilingual": "0.3667"
        },
        "sub_category": {
            "Humanity & History": "0.3333",
            "Literature & Art": "0.3667",
            "Biology & Medicine": "0.3333",
            "Finance & Commerce": "0.4667",
            "Astronomy": "0.2000",
            "Geography": "0.3000",
            "Law": "0.2667",
            "Life Tip": "0.3000",
            "Technology": "0.3667",
            "Animation": "0.2000",
            "Movie & TV Show": "0.4667",
            "Documentary": "0.4000",
            "News Report": "0.2333",
            "Esports": "0.4000",
            "Basketball": "0.3333",
            "Football": "0.2333",
            "Athletics": "0.1333",
            "Other Sports": "0.4000",
            "Stage Play": "0.4000",
            "Magic Show": "0.2667",
            "Variety Show": "0.1333",
            "Acrobatics": "0.4667",
            "Handicraft": "0.5333",
            "Food": "0.4333",
            "Fashion": "0.3333",
            "Daily Life": "0.3667",
            "Travel": "0.2000",
            "Pet & Animal": "0.4333",
            "Exercise": "0.3333",
            "Multilingual": "0.3667"
        },
        "task_type": {
            "Temporal Perception": "0.3333",
            "Spatial Perception": "0.0000",
            "Attribute Perception": "0.5185",
            "Action Recognition": "0.3016",
            "Object Recognition": "0.2963",
            "OCR Problems": "0.5000",
            "Counting Problem": "0.1250",
            "Temporal Reasoning": "0.2857",
            "Spatial Reasoning": "0.6364",
            "Action Reasoning": "0.2556",
            "Object Reasoning": "0.3042",
            "Information Synopsis": "0.5153"
        }
    },
    "overall": {
        "overall": "0.4256",
        "domain": {
            "Knowledge": "0.4123",
            "Film & Television": "0.4889",
            "Sports Competition": "0.3867",
            "Artistic Performance": "0.4417",
            "Life Record": "0.4286",
            "Multilingual": "0.4000"
        },
        "sub_category": {
            "Humanity & History": "0.2889",
            "Literature & Art": "0.3889",
            "Biology & Medicine": "0.5111",
            "Finance & Commerce": "0.5111",
            "Astronomy": "0.4000",
            "Geography": "0.3556",
            "Law": "0.4222",
            "Life Tip": "0.4222",
            "Technology": "0.4111",
            "Animation": "0.3778",
            "Movie & TV Show": "0.5778",
            "Documentary": "0.5111",
            "News Report": "0.4889",
            "Esports": "0.4222",
            "Basketball": "0.2778",
            "Football": "0.4000",
            "Athletics": "0.3333",
            "Other Sports": "0.5000",
            "Stage Play": "0.5667",
            "Magic Show": "0.3111",
            "Variety Show": "0.4222",
            "Acrobatics": "0.4667",
            "Handicraft": "0.4889",
            "Food": "0.4111",
            "Fashion": "0.4444",
            "Daily Life": "0.3667",
            "Travel": "0.4222",
            "Pet & Animal": "0.5000",
            "Exercise": "0.3667",
            "Multilingual": "0.4000"
        },
        "task_type": {
            "Temporal Perception": "0.4727",
            "Spatial Perception": "0.5741",
            "Attribute Perception": "0.5676",
            "Action Recognition": "0.3834",
            "Object Recognition": "0.4605",
            "OCR Problems": "0.5396",
            "Counting Problem": "0.2537",
            "Temporal Reasoning": "0.3051",
            "Spatial Reasoning": "0.6607",
            "Action Reasoning": "0.3298",
            "Object Reasoning": "0.3678",
            "Information Synopsis": "0.5820"
        }
    }
}

When testing with subtitles:

torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-1B --verbose --nframe 16 --use-subtitle

The expected test results are:

{
    "short": {
        "overall": "0.5433",
        "domain": {
            "Knowledge": "0.5630",
            "Film & Television": "0.6000",
            "Sports Competition": "0.4933",
            "Artistic Performance": "0.5167",
            "Life Record": "0.5571",
            "Multilingual": "0.4000"
        },
        "sub_category": {
            "Humanity & History": "0.3333",
            "Literature & Art": "0.4000",
            "Biology & Medicine": "0.7667",
            "Finance & Commerce": "0.6000",
            "Astronomy": "0.6000",
            "Geography": "0.5000",
            "Law": "0.6667",
            "Life Tip": "0.6000",
            "Technology": "0.6000",
            "Animation": "0.5667",
            "Movie & TV Show": "0.7333",
            "Documentary": "0.5000",
            "News Report": "0.6000",
            "Esports": "0.4333",
            "Basketball": "0.4000",
            "Football": "0.5000",
            "Athletics": "0.5000",
            "Other Sports": "0.6333",
            "Stage Play": "0.7667",
            "Magic Show": "0.3333",
            "Variety Show": "0.5333",
            "Acrobatics": "0.4333",
            "Handicraft": "0.5000",
            "Food": "0.6000",
            "Fashion": "0.6333",
            "Daily Life": "0.4333",
            "Travel": "0.7333",
            "Pet & Animal": "0.6667",
            "Exercise": "0.3333",
            "Multilingual": "0.4000"
        },
        "task_type": {
            "Temporal Perception": "0.5556",
            "Spatial Perception": "0.5667",
            "Attribute Perception": "0.6557",
            "Action Recognition": "0.4656",
            "Object Recognition": "0.5238",
            "OCR Problems": "0.6667",
            "Counting Problem": "0.3120",
            "Temporal Reasoning": "0.4615",
            "Spatial Reasoning": "0.6296",
            "Action Reasoning": "0.5957",
            "Object Reasoning": "0.5375",
            "Information Synopsis": "0.7561"
        }
    },
    "medium": {
        "overall": "0.4289",
        "domain": {
            "Knowledge": "0.4111",
            "Film & Television": "0.5250",
            "Sports Competition": "0.4000",
            "Artistic Performance": "0.4917",
            "Life Record": "0.3714",
            "Multilingual": "0.5000"
        },
        "sub_category": {
            "Humanity & History": "0.3667",
            "Literature & Art": "0.4333",
            "Biology & Medicine": "0.5667",
            "Finance & Commerce": "0.5000",
            "Astronomy": "0.5333",
            "Geography": "0.3333",
            "Law": "0.3333",
            "Life Tip": "0.4000",
            "Technology": "0.2333",
            "Animation": "0.2667",
            "Movie & TV Show": "0.5000",
            "Documentary": "0.6333",
            "News Report": "0.7000",
            "Esports": "0.5000",
            "Basketball": "0.1667",
            "Football": "0.4333",
            "Athletics": "0.3667",
            "Other Sports": "0.5333",
            "Stage Play": "0.6333",
            "Magic Show": "0.4333",
            "Variety Show": "0.4333",
            "Acrobatics": "0.4667",
            "Handicraft": "0.5000",
            "Food": "0.3333",
            "Fashion": "0.3333",
            "Daily Life": "0.3000",
            "Travel": "0.4000",
            "Pet & Animal": "0.3000",
            "Exercise": "0.4333",
            "Multilingual": "0.5000"
        },
        "task_type": {
            "Temporal Perception": "0.4194",
            "Spatial Perception": "0.6667",
            "Attribute Perception": "0.4658",
            "Action Recognition": "0.3613",
            "Object Recognition": "0.4924",
            "OCR Problems": "0.4265",
            "Counting Problem": "0.2632",
            "Temporal Reasoning": "0.2877",
            "Spatial Reasoning": "0.7222",
            "Action Reasoning": "0.3276",
            "Object Reasoning": "0.4403",
            "Information Synopsis": "0.6538"
        }
    },
    "long": {
        "overall": "0.3689",
        "domain": {
            "Knowledge": "0.3852",
            "Film & Television": "0.3833",
            "Sports Competition": "0.3267",
            "Artistic Performance": "0.3417",
            "Life Record": "0.3905",
            "Multilingual": "0.3333"
        },
        "sub_category": {
            "Humanity & History": "0.2333",
            "Literature & Art": "0.4333",
            "Biology & Medicine": "0.4333",
            "Finance & Commerce": "0.6000",
            "Astronomy": "0.2667",
            "Geography": "0.2667",
            "Law": "0.5000",
            "Life Tip": "0.4333",
            "Technology": "0.3000",
            "Animation": "0.2667",
            "Movie & TV Show": "0.4667",
            "Documentary": "0.5000",
            "News Report": "0.3000",
            "Esports": "0.3667",
            "Basketball": "0.2667",
            "Football": "0.3667",
            "Athletics": "0.2000",
            "Other Sports": "0.4333",
            "Stage Play": "0.4333",
            "Magic Show": "0.2333",
            "Variety Show": "0.2333",
            "Acrobatics": "0.4667",
            "Handicraft": "0.4667",
            "Food": "0.4333",
            "Fashion": "0.3667",
            "Daily Life": "0.4000",
            "Travel": "0.1667",
            "Pet & Animal": "0.5333",
            "Exercise": "0.3667",
            "Multilingual": "0.3333"
        },
        "task_type": {
            "Temporal Perception": "0.3333",
            "Spatial Perception": "0.0000",
            "Attribute Perception": "0.5185",
            "Action Recognition": "0.3016",
            "Object Recognition": "0.3148",
            "OCR Problems": "0.2857",
            "Counting Problem": "0.1875",
            "Temporal Reasoning": "0.2637",
            "Spatial Reasoning": "0.5455",
            "Action Reasoning": "0.3278",
            "Object Reasoning": "0.3667",
            "Information Synopsis": "0.5521"
        }
    },
    "overall": {
        "overall": "0.4470",
        "domain": {
            "Knowledge": "0.4531",
            "Film & Television": "0.5028",
            "Sports Competition": "0.4067",
            "Artistic Performance": "0.4500",
            "Life Record": "0.4397",
            "Multilingual": "0.4111"
        },
        "sub_category": {
            "Humanity & History": "0.3111",
            "Literature & Art": "0.4222",
            "Biology & Medicine": "0.5889",
            "Finance & Commerce": "0.5667",
            "Astronomy": "0.4667",
            "Geography": "0.3667",
            "Law": "0.5000",
            "Life Tip": "0.4778",
            "Technology": "0.3778",
            "Animation": "0.3667",
            "Movie & TV Show": "0.5667",
            "Documentary": "0.5444",
            "News Report": "0.5333",
            "Esports": "0.4333",
            "Basketball": "0.2778",
            "Football": "0.4333",
            "Athletics": "0.3556",
            "Other Sports": "0.5333",
            "Stage Play": "0.6111",
            "Magic Show": "0.3333",
            "Variety Show": "0.4000",
            "Acrobatics": "0.4556",
            "Handicraft": "0.4889",
            "Food": "0.4556",
            "Fashion": "0.4444",
            "Daily Life": "0.3778",
            "Travel": "0.4333",
            "Pet & Animal": "0.5000",
            "Exercise": "0.3778",
            "Multilingual": "0.4111"
        },
        "task_type": {
            "Temporal Perception": "0.4545",
            "Spatial Perception": "0.5741",
            "Attribute Perception": "0.5766",
            "Action Recognition": "0.3930",
            "Object Recognition": "0.4802",
            "OCR Problems": "0.5108",
            "Counting Problem": "0.2724",
            "Temporal Reasoning": "0.2881",
            "Spatial Reasoning": "0.6429",
            "Action Reasoning": "0.3719",
            "Object Reasoning": "0.4185",
            "Information Synopsis": "0.6285"
        }
    }
}

When testing without subtitles:

torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-2B --verbose --nframe 16

The expected test results are:

{
    "short": {
        "overall": "0.5756",
        "domain": {
            "Knowledge": "0.5593",
            "Film & Television": "0.6417",
            "Sports Competition": "0.5800",
            "Artistic Performance": "0.5917",
            "Life Record": "0.5810",
            "Multilingual": "0.3333"
        },
        "sub_category": {
            "Humanity & History": "0.5333",
            "Literature & Art": "0.4333",
            "Biology & Medicine": "0.6667",
            "Finance & Commerce": "0.4667",
            "Astronomy": "0.5333",
            "Geography": "0.6000",
            "Law": "0.5667",
            "Life Tip": "0.6667",
            "Technology": "0.5667",
            "Animation": "0.6000",
            "Movie & TV Show": "0.6000",
            "Documentary": "0.6000",
            "News Report": "0.7667",
            "Esports": "0.5667",
            "Basketball": "0.4667",
            "Football": "0.6333",
            "Athletics": "0.5667",
            "Other Sports": "0.6667",
            "Stage Play": "0.7333",
            "Magic Show": "0.4333",
            "Variety Show": "0.6667",
            "Acrobatics": "0.5333",
            "Handicraft": "0.4000",
            "Food": "0.6000",
            "Fashion": "0.5333",
            "Daily Life": "0.6667",
            "Travel": "0.6000",
            "Pet & Animal": "0.7667",
            "Exercise": "0.5000",
            "Multilingual": "0.3333"
        },
        "task_type": {
            "Temporal Perception": "0.7222",
            "Spatial Perception": "0.7333",
            "Attribute Perception": "0.6967",
            "Action Recognition": "0.5115",
            "Object Recognition": "0.5536",
            "OCR Problems": "0.7368",
            "Counting Problem": "0.3120",
            "Temporal Reasoning": "0.3846",
            "Spatial Reasoning": "0.7407",
            "Action Reasoning": "0.6809",
            "Object Reasoning": "0.5375",
            "Information Synopsis": "0.6951"
        }
    },
    "medium": {
        "overall": "0.4067",
        "domain": {
            "Knowledge": "0.3741",
            "Film & Television": "0.4917",
            "Sports Competition": "0.3333",
            "Artistic Performance": "0.5417",
            "Life Record": "0.3762",
            "Multilingual": "0.4000"
        },
        "sub_category": {
            "Humanity & History": "0.2000",
            "Literature & Art": "0.4333",
            "Biology & Medicine": "0.4000",
            "Finance & Commerce": "0.3667",
            "Astronomy": "0.4000",
            "Geography": "0.3000",
            "Law": "0.5333",
            "Life Tip": "0.5000",
            "Technology": "0.2333",
            "Animation": "0.3000",
            "Movie & TV Show": "0.5667",
            "Documentary": "0.5000",
            "News Report": "0.6000",
            "Esports": "0.3333",
            "Basketball": "0.2000",
            "Football": "0.2667",
            "Athletics": "0.5000",
            "Other Sports": "0.3667",
            "Stage Play": "0.6667",
            "Magic Show": "0.5000",
            "Variety Show": "0.5000",
            "Acrobatics": "0.5000",
            "Handicraft": "0.4333",
            "Food": "0.2000",
            "Fashion": "0.2667",
            "Daily Life": "0.3333",
            "Travel": "0.4333",
            "Pet & Animal": "0.3667",
            "Exercise": "0.6000",
            "Multilingual": "0.4000"
        },
        "task_type": {
            "Temporal Perception": "0.2903",
            "Spatial Perception": "0.5238",
            "Attribute Perception": "0.4932",
            "Action Recognition": "0.3025",
            "Object Recognition": "0.4924",
            "OCR Problems": "0.3676",
            "Counting Problem": "0.2737",
            "Temporal Reasoning": "0.3151",
            "Spatial Reasoning": "0.6667",
            "Action Reasoning": "0.3966",
            "Object Reasoning": "0.4104",
            "Information Synopsis": "0.5769"
        }
    },
    "long": {
        "overall": "0.3689",
        "domain": {
            "Knowledge": "0.3444",
            "Film & Television": "0.3500",
            "Sports Competition": "0.3933",
            "Artistic Performance": "0.3417",
            "Life Record": "0.4000",
            "Multilingual": "0.4333"
        },
        "sub_category": {
            "Humanity & History": "0.3000",
            "Literature & Art": "0.4667",
            "Biology & Medicine": "0.3667",
            "Finance & Commerce": "0.3667",
            "Astronomy": "0.2333",
            "Geography": "0.2333",
            "Law": "0.4667",
            "Life Tip": "0.3000",
            "Technology": "0.3667",
            "Animation": "0.2333",
            "Movie & TV Show": "0.4333",
            "Documentary": "0.4333",
            "News Report": "0.3000",
            "Esports": "0.4333",
            "Basketball": "0.3000",
            "Football": "0.3333",
            "Athletics": "0.3667",
            "Other Sports": "0.5333",
            "Stage Play": "0.3333",
            "Magic Show": "0.3667",
            "Variety Show": "0.1667",
            "Acrobatics": "0.5000",
            "Handicraft": "0.5000",
            "Food": "0.2000",
            "Fashion": "0.3667",
            "Daily Life": "0.4000",
            "Travel": "0.2667",
            "Pet & Animal": "0.6667",
            "Exercise": "0.4000",
            "Multilingual": "0.4333"
        },
        "task_type": {
            "Temporal Perception": "0.0000",
            "Spatial Perception": "0.3333",
            "Attribute Perception": "0.3704",
            "Action Recognition": "0.3968",
            "Object Recognition": "0.4074",
            "OCR Problems": "0.3571",
            "Counting Problem": "0.2292",
            "Temporal Reasoning": "0.3077",
            "Spatial Reasoning": "0.5455",
            "Action Reasoning": "0.3056",
            "Object Reasoning": "0.3375",
            "Information Synopsis": "0.5399"
        }
    },
    "overall": {
        "overall": "0.4504",
        "domain": {
            "Knowledge": "0.4259",
            "Film & Television": "0.4944",
            "Sports Competition": "0.4356",
            "Artistic Performance": "0.4917",
            "Life Record": "0.4524",
            "Multilingual": "0.3889"
        },
        "sub_category": {
            "Humanity & History": "0.3444",
            "Literature & Art": "0.4444",
            "Biology & Medicine": "0.4778",
            "Finance & Commerce": "0.4000",
            "Astronomy": "0.3889",
            "Geography": "0.3778",
            "Law": "0.5222",
            "Life Tip": "0.4889",
            "Technology": "0.3889",
            "Animation": "0.3778",
            "Movie & TV Show": "0.5333",
            "Documentary": "0.5111",
            "News Report": "0.5556",
            "Esports": "0.4444",
            "Basketball": "0.3222",
            "Football": "0.4111",
            "Athletics": "0.4778",
            "Other Sports": "0.5222",
            "Stage Play": "0.5778",
            "Magic Show": "0.4333",
            "Variety Show": "0.4444",
            "Acrobatics": "0.5111",
            "Handicraft": "0.4444",
            "Food": "0.3333",
            "Fashion": "0.3889",
            "Daily Life": "0.4667",
            "Travel": "0.4333",
            "Pet & Animal": "0.6000",
            "Exercise": "0.5000",
            "Multilingual": "0.3889"
        },
        "task_type": {
            "Temporal Perception": "0.4000",
            "Spatial Perception": "0.6296",
            "Attribute Perception": "0.5901",
            "Action Recognition": "0.4089",
            "Object Recognition": "0.5085",
            "OCR Problems": "0.5180",
            "Counting Problem": "0.2836",
            "Temporal Reasoning": "0.3164",
            "Spatial Reasoning": "0.6786",
            "Action Reasoning": "0.3860",
            "Object Reasoning": "0.3943",
            "Information Synopsis": "0.5882"
        }
    }
}

When testing with subtitles:

torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-2B --verbose --nframe 16 --use-subtitle

The expected test results are:

{
    "short": {
        "overall": "0.5978",
        "domain": {
            "Knowledge": "0.5926",
            "Film & Television": "0.6583",
            "Sports Competition": "0.5867",
            "Artistic Performance": "0.6083",
            "Life Record": "0.5952",
            "Multilingual": "0.4333"
        },
        "sub_category": {
            "Humanity & History": "0.4667",
            "Literature & Art": "0.5333",
            "Biology & Medicine": "0.8000",
            "Finance & Commerce": "0.5333",
            "Astronomy": "0.5667",
            "Geography": "0.6333",
            "Law": "0.6000",
            "Life Tip": "0.6333",
            "Technology": "0.5667",
            "Animation": "0.5667",
            "Movie & TV Show": "0.6333",
            "Documentary": "0.6333",
            "News Report": "0.8000",
            "Esports": "0.5667",
            "Basketball": "0.4333",
            "Football": "0.6667",
            "Athletics": "0.6333",
            "Other Sports": "0.6333",
            "Stage Play": "0.7000",
            "Magic Show": "0.5000",
            "Variety Show": "0.7000",
            "Acrobatics": "0.5333",
            "Handicraft": "0.4000",
            "Food": "0.6667",
            "Fashion": "0.5333",
            "Daily Life": "0.6667",
            "Travel": "0.5667",
            "Pet & Animal": "0.7333",
            "Exercise": "0.6000",
            "Multilingual": "0.4333"
        },
        "task_type": {
            "Temporal Perception": "0.8333",
            "Spatial Perception": "0.6333",
            "Attribute Perception": "0.7213",
            "Action Recognition": "0.5496",
            "Object Recognition": "0.5536",
            "OCR Problems": "0.7368",
            "Counting Problem": "0.3440",
            "Temporal Reasoning": "0.3077",
            "Spatial Reasoning": "0.8148",
            "Action Reasoning": "0.7021",
            "Object Reasoning": "0.5500",
            "Information Synopsis": "0.7683"
        }
    },
    "medium": {
        "overall": "0.4367",
        "domain": {
            "Knowledge": "0.4444",
            "Film & Television": "0.4833",
            "Sports Competition": "0.3600",
            "Artistic Performance": "0.5833",
            "Life Record": "0.3714",
            "Multilingual": "0.4333"
        },
        "sub_category": {
            "Humanity & History": "0.3000",
            "Literature & Art": "0.5000",
            "Biology & Medicine": "0.5333",
            "Finance & Commerce": "0.5333",
            "Astronomy": "0.4667",
            "Geography": "0.3667",
            "Law": "0.5000",
            "Life Tip": "0.6000",
            "Technology": "0.2000",
            "Animation": "0.3000",
            "Movie & TV Show": "0.5667",
            "Documentary": "0.5333",
            "News Report": "0.5333",
            "Esports": "0.3333",
            "Basketball": "0.2333",
            "Football": "0.3667",
            "Athletics": "0.4667",
            "Other Sports": "0.4000",
            "Stage Play": "0.6667",
            "Magic Show": "0.6000",
            "Variety Show": "0.5667",
            "Acrobatics": "0.5000",
            "Handicraft": "0.5000",
            "Food": "0.2000",
            "Fashion": "0.3000",
            "Daily Life": "0.2667",
            "Travel": "0.4333",
            "Pet & Animal": "0.3333",
            "Exercise": "0.5667",
            "Multilingual": "0.4333"
        },
        "task_type": {
            "Temporal Perception": "0.3226",
            "Spatial Perception": "0.5238",
            "Attribute Perception": "0.5068",
            "Action Recognition": "0.3277",
            "Object Recognition": "0.4924",
            "OCR Problems": "0.4118",
            "Counting Problem": "0.3053",
            "Temporal Reasoning": "0.3288",
            "Spatial Reasoning": "0.6667",
            "Action Reasoning": "0.4655",
            "Object Reasoning": "0.4478",
            "Information Synopsis": "0.6538"
        }
    },
    "long": {
        "overall": "0.3856",
        "domain": {
            "Knowledge": "0.3889",
            "Film & Television": "0.3750",
            "Sports Competition": "0.3867",
            "Artistic Performance": "0.3417",
            "Life Record": "0.4048",
            "Multilingual": "0.4333"
        },
        "sub_category": {
            "Humanity & History": "0.3000",
            "Literature & Art": "0.5000",
            "Biology & Medicine": "0.4333",
            "Finance & Commerce": "0.5000",
            "Astronomy": "0.3000",
            "Geography": "0.3000",
            "Law": "0.4333",
            "Life Tip": "0.3333",
            "Technology": "0.4000",
            "Animation": "0.2333",
            "Movie & TV Show": "0.4667",
            "Documentary": "0.4333",
            "News Report": "0.3667",
            "Esports": "0.4667",
            "Basketball": "0.2667",
            "Football": "0.3000",
            "Athletics": "0.3333",
            "Other Sports": "0.5667",
            "Stage Play": "0.4000",
            "Magic Show": "0.3000",
            "Variety Show": "0.2000",
            "Acrobatics": "0.4667",
            "Handicraft": "0.5000",
            "Food": "0.2000",
            "Fashion": "0.4000",
            "Daily Life": "0.4333",
            "Travel": "0.2333",
            "Pet & Animal": "0.7000",
            "Exercise": "0.3667",
            "Multilingual": "0.4333"
        },
        "task_type": {
            "Temporal Perception": "0.0000",
            "Spatial Perception": "0.3333",
            "Attribute Perception": "0.4444",
            "Action Recognition": "0.4603",
            "Object Recognition": "0.3519",
            "OCR Problems": "0.4286",
            "Counting Problem": "0.2292",
            "Temporal Reasoning": "0.3187",
            "Spatial Reasoning": "0.5455",
            "Action Reasoning": "0.3222",
            "Object Reasoning": "0.3625",
            "Information Synopsis": "0.5460"
        }
    },
    "overall": {
        "overall": "0.4733",
        "domain": {
            "Knowledge": "0.4753",
            "Film & Television": "0.5056",
            "Sports Competition": "0.4444",
            "Artistic Performance": "0.5111",
            "Life Record": "0.4571",
            "Multilingual": "0.4333"
        },
        "sub_category": {
            "Humanity & History": "0.3556",
            "Literature & Art": "0.5111",
            "Biology & Medicine": "0.5889",
            "Finance & Commerce": "0.5222",
            "Astronomy": "0.4444",
            "Geography": "0.4333",
            "Law": "0.5111",
            "Life Tip": "0.5222",
            "Technology": "0.3889",
            "Animation": "0.3667",
            "Movie & TV Show": "0.5556",
            "Documentary": "0.5333",
            "News Report": "0.5667",
            "Esports": "0.4556",
            "Basketball": "0.3111",
            "Football": "0.4444",
            "Athletics": "0.4778",
            "Other Sports": "0.5333",
            "Stage Play": "0.5889",
            "Magic Show": "0.4667",
            "Variety Show": "0.4889",
            "Acrobatics": "0.5000",
            "Handicraft": "0.4667",
            "Food": "0.3556",
            "Fashion": "0.4111",
            "Daily Life": "0.4556",
            "Travel": "0.4111",
            "Pet & Animal": "0.5889",
            "Exercise": "0.5111",
            "Multilingual": "0.4333"
        },
        "task_type": {
            "Temporal Perception": "0.4545",
            "Spatial Perception": "0.5741",
            "Attribute Perception": "0.6171",
            "Action Recognition": "0.4473",
            "Object Recognition": "0.5000",
            "OCR Problems": "0.5468",
            "Counting Problem": "0.3097",
            "Temporal Reasoning": "0.3220",
            "Spatial Reasoning": "0.7143",
            "Action Reasoning": "0.4140",
            "Object Reasoning": "0.4207",
            "Information Synopsis": "0.6285"
        }
    }
}

When testing without subtitles:

torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-4B --verbose --nframe 16

The expected test results are:

{
    "short": {
        "overall": "0.6289",
        "domain": {
            "Knowledge": "0.6519",
            "Film & Television": "0.7000",
            "Sports Competition": "0.5800",
            "Artistic Performance": "0.6417",
            "Life Record": "0.6095",
            "Multilingual": "0.4667"
        },
        "sub_category": {
            "Humanity & History": "0.5333",
            "Literature & Art": "0.6000",
            "Biology & Medicine": "0.7667",
            "Finance & Commerce": "0.6000",
            "Astronomy": "0.6333",
            "Geography": "0.5667",
            "Law": "0.7333",
            "Life Tip": "0.7667",
            "Technology": "0.6667",
            "Animation": "0.6000",
            "Movie & TV Show": "0.6667",
            "Documentary": "0.6333",
            "News Report": "0.9000",
            "Esports": "0.5333",
            "Basketball": "0.4667",
            "Football": "0.6667",
            "Athletics": "0.6333",
            "Other Sports": "0.6000",
            "Stage Play": "0.8000",
            "Magic Show": "0.6000",
            "Variety Show": "0.5667",
            "Acrobatics": "0.6000",
            "Handicraft": "0.5667",
            "Food": "0.5667",
            "Fashion": "0.5333",
            "Daily Life": "0.6000",
            "Travel": "0.7000",
            "Pet & Animal": "0.7667",
            "Exercise": "0.5333",
            "Multilingual": "0.4667"
        },
        "task_type": {
            "Temporal Perception": "0.8889",
            "Spatial Perception": "0.6333",
            "Attribute Perception": "0.7459",
            "Action Recognition": "0.6183",
            "Object Recognition": "0.6369",
            "OCR Problems": "0.6140",
            "Counting Problem": "0.3200",
            "Temporal Reasoning": "0.4615",
            "Spatial Reasoning": "0.7778",
            "Action Reasoning": "0.7021",
            "Object Reasoning": "0.6250",
            "Information Synopsis": "0.8171"
        }
    },
    "medium": {
        "overall": "0.4678",
        "domain": {
            "Knowledge": "0.4704",
            "Film & Television": "0.5083",
            "Sports Competition": "0.4133",
            "Artistic Performance": "0.5333",
            "Life Record": "0.4381",
            "Multilingual": "0.5000"
        },
        "sub_category": {
            "Humanity & History": "0.2667",
            "Literature & Art": "0.6000",
            "Biology & Medicine": "0.5333",
            "Finance & Commerce": "0.5333",
            "Astronomy": "0.5000",
            "Geography": "0.4000",
            "Law": "0.5000",
            "Life Tip": "0.5333",
            "Technology": "0.3667",
            "Animation": "0.2333",
            "Movie & TV Show": "0.6333",
            "Documentary": "0.6000",
            "News Report": "0.5667",
            "Esports": "0.3667",
            "Basketball": "0.3667",
            "Football": "0.4333",
            "Athletics": "0.4333",
            "Other Sports": "0.4667",
            "Stage Play": "0.6000",
            "Magic Show": "0.4000",
            "Variety Show": "0.5000",
            "Acrobatics": "0.6333",
            "Handicraft": "0.7000",
            "Food": "0.3667",
            "Fashion": "0.3333",
            "Daily Life": "0.3000",
            "Travel": "0.4333",
            "Pet & Animal": "0.3667",
            "Exercise": "0.5667",
            "Multilingual": "0.5000"
        },
        "task_type": {
            "Temporal Perception": "0.4839",
            "Spatial Perception": "0.4762",
            "Attribute Perception": "0.5205",
            "Action Recognition": "0.3866",
            "Object Recognition": "0.5530",
            "OCR Problems": "0.4559",
            "Counting Problem": "0.3053",
            "Temporal Reasoning": "0.3014",
            "Spatial Reasoning": "0.7222",
            "Action Reasoning": "0.5172",
            "Object Reasoning": "0.4925",
            "Information Synopsis": "0.6154"
        }
    },
    "long": {
        "overall": "0.4467",
        "domain": {
            "Knowledge": "0.4815",
            "Film & Television": "0.4333",
            "Sports Competition": "0.4267",
            "Artistic Performance": "0.4250",
            "Life Record": "0.4333",
            "Multilingual": "0.4667"
        },
        "sub_category": {
            "Humanity & History": "0.3333",
            "Literature & Art": "0.5000",
            "Biology & Medicine": "0.5000",
            "Finance & Commerce": "0.5333",
            "Astronomy": "0.5333",
            "Geography": "0.3333",
            "Law": "0.5000",
            "Life Tip": "0.5333",
            "Technology": "0.5667",
            "Animation": "0.2667",
            "Movie & TV Show": "0.5333",
            "Documentary": "0.5000",
            "News Report": "0.4333",
            "Esports": "0.4667",
            "Basketball": "0.4000",
            "Football": "0.4333",
            "Athletics": "0.3667",
            "Other Sports": "0.4667",
            "Stage Play": "0.6000",
            "Magic Show": "0.4333",
            "Variety Show": "0.2667",
            "Acrobatics": "0.4000",
            "Handicraft": "0.5000",
            "Food": "0.3000",
            "Fashion": "0.4667",
            "Daily Life": "0.3000",
            "Travel": "0.3000",
            "Pet & Animal": "0.6667",
            "Exercise": "0.5000",
            "Multilingual": "0.4667"
        },
        "task_type": {
            "Temporal Perception": "0.5000",
            "Spatial Perception": "0.6667",
            "Attribute Perception": "0.5185",
            "Action Recognition": "0.3810",
            "Object Recognition": "0.4815",
            "OCR Problems": "0.3571",
            "Counting Problem": "0.2708",
            "Temporal Reasoning": "0.2637",
            "Spatial Reasoning": "0.5455",
            "Action Reasoning": "0.4556",
            "Object Reasoning": "0.4500",
            "Information Synopsis": "0.5828"
        }
    },
    "overall": {
        "overall": "0.5144",
        "domain": {
            "Knowledge": "0.5346",
            "Film & Television": "0.5472",
            "Sports Competition": "0.4733",
            "Artistic Performance": "0.5333",
            "Life Record": "0.4937",
            "Multilingual": "0.4778"
        },
        "sub_category": {
            "Humanity & History": "0.3778",
            "Literature & Art": "0.5667",
            "Biology & Medicine": "0.6000",
            "Finance & Commerce": "0.5556",
            "Astronomy": "0.5556",
            "Geography": "0.4333",
            "Law": "0.5778",
            "Life Tip": "0.6111",
            "Technology": "0.5333",
            "Animation": "0.3667",
            "Movie & TV Show": "0.6111",
            "Documentary": "0.5778",
            "News Report": "0.6333",
            "Esports": "0.4556",
            "Basketball": "0.4111",
            "Football": "0.5111",
            "Athletics": "0.4778",
            "Other Sports": "0.5111",
            "Stage Play": "0.6667",
            "Magic Show": "0.4778",
            "Variety Show": "0.4444",
            "Acrobatics": "0.5444",
            "Handicraft": "0.5889",
            "Food": "0.4111",
            "Fashion": "0.4444",
            "Daily Life": "0.4000",
            "Travel": "0.4778",
            "Pet & Animal": "0.6000",
            "Exercise": "0.5333",
            "Multilingual": "0.4778"
        },
        "task_type": {
            "Temporal Perception": "0.6182",
            "Spatial Perception": "0.5741",
            "Attribute Perception": "0.6441",
            "Action Recognition": "0.4824",
            "Object Recognition": "0.5819",
            "OCR Problems": "0.5108",
            "Counting Problem": "0.3060",
            "Temporal Reasoning": "0.2938",
            "Spatial Reasoning": "0.7143",
            "Action Reasoning": "0.5088",
            "Object Reasoning": "0.4934",
            "Information Synopsis": "0.6502"
        }
    }
}

When testing with subtitles:

torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-4B --verbose --nframe 16 --use-subtitle

The expected test results are:

{
    "short": {
        "overall": "0.6511",
        "domain": {
            "Knowledge": "0.6852",
            "Film & Television": "0.7083",
            "Sports Competition": "0.5933",
            "Artistic Performance": "0.6750",
            "Life Record": "0.6286",
            "Multilingual": "0.4667"
        },
        "sub_category": {
            "Humanity & History": "0.5333",
            "Literature & Art": "0.5667",
            "Biology & Medicine": "0.8333",
            "Finance & Commerce": "0.6667",
            "Astronomy": "0.7000",
            "Geography": "0.6333",
            "Law": "0.7667",
            "Life Tip": "0.7667",
            "Technology": "0.7000",
            "Animation": "0.4667",
            "Movie & TV Show": "0.7333",
            "Documentary": "0.7000",
            "News Report": "0.9333",
            "Esports": "0.5000",
            "Basketball": "0.5000",
            "Football": "0.6333",
            "Athletics": "0.7000",
            "Other Sports": "0.6333",
            "Stage Play": "0.7667",
            "Magic Show": "0.7000",
            "Variety Show": "0.5667",
            "Acrobatics": "0.6667",
            "Handicraft": "0.6333",
            "Food": "0.6000",
            "Fashion": "0.5333",
            "Daily Life": "0.6667",
            "Travel": "0.7000",
            "Pet & Animal": "0.7333",
            "Exercise": "0.5333",
            "Multilingual": "0.4667"
        },
        "task_type": {
            "Temporal Perception": "0.8333",
            "Spatial Perception": "0.6667",
            "Attribute Perception": "0.7787",
            "Action Recognition": "0.6260",
            "Object Recognition": "0.6429",
            "OCR Problems": "0.6667",
            "Counting Problem": "0.3360",
            "Temporal Reasoning": "0.6154",
            "Spatial Reasoning": "0.8148",
            "Action Reasoning": "0.7234",
            "Object Reasoning": "0.6375",
            "Information Synopsis": "0.8659"
        }
    },
    "medium": {
        "overall": "0.4878",
        "domain": {
            "Knowledge": "0.5148",
            "Film & Television": "0.5417",
            "Sports Competition": "0.4067",
            "Artistic Performance": "0.5417",
            "Life Record": "0.4619",
            "Multilingual": "0.4000"
        },
        "sub_category": {
            "Humanity & History": "0.3667",
            "Literature & Art": "0.5667",
            "Biology & Medicine": "0.5667",
            "Finance & Commerce": "0.5667",
            "Astronomy": "0.7000",
            "Geography": "0.3667",
            "Law": "0.6000",
            "Life Tip": "0.4667",
            "Technology": "0.4333",
            "Animation": "0.2667",
            "Movie & TV Show": "0.6667",
            "Documentary": "0.5667",
            "News Report": "0.6667",
            "Esports": "0.4667",
            "Basketball": "0.2333",
            "Football": "0.4333",
            "Athletics": "0.4333",
            "Other Sports": "0.4667",
            "Stage Play": "0.6333",
            "Magic Show": "0.4333",
            "Variety Show": "0.5000",
            "Acrobatics": "0.6000",
            "Handicraft": "0.7000",
            "Food": "0.3333",
            "Fashion": "0.3667",
            "Daily Life": "0.3667",
            "Travel": "0.5000",
            "Pet & Animal": "0.4000",
            "Exercise": "0.5667",
            "Multilingual": "0.4000"
        },
        "task_type": {
            "Temporal Perception": "0.4194",
            "Spatial Perception": "0.4286",
            "Attribute Perception": "0.5479",
            "Action Recognition": "0.3950",
            "Object Recognition": "0.5606",
            "OCR Problems": "0.4559",
            "Counting Problem": "0.3474",
            "Temporal Reasoning": "0.2877",
            "Spatial Reasoning": "0.8333",
            "Action Reasoning": "0.4655",
            "Object Reasoning": "0.5522",
            "Information Synopsis": "0.7051"
        }
    },
    "long": {
        "overall": "0.4622",
        "domain": {
            "Knowledge": "0.4889",
            "Film & Television": "0.4750",
            "Sports Competition": "0.4267",
            "Artistic Performance": "0.4500",
            "Life Record": "0.4476",
            "Multilingual": "0.5000"
        },
        "sub_category": {
            "Humanity & History": "0.2667",
            "Literature & Art": "0.5667",
            "Biology & Medicine": "0.5333",
            "Finance & Commerce": "0.6000",
            "Astronomy": "0.5333",
            "Geography": "0.3333",
            "Law": "0.6000",
            "Life Tip": "0.5000",
            "Technology": "0.4667",
            "Animation": "0.3333",
            "Movie & TV Show": "0.5000",
            "Documentary": "0.6000",
            "News Report": "0.4667",
            "Esports": "0.5000",
            "Basketball": "0.4000",
            "Football": "0.5333",
            "Athletics": "0.3000",
            "Other Sports": "0.4000",
            "Stage Play": "0.7333",
            "Magic Show": "0.4333",
            "Variety Show": "0.2333",
            "Acrobatics": "0.4000",
            "Handicraft": "0.5667",
            "Food": "0.2667",
            "Fashion": "0.4667",
            "Daily Life": "0.3333",
            "Travel": "0.3000",
            "Pet & Animal": "0.7000",
            "Exercise": "0.5000",
            "Multilingual": "0.5000"
        },
        "task_type": {
            "Temporal Perception": "0.3333",
            "Spatial Perception": "0.3333",
            "Attribute Perception": "0.5185",
            "Action Recognition": "0.4444",
            "Object Recognition": "0.4815",
            "OCR Problems": "0.2857",
            "Counting Problem": "0.2708",
            "Temporal Reasoning": "0.2418",
            "Spatial Reasoning": "0.5455",
            "Action Reasoning": "0.4444",
            "Object Reasoning": "0.4708",
            "Information Synopsis": "0.6564"
        }
    },
    "overall": {
        "overall": "0.5337",
        "domain": {
            "Knowledge": "0.5630",
            "Film & Television": "0.5750",
            "Sports Competition": "0.4756",
            "Artistic Performance": "0.5556",
            "Life Record": "0.5127",
            "Multilingual": "0.4556"
        },
        "sub_category": {
            "Humanity & History": "0.3889",
            "Literature & Art": "0.5667",
            "Biology & Medicine": "0.6444",
            "Finance & Commerce": "0.6111",
            "Astronomy": "0.6444",
            "Geography": "0.4444",
            "Law": "0.6556",
            "Life Tip": "0.5778",
            "Technology": "0.5333",
            "Animation": "0.3556",
            "Movie & TV Show": "0.6333",
            "Documentary": "0.6222",
            "News Report": "0.6889",
            "Esports": "0.4889",
            "Basketball": "0.3778",
            "Football": "0.5333",
            "Athletics": "0.4778",
            "Other Sports": "0.5000",
            "Stage Play": "0.7111",
            "Magic Show": "0.5222",
            "Variety Show": "0.4333",
            "Acrobatics": "0.5556",
            "Handicraft": "0.6333",
            "Food": "0.4000",
            "Fashion": "0.4556",
            "Daily Life": "0.4556",
            "Travel": "0.5000",
            "Pet & Animal": "0.6111",
            "Exercise": "0.5333",
            "Multilingual": "0.4556"
        },
        "task_type": {
            "Temporal Perception": "0.5455",
            "Spatial Perception": "0.5556",
            "Attribute Perception": "0.6712",
            "Action Recognition": "0.5016",
            "Object Recognition": "0.5876",
            "OCR Problems": "0.5252",
            "Counting Problem": "0.3284",
            "Temporal Reasoning": "0.2881",
            "Spatial Reasoning": "0.7679",
            "Action Reasoning": "0.4947",
            "Object Reasoning": "0.5242",
            "Information Synopsis": "0.7214"
        }
    }
}

When testing without subtitles:

torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-8B --verbose --nframe 16

The expected test results are:

{
    "short": {
        "overall": "0.6567",
        "domain": {
            "Knowledge": "0.6704",
            "Film & Television": "0.7083",
            "Sports Competition": "0.5933",
            "Artistic Performance": "0.7000",
            "Life Record": "0.6619",
            "Multilingual": "0.4333"
        },
        "sub_category": {
            "Humanity & History": "0.6000",
            "Literature & Art": "0.6000",
            "Biology & Medicine": "0.7667",
            "Finance & Commerce": "0.7000",
            "Astronomy": "0.6000",
            "Geography": "0.7000",
            "Law": "0.7000",
            "Life Tip": "0.7000",
            "Technology": "0.6667",
            "Animation": "0.8000",
            "Movie & TV Show": "0.6000",
            "Documentary": "0.6333",
            "News Report": "0.8000",
            "Esports": "0.5333",
            "Basketball": "0.3667",
            "Football": "0.7000",
            "Athletics": "0.7333",
            "Other Sports": "0.6333",
            "Stage Play": "0.8333",
            "Magic Show": "0.6667",
            "Variety Show": "0.6333",
            "Acrobatics": "0.6667",
            "Handicraft": "0.7000",
            "Food": "0.6667",
            "Fashion": "0.5333",
            "Daily Life": "0.6667",
            "Travel": "0.7667",
            "Pet & Animal": "0.7667",
            "Exercise": "0.5333",
            "Multilingual": "0.4333"
        },
        "task_type": {
            "Temporal Perception": "0.7222",
            "Spatial Perception": "0.7667",
            "Attribute Perception": "0.7623",
            "Action Recognition": "0.5954",
            "Object Recognition": "0.6845",
            "OCR Problems": "0.7719",
            "Counting Problem": "0.4080",
            "Temporal Reasoning": "0.6154",
            "Spatial Reasoning": "0.8148",
            "Action Reasoning": "0.6596",
            "Object Reasoning": "0.6250",
            "Information Synopsis": "0.7683"
        }
    },
    "medium": {
        "overall": "0.5044",
        "domain": {
            "Knowledge": "0.5148",
            "Film & Television": "0.5750",
            "Sports Competition": "0.4533",
            "Artistic Performance": "0.5917",
            "Life Record": "0.4429",
            "Multilingual": "0.4667"
        },
        "sub_category": {
            "Humanity & History": "0.4333",
            "Literature & Art": "0.6333",
            "Biology & Medicine": "0.5667",
            "Finance & Commerce": "0.6000",
            "Astronomy": "0.4333",
            "Geography": "0.3333",
            "Law": "0.5667",
            "Life Tip": "0.6333",
            "Technology": "0.4333",
            "Animation": "0.4000",
            "Movie & TV Show": "0.6667",
            "Documentary": "0.5667",
            "News Report": "0.6667",
            "Esports": "0.5667",
            "Basketball": "0.2667",
            "Football": "0.4667",
            "Athletics": "0.4333",
            "Other Sports": "0.5333",
            "Stage Play": "0.8000",
            "Magic Show": "0.4667",
            "Variety Show": "0.5667",
            "Acrobatics": "0.5333",
            "Handicraft": "0.5667",
            "Food": "0.4000",
            "Fashion": "0.5000",
            "Daily Life": "0.3333",
            "Travel": "0.4333",
            "Pet & Animal": "0.3667",
            "Exercise": "0.5000",
            "Multilingual": "0.4667"
        },
        "task_type": {
            "Temporal Perception": "0.4516",
            "Spatial Perception": "0.5714",
            "Attribute Perception": "0.4932",
            "Action Recognition": "0.3782",
            "Object Recognition": "0.6212",
            "OCR Problems": "0.4706",
            "Counting Problem": "0.3053",
            "Temporal Reasoning": "0.3836",
            "Spatial Reasoning": "0.6111",
            "Action Reasoning": "0.5172",
            "Object Reasoning": "0.5970",
            "Information Synopsis": "0.7051"
        }
    },
    "long": {
        "overall": "0.4589",
        "domain": {
            "Knowledge": "0.5037",
            "Film & Television": "0.4500",
            "Sports Competition": "0.4733",
            "Artistic Performance": "0.4417",
            "Life Record": "0.4048",
            "Multilingual": "0.4667"
        },
        "sub_category": {
            "Humanity & History": "0.5333",
            "Literature & Art": "0.5000",
            "Biology & Medicine": "0.6000",
            "Finance & Commerce": "0.5000",
            "Astronomy": "0.5000",
            "Geography": "0.3667",
            "Law": "0.5333",
            "Life Tip": "0.5667",
            "Technology": "0.4333",
            "Animation": "0.2667",
            "Movie & TV Show": "0.5667",
            "Documentary": "0.4667",
            "News Report": "0.5000",
            "Esports": "0.5000",
            "Basketball": "0.3667",
            "Football": "0.5000",
            "Athletics": "0.5000",
            "Other Sports": "0.5000",
            "Stage Play": "0.6333",
            "Magic Show": "0.3333",
            "Variety Show": "0.3000",
            "Acrobatics": "0.5000",
            "Handicraft": "0.4667",
            "Food": "0.2667",
            "Fashion": "0.4000",
            "Daily Life": "0.3333",
            "Travel": "0.3667",
            "Pet & Animal": "0.6333",
            "Exercise": "0.3667",
            "Multilingual": "0.4667"
        },
        "task_type": {
            "Temporal Perception": "0.1667",
            "Spatial Perception": "0.0000",
            "Attribute Perception": "0.6296",
            "Action Recognition": "0.4127",
            "Object Recognition": "0.5000",
            "OCR Problems": "0.5000",
            "Counting Problem": "0.3542",
            "Temporal Reasoning": "0.3297",
            "Spatial Reasoning": "0.6364",
            "Action Reasoning": "0.4000",
            "Object Reasoning": "0.4625",
            "Information Synopsis": "0.6012"
        }
    },
    "overall": {
        "overall": "0.5400",
        "domain": {
            "Knowledge": "0.5630",
            "Film & Television": "0.5778",
            "Sports Competition": "0.5067",
            "Artistic Performance": "0.5778",
            "Life Record": "0.5032",
            "Multilingual": "0.4556"
        },
        "sub_category": {
            "Humanity & History": "0.5222",
            "Literature & Art": "0.5778",
            "Biology & Medicine": "0.6444",
            "Finance & Commerce": "0.6000",
            "Astronomy": "0.5111",
            "Geography": "0.4667",
            "Law": "0.6000",
            "Life Tip": "0.6333",
            "Technology": "0.5111",
            "Animation": "0.4889",
            "Movie & TV Show": "0.6111",
            "Documentary": "0.5556",
            "News Report": "0.6556",
            "Esports": "0.5333",
            "Basketball": "0.3333",
            "Football": "0.5556",
            "Athletics": "0.5556",
            "Other Sports": "0.5556",
            "Stage Play": "0.7556",
            "Magic Show": "0.4889",
            "Variety Show": "0.5000",
            "Acrobatics": "0.5667",
            "Handicraft": "0.5778",
            "Food": "0.4444",
            "Fashion": "0.4778",
            "Daily Life": "0.4444",
            "Travel": "0.5222",
            "Pet & Animal": "0.5889",
            "Exercise": "0.4667",
            "Multilingual": "0.4556"
        },
        "task_type": {
            "Temporal Perception": "0.5091",
            "Spatial Perception": "0.6481",
            "Attribute Perception": "0.6577",
            "Action Recognition": "0.4760",
            "Object Recognition": "0.6328",
            "OCR Problems": "0.5971",
            "Counting Problem": "0.3619",
            "Temporal Reasoning": "0.3729",
            "Spatial Reasoning": "0.7143",
            "Action Reasoning": "0.4667",
            "Object Reasoning": "0.5308",
            "Information Synopsis": "0.6687"
        }
    }
}

When testing with subtitles:

torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-8B --verbose --nframe 16 --use-subtitle

The expected test results are:

{
    "short": {
        "overall": "0.6900",
        "domain": {
            "Knowledge": "0.7148",
            "Film & Television": "0.7500",
            "Sports Competition": "0.5933",
            "Artistic Performance": "0.7250",
            "Life Record": "0.7000",
            "Multilingual": "0.5000"
        },
        "sub_category": {
            "Humanity & History": "0.5667",
            "Literature & Art": "0.6333",
            "Biology & Medicine": "0.8333",
            "Finance & Commerce": "0.8333",
            "Astronomy": "0.6667",
            "Geography": "0.7000",
            "Law": "0.7000",
            "Life Tip": "0.8000",
            "Technology": "0.7000",
            "Animation": "0.7667",
            "Movie & TV Show": "0.6667",
            "Documentary": "0.6667",
            "News Report": "0.9000",
            "Esports": "0.5333",
            "Basketball": "0.4000",
            "Football": "0.6333",
            "Athletics": "0.7667",
            "Other Sports": "0.6333",
            "Stage Play": "0.8000",
            "Magic Show": "0.6667",
            "Variety Show": "0.7667",
            "Acrobatics": "0.6667",
            "Handicraft": "0.6667",
            "Food": "0.7000",
            "Fashion": "0.5667",
            "Daily Life": "0.7000",
            "Travel": "0.8333",
            "Pet & Animal": "0.8333",
            "Exercise": "0.6000",
            "Multilingual": "0.5000"
        },
        "task_type": {
            "Temporal Perception": "0.6667",
            "Spatial Perception": "0.7667",
            "Attribute Perception": "0.7951",
            "Action Recognition": "0.6412",
            "Object Recognition": "0.6964",
            "OCR Problems": "0.7895",
            "Counting Problem": "0.4240",
            "Temporal Reasoning": "0.6923",
            "Spatial Reasoning": "0.8519",
            "Action Reasoning": "0.7021",
            "Object Reasoning": "0.6875",
            "Information Synopsis": "0.8537"
        }
    },
    "medium": {
        "overall": "0.5256",
        "domain": {
            "Knowledge": "0.5593",
            "Film & Television": "0.6167",
            "Sports Competition": "0.4400",
            "Artistic Performance": "0.6167",
            "Life Record": "0.4429",
            "Multilingual": "0.5000"
        },
        "sub_category": {
            "Humanity & History": "0.4667",
            "Literature & Art": "0.6000",
            "Biology & Medicine": "0.5667",
            "Finance & Commerce": "0.6667",
            "Astronomy": "0.5667",
            "Geography": "0.4667",
            "Law": "0.5667",
            "Life Tip": "0.6667",
            "Technology": "0.4667",
            "Animation": "0.3667",
            "Movie & TV Show": "0.6667",
            "Documentary": "0.6667",
            "News Report": "0.7667",
            "Esports": "0.5667",
            "Basketball": "0.2667",
            "Football": "0.4333",
            "Athletics": "0.4333",
            "Other Sports": "0.5000",
            "Stage Play": "0.8333",
            "Magic Show": "0.5333",
            "Variety Show": "0.5667",
            "Acrobatics": "0.5333",
            "Handicraft": "0.5667",
            "Food": "0.3667",
            "Fashion": "0.4000",
            "Daily Life": "0.4333",
            "Travel": "0.4333",
            "Pet & Animal": "0.4000",
            "Exercise": "0.5000",
            "Multilingual": "0.5000"
        },
        "task_type": {
            "Temporal Perception": "0.4516",
            "Spatial Perception": "0.5238",
            "Attribute Perception": "0.5068",
            "Action Recognition": "0.4034",
            "Object Recognition": "0.6515",
            "OCR Problems": "0.4118",
            "Counting Problem": "0.3053",
            "Temporal Reasoning": "0.3973",
            "Spatial Reasoning": "0.7778",
            "Action Reasoning": "0.5517",
            "Object Reasoning": "0.6194",
            "Information Synopsis": "0.7949"
        }
    },
    "long": {
        "overall": "0.4922",
        "domain": {
            "Knowledge": "0.5667",
            "Film & Television": "0.4917",
            "Sports Competition": "0.4800",
            "Artistic Performance": "0.4583",
            "Life Record": "0.4381",
            "Multilingual": "0.4000"
        },
        "sub_category": {
            "Humanity & History": "0.5667",
            "Literature & Art": "0.5667",
            "Biology & Medicine": "0.7333",
            "Finance & Commerce": "0.5333",
            "Astronomy": "0.5667",
            "Geography": "0.4000",
            "Law": "0.6667",
            "Life Tip": "0.6000",
            "Technology": "0.4667",
            "Animation": "0.3333",
            "Movie & TV Show": "0.5000",
            "Documentary": "0.6000",
            "News Report": "0.5333",
            "Esports": "0.4333",
            "Basketball": "0.4000",
            "Football": "0.5333",
            "Athletics": "0.4667",
            "Other Sports": "0.5667",
            "Stage Play": "0.7333",
            "Magic Show": "0.3333",
            "Variety Show": "0.3000",
            "Acrobatics": "0.4667",
            "Handicraft": "0.5667",
            "Food": "0.3333",
            "Fashion": "0.4333",
            "Daily Life": "0.2667",
            "Travel": "0.3667",
            "Pet & Animal": "0.7333",
            "Exercise": "0.3667",
            "Multilingual": "0.4000"
        },
        "task_type": {
            "Temporal Perception": "0.1667",
            "Spatial Perception": "0.0000",
            "Attribute Perception": "0.7037",
            "Action Recognition": "0.4286",
            "Object Recognition": "0.5000",
            "OCR Problems": "0.5714",
            "Counting Problem": "0.2917",
            "Temporal Reasoning": "0.3077",
            "Spatial Reasoning": "0.7273",
            "Action Reasoning": "0.4278",
            "Object Reasoning": "0.4917",
            "Information Synopsis": "0.7117"
        }
    },
    "overall": {
        "overall": "0.5693",
        "domain": {
            "Knowledge": "0.6136",
            "Film & Television": "0.6194",
            "Sports Competition": "0.5044",
            "Artistic Performance": "0.6000",
            "Life Record": "0.5270",
            "Multilingual": "0.4667"
        },
        "sub_category": {
            "Humanity & History": "0.5333",
            "Literature & Art": "0.6000",
            "Biology & Medicine": "0.7111",
            "Finance & Commerce": "0.6778",
            "Astronomy": "0.6000",
            "Geography": "0.5222",
            "Law": "0.6444",
            "Life Tip": "0.6889",
            "Technology": "0.5444",
            "Animation": "0.4889",
            "Movie & TV Show": "0.6111",
            "Documentary": "0.6444",
            "News Report": "0.7333",
            "Esports": "0.5111",
            "Basketball": "0.3556",
            "Football": "0.5333",
            "Athletics": "0.5556",
            "Other Sports": "0.5667",
            "Stage Play": "0.7889",
            "Magic Show": "0.5111",
            "Variety Show": "0.5444",
            "Acrobatics": "0.5556",
            "Handicraft": "0.6000",
            "Food": "0.4667",
            "Fashion": "0.4667",
            "Daily Life": "0.4667",
            "Travel": "0.5444",
            "Pet & Animal": "0.6556",
            "Exercise": "0.4889",
            "Multilingual": "0.4667"
        },
        "task_type": {
            "Temporal Perception": "0.4909",
            "Spatial Perception": "0.6296",
            "Attribute Perception": "0.6892",
            "Action Recognition": "0.5080",
            "Object Recognition": "0.6497",
            "OCR Problems": "0.5827",
            "Counting Problem": "0.3582",
            "Temporal Reasoning": "0.3729",
            "Spatial Reasoning": "0.8036",
            "Action Reasoning": "0.4982",
            "Object Reasoning": "0.5639",
            "Information Synopsis": "0.7678"
        }
    }
}

When testing without subtitles:

torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-26B --verbose --nframe 16

The expected test results are:

{
    "short": {
        "overall": "0.6667",
        "domain": {
            "Knowledge": "0.6741",
            "Film & Television": "0.7333",
            "Sports Competition": "0.6133",
            "Artistic Performance": "0.6750",
            "Life Record": "0.6762",
            "Multilingual": "0.5000"
        },
        "sub_category": {
            "Humanity & History": "0.4000",
            "Literature & Art": "0.5667",
            "Biology & Medicine": "0.8667",
            "Finance & Commerce": "0.7000",
            "Astronomy": "0.6667",
            "Geography": "0.6333",
            "Law": "0.8000",
            "Life Tip": "0.8000",
            "Technology": "0.6333",
            "Animation": "0.8000",
            "Movie & TV Show": "0.7000",
            "Documentary": "0.5667",
            "News Report": "0.8667",
            "Esports": "0.5333",
            "Basketball": "0.4667",
            "Football": "0.6333",
            "Athletics": "0.7667",
            "Other Sports": "0.6667",
            "Stage Play": "0.8667",
            "Magic Show": "0.5333",
            "Variety Show": "0.6333",
            "Acrobatics": "0.6667",
            "Handicraft": "0.7000",
            "Food": "0.7667",
            "Fashion": "0.6667",
            "Daily Life": "0.6667",
            "Travel": "0.7667",
            "Pet & Animal": "0.7333",
            "Exercise": "0.4333",
            "Multilingual": "0.5000"
        },
        "task_type": {
            "Temporal Perception": "0.8333",
            "Spatial Perception": "0.7333",
            "Attribute Perception": "0.7541",
            "Action Recognition": "0.6489",
            "Object Recognition": "0.6548",
            "OCR Problems": "0.7719",
            "Counting Problem": "0.4080",
            "Temporal Reasoning": "0.6154",
            "Spatial Reasoning": "0.7778",
            "Action Reasoning": "0.7234",
            "Object Reasoning": "0.6500",
            "Information Synopsis": "0.8049"
        }
    },
    "medium": {
        "overall": "0.5200",
        "domain": {
            "Knowledge": "0.5481",
            "Film & Television": "0.5833",
            "Sports Competition": "0.4267",
            "Artistic Performance": "0.6167",
            "Life Record": "0.4524",
            "Multilingual": "0.5667"
        },
        "sub_category": {
            "Humanity & History": "0.4000",
            "Literature & Art": "0.6000",
            "Biology & Medicine": "0.6667",
            "Finance & Commerce": "0.5667",
            "Astronomy": "0.5333",
            "Geography": "0.4667",
            "Law": "0.6667",
            "Life Tip": "0.5333",
            "Technology": "0.5000",
            "Animation": "0.3667",
            "Movie & TV Show": "0.6000",
            "Documentary": "0.7000",
            "News Report": "0.6667",
            "Esports": "0.4667",
            "Basketball": "0.3000",
            "Football": "0.5000",
            "Athletics": "0.3667",
            "Other Sports": "0.5000",
            "Stage Play": "0.6667",
            "Magic Show": "0.6333",
            "Variety Show": "0.6000",
            "Acrobatics": "0.5667",
            "Handicraft": "0.6667",
            "Food": "0.3000",
            "Fashion": "0.4000",
            "Daily Life": "0.4000",
            "Travel": "0.5333",
            "Pet & Animal": "0.4667",
            "Exercise": "0.4000",
            "Multilingual": "0.5667"
        },
        "task_type": {
            "Temporal Perception": "0.4839",
            "Spatial Perception": "0.5238",
            "Attribute Perception": "0.5890",
            "Action Recognition": "0.4454",
            "Object Recognition": "0.6364",
            "OCR Problems": "0.4412",
            "Counting Problem": "0.3474",
            "Temporal Reasoning": "0.3836",
            "Spatial Reasoning": "0.7222",
            "Action Reasoning": "0.4655",
            "Object Reasoning": "0.5448",
            "Information Synopsis": "0.7436"
        }
    },
    "long": {
        "overall": "0.4578",
        "domain": {
            "Knowledge": "0.4815",
            "Film & Television": "0.4583",
            "Sports Competition": "0.4200",
            "Artistic Performance": "0.4167",
            "Life Record": "0.4857",
            "Multilingual": "0.4000"
        },
        "sub_category": {
            "Humanity & History": "0.5000",
            "Literature & Art": "0.5667",
            "Biology & Medicine": "0.5333",
            "Finance & Commerce": "0.6000",
            "Astronomy": "0.4667",
            "Geography": "0.3000",
            "Law": "0.5000",
            "Life Tip": "0.4667",
            "Technology": "0.4000",
            "Animation": "0.3667",
            "Movie & TV Show": "0.4667",
            "Documentary": "0.5000",
            "News Report": "0.5000",
            "Esports": "0.4667",
            "Basketball": "0.4000",
            "Football": "0.4667",
            "Athletics": "0.4000",
            "Other Sports": "0.3667",
            "Stage Play": "0.5667",
            "Magic Show": "0.4667",
            "Variety Show": "0.1333",
            "Acrobatics": "0.5000",
            "Handicraft": "0.6333",
            "Food": "0.4333",
            "Fashion": "0.3667",
            "Daily Life": "0.5333",
            "Travel": "0.3667",
            "Pet & Animal": "0.6667",
            "Exercise": "0.4000",
            "Multilingual": "0.4000"
        },
        "task_type": {
            "Temporal Perception": "0.0000",
            "Spatial Perception": "0.3333",
            "Attribute Perception": "0.5926",
            "Action Recognition": "0.3968",
            "Object Recognition": "0.5741",
            "OCR Problems": "0.5000",
            "Counting Problem": "0.2917",
            "Temporal Reasoning": "0.2967",
            "Spatial Reasoning": "0.6364",
            "Action Reasoning": "0.4111",
            "Object Reasoning": "0.4583",
            "Information Synopsis": "0.6135"
        }
    },
    "overall": {
        "overall": "0.5481",
        "domain": {
            "Knowledge": "0.5679",
            "Film & Television": "0.5917",
            "Sports Competition": "0.4867",
            "Artistic Performance": "0.5694",
            "Life Record": "0.5381",
            "Multilingual": "0.4889"
        },
        "sub_category": {
            "Humanity & History": "0.4333",
            "Literature & Art": "0.5778",
            "Biology & Medicine": "0.6889",
            "Finance & Commerce": "0.6222",
            "Astronomy": "0.5556",
            "Geography": "0.4667",
            "Law": "0.6556",
            "Life Tip": "0.6000",
            "Technology": "0.5111",
            "Animation": "0.5111",
            "Movie & TV Show": "0.5889",
            "Documentary": "0.5889",
            "News Report": "0.6778",
            "Esports": "0.4889",
            "Basketball": "0.3889",
            "Football": "0.5333",
            "Athletics": "0.5111",
            "Other Sports": "0.5111",
            "Stage Play": "0.7000",
            "Magic Show": "0.5444",
            "Variety Show": "0.4556",
            "Acrobatics": "0.5778",
            "Handicraft": "0.6667",
            "Food": "0.5000",
            "Fashion": "0.4778",
            "Daily Life": "0.5333",
            "Travel": "0.5556",
            "Pet & Animal": "0.6222",
            "Exercise": "0.4111",
            "Multilingual": "0.4889"
        },
        "task_type": {
            "Temporal Perception": "0.5455",
            "Spatial Perception": "0.6296",
            "Attribute Perception": "0.6802",
            "Action Recognition": "0.5208",
            "Object Recognition": "0.6356",
            "OCR Problems": "0.5827",
            "Counting Problem": "0.3657",
            "Temporal Reasoning": "0.3559",
            "Spatial Reasoning": "0.7321",
            "Action Reasoning": "0.4737",
            "Object Reasoning": "0.5176",
            "Information Synopsis": "0.6935"
        }
    }
}

When testing with subtitles:

torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-26B --verbose --nframe 16 --use-subtitle

The expected test results are:

{
    "short": {
        "overall": "0.6844",
        "domain": {
            "Knowledge": "0.6889",
            "Film & Television": "0.7250",
            "Sports Competition": "0.6200",
            "Artistic Performance": "0.7167",
            "Life Record": "0.7000",
            "Multilingual": "0.5667"
        },
        "sub_category": {
            "Humanity & History": "0.3667",
            "Literature & Art": "0.6000",
            "Biology & Medicine": "0.9000",
            "Finance & Commerce": "0.7333",
            "Astronomy": "0.7000",
            "Geography": "0.7333",
            "Law": "0.8333",
            "Life Tip": "0.7000",
            "Technology": "0.6333",
            "Animation": "0.7333",
            "Movie & TV Show": "0.7333",
            "Documentary": "0.5667",
            "News Report": "0.8667",
            "Esports": "0.6667",
            "Basketball": "0.4333",
            "Football": "0.6667",
            "Athletics": "0.7333",
            "Other Sports": "0.6000",
            "Stage Play": "0.8333",
            "Magic Show": "0.6000",
            "Variety Show": "0.7667",
            "Acrobatics": "0.6667",
            "Handicraft": "0.6667",
            "Food": "0.8333",
            "Fashion": "0.6667",
            "Daily Life": "0.7667",
            "Travel": "0.7667",
            "Pet & Animal": "0.7333",
            "Exercise": "0.4667",
            "Multilingual": "0.5667"
        },
        "task_type": {
            "Temporal Perception": "0.7778",
            "Spatial Perception": "0.7000",
            "Attribute Perception": "0.7869",
            "Action Recognition": "0.6336",
            "Object Recognition": "0.6905",
            "OCR Problems": "0.8070",
            "Counting Problem": "0.4080",
            "Temporal Reasoning": "0.7692",
            "Spatial Reasoning": "0.8519",
            "Action Reasoning": "0.7021",
            "Object Reasoning": "0.7125",
            "Information Synopsis": "0.8049"
        }
    },
    "medium": {
        "overall": "0.5456",
        "domain": {
            "Knowledge": "0.5852",
            "Film & Television": "0.6167",
            "Sports Competition": "0.4400",
            "Artistic Performance": "0.6333",
            "Life Record": "0.4714",
            "Multilingual": "0.6000"
        },
        "sub_category": {
            "Humanity & History": "0.5333",
            "Literature & Art": "0.5667",
            "Biology & Medicine": "0.6333",
            "Finance & Commerce": "0.6667",
            "Astronomy": "0.6667",
            "Geography": "0.5000",
            "Law": "0.6333",
            "Life Tip": "0.5667",
            "Technology": "0.5000",
            "Animation": "0.3333",
            "Movie & TV Show": "0.6333",
            "Documentary": "0.7000",
            "News Report": "0.8000",
            "Esports": "0.4333",
            "Basketball": "0.2667",
            "Football": "0.6000",
            "Athletics": "0.3667",
            "Other Sports": "0.5333",
            "Stage Play": "0.7667",
            "Magic Show": "0.6000",
            "Variety Show": "0.6000",
            "Acrobatics": "0.5667",
            "Handicraft": "0.6333",
            "Food": "0.3000",
            "Fashion": "0.4333",
            "Daily Life": "0.3667",
            "Travel": "0.6000",
            "Pet & Animal": "0.4667",
            "Exercise": "0.5000",
            "Multilingual": "0.6000"
        },
        "task_type": {
            "Temporal Perception": "0.4839",
            "Spatial Perception": "0.4762",
            "Attribute Perception": "0.5890",
            "Action Recognition": "0.4622",
            "Object Recognition": "0.6591",
            "OCR Problems": "0.4706",
            "Counting Problem": "0.3474",
            "Temporal Reasoning": "0.4247",
            "Spatial Reasoning": "0.8333",
            "Action Reasoning": "0.4310",
            "Object Reasoning": "0.6194",
            "Information Synopsis": "0.7949"
        }
    },
    "long": {
        "overall": "0.4833",
        "domain": {
            "Knowledge": "0.5296",
            "Film & Television": "0.5083",
            "Sports Competition": "0.4333",
            "Artistic Performance": "0.4583",
            "Life Record": "0.4667",
            "Multilingual": "0.4333"
        },
        "sub_category": {
            "Humanity & History": "0.4667",
            "Literature & Art": "0.5000",
            "Biology & Medicine": "0.7000",
            "Finance & Commerce": "0.6667",
            "Astronomy": "0.6000",
            "Geography": "0.3333",
            "Law": "0.5667",
            "Life Tip": "0.5000",
            "Technology": "0.4333",
            "Animation": "0.4000",
            "Movie & TV Show": "0.4667",
            "Documentary": "0.6667",
            "News Report": "0.5000",
            "Esports": "0.4667",
            "Basketball": "0.3667",
            "Football": "0.5667",
            "Athletics": "0.3333",
            "Other Sports": "0.4333",
            "Stage Play": "0.7667",
            "Magic Show": "0.4000",
            "Variety Show": "0.2000",
            "Acrobatics": "0.4667",
            "Handicraft": "0.6333",
            "Food": "0.3333",
            "Fashion": "0.4333",
            "Daily Life": "0.4667",
            "Travel": "0.3000",
            "Pet & Animal": "0.7000",
            "Exercise": "0.4000",
            "Multilingual": "0.4333"
        },
        "task_type": {
            "Temporal Perception": "0.0000",
            "Spatial Perception": "0.3333",
            "Attribute Perception": "0.5556",
            "Action Recognition": "0.4444",
            "Object Recognition": "0.4815",
            "OCR Problems": "0.6429",
            "Counting Problem": "0.3333",
            "Temporal Reasoning": "0.2967",
            "Spatial Reasoning": "0.7273",
            "Action Reasoning": "0.4611",
            "Object Reasoning": "0.4667",
            "Information Synopsis": "0.6748"
        }
    },
    "overall": {
        "overall": "0.5711",
        "domain": {
            "Knowledge": "0.6012",
            "Film & Television": "0.6167",
            "Sports Competition": "0.4978",
            "Artistic Performance": "0.6028",
            "Life Record": "0.5460",
            "Multilingual": "0.5333"
        },
        "sub_category": {
            "Humanity & History": "0.4556",
            "Literature & Art": "0.5556",
            "Biology & Medicine": "0.7444",
            "Finance & Commerce": "0.6889",
            "Astronomy": "0.6556",
            "Geography": "0.5222",
            "Law": "0.6778",
            "Life Tip": "0.5889",
            "Technology": "0.5222",
            "Animation": "0.4889",
            "Movie & TV Show": "0.6111",
            "Documentary": "0.6444",
            "News Report": "0.7222",
            "Esports": "0.5222",
            "Basketball": "0.3556",
            "Football": "0.6111",
            "Athletics": "0.4778",
            "Other Sports": "0.5222",
            "Stage Play": "0.7889",
            "Magic Show": "0.5333",
            "Variety Show": "0.5222",
            "Acrobatics": "0.5667",
            "Handicraft": "0.6444",
            "Food": "0.4889",
            "Fashion": "0.5111",
            "Daily Life": "0.5333",
            "Travel": "0.5556",
            "Pet & Animal": "0.6333",
            "Exercise": "0.4556",
            "Multilingual": "0.5333"
        },
        "task_type": {
            "Temporal Perception": "0.5273",
            "Spatial Perception": "0.5926",
            "Attribute Perception": "0.6937",
            "Action Recognition": "0.5304",
            "Object Recognition": "0.6469",
            "OCR Problems": "0.6259",
            "Counting Problem": "0.3731",
            "Temporal Reasoning": "0.3842",
            "Spatial Reasoning": "0.8214",
            "Action Reasoning": "0.4947",
            "Object Reasoning": "0.5551",
            "Information Synopsis": "0.7368"
        }
    }
}

When testing without subtitles:

torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-40B --verbose --nframe 16

The expected test results are:

{
    "short": {
        "overall": "0.7200",
        "domain": {
            "Knowledge": "0.7222",
            "Film & Television": "0.7417",
            "Sports Competition": "0.6667",
            "Artistic Performance": "0.7583",
            "Life Record": "0.7476",
            "Multilingual": "0.5333"
        },
        "sub_category": {
            "Humanity & History": "0.4333",
            "Literature & Art": "0.6667",
            "Biology & Medicine": "0.9667",
            "Finance & Commerce": "0.8000",
            "Astronomy": "0.8000",
            "Geography": "0.6333",
            "Law": "0.7333",
            "Life Tip": "0.7333",
            "Technology": "0.7333",
            "Animation": "0.8000",
            "Movie & TV Show": "0.7333",
            "Documentary": "0.5667",
            "News Report": "0.8667",
            "Esports": "0.6667",
            "Basketball": "0.4333",
            "Football": "0.7667",
            "Athletics": "0.8000",
            "Other Sports": "0.6667",
            "Stage Play": "0.9000",
            "Magic Show": "0.6667",
            "Variety Show": "0.7667",
            "Acrobatics": "0.7000",
            "Handicraft": "0.8667",
            "Food": "0.7333",
            "Fashion": "0.7333",
            "Daily Life": "0.7333",
            "Travel": "0.7667",
            "Pet & Animal": "0.8000",
            "Exercise": "0.6000",
            "Multilingual": "0.5333"
        },
        "task_type": {
            "Temporal Perception": "0.8889",
            "Spatial Perception": "0.7333",
            "Attribute Perception": "0.8033",
            "Action Recognition": "0.6718",
            "Object Recognition": "0.7262",
            "OCR Problems": "0.8596",
            "Counting Problem": "0.4400",
            "Temporal Reasoning": "0.8462",
            "Spatial Reasoning": "0.8889",
            "Action Reasoning": "0.7660",
            "Object Reasoning": "0.7250",
            "Information Synopsis": "0.8415"
        }
    },
    "medium": {
        "overall": "0.5911",
        "domain": {
            "Knowledge": "0.6074",
            "Film & Television": "0.6417",
            "Sports Competition": "0.5067",
            "Artistic Performance": "0.6583",
            "Life Record": "0.5429",
            "Multilingual": "0.7333"
        },
        "sub_category": {
            "Humanity & History": "0.5333",
            "Literature & Art": "0.6333",
            "Biology & Medicine": "0.6000",
            "Finance & Commerce": "0.6000",
            "Astronomy": "0.5667",
            "Geography": "0.5333",
            "Law": "0.8000",
            "Life Tip": "0.5667",
            "Technology": "0.6333",
            "Animation": "0.4000",
            "Movie & TV Show": "0.7000",
            "Documentary": "0.8000",
            "News Report": "0.6667",
            "Esports": "0.6333",
            "Basketball": "0.1667",
            "Football": "0.5333",
            "Athletics": "0.6000",
            "Other Sports": "0.6000",
            "Stage Play": "0.7667",
            "Magic Show": "0.6333",
            "Variety Show": "0.5667",
            "Acrobatics": "0.6667",
            "Handicraft": "0.7000",
            "Food": "0.3667",
            "Fashion": "0.4333",
            "Daily Life": "0.5333",
            "Travel": "0.6333",
            "Pet & Animal": "0.4000",
            "Exercise": "0.7333",
            "Multilingual": "0.7333"
        },
        "task_type": {
            "Temporal Perception": "0.5484",
            "Spatial Perception": "0.6190",
            "Attribute Perception": "0.6712",
            "Action Recognition": "0.5126",
            "Object Recognition": "0.6667",
            "OCR Problems": "0.5000",
            "Counting Problem": "0.3579",
            "Temporal Reasoning": "0.5068",
            "Spatial Reasoning": "0.7778",
            "Action Reasoning": "0.5345",
            "Object Reasoning": "0.6716",
            "Information Synopsis": "0.8205"
        }
    },
    "long": {
        "overall": "0.5256",
        "domain": {
            "Knowledge": "0.5926",
            "Film & Television": "0.4583",
            "Sports Competition": "0.5267",
            "Artistic Performance": "0.5417",
            "Life Record": "0.4762",
            "Multilingual": "0.4667"
        },
        "sub_category": {
            "Humanity & History": "0.5333",
            "Literature & Art": "0.6000",
            "Biology & Medicine": "0.7000",
            "Finance & Commerce": "0.7667",
            "Astronomy": "0.5667",
            "Geography": "0.4333",
            "Law": "0.5000",
            "Life Tip": "0.6333",
            "Technology": "0.6000",
            "Animation": "0.3000",
            "Movie & TV Show": "0.5333",
            "Documentary": "0.5333",
            "News Report": "0.4667",
            "Esports": "0.6667",
            "Basketball": "0.3667",
            "Football": "0.6000",
            "Athletics": "0.4000",
            "Other Sports": "0.6000",
            "Stage Play": "0.7000",
            "Magic Show": "0.5667",
            "Variety Show": "0.3667",
            "Acrobatics": "0.5333",
            "Handicraft": "0.5667",
            "Food": "0.3667",
            "Fashion": "0.4000",
            "Daily Life": "0.4333",
            "Travel": "0.3667",
            "Pet & Animal": "0.6333",
            "Exercise": "0.5667",
            "Multilingual": "0.4667"
        },
        "task_type": {
            "Temporal Perception": "0.3333",
            "Spatial Perception": "0.3333",
            "Attribute Perception": "0.6667",
            "Action Recognition": "0.5397",
            "Object Recognition": "0.5185",
            "OCR Problems": "0.4286",
            "Counting Problem": "0.2917",
            "Temporal Reasoning": "0.3297",
            "Spatial Reasoning": "0.6364",
            "Action Reasoning": "0.5000",
            "Object Reasoning": "0.5292",
            "Information Synopsis": "0.7117"
        }
    },
    "overall": {
        "overall": "0.6122",
        "domain": {
            "Knowledge": "0.6407",
            "Film & Television": "0.6139",
            "Sports Competition": "0.5667",
            "Artistic Performance": "0.6528",
            "Life Record": "0.5889",
            "Multilingual": "0.5778"
        },
        "sub_category": {
            "Humanity & History": "0.5000",
            "Literature & Art": "0.6333",
            "Biology & Medicine": "0.7556",
            "Finance & Commerce": "0.7222",
            "Astronomy": "0.6444",
            "Geography": "0.5333",
            "Law": "0.6778",
            "Life Tip": "0.6444",
            "Technology": "0.6556",
            "Animation": "0.5000",
            "Movie & TV Show": "0.6556",
            "Documentary": "0.6333",
            "News Report": "0.6667",
            "Esports": "0.6556",
            "Basketball": "0.3222",
            "Football": "0.6333",
            "Athletics": "0.6000",
            "Other Sports": "0.6222",
            "Stage Play": "0.7889",
            "Magic Show": "0.6222",
            "Variety Show": "0.5667",
            "Acrobatics": "0.6333",
            "Handicraft": "0.7111",
            "Food": "0.4889",
            "Fashion": "0.5222",
            "Daily Life": "0.5667",
            "Travel": "0.5889",
            "Pet & Animal": "0.6111",
            "Exercise": "0.6333",
            "Multilingual": "0.5778"
        },
        "task_type": {
            "Temporal Perception": "0.6364",
            "Spatial Perception": "0.6667",
            "Attribute Perception": "0.7432",
            "Action Recognition": "0.5847",
            "Object Recognition": "0.6723",
            "OCR Problems": "0.6403",
            "Counting Problem": "0.3843",
            "Temporal Reasoning": "0.4407",
            "Spatial Reasoning": "0.8036",
            "Action Reasoning": "0.5509",
            "Object Reasoning": "0.6057",
            "Information Synopsis": "0.7709"
        }
    }
}

When testing with subtitles:

torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-40B --verbose --nframe 16 --use-subtitle

The expected test results are:

{
    "short": {
        "overall": "0.7278",
        "domain": {
            "Knowledge": "0.7370",
            "Film & Television": "0.7583",
            "Sports Competition": "0.6800",
            "Artistic Performance": "0.7750",
            "Life Record": "0.7286",
            "Multilingual": "0.5667"
        },
        "sub_category": {
            "Humanity & History": "0.4333",
            "Literature & Art": "0.6333",
            "Biology & Medicine": "0.9667",
            "Finance & Commerce": "0.8667",
            "Astronomy": "0.8333",
            "Geography": "0.7000",
            "Law": "0.7667",
            "Life Tip": "0.7000",
            "Technology": "0.7333",
            "Animation": "0.7667",
            "Movie & TV Show": "0.7000",
            "Documentary": "0.6667",
            "News Report": "0.9000",
            "Esports": "0.6667",
            "Basketball": "0.3667",
            "Football": "0.8000",
            "Athletics": "0.8333",
            "Other Sports": "0.7333",
            "Stage Play": "0.8667",
            "Magic Show": "0.7333",
            "Variety Show": "0.8000",
            "Acrobatics": "0.7000",
            "Handicraft": "0.7667",
            "Food": "0.8000",
            "Fashion": "0.6667",
            "Daily Life": "0.7333",
            "Travel": "0.7667",
            "Pet & Animal": "0.8000",
            "Exercise": "0.5667",
            "Multilingual": "0.5667"
        },
        "task_type": {
            "Temporal Perception": "0.8889",
            "Spatial Perception": "0.7333",
            "Attribute Perception": "0.8115",
            "Action Recognition": "0.6870",
            "Object Recognition": "0.7202",
            "OCR Problems": "0.8596",
            "Counting Problem": "0.4640",
            "Temporal Reasoning": "0.6923",
            "Spatial Reasoning": "0.8889",
            "Action Reasoning": "0.7234",
            "Object Reasoning": "0.7625",
            "Information Synopsis": "0.8780"
        }
    },
    "medium": {
        "overall": "0.6133",
        "domain": {
            "Knowledge": "0.6630",
            "Film & Television": "0.6583",
            "Sports Competition": "0.5133",
            "Artistic Performance": "0.6917",
            "Life Record": "0.5333",
            "Multilingual": "0.7333"
        },
        "sub_category": {
            "Humanity & History": "0.6000",
            "Literature & Art": "0.7000",
            "Biology & Medicine": "0.5667",
            "Finance & Commerce": "0.7333",
            "Astronomy": "0.7000",
            "Geography": "0.5667",
            "Law": "0.8333",
            "Life Tip": "0.6667",
            "Technology": "0.6000",
            "Animation": "0.4333",
            "Movie & TV Show": "0.7667",
            "Documentary": "0.7333",
            "News Report": "0.7000",
            "Esports": "0.5667",
            "Basketball": "0.2667",
            "Football": "0.5667",
            "Athletics": "0.5667",
            "Other Sports": "0.6000",
            "Stage Play": "0.8000",
            "Magic Show": "0.6333",
            "Variety Show": "0.6667",
            "Acrobatics": "0.6667",
            "Handicraft": "0.7000",
            "Food": "0.3333",
            "Fashion": "0.4000",
            "Daily Life": "0.5333",
            "Travel": "0.6333",
            "Pet & Animal": "0.4667",
            "Exercise": "0.6667",
            "Multilingual": "0.7333"
        },
        "task_type": {
            "Temporal Perception": "0.5484",
            "Spatial Perception": "0.5238",
            "Attribute Perception": "0.6438",
            "Action Recognition": "0.5798",
            "Object Recognition": "0.7121",
            "OCR Problems": "0.4706",
            "Counting Problem": "0.3684",
            "Temporal Reasoning": "0.5479",
            "Spatial Reasoning": "0.8333",
            "Action Reasoning": "0.6034",
            "Object Reasoning": "0.6791",
            "Information Synopsis": "0.8462"
        }
    },
    "long": {
        "overall": "0.5300",
        "domain": {
            "Knowledge": "0.5889",
            "Film & Television": "0.5000",
            "Sports Competition": "0.5000",
            "Artistic Performance": "0.6000",
            "Life Record": "0.4571",
            "Multilingual": "0.5000"
        },
        "sub_category": {
            "Humanity & History": "0.5333",
            "Literature & Art": "0.6000",
            "Biology & Medicine": "0.6333",
            "Finance & Commerce": "0.6333",
            "Astronomy": "0.6667",
            "Geography": "0.4000",
            "Law": "0.7000",
            "Life Tip": "0.6000",
            "Technology": "0.5333",
            "Animation": "0.3667",
            "Movie & TV Show": "0.5000",
            "Documentary": "0.5333",
            "News Report": "0.6000",
            "Esports": "0.6000",
            "Basketball": "0.3333",
            "Football": "0.6333",
            "Athletics": "0.4000",
            "Other Sports": "0.5333",
            "Stage Play": "0.8667",
            "Magic Show": "0.5667",
            "Variety Show": "0.4333",
            "Acrobatics": "0.5333",
            "Handicraft": "0.5667",
            "Food": "0.3333",
            "Fashion": "0.3667",
            "Daily Life": "0.4333",
            "Travel": "0.4000",
            "Pet & Animal": "0.6667",
            "Exercise": "0.4333",
            "Multilingual": "0.5000"
        },
        "task_type": {
            "Temporal Perception": "0.1667",
            "Spatial Perception": "0.3333",
            "Attribute Perception": "0.6296",
            "Action Recognition": "0.5714",
            "Object Recognition": "0.5185",
            "OCR Problems": "0.5714",
            "Counting Problem": "0.2708",
            "Temporal Reasoning": "0.3187",
            "Spatial Reasoning": "0.6364",
            "Action Reasoning": "0.4889",
            "Object Reasoning": "0.5417",
            "Information Synopsis": "0.7301"
        }
    },
    "overall": {
        "overall": "0.6237",
        "domain": {
            "Knowledge": "0.6630",
            "Film & Television": "0.6389",
            "Sports Competition": "0.5644",
            "Artistic Performance": "0.6889",
            "Life Record": "0.5730",
            "Multilingual": "0.6000"
        },
        "sub_category": {
            "Humanity & History": "0.5222",
            "Literature & Art": "0.6444",
            "Biology & Medicine": "0.7222",
            "Finance & Commerce": "0.7444",
            "Astronomy": "0.7333",
            "Geography": "0.5556",
            "Law": "0.7667",
            "Life Tip": "0.6556",
            "Technology": "0.6222",
            "Animation": "0.5222",
            "Movie & TV Show": "0.6556",
            "Documentary": "0.6444",
            "News Report": "0.7333",
            "Esports": "0.6111",
            "Basketball": "0.3222",
            "Football": "0.6667",
            "Athletics": "0.6000",
            "Other Sports": "0.6222",
            "Stage Play": "0.8444",
            "Magic Show": "0.6444",
            "Variety Show": "0.6333",
            "Acrobatics": "0.6333",
            "Handicraft": "0.6778",
            "Food": "0.4889",
            "Fashion": "0.4778",
            "Daily Life": "0.5667",
            "Travel": "0.6000",
            "Pet & Animal": "0.6444",
            "Exercise": "0.5556",
            "Multilingual": "0.6000"
        },
        "task_type": {
            "Temporal Perception": "0.6182",
            "Spatial Perception": "0.6296",
            "Attribute Perception": "0.7342",
            "Action Recognition": "0.6230",
            "Object Recognition": "0.6864",
            "OCR Problems": "0.6403",
            "Counting Problem": "0.3955",
            "Temporal Reasoning": "0.4407",
            "Spatial Reasoning": "0.8214",
            "Action Reasoning": "0.5509",
            "Object Reasoning": "0.6211",
            "Information Synopsis": "0.7957"
        }
    }
}

When testing without subtitles:

torchrun --nproc-per-node=1 run.py --data Video-MME --model InternVL2-76B --verbose --nframe 16

The expected test results are:

{
    "short": {
        "overall": "0.7222",
        "domain": {
            "Knowledge": "0.7593",
            "Film & Television": "0.7167",
            "Sports Competition": "0.6800",
            "Artistic Performance": "0.7500",
            "Life Record": "0.7143",
            "Multilingual": "0.5667"
        },
        "sub_category": {
            "Humanity & History": "0.5333",
            "Literature & Art": "0.6667",
            "Biology & Medicine": "0.9333",
            "Finance & Commerce": "0.8333",
            "Astronomy": "0.7667",
            "Geography": "0.7333",
            "Law": "0.8000",
            "Life Tip": "0.7667",
            "Technology": "0.8000",
            "Animation": "0.8000",
            "Movie & TV Show": "0.6333",
            "Documentary": "0.5667",
            "News Report": "0.8667",
            "Esports": "0.6667",
            "Basketball": "0.6000",
            "Football": "0.7667",
            "Athletics": "0.7333",
            "Other Sports": "0.6333",
            "Stage Play": "0.8667",
            "Magic Show": "0.6667",
            "Variety Show": "0.7333",
            "Acrobatics": "0.7333",
            "Handicraft": "0.8000",
            "Food": "0.7333",
            "Fashion": "0.6000",
            "Daily Life": "0.7333",
            "Travel": "0.8667",
            "Pet & Animal": "0.7667",
            "Exercise": "0.5000",
            "Multilingual": "0.5667"
        },
        "task_type": {
            "Temporal Perception": "0.9444",
            "Spatial Perception": "0.8333",
            "Attribute Perception": "0.7869",
            "Action Recognition": "0.6870",
            "Object Recognition": "0.6786",
            "OCR Problems": "0.8596",
            "Counting Problem": "0.4400",
            "Temporal Reasoning": "0.6923",
            "Spatial Reasoning": "0.8519",
            "Action Reasoning": "0.8085",
            "Object Reasoning": "0.8000",
            "Information Synopsis": "0.8537"
        }
    },
    "medium": {
        "overall": "0.5800",
        "domain": {
            "Knowledge": "0.5741",
            "Film & Television": "0.6833",
            "Sports Competition": "0.5200",
            "Artistic Performance": "0.6833",
            "Life Record": "0.5095",
            "Multilingual": "0.6000"
        },
        "sub_category": {
            "Humanity & History": "0.5000",
            "Literature & Art": "0.6000",
            "Biology & Medicine": "0.5667",
            "Finance & Commerce": "0.6333",
            "Astronomy": "0.6000",
            "Geography": "0.5000",
            "Law": "0.6333",
            "Life Tip": "0.6000",
            "Technology": "0.5333",
            "Animation": "0.6000",
            "Movie & TV Show": "0.7667",
            "Documentary": "0.7667",
            "News Report": "0.6000",
            "Esports": "0.5000",
            "Basketball": "0.4000",
            "Football": "0.6000",
            "Athletics": "0.4667",
            "Other Sports": "0.6333",
            "Stage Play": "0.8000",
            "Magic Show": "0.6333",
            "Variety Show": "0.6000",
            "Acrobatics": "0.7000",
            "Handicraft": "0.7333",
            "Food": "0.3000",
            "Fashion": "0.4000",
            "Daily Life": "0.3667",
            "Travel": "0.5667",
            "Pet & Animal": "0.6333",
            "Exercise": "0.5667",
            "Multilingual": "0.6000"
        },
        "task_type": {
            "Temporal Perception": "0.5806",
            "Spatial Perception": "0.5238",
            "Attribute Perception": "0.6027",
            "Action Recognition": "0.5546",
            "Object Recognition": "0.6212",
            "OCR Problems": "0.5000",
            "Counting Problem": "0.4000",
            "Temporal Reasoning": "0.3836",
            "Spatial Reasoning": "0.7222",
            "Action Reasoning": "0.6207",
            "Object Reasoning": "0.6642",
            "Information Synopsis": "0.8077"
        }
    },
    "long": {
        "overall": "0.5333",
        "domain": {
            "Knowledge": "0.5926",
            "Film & Television": "0.4667",
            "Sports Competition": "0.5200",
            "Artistic Performance": "0.5750",
            "Life Record": "0.4810",
            "Multilingual": "0.5333"
        },
        "sub_category": {
            "Humanity & History": "0.5333",
            "Literature & Art": "0.6000",
            "Biology & Medicine": "0.5667",
            "Finance & Commerce": "0.6667",
            "Astronomy": "0.7333",
            "Geography": "0.5000",
            "Law": "0.5333",
            "Life Tip": "0.7000",
            "Technology": "0.5000",
            "Animation": "0.4000",
            "Movie & TV Show": "0.4000",
            "Documentary": "0.4667",
            "News Report": "0.6000",
            "Esports": "0.4333",
            "Basketball": "0.5333",
            "Football": "0.5667",
            "Athletics": "0.5000",
            "Other Sports": "0.5667",
            "Stage Play": "0.7333",
            "Magic Show": "0.5667",
            "Variety Show": "0.3333",
            "Acrobatics": "0.6667",
            "Handicraft": "0.5667",
            "Food": "0.3667",
            "Fashion": "0.5000",
            "Daily Life": "0.4667",
            "Travel": "0.3667",
            "Pet & Animal": "0.7000",
            "Exercise": "0.4000",
            "Multilingual": "0.5333"
        },
        "task_type": {
            "Temporal Perception": "0.5000",
            "Spatial Perception": "0.3333",
            "Attribute Perception": "0.5185",
            "Action Recognition": "0.5556",
            "Object Recognition": "0.5741",
            "OCR Problems": "0.3571",
            "Counting Problem": "0.3750",
            "Temporal Reasoning": "0.4835",
            "Spatial Reasoning": "0.6364",
            "Action Reasoning": "0.4778",
            "Object Reasoning": "0.5250",
            "Information Synopsis": "0.6748"
        }
    },
    "overall": {
        "overall": "0.6119",
        "domain": {
            "Knowledge": "0.6420",
            "Film & Television": "0.6222",
            "Sports Competition": "0.5733",
            "Artistic Performance": "0.6694",
            "Life Record": "0.5683",
            "Multilingual": "0.5667"
        },
        "sub_category": {
            "Humanity & History": "0.5222",
            "Literature & Art": "0.6222",
            "Biology & Medicine": "0.6889",
            "Finance & Commerce": "0.7111",
            "Astronomy": "0.7000",
            "Geography": "0.5778",
            "Law": "0.6556",
            "Life Tip": "0.6889",
            "Technology": "0.6111",
            "Animation": "0.6000",
            "Movie & TV Show": "0.6000",
            "Documentary": "0.6000",
            "News Report": "0.6889",
            "Esports": "0.5333",
            "Basketball": "0.5111",
            "Football": "0.6444",
            "Athletics": "0.5667",
            "Other Sports": "0.6111",
            "Stage Play": "0.8000",
            "Magic Show": "0.6222",
            "Variety Show": "0.5556",
            "Acrobatics": "0.7000",
            "Handicraft": "0.7000",
            "Food": "0.4667",
            "Fashion": "0.5000",
            "Daily Life": "0.5222",
            "Travel": "0.6000",
            "Pet & Animal": "0.7000",
            "Exercise": "0.4889",
            "Multilingual": "0.5667"
        },
        "task_type": {
            "Temporal Perception": "0.6909",
            "Spatial Perception": "0.6852",
            "Attribute Perception": "0.6937",
            "Action Recognition": "0.6102",
            "Object Recognition": "0.6412",
            "OCR Problems": "0.6331",
            "Counting Problem": "0.4142",
            "Temporal Reasoning": "0.4576",
            "Spatial Reasoning": "0.7679",
            "Action Reasoning": "0.5614",
            "Object Reasoning": "0.6145",
            "Information Synopsis": "0.7523"
        }
    }
}

When testing with subtitles:

torchrun --nproc-per-node=1 run.py --data Video-MME --model InternVL2-76B --verbose --nframe 16 --use-subtitle

The expected test results are:

{
    "short": {
        "overall": "0.7422",
        "domain": {
            "Knowledge": "0.7667",
            "Film & Television": "0.7583",
            "Sports Competition": "0.7067",
            "Artistic Performance": "0.7833",
            "Life Record": "0.7286",
            "Multilingual": "0.5667"
        },
        "sub_category": {
            "Humanity & History": "0.5000",
            "Literature & Art": "0.6667",
            "Biology & Medicine": "0.9667",
            "Finance & Commerce": "0.8667",
            "Astronomy": "0.8000",
            "Geography": "0.7667",
            "Law": "0.8000",
            "Life Tip": "0.7667",
            "Technology": "0.7667",
            "Animation": "0.7667",
            "Movie & TV Show": "0.7000",
            "Documentary": "0.6667",
            "News Report": "0.9000",
            "Esports": "0.7000",
            "Basketball": "0.5000",
            "Football": "0.7667",
            "Athletics": "0.8333",
            "Other Sports": "0.7333",
            "Stage Play": "0.8333",
            "Magic Show": "0.7667",
            "Variety Show": "0.8000",
            "Acrobatics": "0.7333",
            "Handicraft": "0.8000",
            "Food": "0.8000",
            "Fashion": "0.6333",
            "Daily Life": "0.7333",
            "Travel": "0.8667",
            "Pet & Animal": "0.7333",
            "Exercise": "0.5333",
            "Multilingual": "0.5667"
        },
        "task_type": {
            "Temporal Perception": "0.8889",
            "Spatial Perception": "0.8000",
            "Attribute Perception": "0.8115",
            "Action Recognition": "0.7023",
            "Object Recognition": "0.6964",
            "OCR Problems": "0.9123",
            "Counting Problem": "0.4720",
            "Temporal Reasoning": "0.7692",
            "Spatial Reasoning": "0.8519",
            "Action Reasoning": "0.8511",
            "Object Reasoning": "0.7875",
            "Information Synopsis": "0.8902"
        }
    },
    "medium": {
        "overall": "0.5900",
        "domain": {
            "Knowledge": "0.6111",
            "Film & Television": "0.7083",
            "Sports Competition": "0.4800",
            "Artistic Performance": "0.7083",
            "Life Record": "0.5048",
            "Multilingual": "0.6000"
        },
        "sub_category": {
            "Humanity & History": "0.6000",
            "Literature & Art": "0.6333",
            "Biology & Medicine": "0.5667",
            "Finance & Commerce": "0.6333",
            "Astronomy": "0.6333",
            "Geography": "0.6000",
            "Law": "0.6667",
            "Life Tip": "0.6333",
            "Technology": "0.5333",
            "Animation": "0.5333",
            "Movie & TV Show": "0.8000",
            "Documentary": "0.7667",
            "News Report": "0.7333",
            "Esports": "0.5000",
            "Basketball": "0.3000",
            "Football": "0.5667",
            "Athletics": "0.4667",
            "Other Sports": "0.5667",
            "Stage Play": "0.8333",
            "Magic Show": "0.6667",
            "Variety Show": "0.6000",
            "Acrobatics": "0.7333",
            "Handicraft": "0.7333",
            "Food": "0.3333",
            "Fashion": "0.3333",
            "Daily Life": "0.4333",
            "Travel": "0.5333",
            "Pet & Animal": "0.6333",
            "Exercise": "0.5333",
            "Multilingual": "0.6000"
        },
        "task_type": {
            "Temporal Perception": "0.5161",
            "Spatial Perception": "0.5238",
            "Attribute Perception": "0.6027",
            "Action Recognition": "0.5546",
            "Object Recognition": "0.6439",
            "OCR Problems": "0.5147",
            "Counting Problem": "0.3579",
            "Temporal Reasoning": "0.3973",
            "Spatial Reasoning": "0.8889",
            "Action Reasoning": "0.6207",
            "Object Reasoning": "0.6791",
            "Information Synopsis": "0.8718"
        }
    },
    "long": {
        "overall": "0.5522",
        "domain": {
            "Knowledge": "0.6222",
            "Film & Television": "0.5167",
            "Sports Competition": "0.5267",
            "Artistic Performance": "0.5750",
            "Life Record": "0.4905",
            "Multilingual": "0.5333"
        },
        "sub_category": {
            "Humanity & History": "0.6333",
            "Literature & Art": "0.7000",
            "Biology & Medicine": "0.6000",
            "Finance & Commerce": "0.7667",
            "Astronomy": "0.6000",
            "Geography": "0.5333",
            "Law": "0.6667",
            "Life Tip": "0.6333",
            "Technology": "0.4667",
            "Animation": "0.4667",
            "Movie & TV Show": "0.4333",
            "Documentary": "0.5333",
            "News Report": "0.6333",
            "Esports": "0.5333",
            "Basketball": "0.4333",
            "Football": "0.6333",
            "Athletics": "0.5000",
            "Other Sports": "0.5333",
            "Stage Play": "0.7333",
            "Magic Show": "0.5667",
            "Variety Show": "0.3667",
            "Acrobatics": "0.6333",
            "Handicraft": "0.5667",
            "Food": "0.3667",
            "Fashion": "0.4667",
            "Daily Life": "0.4667",
            "Travel": "0.4333",
            "Pet & Animal": "0.7000",
            "Exercise": "0.4333",
            "Multilingual": "0.5333"
        },
        "task_type": {
            "Temporal Perception": "0.5000",
            "Spatial Perception": "0.6667",
            "Attribute Perception": "0.6667",
            "Action Recognition": "0.5238",
            "Object Recognition": "0.5000",
            "OCR Problems": "0.5714",
            "Counting Problem": "0.2917",
            "Temporal Reasoning": "0.5165",
            "Spatial Reasoning": "0.6364",
            "Action Reasoning": "0.4944",
            "Object Reasoning": "0.5458",
            "Information Synopsis": "0.7239"
        }
    },
    "overall": {
        "overall": "0.6281",
        "domain": {
            "Knowledge": "0.6667",
            "Film & Television": "0.6611",
            "Sports Competition": "0.5711",
            "Artistic Performance": "0.6889",
            "Life Record": "0.5746",
            "Multilingual": "0.5667"
        },
        "sub_category": {
            "Humanity & History": "0.5778",
            "Literature & Art": "0.6667",
            "Biology & Medicine": "0.7111",
            "Finance & Commerce": "0.7556",
            "Astronomy": "0.6778",
            "Geography": "0.6333",
            "Law": "0.7111",
            "Life Tip": "0.6778",
            "Technology": "0.5889",
            "Animation": "0.5889",
            "Movie & TV Show": "0.6444",
            "Documentary": "0.6556",
            "News Report": "0.7556",
            "Esports": "0.5778",
            "Basketball": "0.4111",
            "Football": "0.6556",
            "Athletics": "0.6000",
            "Other Sports": "0.6111",
            "Stage Play": "0.8000",
            "Magic Show": "0.6667",
            "Variety Show": "0.5889",
            "Acrobatics": "0.7000",
            "Handicraft": "0.7000",
            "Food": "0.5000",
            "Fashion": "0.4778",
            "Daily Life": "0.5444",
            "Travel": "0.6111",
            "Pet & Animal": "0.6889",
            "Exercise": "0.5000",
            "Multilingual": "0.5667"
        },
        "task_type": {
            "Temporal Perception": "0.6364",
            "Spatial Perception": "0.6852",
            "Attribute Perception": "0.7252",
            "Action Recognition": "0.6102",
            "Object Recognition": "0.6469",
            "OCR Problems": "0.6835",
            "Counting Problem": "0.3993",
            "Temporal Reasoning": "0.4859",
            "Spatial Reasoning": "0.8214",
            "Action Reasoning": "0.5789",
            "Object Reasoning": "0.6278",
            "Information Synopsis": "0.8019"
        }
    }
}

MMBench-Video#

MMBench-Video is a benchmark designed to evaluate the proficiency of MLLMs in understanding video content. It addresses the limitations of traditional VideoQA benchmarks by incorporating long-form videos sourced from YouTube, which better reflect real-world scenarios. The benchmark uses free-form questions that require temporal reasoning, which are human-annotated based on a comprehensive capability taxonomy.

When testing with 8 frames:

torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-1B --verbose --nframe 8

The expected test results are:

{
    "coarse_all": {
        "CP": "1.11",
        "FP-S": "1.00",
        "FP-C": "0.84",
        "HL": "0.27",
        "LR": "0.71",
        "AR": "1.01",
        "RR": "1.17",
        "CSR": "0.77",
        "TR": "0.71",
        "Perception": "0.97",
        "Reasoning": "0.88",
        "Overall": "0.95"
    },
    "coarse_valid": {
        "CP": "1.11",
        "FP-S": "1.00",
        "FP-C": "0.84",
        "HL": "0.27",
        "LR": "0.71",
        "AR": "1.01",
        "RR": "1.17",
        "CSR": "0.77",
        "TR": "0.71",
        "Perception": "0.97",
        "Reasoning": "0.88",
        "Overall": "0.95"
    },
    "fine_all": {
        "Video Topic": "1.05",
        "Video Emotion": "1.27",
        "Video Scene": "0.84",
        "Video Style": "1.38",
        "OCR": "0.87",
        "Object Recognition": "1.07",
        "Attribute Recognition": "1.41",
        "Event Recognition": "0.93",
        "Human Motion": "0.84",
        "Counting": "0.99",
        "Spatial Relationship": "1.16",
        "Human-object Interaction": "0.80",
        "Human Interaction": "0.70",
        "Hallucination": "0.27",
        "Structuralized Image-Text Understanding": "0.97",
        "Mathematical Calculation": "0.31",
        "Physical Property": "0.78",
        "Function Reasoning": "0.95",
        "Identity Reasoning": "1.30",
        "Natural Relation": "1.04",
        "Physical Relation": "0.92",
        "Social Relation": "1.48",
        "Common Sense Reasoning": "0.77",
        "Counterfactual Reasoning": "0.80",
        "Causal Reasoning": "0.67",
        "Future Prediction": "0.77"
    },
    "fine_valid": {
        "Video Topic": "1.05",
        "Video Emotion": "1.27",
        "Video Scene": "0.84",
        "Video Style": "1.38",
        "OCR": "0.87",
        "Object Recognition": "1.07",
        "Attribute Recognition": "1.41",
        "Event Recognition": "0.93",
        "Human Motion": "0.84",
        "Counting": "0.99",
        "Spatial Relationship": "1.16",
        "Human-object Interaction": "0.80",
        "Human Interaction": "0.70",
        "Hallucination": "0.27",
        "Structuralized Image-Text Understanding": "0.97",
        "Mathematical Calculation": "0.31",
        "Physical Property": "0.78",
        "Function Reasoning": "0.95",
        "Identity Reasoning": "1.30",
        "Natural Relation": "1.04",
        "Physical Relation": "0.92",
        "Social Relation": "1.48",
        "Common Sense Reasoning": "0.77",
        "Counterfactual Reasoning": "0.80",
        "Causal Reasoning": "0.67",
        "Future Prediction": "0.77"
    }
}

When testing with 16 frames:

torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-1B --verbose --nframe 16

The expected test results are:

{
    "coarse_all": {
        "CP": "1.21",
        "FP-S": "1.03",
        "FP-C": "0.85",
        "HL": "0.29",
        "LR": "0.73",
        "AR": "1.00",
        "RR": "1.26",
        "CSR": "0.70",
        "TR": "0.74",
        "Perception": "1.00",
        "Reasoning": "0.90",
        "Overall": "0.98"
    },
    "coarse_valid": {
        "CP": "1.21",
        "FP-S": "1.03",
        "FP-C": "0.85",
        "HL": "0.29",
        "LR": "0.73",
        "AR": "1.00",
        "RR": "1.26",
        "CSR": "0.70",
        "TR": "0.74",
        "Perception": "1.00",
        "Reasoning": "0.90",
        "Overall": "0.98"
    },
    "fine_all": {
        "Video Topic": "1.15",
        "Video Emotion": "1.37",
        "Video Scene": "0.96",
        "Video Style": "1.43",
        "OCR": "0.96",
        "Object Recognition": "1.08",
        "Attribute Recognition": "1.47",
        "Event Recognition": "0.86",
        "Human Motion": "0.77",
        "Counting": "0.94",
        "Spatial Relationship": "1.09",
        "Human-object Interaction": "0.85",
        "Human Interaction": "0.64",
        "Hallucination": "0.29",
        "Structuralized Image-Text Understanding": "0.96",
        "Mathematical Calculation": "0.38",
        "Physical Property": "0.76",
        "Function Reasoning": "0.89",
        "Identity Reasoning": "1.36",
        "Natural Relation": "1.00",
        "Physical Relation": "1.10",
        "Social Relation": "1.54",
        "Common Sense Reasoning": "0.70",
        "Counterfactual Reasoning": "0.88",
        "Causal Reasoning": "0.72",
        "Future Prediction": "0.74"
    },
    "fine_valid": {
        "Video Topic": "1.15",
        "Video Emotion": "1.37",
        "Video Scene": "0.96",
        "Video Style": "1.43",
        "OCR": "0.96",
        "Object Recognition": "1.08",
        "Attribute Recognition": "1.47",
        "Event Recognition": "0.86",
        "Human Motion": "0.77",
        "Counting": "0.94",
        "Spatial Relationship": "1.09",
        "Human-object Interaction": "0.85",
        "Human Interaction": "0.64",
        "Hallucination": "0.29",
        "Structuralized Image-Text Understanding": "0.96",
        "Mathematical Calculation": "0.38",
        "Physical Property": "0.76",
        "Function Reasoning": "0.89",
        "Identity Reasoning": "1.36",
        "Natural Relation": "1.00",
        "Physical Relation": "1.10",
        "Social Relation": "1.54",
        "Common Sense Reasoning": "0.70",
        "Counterfactual Reasoning": "0.88",
        "Causal Reasoning": "0.72",
        "Future Prediction": "0.74"
    }
}

When testing with 8 frames:

torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-2B --verbose --nframe 8

The expected test results are:

{
    "coarse_all": {
        "CP": "1.16",
        "FP-S": "1.05",
        "FP-C": "0.81",
        "HL": "0.26",
        "LR": "0.50",
        "AR": "1.12",
        "RR": "1.11",
        "CSR": "0.81",
        "TR": "0.83",
        "Perception": "1.00",
        "Reasoning": "0.91",
        "Overall": "0.97"
    },
    "coarse_valid": {
        "CP": "1.16",
        "FP-S": "1.05",
        "FP-C": "0.81",
        "HL": "0.26",
        "LR": "0.50",
        "AR": "1.12",
        "RR": "1.11",
        "CSR": "0.81",
        "TR": "0.83",
        "Perception": "1.00",
        "Reasoning": "0.91",
        "Overall": "0.97"
    },
    "fine_all": {
        "Video Topic": "1.12",
        "Video Emotion": "1.29",
        "Video Scene": "0.99",
        "Video Style": "1.24",
        "OCR": "0.94",
        "Object Recognition": "1.04",
        "Attribute Recognition": "1.46",
        "Event Recognition": "1.02",
        "Human Motion": "0.66",
        "Counting": "1.16",
        "Spatial Relationship": "0.93",
        "Human-object Interaction": "0.77",
        "Human Interaction": "0.77",
        "Hallucination": "0.26",
        "Structuralized Image-Text Understanding": "0.69",
        "Mathematical Calculation": "0.22",
        "Physical Property": "0.94",
        "Function Reasoning": "1.09",
        "Identity Reasoning": "1.32",
        "Natural Relation": "0.93",
        "Physical Relation": "0.98",
        "Social Relation": "1.33",
        "Common Sense Reasoning": "0.81",
        "Counterfactual Reasoning": "1.00",
        "Causal Reasoning": "0.76",
        "Future Prediction": "0.87"
    },
    "fine_valid": {
        "Video Topic": "1.12",
        "Video Emotion": "1.29",
        "Video Scene": "0.99",
        "Video Style": "1.24",
        "OCR": "0.94",
        "Object Recognition": "1.04",
        "Attribute Recognition": "1.46",
        "Event Recognition": "1.02",
        "Human Motion": "0.66",
        "Counting": "1.16",
        "Spatial Relationship": "0.93",
        "Human-object Interaction": "0.77",
        "Human Interaction": "0.77",
        "Hallucination": "0.26",
        "Structuralized Image-Text Understanding": "0.69",
        "Mathematical Calculation": "0.22",
        "Physical Property": "0.94",
        "Function Reasoning": "1.09",
        "Identity Reasoning": "1.32",
        "Natural Relation": "0.93",
        "Physical Relation": "0.98",
        "Social Relation": "1.33",
        "Common Sense Reasoning": "0.81",
        "Counterfactual Reasoning": "1.00",
        "Causal Reasoning": "0.76",
        "Future Prediction": "0.87"
    }
}

When testing with 16 frames:

torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-2B --verbose --nframe 16

The expected test results are:

{
    "coarse_all": {
        "CP": "1.22",
        "FP-S": "1.13",
        "FP-C": "0.80",
        "HL": "0.34",
        "LR": "0.64",
        "AR": "1.01",
        "RR": "1.23",
        "CSR": "0.88",
        "TR": "0.87",
        "Perception": "1.06",
        "Reasoning": "0.95",
        "Overall": "1.03"
    },
    "coarse_valid": {
        "CP": "1.22",
        "FP-S": "1.13",
        "FP-C": "0.80",
        "HL": "0.34",
        "LR": "0.64",
        "AR": "1.01",
        "RR": "1.23",
        "CSR": "0.88",
        "TR": "0.87",
        "Perception": "1.06",
        "Reasoning": "0.95",
        "Overall": "1.03"
    },
    "fine_all": {
        "Video Topic": "1.14",
        "Video Emotion": "1.29",
        "Video Scene": "1.17",
        "Video Style": "1.21",
        "OCR": "1.02",
        "Object Recognition": "1.13",
        "Attribute Recognition": "1.59",
        "Event Recognition": "0.99",
        "Human Motion": "0.72",
        "Counting": "1.24",
        "Spatial Relationship": "1.02",
        "Human-object Interaction": "0.67",
        "Human Interaction": "0.85",
        "Hallucination": "0.34",
        "Structuralized Image-Text Understanding": "0.79",
        "Mathematical Calculation": "0.40",
        "Physical Property": "0.85",
        "Function Reasoning": "1.07",
        "Identity Reasoning": "1.11",
        "Natural Relation": "1.15",
        "Physical Relation": "1.00",
        "Social Relation": "1.48",
        "Common Sense Reasoning": "0.88",
        "Counterfactual Reasoning": "1.10",
        "Causal Reasoning": "0.82",
        "Future Prediction": "0.81"
    },
    "fine_valid": {
        "Video Topic": "1.14",
        "Video Emotion": "1.29",
        "Video Scene": "1.17",
        "Video Style": "1.21",
        "OCR": "1.02",
        "Object Recognition": "1.13",
        "Attribute Recognition": "1.59",
        "Event Recognition": "0.99",
        "Human Motion": "0.72",
        "Counting": "1.24",
        "Spatial Relationship": "1.02",
        "Human-object Interaction": "0.67",
        "Human Interaction": "0.85",
        "Hallucination": "0.34",
        "Structuralized Image-Text Understanding": "0.79",
        "Mathematical Calculation": "0.40",
        "Physical Property": "0.85",
        "Function Reasoning": "1.07",
        "Identity Reasoning": "1.11",
        "Natural Relation": "1.15",
        "Physical Relation": "1.00",
        "Social Relation": "1.48",
        "Common Sense Reasoning": "0.88",
        "Counterfactual Reasoning": "1.10",
        "Causal Reasoning": "0.82",
        "Future Prediction": "0.81"
    }
}

When testing with 8 frames:

torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-4B --verbose --nframe 8

The expected test results are:

{
    "coarse_all": {
        "CP": "1.34",
        "FP-S": "1.16",
        "FP-C": "0.97",
        "HL": "0.13",
        "LR": "0.58",
        "AR": "1.16",
        "RR": "1.26",
        "CSR": "1.02",
        "TR": "0.99",
        "Perception": "1.13",
        "Reasoning": "1.03",
        "Overall": "1.10"
    },
    "coarse_valid": {
        "CP": "1.34",
        "FP-S": "1.16",
        "FP-C": "0.97",
        "HL": "0.13",
        "LR": "0.58",
        "AR": "1.16",
        "RR": "1.26",
        "CSR": "1.02",
        "TR": "0.99",
        "Perception": "1.13",
        "Reasoning": "1.03",
        "Overall": "1.10"
    },
    "fine_all": {
        "Video Topic": "1.30",
        "Video Emotion": "1.43",
        "Video Scene": "1.18",
        "Video Style": "1.62",
        "OCR": "0.98",
        "Object Recognition": "1.24",
        "Attribute Recognition": "1.53",
        "Event Recognition": "1.11",
        "Human Motion": "0.95",
        "Counting": "1.31",
        "Spatial Relationship": "1.07",
        "Human-object Interaction": "0.95",
        "Human Interaction": "0.95",
        "Hallucination": "0.13",
        "Structuralized Image-Text Understanding": "0.75",
        "Mathematical Calculation": "0.33",
        "Physical Property": "1.11",
        "Function Reasoning": "1.07",
        "Identity Reasoning": "1.30",
        "Natural Relation": "0.96",
        "Physical Relation": "1.25",
        "Social Relation": "1.41",
        "Common Sense Reasoning": "1.02",
        "Counterfactual Reasoning": "0.97",
        "Causal Reasoning": "0.98",
        "Future Prediction": "1.02"
    },
    "fine_valid": {
        "Video Topic": "1.30",
        "Video Emotion": "1.43",
        "Video Scene": "1.18",
        "Video Style": "1.62",
        "OCR": "0.98",
        "Object Recognition": "1.24",
        "Attribute Recognition": "1.53",
        "Event Recognition": "1.11",
        "Human Motion": "0.95",
        "Counting": "1.31",
        "Spatial Relationship": "1.07",
        "Human-object Interaction": "0.95",
        "Human Interaction": "0.95",
        "Hallucination": "0.13",
        "Structuralized Image-Text Understanding": "0.75",
        "Mathematical Calculation": "0.33",
        "Physical Property": "1.11",
        "Function Reasoning": "1.07",
        "Identity Reasoning": "1.30",
        "Natural Relation": "0.96",
        "Physical Relation": "1.25",
        "Social Relation": "1.41",
        "Common Sense Reasoning": "1.02",
        "Counterfactual Reasoning": "0.97",
        "Causal Reasoning": "0.98",
        "Future Prediction": "1.02"
    }
}

When testing with 16 frames:

torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-4B --verbose --nframe 16

The expected test results are:

{
    "coarse_all": {
        "CP": "1.38",
        "FP-S": "1.27",
        "FP-C": "1.03",
        "HL": "0.15",
        "LR": "0.73",
        "AR": "1.24",
        "RR": "1.29",
        "CSR": "1.17",
        "TR": "0.99",
        "Perception": "1.22",
        "Reasoning": "1.09",
        "Overall": "1.18"
    },
    "coarse_valid": {
        "CP": "1.38",
        "FP-S": "1.27",
        "FP-C": "1.03",
        "HL": "0.15",
        "LR": "0.73",
        "AR": "1.24",
        "RR": "1.29",
        "CSR": "1.17",
        "TR": "0.99",
        "Perception": "1.22",
        "Reasoning": "1.09",
        "Overall": "1.18"
    },
    "fine_all": {
        "Video Topic": "1.31",
        "Video Emotion": "1.47",
        "Video Scene": "1.22",
        "Video Style": "1.74",
        "OCR": "1.19",
        "Object Recognition": "1.29",
        "Attribute Recognition": "1.62",
        "Event Recognition": "1.13",
        "Human Motion": "1.02",
        "Counting": "1.25",
        "Spatial Relationship": "1.16",
        "Human-object Interaction": "0.99",
        "Human Interaction": "1.00",
        "Hallucination": "0.15",
        "Structuralized Image-Text Understanding": "0.87",
        "Mathematical Calculation": "0.51",
        "Physical Property": "1.17",
        "Function Reasoning": "1.05",
        "Identity Reasoning": "1.49",
        "Natural Relation": "1.00",
        "Physical Relation": "1.25",
        "Social Relation": "1.46",
        "Common Sense Reasoning": "1.17",
        "Counterfactual Reasoning": "1.05",
        "Causal Reasoning": "0.96",
        "Future Prediction": "1.04"
    },
    "fine_valid": {
        "Video Topic": "1.31",
        "Video Emotion": "1.47",
        "Video Scene": "1.22",
        "Video Style": "1.74",
        "OCR": "1.19",
        "Object Recognition": "1.29",
        "Attribute Recognition": "1.62",
        "Event Recognition": "1.13",
        "Human Motion": "1.02",
        "Counting": "1.25",
        "Spatial Relationship": "1.16",
        "Human-object Interaction": "0.99",
        "Human Interaction": "1.00",
        "Hallucination": "0.15",
        "Structuralized Image-Text Understanding": "0.87",
        "Mathematical Calculation": "0.51",
        "Physical Property": "1.17",
        "Function Reasoning": "1.05",
        "Identity Reasoning": "1.49",
        "Natural Relation": "1.00",
        "Physical Relation": "1.25",
        "Social Relation": "1.46",
        "Common Sense Reasoning": "1.17",
        "Counterfactual Reasoning": "1.05",
        "Causal Reasoning": "0.96",
        "Future Prediction": "1.04"
    }
}

When testing with 8 frames:

torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-8B --verbose --nframe 8

The expected test results are:

{
    "coarse_all": {
        "CP": "1.36",
        "FP-S": "1.26",
        "FP-C": "1.07",
        "HL": "0.32",
        "LR": "0.83",
        "AR": "1.19",
        "RR": "1.33",
        "CSR": "1.14",
        "TR": "1.02",
        "Perception": "1.22",
        "Reasoning": "1.12",
        "Overall": "1.19"
    },
    "coarse_valid": {
        "CP": "1.36",
        "FP-S": "1.26",
        "FP-C": "1.07",
        "HL": "0.32",
        "LR": "0.83",
        "AR": "1.19",
        "RR": "1.33",
        "CSR": "1.14",
        "TR": "1.02",
        "Perception": "1.22",
        "Reasoning": "1.12",
        "Overall": "1.19"
    },
    "fine_all": {
        "Video Topic": "1.23",
        "Video Emotion": "1.49",
        "Video Scene": "1.22",
        "Video Style": "1.67",
        "OCR": "1.14",
        "Object Recognition": "1.35",
        "Attribute Recognition": "1.66",
        "Event Recognition": "1.18",
        "Human Motion": "0.90",
        "Counting": "1.31",
        "Spatial Relationship": "1.24",
        "Human-object Interaction": "1.05",
        "Human Interaction": "1.02",
        "Hallucination": "0.32",
        "Structuralized Image-Text Understanding": "1.03",
        "Mathematical Calculation": "0.53",
        "Physical Property": "1.24",
        "Function Reasoning": "1.05",
        "Identity Reasoning": "1.26",
        "Natural Relation": "1.00",
        "Physical Relation": "1.27",
        "Social Relation": "1.56",
        "Common Sense Reasoning": "1.14",
        "Counterfactual Reasoning": "0.95",
        "Causal Reasoning": "1.07",
        "Future Prediction": "0.98"
    },
    "fine_valid": {
        "Video Topic": "1.23",
        "Video Emotion": "1.49",
        "Video Scene": "1.22",
        "Video Style": "1.67",
        "OCR": "1.14",
        "Object Recognition": "1.35",
        "Attribute Recognition": "1.66",
        "Event Recognition": "1.18",
        "Human Motion": "0.90",
        "Counting": "1.31",
        "Spatial Relationship": "1.24",
        "Human-object Interaction": "1.05",
        "Human Interaction": "1.02",
        "Hallucination": "0.32",
        "Structuralized Image-Text Understanding": "1.03",
        "Mathematical Calculation": "0.53",
        "Physical Property": "1.24",
        "Function Reasoning": "1.05",
        "Identity Reasoning": "1.26",
        "Natural Relation": "1.00",
        "Physical Relation": "1.27",
        "Social Relation": "1.56",
        "Common Sense Reasoning": "1.14",
        "Counterfactual Reasoning": "0.95",
        "Causal Reasoning": "1.07",
        "Future Prediction": "0.98"
    }
}

When testing with 16 frames:

torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-8B --verbose --nframe 16

The expected test results are:

{
    "coarse_all": {
        "CP": "1.45",
        "FP-S": "1.40",
        "FP-C": "1.13",
        "HL": "0.18",
        "LR": "0.90",
        "AR": "1.32",
        "RR": "1.45",
        "CSR": "1.19",
        "TR": "1.04",
        "Perception": "1.32",
        "Reasoning": "1.18",
        "Overall": "1.28"
    },
    "coarse_valid": {
        "CP": "1.45",
        "FP-S": "1.40",
        "FP-C": "1.13",
        "HL": "0.18",
        "LR": "0.90",
        "AR": "1.32",
        "RR": "1.45",
        "CSR": "1.19",
        "TR": "1.04",
        "Perception": "1.32",
        "Reasoning": "1.18",
        "Overall": "1.28"
    },
    "fine_all": {
        "Video Topic": "1.38",
        "Video Emotion": "1.57",
        "Video Scene": "1.27",
        "Video Style": "1.69",
        "OCR": "1.32",
        "Object Recognition": "1.40",
        "Attribute Recognition": "1.80",
        "Event Recognition": "1.18",
        "Human Motion": "1.15",
        "Counting": "1.44",
        "Spatial Relationship": "1.22",
        "Human-object Interaction": "1.15",
        "Human Interaction": "1.03",
        "Hallucination": "0.18",
        "Structuralized Image-Text Understanding": "1.13",
        "Mathematical Calculation": "0.56",
        "Physical Property": "1.20",
        "Function Reasoning": "1.05",
        "Identity Reasoning": "1.72",
        "Natural Relation": "0.93",
        "Physical Relation": "1.45",
        "Social Relation": "1.70",
        "Common Sense Reasoning": "1.19",
        "Counterfactual Reasoning": "1.07",
        "Causal Reasoning": "1.04",
        "Future Prediction": "1.06"
    },
    "fine_valid": {
        "Video Topic": "1.38",
        "Video Emotion": "1.57",
        "Video Scene": "1.27",
        "Video Style": "1.69",
        "OCR": "1.32",
        "Object Recognition": "1.40",
        "Attribute Recognition": "1.80",
        "Event Recognition": "1.18",
        "Human Motion": "1.15",
        "Counting": "1.44",
        "Spatial Relationship": "1.22",
        "Human-object Interaction": "1.15",
        "Human Interaction": "1.03",
        "Hallucination": "0.18",
        "Structuralized Image-Text Understanding": "1.13",
        "Mathematical Calculation": "0.56",
        "Physical Property": "1.20",
        "Function Reasoning": "1.05",
        "Identity Reasoning": "1.72",
        "Natural Relation": "0.93",
        "Physical Relation": "1.45",
        "Social Relation": "1.70",
        "Common Sense Reasoning": "1.19",
        "Counterfactual Reasoning": "1.07",
        "Causal Reasoning": "1.04",
        "Future Prediction": "1.06"
    }
}

When testing with 8 frames:

torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-26B --verbose --nframe 8

The expected test results are:

{
    "coarse_all": {
        "CP": "1.47",
        "FP-S": "1.32",
        "FP-C": "1.07",
        "HL": "0.35",
        "LR": "1.04",
        "AR": "1.42",
        "RR": "1.43",
        "CSR": "1.16",
        "TR": "1.04",
        "Perception": "1.28",
        "Reasoning": "1.22",
        "Overall": "1.27"
    },
    "coarse_valid": {
        "CP": "1.47",
        "FP-S": "1.32",
        "FP-C": "1.07",
        "HL": "0.35",
        "LR": "1.04",
        "AR": "1.42",
        "RR": "1.43",
        "CSR": "1.16",
        "TR": "1.04",
        "Perception": "1.28",
        "Reasoning": "1.22",
        "Overall": "1.27"
    },
    "fine_all": {
        "Video Topic": "1.35",
        "Video Emotion": "1.47",
        "Video Scene": "1.51",
        "Video Style": "1.69",
        "OCR": "1.21",
        "Object Recognition": "1.37",
        "Attribute Recognition": "1.82",
        "Event Recognition": "1.16",
        "Human Motion": "0.97",
        "Counting": "1.43",
        "Spatial Relationship": "1.20",
        "Human-object Interaction": "1.05",
        "Human Interaction": "1.02",
        "Hallucination": "0.35",
        "Structuralized Image-Text Understanding": "1.22",
        "Mathematical Calculation": "0.76",
        "Physical Property": "1.43",
        "Function Reasoning": "1.29",
        "Identity Reasoning": "1.55",
        "Natural Relation": "1.33",
        "Physical Relation": "1.12",
        "Social Relation": "1.78",
        "Common Sense Reasoning": "1.16",
        "Counterfactual Reasoning": "1.05",
        "Causal Reasoning": "1.05",
        "Future Prediction": "1.06"
    },
    "fine_valid": {
        "Video Topic": "1.35",
        "Video Emotion": "1.47",
        "Video Scene": "1.51",
        "Video Style": "1.69",
        "OCR": "1.21",
        "Object Recognition": "1.37",
        "Attribute Recognition": "1.82",
        "Event Recognition": "1.16",
        "Human Motion": "0.97",
        "Counting": "1.43",
        "Spatial Relationship": "1.20",
        "Human-object Interaction": "1.05",
        "Human Interaction": "1.02",
        "Hallucination": "0.35",
        "Structuralized Image-Text Understanding": "1.22",
        "Mathematical Calculation": "0.76",
        "Physical Property": "1.43",
        "Function Reasoning": "1.29",
        "Identity Reasoning": "1.55",
        "Natural Relation": "1.33",
        "Physical Relation": "1.12",
        "Social Relation": "1.78",
        "Common Sense Reasoning": "1.16",
        "Counterfactual Reasoning": "1.05",
        "Causal Reasoning": "1.06",
        "Future Prediction": "1.06"
    }
}

When testing with 16 frames:

torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-26B --verbose --nframe 16

The expected test results are:

{
    "coarse_all": {
        "CP": "1.56",
        "FP-S": "1.48",
        "FP-C": "1.23",
        "HL": "0.52",
        "LR": "1.06",
        "AR": "1.61",
        "RR": "1.45",
        "CSR": "1.38",
        "TR": "1.23",
        "Perception": "1.42",
        "Reasoning": "1.35",
        "Overall": "1.41"
    },
    "coarse_valid": {
        "CP": "1.56",
        "FP-S": "1.48",
        "FP-C": "1.23",
        "HL": "0.52",
        "LR": "1.06",
        "AR": "1.61",
        "RR": "1.47",
        "CSR": "1.38",
        "TR": "1.23",
        "Perception": "1.42",
        "Reasoning": "1.35",
        "Overall": "1.41"
    },
    "fine_all": {
        "Video Topic": "1.52",
        "Video Emotion": "1.48",
        "Video Scene": "1.59",
        "Video Style": "1.76",
        "OCR": "1.37",
        "Object Recognition": "1.55",
        "Attribute Recognition": "1.91",
        "Event Recognition": "1.30",
        "Human Motion": "1.15",
        "Counting": "1.46",
        "Spatial Relationship": "1.18",
        "Human-object Interaction": "1.35",
        "Human Interaction": "1.08",
        "Hallucination": "0.52",
        "Structuralized Image-Text Understanding": "1.25",
        "Mathematical Calculation": "0.78",
        "Physical Property": "1.46",
        "Function Reasoning": "1.42",
        "Identity Reasoning": "1.96",
        "Natural Relation": "1.44",
        "Physical Relation": "1.06",
        "Social Relation": "1.83",
        "Common Sense Reasoning": "1.38",
        "Counterfactual Reasoning": "1.25",
        "Causal Reasoning": "1.23",
        "Future Prediction": "1.17"
    },
    "fine_valid": {
        "Video Topic": "1.52",
        "Video Emotion": "1.48",
        "Video Scene": "1.59",
        "Video Style": "1.76",
        "OCR": "1.38",
        "Object Recognition": "1.56",
        "Attribute Recognition": "1.91",
        "Event Recognition": "1.30",
        "Human Motion": "1.15",
        "Counting": "1.46",
        "Spatial Relationship": "1.18",
        "Human-object Interaction": "1.35",
        "Human Interaction": "1.08",
        "Hallucination": "0.52",
        "Structuralized Image-Text Understanding": "1.25",
        "Mathematical Calculation": "0.78",
        "Physical Property": "1.46",
        "Function Reasoning": "1.42",
        "Identity Reasoning": "1.96",
        "Natural Relation": "1.50",
        "Physical Relation": "1.06",
        "Social Relation": "1.83",
        "Common Sense Reasoning": "1.38",
        "Counterfactual Reasoning": "1.25",
        "Causal Reasoning": "1.24",
        "Future Prediction": "1.17"
    }
}

When testing with 8 frames:

torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-40B --verbose --nframe 8

The expected test results are:

{
    "coarse_all": {
        "CP": "1.53",
        "FP-S": "1.39",
        "FP-C": "1.12",
        "HL": "0.32",
        "LR": "0.88",
        "AR": "1.45",
        "RR": "1.52",
        "CSR": "1.15",
        "TR": "1.13",
        "Perception": "1.34",
        "Reasoning": "1.25",
        "Overall": "1.32"
    },
    "coarse_valid": {
        "CP": "1.53",
        "FP-S": "1.39",
        "FP-C": "1.12",
        "HL": "0.32",
        "LR": "0.88",
        "AR": "1.45",
        "RR": "1.52",
        "CSR": "1.15",
        "TR": "1.13",
        "Perception": "1.34",
        "Reasoning": "1.25",
        "Overall": "1.32"
    },
    "fine_all": {
        "Video Topic": "1.57",
        "Video Emotion": "1.65",
        "Video Scene": "1.24",
        "Video Style": "1.81",
        "OCR": "1.29",
        "Object Recognition": "1.40",
        "Attribute Recognition": "1.80",
        "Event Recognition": "1.21",
        "Human Motion": "1.36",
        "Counting": "1.45",
        "Spatial Relationship": "1.22",
        "Human-object Interaction": "1.14",
        "Human Interaction": "1.02",
        "Hallucination": "0.32",
        "Structuralized Image-Text Understanding": "1.04",
        "Mathematical Calculation": "0.62",
        "Physical Property": "1.30",
        "Function Reasoning": "1.33",
        "Identity Reasoning": "1.74",
        "Natural Relation": "1.30",
        "Physical Relation": "1.35",
        "Social Relation": "1.78",
        "Common Sense Reasoning": "1.15",
        "Counterfactual Reasoning": "1.18",
        "Causal Reasoning": "1.14",
        "Future Prediction": "1.13"
    },
    "fine_valid": {
        "Video Topic": "1.57",
        "Video Emotion": "1.65",
        "Video Scene": "1.24",
        "Video Style": "1.81",
        "OCR": "1.29",
        "Object Recognition": "1.40",
        "Attribute Recognition": "1.80",
        "Event Recognition": "1.21",
        "Human Motion": "1.36",
        "Counting": "1.45",
        "Spatial Relationship": "1.22",
        "Human-object Interaction": "1.14",
        "Human Interaction": "1.02",
        "Hallucination": "0.32",
        "Structuralized Image-Text Understanding": "1.04",
        "Mathematical Calculation": "0.62",
        "Physical Property": "1.30",
        "Function Reasoning": "1.33",
        "Identity Reasoning": "1.74",
        "Natural Relation": "1.30",
        "Physical Relation": "1.35",
        "Social Relation": "1.78",
        "Common Sense Reasoning": "1.15",
        "Counterfactual Reasoning": "1.18",
        "Causal Reasoning": "1.14",
        "Future Prediction": "1.13"
    }
}

When testing with 16 frames:

torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-40B --verbose --nframe 16

The expected test results are:

{
    "coarse_all": {
        "CP": "1.58",
        "FP-S": "1.56",
        "FP-C": "1.28",
        "HL": "0.39",
        "LR": "1.10",
        "AR": "1.61",
        "RR": "1.53",
        "CSR": "1.25",
        "TR": "1.20",
        "Perception": "1.48",
        "Reasoning": "1.35",
        "Overall": "1.45"
    },
    "coarse_valid": {
        "CP": "1.58",
        "FP-S": "1.56",
        "FP-C": "1.28",
        "HL": "0.39",
        "LR": "1.10",
        "AR": "1.61",
        "RR": "1.53",
        "CSR": "1.25",
        "TR": "1.20",
        "Perception": "1.48",
        "Reasoning": "1.35",
        "Overall": "1.45"
    },
    "fine_all": {
        "Video Topic": "1.57",
        "Video Emotion": "1.67",
        "Video Scene": "1.39",
        "Video Style": "1.83",
        "OCR": "1.47",
        "Object Recognition": "1.64",
        "Attribute Recognition": "2.03",
        "Event Recognition": "1.32",
        "Human Motion": "1.26",
        "Counting": "1.49",
        "Spatial Relationship": "1.31",
        "Human-object Interaction": "1.30",
        "Human Interaction": "1.26",
        "Hallucination": "0.39",
        "Structuralized Image-Text Understanding": "1.26",
        "Mathematical Calculation": "0.84",
        "Physical Property": "1.43",
        "Function Reasoning": "1.49",
        "Identity Reasoning": "1.92",
        "Natural Relation": "1.56",
        "Physical Relation": "1.27",
        "Social Relation": "1.76",
        "Common Sense Reasoning": "1.25",
        "Counterfactual Reasoning": "1.27",
        "Causal Reasoning": "1.19",
        "Future Prediction": "1.15"
    },
    "fine_valid": {
        "Video Topic": "1.57",
        "Video Emotion": "1.67",
        "Video Scene": "1.39",
        "Video Style": "1.83",
        "OCR": "1.47",
        "Object Recognition": "1.64",
        "Attribute Recognition": "2.03",
        "Event Recognition": "1.32",
        "Human Motion": "1.26",
        "Counting": "1.49",
        "Spatial Relationship": "1.31",
        "Human-object Interaction": "1.30",
        "Human Interaction": "1.26",
        "Hallucination": "0.39",
        "Structuralized Image-Text Understanding": "1.26",
        "Mathematical Calculation": "0.84",
        "Physical Property": "1.43",
        "Function Reasoning": "1.49",
        "Identity Reasoning": "1.92",
        "Natural Relation": "1.56",
        "Physical Relation": "1.27",
        "Social Relation": "1.76",
        "Common Sense Reasoning": "1.25",
        "Counterfactual Reasoning": "1.27",
        "Causal Reasoning": "1.19",
        "Future Prediction": "1.15"
    }
}

When testing with 8 frames:

torchrun --nproc-per-node=1 run.py --data MMBench-Video --model InternVL2-76B --verbose --nframe 8

The expected test results are:

{
    "coarse_all": {
        "CP": "1.59",
        "FP-S": "1.41",
        "FP-C": "1.25",
        "HL": "0.42",
        "LR": "0.98",
        "AR": "1.60",
        "RR": "1.41",
        "CSR": "1.44",
        "TR": "1.27",
        "Perception": "1.38",
        "Reasoning": "1.35",
        "Overall": "1.37"
    },
    "coarse_valid": {
        "CP": "1.59",
        "FP-S": "1.41",
        "FP-C": "1.25",
        "HL": "0.42",
        "LR": "0.98",
        "AR": "1.60",
        "RR": "1.41",
        "CSR": "1.44",
        "TR": "1.27",
        "Perception": "1.38",
        "Reasoning": "1.35",
        "Overall": "1.37"
    },
    "fine_all": {
        "Video Topic": "1.51",
        "Video Emotion": "1.66",
        "Video Scene": "1.46",
        "Video Style": "1.90",
        "OCR": "1.32",
        "Object Recognition": "1.45",
        "Attribute Recognition": "1.78",
        "Event Recognition": "1.30",
        "Human Motion": "1.07",
        "Counting": "1.49",
        "Spatial Relationship": "1.36",
        "Human-object Interaction": "1.27",
        "Human Interaction": "1.21",
        "Hallucination": "0.42",
        "Structuralized Image-Text Understanding": "1.21",
        "Mathematical Calculation": "0.64",
        "Physical Property": "1.57",
        "Function Reasoning": "1.51",
        "Identity Reasoning": "1.72",
        "Natural Relation": "1.33",
        "Physical Relation": "1.33",
        "Social Relation": "1.52",
        "Common Sense Reasoning": "1.44",
        "Counterfactual Reasoning": "1.27",
        "Causal Reasoning": "1.33",
        "Future Prediction": "1.17"
    },
    "fine_valid": {
        "Video Topic": "1.51",
        "Video Emotion": "1.66",
        "Video Scene": "1.46",
        "Video Style": "1.90",
        "OCR": "1.32",
        "Object Recognition": "1.45",
        "Attribute Recognition": "1.78",
        "Event Recognition": "1.30",
        "Human Motion": "1.07",
        "Counting": "1.49",
        "Spatial Relationship": "1.36",
        "Human-object Interaction": "1.27",
        "Human Interaction": "1.21",
        "Hallucination": "0.42",
        "Structuralized Image-Text Understanding": "1.21",
        "Mathematical Calculation": "0.64",
        "Physical Property": "1.57",
        "Function Reasoning": "1.51",
        "Identity Reasoning": "1.72",
        "Natural Relation": "1.33",
        "Physical Relation": "1.33",
        "Social Relation": "1.52",
        "Common Sense Reasoning": "1.44",
        "Counterfactual Reasoning": "1.27",
        "Causal Reasoning": "1.33",
        "Future Prediction": "1.17"
    }
}

When testing with 16 frames:

torchrun --nproc-per-node=1 run.py --data MMBench-Video --model InternVL2-76B --verbose --nframe 16

The expected test results are:

{
    "coarse_all": {
        "CP": "1.69",
        "FP-S": "1.60",
        "FP-C": "1.34",
        "HL": "0.44",
        "LR": "1.19",
        "AR": "1.77",
        "RR": "1.48",
        "CSR": "1.51",
        "TR": "1.36",
        "Perception": "1.54",
        "Reasoning": "1.46",
        "Overall": "1.52"
    },
    "coarse_valid": {
        "CP": "1.69",
        "FP-S": "1.60",
        "FP-C": "1.34",
        "HL": "0.44",
        "LR": "1.19",
        "AR": "1.77",
        "RR": "1.48",
        "CSR": "1.51",
        "TR": "1.36",
        "Perception": "1.54",
        "Reasoning": "1.46",
        "Overall": "1.52"
    },
    "fine_all": {
        "Video Topic": "1.64",
        "Video Emotion": "1.73",
        "Video Scene": "1.60",
        "Video Style": "1.93",
        "OCR": "1.48",
        "Object Recognition": "1.65",
        "Attribute Recognition": "2.06",
        "Event Recognition": "1.42",
        "Human Motion": "1.39",
        "Counting": "1.69",
        "Spatial Relationship": "1.36",
        "Human-object Interaction": "1.44",
        "Human Interaction": "1.20",
        "Hallucination": "0.44",
        "Structuralized Image-Text Understanding": "1.40",
        "Mathematical Calculation": "0.89",
        "Physical Property": "1.65",
        "Function Reasoning": "1.49",
        "Identity Reasoning": "2.17",
        "Natural Relation": "1.30",
        "Physical Relation": "1.47",
        "Social Relation": "1.59",
        "Common Sense Reasoning": "1.51",
        "Counterfactual Reasoning": "1.43",
        "Causal Reasoning": "1.36",
        "Future Prediction": "1.34"
    },
    "fine_valid": {
        "Video Topic": "1.64",
        "Video Emotion": "1.73",
        "Video Scene": "1.60",
        "Video Style": "1.93",
        "OCR": "1.48",
        "Object Recognition": "1.65",
        "Attribute Recognition": "2.06",
        "Event Recognition": "1.42",
        "Human Motion": "1.39",
        "Counting": "1.69",
        "Spatial Relationship": "1.36",
        "Human-object Interaction": "1.44",
        "Human Interaction": "1.20",
        "Hallucination": "0.44",
        "Structuralized Image-Text Understanding": "1.40",
        "Mathematical Calculation": "0.89",
        "Physical Property": "1.65",
        "Function Reasoning": "1.49",
        "Identity Reasoning": "2.17",
        "Natural Relation": "1.30",
        "Physical Relation": "1.47",
        "Social Relation": "1.59",
        "Common Sense Reasoning": "1.51",
        "Counterfactual Reasoning": "1.43",
        "Causal Reasoning": "1.36",
        "Future Prediction": "1.34"
    }
}

MathVision#

The MathVision (MATH-V) dataset is a comprehensive benchmark designed to evaluate the mathematical reasoning capabilities of multimodal large models. This dataset includes 3,040 high-quality mathematical problems, each paired with visual contexts sourced from real math competitions. It spans 16 distinct mathematical disciplines, including algebra, geometry, topology, and graph theory, and is graded across five levels of difficulty. This setup provides a diverse set of challenges that assess both the visual perception and reasoning abilities of models.

torchrun --nproc-per-node=8 run.py --model InternVL2-1B --data MathVision_MINI

The expected test results are:

--  ------------------------  ---  ---  --  --------  --------
Overall                   304  100  37  32.8947   12.1711
algebra                    19    5   1  26.3158    5.26316
analytic geometry          19    5   3  26.3158   15.7895
arithmetic                 19    4   2  21.0526   10.5263
combinatorial geometry     19    7   2  36.8421   10.5263
combinatorics              19    1   3   5.26316  15.7895
counting                   19    1   2   5.26316  10.5263
descriptive geometry       19   10   4  52.6316   21.0526
graph theory               19    7   2  36.8421   10.5263
logic                      19    6   3  31.5789   15.7895
metric geometry - angle    19   10   4  52.6316   21.0526
metric geometry - area     19    8   1  42.1053    5.26316
metric geometry - length   19    8   3  42.1053   15.7895
solid geometry             19    6   0  31.5789    0
statistics                 19    6   2  31.5789   10.5263
topology                   19    8   2  42.1053   10.5263
transformation geometry    19    8   3  42.1053   15.7895
--  ------------------------  ---  ---  --  --------  --------

torchrun --nproc-per-node=8 run.py --model InternVL2-2B --data MathVision_MINI

The expected test results are:

--  ------------------------  ---  ---  --  --------  --------
Overall                   304  100  48  32.8947   15.7895
algebra                    19    6   1  31.5789    5.26316
analytic geometry          19    7   2  36.8421   10.5263
arithmetic                 19    4   1  21.0526    5.26316
combinatorial geometry     19    5   5  26.3158   26.3158
combinatorics              19    1   1   5.26316   5.26316
counting                   19    0   2   0        10.5263
descriptive geometry       19    8   4  42.1053   21.0526
graph theory               19    3   4  15.7895   21.0526
logic                      19    9   5  47.3684   26.3158
metric geometry - angle    19   11   4  57.8947   21.0526
metric geometry - area     19    8   3  42.1053   15.7895
metric geometry - length   19   10   4  52.6316   21.0526
solid geometry             19    6   1  31.5789    5.26316
statistics                 19    7   5  36.8421   26.3158
topology                   19    5   1  26.3158    5.26316
transformation geometry    19   10   5  52.6316   26.3158
--  ------------------------  ---  ---  --  --------  --------

torchrun --nproc-per-node=8 run.py --model InternVL2-4B --data MathVision_MINI

The expected test results are:

--  ------------------------  ---  --  --  --------  --------
Overall                   304  89  54  29.2763   17.7632
algebra                    19   4   4  21.0526   21.0526
analytic geometry          19   7   4  36.8421   21.0526
arithmetic                 19   1   4   5.26316  21.0526
combinatorial geometry     19   6   2  31.5789   10.5263
combinatorics              19   1   2   5.26316  10.5263
counting                   19   0   5   0        26.3158
descriptive geometry       19   8   5  42.1053   26.3158
graph theory               19   6   2  31.5789   10.5263
logic                      19   8   2  42.1053   10.5263
metric geometry - angle    19  10   6  52.6316   31.5789
metric geometry - area     19   7   5  36.8421   26.3158
metric geometry - length   19  11   2  57.8947   10.5263
solid geometry             19   7   2  36.8421   10.5263
statistics                 19   4   5  21.0526   26.3158
topology                   19   6   1  31.5789    5.26316
transformation geometry    19   3   3  15.7895   15.7895
--  ------------------------  ---  --  --  --------  --------

torchrun --nproc-per-node=8 run.py --model InternVL2-8B --data MathVision_MINI

The expected test results are:

--  ------------------------  ---  ---  --  --------  -------
Overall                   304  104  62  34.2105   20.3947
algebra                    19    4   4  21.0526   21.0526
analytic geometry          19    4   3  21.0526   15.7895
arithmetic                 19    2   4  10.5263   21.0526
combinatorial geometry     19    9   6  47.3684   31.5789
combinatorics              19    1   3   5.26316  15.7895
counting                   19    2   4  10.5263   21.0526
descriptive geometry       19   11   4  57.8947   21.0526
graph theory               19    6   2  31.5789   10.5263
logic                      19   10   2  52.6316   10.5263
metric geometry - angle    19    7   4  36.8421   21.0526
metric geometry - area     19    7   7  36.8421   36.8421
metric geometry - length   19    7   2  36.8421   10.5263
solid geometry             19    8   4  42.1053   21.0526
statistics                 19    6   4  31.5789   21.0526
topology                   19   11   5  57.8947   26.3158
transformation geometry    19    9   4  47.3684   21.0526
--  ------------------------  ---  ---  --  --------  -------

torchrun --nproc-per-node=8 run.py --model InternVL2-26B --data MathVision_MINI

The expected test results are:

--  ------------------------  ---  ---  --  --------  --------
Overall                   304  105  71  34.5395   23.3553
algebra                    19    6   3  31.5789   15.7895
analytic geometry          19    6   7  31.5789   36.8421
arithmetic                 19    4   4  21.0526   21.0526
combinatorial geometry     19    4   3  21.0526   15.7895
combinatorics              19    4   6  21.0526   31.5789
counting                   19    1   3   5.26316  15.7895
descriptive geometry       19    7   4  36.8421   21.0526
graph theory               19    5   5  26.3158   26.3158
logic                      19   11   7  57.8947   36.8421
metric geometry - angle    19    9   3  47.3684   15.7895
metric geometry - area     19    9   7  47.3684   36.8421
metric geometry - length   19   10   3  52.6316   15.7895
solid geometry             19    6   1  31.5789    5.26316
statistics                 19    8   7  42.1053   36.8421
topology                   19   10   5  52.6316   26.3158
transformation geometry    19    5   3  26.3158   15.7895
--  ------------------------  ---  ---  --  --------  --------

torchrun --nproc-per-node=8 run.py --model InternVL2-40B --data MathVision_MINI

The expected test results are:

--  ------------------------  ---  ---  --  --------  -------
Overall                   304  100  65  32.8947   21.3816
algebra                    19    6   4  31.5789   21.0526
analytic geometry          19    7   5  36.8421   26.3158
arithmetic                 19    4   8  21.0526   42.1053
combinatorial geometry     19    3   6  15.7895   31.5789
combinatorics              19    0   4   0        21.0526
counting                   19    1   2   5.26316  10.5263
descriptive geometry       19    8   2  42.1053   10.5263
graph theory               19    6   3  31.5789   15.7895
logic                      19    8   4  42.1053   21.0526
metric geometry - angle    19   10   5  52.6316   26.3158
metric geometry - area     19    8   2  42.1053   10.5263
metric geometry - length   19   10   3  52.6316   15.7895
solid geometry             19    6   3  31.5789   15.7895
statistics                 19   10   6  52.6316   31.5789
topology                   19    7   4  36.8421   21.0526
transformation geometry    19    6   4  31.5789   21.0526
--  ------------------------  ---  ---  --  --------  -------

torchrun --nproc-per-node=1 run.py --model InternVL2-76B --data MathVision_MINI

The expected test results are:

--  ------------------------  ---  ---  --  --------  -------
Overall                   304  102  72  33.5526   23.6842
algebra                    19    1   3   5.26316  15.7895
analytic geometry          19    6   8  31.5789   42.1053
arithmetic                 19    5   7  26.3158   36.8421
combinatorial geometry     19    7   2  36.8421   10.5263
combinatorics              19    1   4   5.26316  21.0526
counting                   19    0   3   0        15.7895
descriptive geometry       19    9   2  47.3684   10.5263
graph theory               19    6   3  31.5789   15.7895
logic                      19    8   5  42.1053   26.3158
metric geometry - angle    19   11   5  57.8947   26.3158
metric geometry - area     19    9   5  47.3684   26.3158
metric geometry - length   19   10   5  52.6316   26.3158
solid geometry             19    6   5  31.5789   26.3158
statistics                 19    6   8  31.5789   42.1053
topology                   19    7   4  36.8421   21.0526
transformation geometry    19   10   3  52.6316   15.7895
--  ------------------------  ---  ---  --  --------  -------

BLINK#

The BLINK dataset is a new benchmark designed to challenge MLLMs by focusing on core visual perception tasks that are not typically covered by other benchmarks. It reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompts. These tasks include relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning, which humans can generally solve quickly but are significantly challenging for current multimodal LLMs.

torchrun --nproc-per-node=8 run.py --model InternVL2-1B --data BLINK

The expected test results are:

2024-08-02 13:47:04,164 - RUN - INFO - The evaluation of model InternVL2-1B x dataset BLINK has finished!
2024-08-02 13:47:04,164 - RUN - INFO - Evaluation Results:
2024-08-02 13:47:04,166 - RUN - INFO -
-------------------------  -------------------
split                      none
Overall                    0.3855865334034719
Art_Style                  0.4700854700854701
Counting                   0.325
Forensic_Detection         0.25
Functional_Correspondence  0.26153846153846155
IQ_Test                    0.2866666666666667
Jigsaw                     0.5266666666666666
Multi-view_Reasoning       0.44360902255639095
Object_Localization        0.4918032786885246
Relative_Depth             0.49193548387096775
Relative_Reflectance       0.3283582089552239
Semantic_Correspondence    0.2446043165467626
Spatial_Relation           0.5664335664335665
Visual_Correspondence      0.27325581395348836
Visual_Similarity          0.4740740740740741
-------------------------  -------------------

torchrun --nproc-per-node=8 run.py --model InternVL2-2B --data BLINK

The expected test results are:

2024-08-02 13:46:22,686 - RUN - INFO - The evaluation of model InternVL2-2B x dataset BLINK has finished!
2024-08-02 13:46:22,686 - RUN - INFO - Evaluation Results:
2024-08-02 13:46:22,689 - RUN - INFO -
-------------------------  -------------------
split                      none
Overall                    0.43766438716465017
Art_Style                  0.5299145299145299
Counting                   0.4666666666666667
Forensic_Detection         0.2803030303030303
Functional_Correspondence  0.23076923076923078
IQ_Test                    0.2866666666666667
Jigsaw                     0.47333333333333333
Multi-view_Reasoning       0.556390977443609
Object_Localization        0.36885245901639346
Relative_Depth             0.6048387096774194
Relative_Reflectance       0.39552238805970147
Semantic_Correspondence    0.3669064748201439
Spatial_Relation           0.7622377622377622
Visual_Correspondence      0.3313953488372093
Visual_Similarity          0.5111111111111111
-------------------------  -------------------

torchrun --nproc-per-node=8 run.py --model InternVL2-4B --data BLINK

The expected test results are:

2024-08-02 13:34:06,982 - RUN - INFO - The evaluation of model InternVL2-4B x dataset BLINK has finished!
2024-08-02 13:34:06,982 - RUN - INFO - Evaluation Results:
2024-08-02 13:34:06,984 - RUN - INFO -
-------------------------  -------------------
split                      none
Overall                    0.46081009994739613
Art_Style                  0.5897435897435898
Counting                   0.55
Forensic_Detection         0.32575757575757575
Functional_Correspondence  0.25384615384615383
IQ_Test                    0.23333333333333334
Jigsaw                     0.48
Multi-view_Reasoning       0.556390977443609
Object_Localization        0.5245901639344263
Relative_Depth             0.6370967741935484
Relative_Reflectance       0.3283582089552239
Semantic_Correspondence    0.2805755395683453
Spatial_Relation           0.8111888111888111
Visual_Correspondence      0.36046511627906974
Visual_Similarity          0.5925925925925926
-------------------------  -------------------

torchrun --nproc-per-node=8 run.py --model InternVL2-8B --data BLINK

The expected test results are:

2024-08-02 13:28:10,915 - RUN - INFO - The evaluation of model InternVL2-8B x dataset BLINK has finished!
2024-08-02 13:28:10,915 - RUN - INFO - Evaluation Results:
2024-08-02 13:28:10,917 - RUN - INFO -
-------------------------  -------------------
split                      none
Overall                    0.5086796422935297
Art_Style                  0.7094017094017094
Counting                   0.75
Forensic_Detection         0.3484848484848485
Functional_Correspondence  0.17692307692307693
IQ_Test                    0.30666666666666664
Jigsaw                     0.5466666666666666
Multi-view_Reasoning       0.48872180451127817
Object_Localization        0.5573770491803278
Relative_Depth             0.7419354838709677
Relative_Reflectance       0.39552238805970147
Semantic_Correspondence    0.26618705035971224
Spatial_Relation           0.7972027972027972
Visual_Correspondence      0.36046511627906974
Visual_Similarity          0.7851851851851852
-------------------------  -------------------

torchrun --nproc-per-node=8 run.py --model InternVL2-26B --data BLINK

The expected test results are:

2024-08-02 13:00:51,453 - RUN - INFO - The evaluation of model InternVL2-26B x dataset BLINK has finished!
2024-08-02 13:00:51,453 - RUN - INFO - Evaluation Results:
2024-08-02 13:00:51,455 - RUN - INFO -
-------------------------  -------------------
split                      none
Overall                    0.5623356128353498
Art_Style                  0.7606837606837606
Counting                   0.675
Forensic_Detection         0.45454545454545453
Functional_Correspondence  0.3
IQ_Test                    0.30666666666666664
Jigsaw                     0.7466666666666667
Multi-view_Reasoning       0.41353383458646614
Object_Localization        0.5737704918032787
Relative_Depth             0.782258064516129
Relative_Reflectance       0.3582089552238806
Semantic_Correspondence    0.4172661870503597
Spatial_Relation           0.8461538461538461
Visual_Correspondence      0.47674418604651164
Visual_Similarity          0.8222222222222222
-------------------------  -------------------

torchrun --nproc-per-node=8 run.py --model InternVL2-40B --data BLINK

The expected test results are:

2024-08-02 14:03:54,291 - RUN - INFO - The evaluation of model InternVL2-40B x dataset BLINK has finished!
2024-08-02 14:03:54,291 - RUN - INFO - Evaluation Results:
2024-08-02 14:03:54,292 - RUN - INFO -
-------------------------  -------------------
split                      none
Overall                    0.5718043135192005
Art_Style                  0.6923076923076923
Counting                   0.7166666666666667
Forensic_Detection         0.44696969696969696
Functional_Correspondence  0.25384615384615383
IQ_Test                    0.22666666666666666
Jigsaw                     0.8
Multi-view_Reasoning       0.5639097744360902
Object_Localization        0.5819672131147541
Relative_Depth             0.7903225806451613
Relative_Reflectance       0.3880597014925373
Semantic_Correspondence    0.41007194244604317
Spatial_Relation           0.8461538461538461
Visual_Correspondence      0.4941860465116279
Visual_Similarity          0.8518518518518519
-------------------------  -------------------

torchrun --nproc-per-node=1 run.py --model InternVL2-76B --data BLINK

The expected test results are:

2024-08-02 16:08:58,199 - RUN - INFO - The evaluation of model InternVL2-76B x dataset BLINK has finished!
2024-08-02 16:08:58,199 - RUN - INFO - Evaluation Results:
2024-08-02 16:08:58,200 - RUN - INFO -
-------------------------  -------------------
split                      none
Overall                    0.5681220410310363
Art_Style                  0.6581196581196581
Counting                   0.7
Forensic_Detection         0.42424242424242425
Functional_Correspondence  0.3
IQ_Test                    0.2733333333333333
Jigsaw                     0.74
Multi-view_Reasoning       0.5639097744360902
Object_Localization        0.5245901639344263
Relative_Depth             0.782258064516129
Relative_Reflectance       0.30597014925373134
Semantic_Correspondence    0.4028776978417266
Spatial_Relation           0.8391608391608392
Visual_Correspondence      0.6802325581395349
Visual_Similarity          0.7555555555555555
-------------------------  -------------------

MTVQA#

MTVQA (Multilingual Text-Centric Visual Question Answering) introduces high-quality human expert annotations across nine diverse languages to address multilingual TEC-VQA challenges, enhancing AI models’ performance in text-centric visual environments.

torchrun --nproc-per-node=8 run.py --model InternVL2-1B --data MTVQA_TEST

The expected test results are:

{
    "AR": 1.991465149359886,
    "Average": 12.570079669519032,
    "DE": 21.85114503816794,
    "FR": 20.54176072234763,
    "IT": 22.39819004524887,
    "JA": 6.159420289855073,
    "KR": 8.422939068100359,
    "RU": 3.571428571428571,
    "TH": 2.1645021645021645,
    "VI": 11.199095022624435
}

torchrun --nproc-per-node=8 run.py --model InternVL2-2B --data MTVQA_TEST

The expected test results are:

{
    "AR": 1.422475106685633,
    "Average": 10.88816760106226,
    "DE": 15.744274809160306,
    "FR": 19.751693002257337,
    "IT": 21.380090497737555,
    "JA": 7.367149758454106,
    "KR": 5.913978494623656,
    "RU": 3.0423280423280423,
    "TH": 0.8658008658008658,
    "VI": 9.049773755656108
}

torchrun --nproc-per-node=8 run.py --model InternVL2-4B --data MTVQA_TEST

The expected test results are:

{
    "AR": 1.849217638691323,
    "Average": 15.34375922100915,
    "DE": 24.904580152671755,
    "FR": 30.81264108352145,
    "IT": 26.923076923076923,
    "JA": 8.091787439613526,
    "KR": 8.064516129032258,
    "RU": 3.7037037037037033,
    "TH": 3.463203463203463,
    "VI": 12.104072398190045
}

torchrun --nproc-per-node=8 run.py --model InternVL2-8B --data MTVQA_TEST

The expected test results are:

{
    "AR": 2.418207681365576,
    "Average": 18.102685157863675,
    "DE": 28.435114503816795,
    "FR": 33.972911963882616,
    "IT": 30.20361990950226,
    "JA": 8.57487922705314,
    "KR": 10.931899641577061,
    "RU": 5.158730158730158,
    "TH": 6.926406926406926,
    "VI": 17.760180995475114
}

torchrun --nproc-per-node=8 run.py --model InternVL2-26B --data MTVQA_TEST

The expected test results are:

{
    "AR": 3.982930298719772,
    "Average": 17.71909117733845,
    "DE": 28.053435114503817,
    "FR": 26.52370203160271,
    "IT": 30.316742081447963,
    "JA": 9.903381642512077,
    "KR": 11.29032258064516,
    "RU": 6.613756613756613,
    "TH": 8.225108225108226,
    "VI": 18.32579185520362
}

torchrun --nproc-per-node=8 run.py --model InternVL2-40B --data MTVQA_TEST

The expected test results are:

{
    "AR": 4.551920341394026,
    "Average": 20.61079964591325,
    "DE": 30.62977099236641,
    "FR": 36.455981941309254,
    "IT": 34.61538461538461,
    "JA": 10.748792270531402,
    "KR": 13.261648745519713,
    "RU": 6.481481481481481,
    "TH": 5.627705627705628,
    "VI": 21.49321266968326
}

torchrun --nproc-per-node=1 run.py --model InternVL2-76B --data MTVQA_TEST

The expected test results are:

{
    "AR": 9.53058321479374,
    "Average": 22.794334611979934,
    "DE": 31.297709923664126,
    "FR": 35.66591422121896,
    "IT": 35.18099547511312,
    "JA": 11.11111111111111,
    "KR": 14.336917562724013,
    "RU": 11.904761904761903,
    "TH": 9.956709956709958,
    "VI": 26.923076923076923
}

Citation#

If you find this project useful in your research, please consider citing:

@article{chen2024far,
  title={How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={Science China Information Sciences},
  volume={67},
  number={12},
  pages={220101},
  year={2024},
  publisher={Springer}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}

Evaluation of InternVL2 Series

Contents

Evaluation of InternVL2 Series#

Model Preparation#

Evaluation using InternVL Codebase#

Data Preparation#

MME#

OKVQA#

TextVQA#

VizWiz#

ChartQA#

DocVQA#

AI2D#

InfographicVQA#

GQA#

POPE#

Tiny LVLM#

MMMU#

MMVet (GPT-4-0613)#

MMBench#

CCBench#

SEED#

MMVP#

RefCOCO Series#

MVBench#

Evaluation using VLMEvalKit Codebase#

Data Preparation#

MathVista#

HallusionBench#

MMStar#

OCRBench#

MMMU#

RealWorldQA#

MMVet (GPT-4-Turbo)#

LLaVA-Bench (GPT-4-Turbo)#

VideoMME#

MMBench-Video#

MathVision#

BLINK#

MTVQA#

Citation#