# Evaluation of InternVL2 Series

To evaluate the performance of the InternVL2 series across various tasks, follow the instructions for each specific dataset. Ensure that the appropriate number of GPUs is allocated as specified.

> 1⃣️ We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.

> 2⃣️ Please note that evaluating the same model using different testing toolkits like InternVL and VLMEvalKit can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results.

> 3⃣️️ Note, the dataset description is generated by GPT-4 and may contain errors.

## Model Preparation

| model name           | type | param | download                                                            |  size  |
| -------------------- | ---- | ----- | ------------------------------------------------------------------- | :----: |
| InternVL2-1B         | MLLM | 0.9B  | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL2-1B)         | 1.8 GB |
| InternVL2-2B         | MLLM | 2.2B  | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL2-2B)         | 4.2 GB |
| InternVL2-4B         | MLLM | 4.2B  | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL2-4B)         | 7.8 GB |
| InternVL2-8B         | MLLM | 8.1B  | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL2-8B)         | 16 GB  |
| InternVL2-26B        | MLLM | 25.5B | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL2-26B)        | 48 GB  |
| InternVL2-40B        | MLLM | 40.1B | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL2-40B)        | 75 GB  |
| InternVL2-Llama3-76B | MLLM | 76.3B | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B) | 143 GB |

Before evaluation, download the trained model we provide.

```sh
cd pretrained/
# pip install -U huggingface_hub
# Download OpenGVLab/InternVL2-1B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-1B --local-dir InternVL2-1B
# Download OpenGVLab/InternVL2-2B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-2B --local-dir InternVL2-2B
# Download OpenGVLab/InternVL2-4B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-4B --local-dir InternVL2-4B
# Download OpenGVLab/InternVL2-8B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-8B --local-dir InternVL2-8B
# Download OpenGVLab/InternVL2-26B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-26B --local-dir InternVL2-26B
# Download OpenGVLab/InternVL2-40B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-40B --local-dir InternVL2-40B
# Download OpenGVLab/InternVL2-Llama3-76B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-Llama3-76B --local-dir InternVL2-Llama3-76B
```

The directory structure is:

```sh
pretrained
├── InternVL2-1B
├── InternVL2-2B
├── InternVL2-4B
├── InternVL2-8B
├── InternVL2-26B
├── InternVL2-40B
└── InternVL2-Llama3-76B
```

## Evaluation using InternVL Codebase

### Data Preparation

Please prepare the evaluation data according to the [guidance provided here](../get_started/eval_data_preparation.md).

### MME

MME is a comprehensive benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on both perception and cognition abilities across 14 different subtasks, ensuring robust and diverse testing of these models.

`````{tabs}

````{tab} 1B

Please use the following command to perform the test with 1 GPU:

```bash
GPUS=1 sh evaluate.sh pretrained/InternVL2-1B mme --dynamic
```

The expected test results are:

```
=========== Perception ===========
total score: 1346.1990796318528

         existence  score: 175.0
         count  score: 113.33333333333334
         position  score: 135.0
         color  score: 138.33333333333331
         posters  score: 116.32653061224491
         celebrity  score: 144.70588235294116
         scene  score: 143.25
         landmark  score: 128.5
         artwork  score: 141.75
         OCR  score: 110.0


=========== Cognition ===========
total score: 448.2142857142857

         commonsense_reasoning  score: 95.71428571428571
         numerical_calculation  score: 57.5
         text_translation  score: 177.5
         code_reasoning  score: 117.5
```

````

````{tab} 2B

Please use the following command to perform the test with 1 GPU:

```bash
GPUS=1 sh evaluate.sh pretrained/InternVL2-2B mme --dynamic
```

The expected test results are:

```
=========== Perception ===========
total score: 1439.6688675470189

         existence  score: 200.0
         count  score: 128.33333333333334
         position  score: 145.0
         color  score: 163.33333333333334
         posters  score: 131.97278911564626
         celebrity  score: 118.52941176470588
         scene  score: 157.0
         landmark  score: 154.0
         artwork  score: 146.5
         OCR  score: 95.0


=========== Cognition ===========
total score: 437.1428571428571

         commonsense_reasoning  score: 112.14285714285714
         numerical_calculation  score: 45.0
         text_translation  score: 177.5
         code_reasoning  score: 102.5
```

````

````{tab} 4B

Please use the following command to perform the test with 1 GPU:

```bash
GPUS=1 sh evaluate.sh pretrained/InternVL2-4B mme --dynamic
```

The expected test results are:

```
=========== Perception ===========
total score: 1532.31662665066

         existence  score: 200.0
         count  score: 123.33333333333333
         position  score: 148.33333333333331
         color  score: 165.0
         posters  score: 155.78231292517006
         celebrity  score: 124.11764705882354
         scene  score: 158.75
         landmark  score: 165.0
         artwork  score: 144.5
         OCR  score: 147.5


=========== Cognition ===========
total score: 531.7857142857142

         commonsense_reasoning  score: 129.28571428571428
         numerical_calculation  score: 115.0
         text_translation  score: 170.0
         code_reasoning  score: 117.5
```

````

````{tab} 8B

Please use the following command to perform the test with 1 GPU:

```bash
GPUS=1 sh evaluate.sh pretrained/InternVL2-8B mme --dynamic
```

The expected test results are:

```
=========== Perception ===========
total score: 1648.1331532613044

         existence  score: 190.0
         count  score: 158.33333333333331
         position  score: 163.33333333333334
         color  score: 175.0
         posters  score: 167.68707482993196
         celebrity  score: 148.52941176470586
         scene  score: 152.5
         landmark  score: 176.5
         artwork  score: 153.75
         OCR  score: 162.5


=========== Cognition ===========
total score: 562.1428571428571

         commonsense_reasoning  score: 147.14285714285714
         numerical_calculation  score: 87.5
         text_translation  score: 192.5
         code_reasoning  score: 135.0
```

````

````{tab} 26B

Please use the following command to perform the test with 1 GPU:

```bash
GPUS=1 sh evaluate.sh pretrained/InternVL2-26B mme --dynamic
```

The expected test results are:

```
=========== Perception ===========
total score: 1720.0325130052022

         existence  score: 195.0
         count  score: 170.0
         position  score: 176.66666666666669
         color  score: 168.33333333333331
         posters  score: 176.87074829931973
         celebrity  score: 159.41176470588235
         scene  score: 154.0
         landmark  score: 179.5
         artwork  score: 162.75
         OCR  score: 177.5


=========== Cognition ===========
total score: 540.7142857142858

         commonsense_reasoning  score: 145.71428571428572
         numerical_calculation  score: 95.0
         text_translation  score: 185.0
         code_reasoning  score: 115.0
```

````

````{tab} 40B

Please use the following command to perform the test with 8 GPU:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mme --dynamic --auto
```

The expected test results are:

```
=========== Perception ===========
total score: 1715.390456182473

         existence  score: 185.0
         count  score: 175.0
         position  score: 158.33333333333331
         color  score: 188.33333333333331
         posters  score: 187.41496598639458
         celebrity  score: 162.05882352941177
         scene  score: 152.5
         landmark  score: 180.25
         artwork  score: 171.5
         OCR  score: 155.0


=========== Cognition ===========
total score: 599.6428571428571

         commonsense_reasoning  score: 152.14285714285714
         numerical_calculation  score: 125.0
         text_translation  score: 177.5
         code_reasoning  score: 145.0
```

````

````{tab} 76B

Please use the following command to perform the test with 8 GPU:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mme --dynamic --auto
```

The expected test results are:

```
=========== Perception ===========
total score: 1731.095538215286

         existence  score: 200.0
         count  score: 175.0
         position  score: 168.33333333333331
         color  score: 185.0
         posters  score: 186.39455782312925
         celebrity  score: 169.11764705882354
         scene  score: 152.0
         landmark  score: 182.0
         artwork  score: 173.25
         OCR  score: 140.0


=========== Cognition ===========
total score: 683.5714285714286

         commonsense_reasoning  score: 158.57142857142856
         numerical_calculation  score: 185.0
         text_translation  score: 177.5
         code_reasoning  score: 162.5
```

````

`````

### OKVQA

OKVQA (Outside Knowledge Visual Question Answering) is a dataset designed for visual question answering tasks that require external knowledge beyond what is visible in the image, featuring over 14,000 questions to evaluate the reasoning abilities of AI models.

`````{tabs}

````{tab} 1B

Please use the following command to perform the test with 8 GPU:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-okvqa-val --dynamic
```

The expected test results are:

```
okvqa_val 0.48513674197383483
```

````

````{tab} 2B

Please use the following command to perform the test with 8 GPU:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-okvqa-val --dynamic
```

The expected test results are:

```
okvqa_val 0.5316290130796605
```

````


````{tab} 4B

Please use the following command to perform the test with 8 GPU:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-okvqa-val --dynamic
```

The expected test results are:

```
okvqa_val 0.6007530717399846
```

````


````{tab} 8B

Please use the following command to perform the test with 8 GPU:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-okvqa-val --dynamic
```

The expected test results are:

```
okvqa_val 0.6289734443123187
```

````


````{tab} 26B

Please use the following command to perform the test with 8 GPU:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-okvqa-val --dynamic
```

The expected test results are:

```
okvqa_val 0.6594530321046287
```

````


````{tab} 40B

Please use the following command to perform the test with 8 GPU:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-okvqa-val --dynamic --auto
```

The expected test results are:

```
okvqa_val 0.664288545382473
```

````


````{tab} 76B

Please use the following command to perform the test with 8 GPU:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-okvqa-val --dynamic --auto
```

The expected test results are:

```
okvqa_val 0.683432421720166
```

````

`````

### TextVQA

TextVQA is a dataset designed to evaluate visual question answering models by requiring them to read and reason about text present within images, containing 45,336 questions over 28,408 images from the OpenImages dataset.

The TextVQA dataset provides official OCR results, specifically Rosetta OCR tokens. During testing with InstructBLIP and LLaVA 1.5, the OCR results are input to the LLM as a prompt. If you want to input Rosetta OCR tokens, use the following command:

`````{tabs}

````{tab} 1B

We do not use Rosetta OCR tokens, run this command:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-textvqa-val --dynamic
```

The expected test results are:

```
textvqa_val 0.7052400000000033
```

````

````{tab} 2B

We do not use Rosetta OCR tokens, run this command:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-textvqa-val --dynamic
```

The expected test results are:

```
textvqa_val 0.7335600000000035
```

````

````{tab} 4B

We do not use Rosetta OCR tokens, run this command:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-textvqa-val --dynamic
```

The expected test results are:

```
textvqa_val 0.7437000000000039
```

````

````{tab} 8B

We do not use Rosetta OCR tokens, run this command:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-textvqa-val --dynamic
```

The expected test results are:

```
textvqa_val 0.773740000000004
```

````

````{tab} 26B

We do not use Rosetta OCR tokens, run this command:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-textvqa-val --dynamic
```

The expected test results are:

```
textvqa_val 0.8228200000000048
```

````

````{tab} 40B

We do not use Rosetta OCR tokens, run this command:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-textvqa-val --dynamic --auto
```

The expected test results are:

```
textvqa_val 0.8301600000000046
```

````

````{tab} 76B

We do not use Rosetta OCR tokens, run this command:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-textvqa-val --dynamic --auto
```

The expected test results are:

```
textvqa_val 0.844100000000004
```

````

`````

### VizWiz

The VizWiz VQA dataset is a visual question answering dataset created to help answer visual questions posed by blind individuals. It contains over 31,000 visual questions, where users took a picture using a mobile phone and recorded a spoken question about it. Each question comes with 10 crowdsourced answers. This dataset addresses tasks such as predicting the answer to a visual question and determining whether a visual question can be answered.

`````{tabs}

````{tab} 1B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-vizwiz-val --dynamic
```

The expected test results are:

```
vizwiz_val 0.5306783977772626
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-vizwiz-test --dynamic
```

For the test set, submit the results to the [evaluation server](https://eval.ai/web/challenges/challenge-page/2185/overview).

````

````{tab} 2B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-vizwiz-val --dynamic
```

The expected test results are:

```
vizwiz_val 0.47376707571196724
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-vizwiz-test --dynamic
```

For the test set, submit the results to the [evaluation server](https://eval.ai/web/challenges/challenge-page/2185/overview).

````

````{tab} 4B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-vizwiz-val --dynamic
```

The expected test results are:

```
vizwiz_val 0.622088446399631
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-vizwiz-test --dynamic
```

For the test set, submit the results to the [evaluation server](https://eval.ai/web/challenges/challenge-page/2185/overview).

````

````{tab} 8B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-vizwiz-val --dynamic
```

The expected test results are:

```
vizwiz_val 0.6290808057420708
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-vizwiz-test --dynamic
```

For the test set, submit the results to the [evaluation server](https://eval.ai/web/challenges/challenge-page/2185/overview).

````

````{tab} 26B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-vizwiz-val --dynamic
```

The expected test results are:

```
vizwiz_val 0.6839083121092873
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-vizwiz-test --dynamic
```

For the test set, submit the results to the [evaluation server](https://eval.ai/web/challenges/challenge-page/2185/overview).

````

````{tab} 40B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-vizwiz-val --dynamic --auto
```

The expected test results are:

```
vizwiz_val 0.6521880064829846
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-vizwiz-test --dynamic --auto
```

For the test set, submit the results to the [evaluation server](https://eval.ai/web/challenges/challenge-page/2185/overview).

````

````{tab} 76B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-vizwiz-val --dynamic --auto
```

The expected test results are:

```
vizwiz_val 0.6767075711970381
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-vizwiz-test --dynamic --auto
```

For the test set, submit the results to the [evaluation server](https://eval.ai/web/challenges/challenge-page/2185/overview).

````

`````

### ChartQA

The ChartQA dataset is a comprehensive benchmark for question answering about charts that involves both visual and logical reasoning. It includes a mix of 9.6K human-written questions and 23.1K machine-generated questions derived from chart summaries. This dataset is designed to evaluate models that can understand and analyze charts by answering complex questions that often require multiple logical and arithmetic operations, as well as referencing visual features of the charts.

`````{tabs}

````{tab} 1B

The ChartQA dataset includes two test sets: `chartqa_test_human` and `chartqa_test_augmented`. The final score for model evaluation is calculated as the average of the scores on these two test sets:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-chartqa-test --dynamic --max-num 12
```

The expected test results are:

```
['chartqa_test_human', {'relaxed_accuracy': 0.5392}]
['chartqa_test_augmented', {'relaxed_accuracy': 0.9184}]

result = (53.92 + 91.84) / 2 = 72.88
```

````

````{tab} 2B

The ChartQA dataset includes two test sets: `chartqa_test_human` and `chartqa_test_augmented`. The final score for model evaluation is calculated as the average of the scores on these two test sets:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-chartqa-test --dynamic --max-num 12
```

The expected test results are:

```
['chartqa_test_human', {'relaxed_accuracy': 0.5952}]
['chartqa_test_augmented', {'relaxed_accuracy': 0.9296}]

result = (59.52 + 92.96) / 2 = 76.24
```

````

````{tab} 4B

The ChartQA dataset includes two test sets: `chartqa_test_human` and `chartqa_test_augmented`. The final score for model evaluation is calculated as the average of the scores on these two test sets:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-chartqa-test --dynamic --max-num 12
```

The expected test results are:

```
['chartqa_test_human', {'relaxed_accuracy': 0.6992}]
['chartqa_test_augmented', {'relaxed_accuracy': 0.9304}]

result = (69.92 + 93.04) / 2 = 81.48
```

````

````{tab} 8B

The ChartQA dataset includes two test sets: `chartqa_test_human` and `chartqa_test_augmented`. The final score for model evaluation is calculated as the average of the scores on these two test sets:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-chartqa-test --dynamic --max-num 12
```

The expected test results are:

```
['chartqa_test_human', {'relaxed_accuracy': 0.7288}]
['chartqa_test_augmented', {'relaxed_accuracy': 0.9368}]

result = (72.88 + 93.68) / 2 = 83.28
```

````

````{tab} 26B

The ChartQA dataset includes two test sets: `chartqa_test_human` and `chartqa_test_augmented`. The final score for model evaluation is calculated as the average of the scores on these two test sets:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-chartqa-test --dynamic --max-num 12
```

The expected test results are:

```
['chartqa_test_human', {'relaxed_accuracy': 0.7528}]
['chartqa_test_augmented', {'relaxed_accuracy': 0.9448}]

result = (75.28 + 94.48) / 2 = 84.88
```

````

````{tab} 40B

The ChartQA dataset includes two test sets: `chartqa_test_human` and `chartqa_test_augmented`. The final score for model evaluation is calculated as the average of the scores on these two test sets:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-chartqa-test --dynamic --max-num 12 --auto
```

The expected test results are:

```
['chartqa_test_human', {'relaxed_accuracy': 0.772}]
['chartqa_test_augmented', {'relaxed_accuracy': 0.952}]

result = (77.2 + 95.2) / 2 = 86.2
```

````

````{tab} 76B

The ChartQA dataset includes two test sets: `chartqa_test_human` and `chartqa_test_augmented`. The final score for model evaluation is calculated as the average of the scores on these two test sets:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-chartqa-test --dynamic --max-num 12 --auto
```

The expected test results are:

```
['chartqa_test_human', {'relaxed_accuracy': 0.816}]
['chartqa_test_augmented', {'relaxed_accuracy': 0.952}]

result = (81.6 + 95.2) / 2 = 88.4
```

````

`````

### DocVQA

The DocVQA dataset consists of 50,000 questions on 12,000+ document images. It is designed for visual question answering tasks where questions are answered using text within the document images. The dataset includes OCR transcriptions and ground truth answers, supporting evaluation of models that interpret and extract information from documents.

`````{tabs}

````{tab} 1B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-docvqa-val --dynamic --max-num 18
```

The expected test results are:

```
Overall ANLS: 0.7999
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-docvqa-test --dynamic --max-num 18
```

For the test set, submit the results to the [evaluation server](https://rrc.cvc.uab.es/?ch=17).

The expected test results are:

```
Overall ANLS: 0.8170
```

````

````{tab} 2B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-docvqa-val --dynamic --max-num 18
```

The expected test results are:

```
Overall ANLS: 0.8590
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-docvqa-test --dynamic --max-num 18
```

For the test set, submit the results to the [evaluation server](https://rrc.cvc.uab.es/?ch=17).

The expected test results are:

```
Overall ANLS: 0.8690
```

````

````{tab} 4B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-docvqa-val --dynamic --max-num 18
```

The expected test results are:

```
Overall ANLS: 0.8809
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-docvqa-test --dynamic --max-num 18
```

For the test set, submit the results to the [evaluation server](https://rrc.cvc.uab.es/?ch=17).

The expected test results are:

```
Overall ANLS: 0.8920
```

````

````{tab} 8B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-docvqa-val --dynamic --max-num 18
```

The expected test results are:

```
Overall ANLS: 0.9081
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-docvqa-test --dynamic --max-num 18
```

For the test set, submit the results to the [evaluation server](https://rrc.cvc.uab.es/?ch=17).

The expected test results are:

```
Overall ANLS: 0.9160
```

````

````{tab} 26B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-docvqa-val --dynamic --max-num 18
```

The expected test results are:

```
Overall ANLS: 0.9212
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-docvqa-test --dynamic --max-num 18
```

For the test set, submit the results to the [evaluation server](https://rrc.cvc.uab.es/?ch=17).

The expected test results are:

```
Overall ANLS: 0.9290
```

````

````{tab} 40B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-docvqa-val --dynamic --max-num 18 --auto
```

The expected test results are:

```
Overall ANLS: 0.9373
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-docvqa-test --dynamic --max-num 18 --auto
```

For the test set, submit the results to the [evaluation server](https://rrc.cvc.uab.es/?ch=17).

The expected test results are:

```
Overall ANLS: 0.9390
```

````

````{tab} 76B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-docvqa-val --dynamic --max-num 18 --auto
```

The expected test results are:

```
Overall ANLS: 0.9417
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-docvqa-test --dynamic --max-num 18 --auto
```

For the test set, submit the results to the [evaluation server](https://rrc.cvc.uab.es/?ch=17).

The expected test results are:

```
Overall ANLS: 0.9410
```

````

`````

### AI2D

The AI2D dataset contains over 5,000 grade school science diagrams with extensive annotations and 15,000 multiple-choice questions for research on diagram understanding and question answering.

`````{tabs}

````{tab} 1B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-ai2d-test --dynamic
```

The expected test results are:

```
ai2diagram_test {'accuracy': 0.6408678756476683}
```

````

````{tab} 2B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-ai2d-test --dynamic
```

The expected test results are:

```
ai2diagram_test {'accuracy': 0.7409326424870466}
```

````

````{tab} 4B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-ai2d-test --dynamic
```

The expected test results are:

```
ai2diagram_test {'accuracy': 0.788860103626943}
```

````

````{tab} 8B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-ai2d-test --dynamic
```

The expected test results are:

```
ai2diagram_test {'accuracy': 0.8377590673575129}
```

````

````{tab} 26B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-ai2d-test --dynamic
```

The expected test results are:

```
ai2diagram_test {'accuracy': 0.844559585492228}
```

````

````{tab} 40B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-ai2d-test --dynamic --auto
```

The expected test results are:

```
ai2diagram_test {'accuracy': 0.8711139896373057}
```

````

````{tab} 76B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-ai2d-test --dynamic --auto
```

The expected test results are:

```
ai2diagram_test {'accuracy': 0.8762953367875648}
```

````

`````

### InfographicVQA

The InfographicVQA dataset is a collection of infographics accompanied by natural language questions and answers. This dataset includes a diverse range of infographics sourced from thousands of different websites, ensuring a variety of layouts and designs. It comprises 30,035 questions across 5,485 images, split into training, validation, and test sets.

`````{tabs}

````{tab} 1B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-infovqa-val --dynamic --max-num 24
```

The expected test results are:

```
Overall ANLS: 0.5018
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-infovqa-test --dynamic --max-num 24
```

For the test set, submit the results to the [evaluation server](https://rrc.cvc.uab.es/?ch=17).

The expected test results are:

```
Overall ANLS: 0.5090
```

````

````{tab} 2B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-infovqa-val --dynamic --max-num 24
```

The expected test results are:

```
Overall ANLS: 0.5766
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-infovqa-test --dynamic --max-num 24
```

For the test set, submit the results to the [evaluation server](https://rrc.cvc.uab.es/?ch=17).

The expected test results are:

```
Overall ANLS: 0.5890
```

````

````{tab} 4B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-infovqa-val --dynamic --max-num 24
```

The expected test results are:

```
Overall ANLS: 0.6625
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-infovqa-test --dynamic --max-num 24
```

For the test set, submit the results to the [evaluation server](https://rrc.cvc.uab.es/?ch=17).

The expected test results are:

```
Overall ANLS: 0.6700
```

````

````{tab} 8B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-infovqa-val --dynamic --max-num 24
```

The expected test results are:

```
Overall ANLS: 0.7260
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-infovqa-test --dynamic --max-num 24
```

For the test set, submit the results to the [evaluation server](https://rrc.cvc.uab.es/?ch=17).

The expected test results are:

```
Overall ANLS: 0.7480
```

````

````{tab} 26B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-infovqa-val --dynamic --max-num 24
```

The expected test results are:

```
Overall ANLS: 0.7601
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-infovqa-test --dynamic --max-num 24
```

For the test set, submit the results to the [evaluation server](https://rrc.cvc.uab.es/?ch=17).

The expected test results are:

```
Overall ANLS: 0.7590
```

````

````{tab} 40B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-infovqa-val --dynamic --max-num 24 --auto
```

The expected test results are:

```
Overall ANLS: 0.7851
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-infovqa-test --dynamic --max-num 24 --auto
```

For the test set, submit the results to the [evaluation server](https://rrc.cvc.uab.es/?ch=17).

The expected test results are:

```
Overall ANLS: 0.7870
```

````

````{tab} 76B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-infovqa-val --dynamic --max-num 24 --auto
```

The expected test results are:

```
Overall ANLS: 0.8021
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-infovqa-test --dynamic --max-num 24 --auto
```

For the test set, submit the results to the [evaluation server](https://rrc.cvc.uab.es/?ch=17).

The expected test results are:

```
Overall ANLS: 0.8200
```

````

`````

### GQA

The GQA dataset is a large-scale visual question answering dataset designed for real-world visual reasoning and compositional question answering. It contains over 22 million questions grounded in real images, each accompanied by detailed scene graphs that describe objects, their attributes, and relationships within the scene. The dataset includes images from the Visual Genome dataset, with questions that require various reasoning skills such as spatial understanding and multi-step inference.

`````{tabs}

````{tab} 1B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-gqa-testdev --dynamic
```

The expected test results are:

```
Accuracy: 59.77%
```

````

````{tab} 2B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-gqa-testdev --dynamic
```

The expected test results are:

```
Accuracy: 61.03%
```

````

````{tab} 4B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-gqa-testdev --dynamic
```

The expected test results are:

```
Accuracy: 62.07%
```

````

````{tab} 8B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-gqa-testdev --dynamic
```

The expected test results are:

```
Accuracy: 63.23%
```

````

````{tab} 26B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-gqa-testdev --dynamic
```

The expected test results are:

```
Accuracy: 64.89%
```

````

````{tab} 40B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-gqa-testdev --dynamic --auto
```

The expected test results are:

```
Accuracy: 64.89%
```

````

````{tab} 76B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-gqa-testdev --dynamic --auto
```

The expected test results are:

```
Accuracy: 65.22%
```

````

`````

### POPE

The POPE (Polling-based Object Probing Evaluation) dataset is designed to evaluate object hallucination in MLLMs. The dataset consists of 3,000 questions related to the captions of 500 images. By treating the MLLMs' answers to these questions as a binary classification task, the dataset allows researchers to measure accuracy, precision, recall, and F1 scores to determine the extent of hallucination in the models.

`````{tabs}

````{tab} 1B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B pope --dynamic
```

The expected test results are:

```
Category: random, # samples: 2910
TP      FP      TN      FN
1239    51      1359    261
Accuracy: 0.8927835051546392
Precision: 0.9604651162790697
Recall: 0.826
F1 score: 0.8881720430107527
Yes ratio: 0.44329896907216493
0.888, 0.893, 0.960, 0.826, 0.443
====================================
Category: popular, # samples: 3000
TP      FP      TN      FN
1239    93      1407    261
Accuracy: 0.882
Precision: 0.9301801801801802
Recall: 0.826
F1 score: 0.875
Yes ratio: 0.444
0.875, 0.882, 0.930, 0.826, 0.444
====================================
Category: adversarial, # samples: 3000
TP      FP      TN      FN
1239    151     1349    261
Accuracy: 0.8626666666666667
Precision: 0.8913669064748202
Recall: 0.826
F1 score: 0.8574394463667819
Yes ratio: 0.4633333333333333
0.857, 0.863, 0.891, 0.826, 0.463
====================================

result = (88.8 + 87.5 + 85.7) / 3 = 87.3
```

````

````{tab} 2B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B pope --dynamic
```

The expected test results are:

```
Category: random, # samples: 2910
TP      FP      TN      FN
1256    39      1371    244
Accuracy: 0.9027491408934708
Precision: 0.9698841698841699
Recall: 0.8373333333333334
F1 score: 0.898747763864043
Yes ratio: 0.44501718213058417
0.899, 0.903, 0.970, 0.837, 0.445
====================================
Category: popular, # samples: 3000
TP      FP      TN      FN
1256    89      1411    244
Accuracy: 0.889
Precision: 0.9338289962825279
Recall: 0.8373333333333334
F1 score: 0.8829525483304044
Yes ratio: 0.4483333333333333
0.883, 0.889, 0.934, 0.837, 0.448
====================================
Category: adversarial, # samples: 3000
TP      FP      TN      FN
1256    139     1361    244
Accuracy: 0.8723333333333333
Precision: 0.9003584229390681
Recall: 0.8373333333333334
F1 score: 0.8677029360967184
Yes ratio: 0.465
0.868, 0.872, 0.900, 0.837, 0.465
====================================

result = (89.9 + 88.3 + 86.8) / 3 = 88.3
```

````

````{tab} 4B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B pope --dynamic
```

The expected test results are:

```
Category: random, # samples: 2910
TP      FP      TN      FN
1247    54      1356    253
Accuracy: 0.8945017182130585
Precision: 0.9584934665641814
Recall: 0.8313333333333334
F1 score: 0.8903962870403428
Yes ratio: 0.4470790378006873
0.890, 0.895, 0.958, 0.831, 0.447
====================================
Category: popular, # samples: 3000
TP      FP      TN      FN
1247    116     1384    253
Accuracy: 0.877
Precision: 0.9148936170212766
Recall: 0.8313333333333334
F1 score: 0.8711142158574922
Yes ratio: 0.4543333333333333
0.871, 0.877, 0.915, 0.831, 0.454
====================================
Category: adversarial, # samples: 3000
TP      FP      TN      FN
1247    175     1325    253
Accuracy: 0.8573333333333333
Precision: 0.8769338959212377
Recall: 0.8313333333333334
F1 score: 0.8535249828884327
Yes ratio: 0.474
0.854, 0.857, 0.877, 0.831, 0.474
====================================

result = (89.0 + 87.1 + 85.4) / 3 = 87.2
```

````

````{tab} 8B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B pope --dynamic
```

The expected test results are:

```
Category: random, # samples: 2910
TP      FP      TN      FN
1204    29      1381    296
Accuracy: 0.8883161512027491
Precision: 0.9764801297648013
Recall: 0.8026666666666666
F1 score: 0.8810830589096232
Yes ratio: 0.42371134020618556
0.881, 0.888, 0.976, 0.803, 0.424
====================================
Category: popular, # samples: 3000
TP      FP      TN      FN
1204    67      1433    296
Accuracy: 0.879
Precision: 0.9472856018882769
Recall: 0.8026666666666666
F1 score: 0.8690003608805486
Yes ratio: 0.4236666666666667
0.869, 0.879, 0.947, 0.803, 0.424
====================================
Category: adversarial, # samples: 3000
TP      FP      TN      FN
1204    101     1399    296
Accuracy: 0.8676666666666667
Precision: 0.9226053639846743
Recall: 0.8026666666666666
F1 score: 0.8584670231729055
Yes ratio: 0.435
0.858, 0.868, 0.923, 0.803, 0.435
====================================

result = (88.1 + 86.9 + 85.8) / 3 = 86.9
```

````

````{tab} 26B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B pope --dynamic
```

The expected test results are:

```
Category: random, # samples: 2910
TP      FP      TN      FN
1221    25      1385    279
Accuracy: 0.89553264604811
Precision: 0.9799357945425361
Recall: 0.814
F1 score: 0.8892935178441369
Yes ratio: 0.4281786941580756
0.889, 0.896, 0.980, 0.814, 0.428
====================================
Category: popular, # samples: 3000
TP      FP      TN      FN
1221    57      1443    279
Accuracy: 0.888
Precision: 0.9553990610328639
Recall: 0.814
F1 score: 0.8790496760259179
Yes ratio: 0.426
0.879, 0.888, 0.955, 0.814, 0.426
====================================
Category: adversarial, # samples: 3000
TP      FP      TN      FN
1221    84      1416    279
Accuracy: 0.879
Precision: 0.9356321839080459
Recall: 0.814
F1 score: 0.8705882352941177
Yes ratio: 0.435
0.871, 0.879, 0.936, 0.814, 0.435
====================================

result = (88.9 + 87.9 + 87.1) / 3 = 88.0
```

````

````{tab} 40B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B pope --dynamic --auto
```

The expected test results are:

```
Category: random, # samples: 2910
TP      FP      TN      FN
1232    16      1394    268
Accuracy: 0.902405498281787
Precision: 0.9871794871794872
Recall: 0.8213333333333334
F1 score: 0.8966521106259098
Yes ratio: 0.4288659793814433
0.897, 0.902, 0.987, 0.821, 0.429
====================================
Category: popular, # samples: 3000
TP      FP      TN      FN
1232    65      1435    268
Accuracy: 0.889
Precision: 0.9498843484965305
Recall: 0.8213333333333334
F1 score: 0.8809438684304614
Yes ratio: 0.43233333333333335
0.881, 0.889, 0.950, 0.821, 0.432
====================================
Category: adversarial, # samples: 3000
TP      FP      TN      FN
1232    87      1413    268
Accuracy: 0.8816666666666667
Precision: 0.934040940106141
Recall: 0.8213333333333334
F1 score: 0.8740688187300462
Yes ratio: 0.43966666666666665
0.874, 0.882, 0.934, 0.821, 0.440
====================================

result = (89.7 + 88.1 + 87.4) / 3 = 88.4
```

````

````{tab} 76B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B pope --dynamic --auto
```

The expected test results are:

```
Category: random, # samples: 2910
TP      FP      TN      FN
1251    26      1384    249
Accuracy: 0.9054982817869416
Precision: 0.9796397807361003
Recall: 0.834
F1 score: 0.9009722722362261
Yes ratio: 0.4388316151202749
0.901, 0.905, 0.980, 0.834, 0.439
====================================
Category: popular, # samples: 3000
TP      FP      TN      FN
1251    62      1438    249
Accuracy: 0.8963333333333333
Precision: 0.9527798933739527
Recall: 0.834
F1 score: 0.8894418769996445
Yes ratio: 0.43766666666666665
0.889, 0.896, 0.953, 0.834, 0.438
====================================
Category: adversarial, # samples: 3000
TP      FP      TN      FN
1251    91      1409    249
Accuracy: 0.8866666666666667
Precision: 0.9321907600596125
Recall: 0.834
F1 score: 0.8803659394792399
Yes ratio: 0.44733333333333336
0.880, 0.887, 0.932, 0.834, 0.447
====================================

result = (90.1 + 88.9 + 88.0) / 3 = 89.0
```

````

`````

### Tiny LVLM

The Tiny LVLM-eHub is a streamlined evaluation benchmark designed to assess the multimodal capabilities of MLLMs, including models like Bard. It focuses on six categories of multimodal abilities: visual perception, visual knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence.

`````{tabs}

````{tab} 1B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B tiny_lvlm --dynamic
```

The expected test results are:

```
Visual_Knowledge_Acquisition: 0.6857142857142857
Object_Hallucination: 0.91
Visual_Commonsense: 0.556
Visual_Perception: 0.4875
Visual_Reasoning: 0.6145454545454545
Overall: 3.2537597402597402
```

````

````{tab} 2B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B tiny_lvlm --dynamic
```

The expected test results are:

```
Visual_Knowledge_Acquisition: 0.71
Object_Hallucination: 0.91
Visual_Commonsense: 0.558
Visual_Perception: 0.4675
Visual_Reasoning: 0.649090909090909
Overall: 3.294590909090909
```

````

````{tab} 4B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B tiny_lvlm --dynamic
```

The expected test results are:

```
Visual_Knowledge_Acquisition: 0.6814285714285714
Object_Hallucination: 0.89
Visual_Commonsense: 0.652
Visual_Perception: 0.4875
Visual_Reasoning: 0.6563636363636364
Overall: 3.3672922077922074
```

````

````{tab} 8B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B tiny_lvlm --dynamic
```

The expected test results are:

```
Visual_Knowledge_Acquisition: 0.6985714285714286
Object_Hallucination: 0.8966666666666666
Visual_Commonsense: 0.652
Visual_Perception: 0.485
Visual_Reasoning: 0.6854545454545454
Overall: 3.417692640692641
```

````

````{tab} 26B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B tiny_lvlm --dynamic
```

The expected test results are:

```
Visual_Knowledge_Acquisition: 0.7614285714285715
Object_Hallucination: 0.9
Visual_Commonsense: 0.652
Visual_Perception: 0.555
Visual_Reasoning: 0.7109090909090909
Overall: 3.5793376623376627
```

````

````{tab} 40B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B tiny_lvlm --dynamic --auto
```

The expected test results are:

```
Visual_Knowledge_Acquisition: 0.75
Object_Hallucination: 0.8966666666666666
Visual_Commonsense: 0.674
Visual_Perception: 0.5325
Visual_Reasoning: 0.730909090909091
Overall: 3.5840757575757576
```

````

````{tab} 76B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B tiny_lvlm --dynamic --auto
```

The expected test results are:

```
Visual_Knowledge_Acquisition: 0.7557142857142857
Object_Hallucination: 0.9166666666666666
Visual_Commonsense: 0.69
Visual_Perception: 0.525
Visual_Reasoning: 0.7418181818181818
Overall: 3.629199134199134
```

````

`````

### MMMU

The MMMU dataset is a comprehensive benchmark designed to evaluate multimodal models on college-level tasks that require domain-specific knowledge and reasoning. It includes 11,500 questions sourced from college exams, quizzes, and textbooks, spanning six disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions cover 30 subjects and feature 30 types of images, such as charts, diagrams, maps, tables, and more.

`````{tabs}

````{tab} 1B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmmu-val --dynamic
```

The expected test results are:

```
{'Overall-Art and Design': {'num': 120, 'acc': 0.383}, 'Art': {'num': 30, 'acc': 0.4}, 'Art_Theory': {'num': 30, 'acc': 0.4}, 'Design': {'num': 30, 'acc': 0.567}, 'Music': {'num': 30, 'acc': 0.167}, 'Overall-Business': {'num': 150, 'acc': 0.333}, 'Accounting': {'num': 30, 'acc': 0.333}, 'Economics': {'num': 30, 'acc': 0.433}, 'Finance': {'num': 30, 'acc': 0.067}, 'Manage': {'num': 30, 'acc': 0.367}, 'Marketing': {'num': 30, 'acc': 0.467}, 'Overall-Science': {'num': 150, 'acc': 0.3}, 'Biology': {'num': 30, 'acc': 0.267}, 'Chemistry': {'num': 30, 'acc': 0.233}, 'Geography': {'num': 30, 'acc': 0.367}, 'Math': {'num': 30, 'acc': 0.167}, 'Physics': {'num': 30, 'acc': 0.467}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.313}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.433}, 'Clinical_Medicine': {'num': 30, 'acc': 0.233}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.4}, 'Pharmacy': {'num': 30, 'acc': 0.3}, 'Public_Health': {'num': 30, 'acc': 0.2}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.483}, 'History': {'num': 30, 'acc': 0.4}, 'Literature': {'num': 30, 'acc': 0.667}, 'Sociology': {'num': 30, 'acc': 0.467}, 'Psychology': {'num': 30, 'acc': 0.4}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.348}, 'Agriculture': {'num': 30, 'acc': 0.233}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.367}, 'Computer_Science': {'num': 30, 'acc': 0.4}, 'Electronics': {'num': 30, 'acc': 0.4}, 'Energy_and_Power': {'num': 30, 'acc': 0.333}, 'Materials': {'num': 30, 'acc': 0.4}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.3}, 
'Overall': {'num': 900, 'acc': 0.354}}
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmmu-test --dynamic
```

For the test set, submit the results to the [evaluation server](https://eval.ai/web/challenges/challenge-page/2179/overview).

````

````{tab} 2B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmmu-val --dynamic
```

The expected test results are:

```
{'Overall-Art and Design': {'num': 120, 'acc': 0.392}, 'Art': {'num': 30, 'acc': 0.467}, 'Art_Theory': {'num': 30, 'acc': 0.4}, 'Design': {'num': 30, 'acc': 0.5}, 'Music': {'num': 30, 'acc': 0.2}, 'Overall-Business': {'num': 150, 'acc': 0.347}, 'Accounting': {'num': 30, 'acc': 0.367}, 'Economics': {'num': 30, 'acc': 0.333}, 'Finance': {'num': 30, 'acc': 0.333}, 'Manage': {'num': 30, 'acc': 0.367}, 'Marketing': {'num': 30, 'acc': 0.333}, 'Overall-Science': {'num': 150, 'acc': 0.213}, 'Biology': {'num': 30, 'acc': 0.233}, 'Chemistry': {'num': 30, 'acc': 0.1}, 'Geography': {'num': 30, 'acc': 0.167}, 'Math': {'num': 30, 'acc': 0.367}, 'Physics': {'num': 30, 'acc': 0.2}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.373}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.433}, 'Clinical_Medicine': {'num': 30, 'acc': 0.4}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.4}, 'Pharmacy': {'num': 30, 'acc': 0.267}, 'Public_Health': {'num': 30, 'acc': 0.367}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.492}, 'History': {'num': 30, 'acc': 0.4}, 'Literature': {'num': 30, 'acc': 0.767}, 'Sociology': {'num': 30, 'acc': 0.433}, 'Psychology': {'num': 30, 'acc': 0.367}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.3}, 'Agriculture': {'num': 30, 'acc': 0.433}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.233}, 'Computer_Science': {'num': 30, 'acc': 0.233}, 'Electronics': {'num': 30, 'acc': 0.367}, 'Energy_and_Power': {'num': 30, 'acc': 0.233}, 'Materials': {'num': 30, 'acc': 0.4}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.2}, 
'Overall': {'num': 900, 'acc': 0.343}}
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmmu-test --dynamic
```

For the test set, submit the results to the [evaluation server](https://eval.ai/web/challenges/challenge-page/2179/overview).

````

````{tab} 4B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmmu-val --dynamic
```

The expected test results are:

```
'Overall': {'num': 900, 'acc': 0.470}
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmmu-test --dynamic
```

For the test set, submit the results to the [evaluation server](https://eval.ai/web/challenges/challenge-page/2179/overview).

````

````{tab} 8B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmmu-val --dynamic
```

The expected test results are:

```
{'Overall-Art and Design': {'num': 120, 'acc': 0.608}, 'Art': {'num': 30, 'acc': 0.733}, 'Art_Theory': {'num': 30, 'acc': 0.7}, 'Design': {'num': 30, 'acc': 0.733}, 'Music': {'num': 30, 'acc': 0.267}, 'Overall-Business': {'num': 150, 'acc': 0.453}, 'Accounting': {'num': 30, 'acc': 0.467}, 'Economics': {'num': 30, 'acc': 0.533}, 'Finance': {'num': 30, 'acc': 0.333}, 'Manage': {'num': 30, 'acc': 0.4}, 'Marketing': {'num': 30, 'acc': 0.533}, 'Overall-Science': {'num': 150, 'acc': 0.393}, 'Biology': {'num': 30, 'acc': 0.467}, 'Chemistry': {'num': 30, 'acc': 0.267}, 'Geography': {'num': 30, 'acc': 0.4}, 'Math': {'num': 30, 'acc': 0.5}, 'Physics': {'num': 30, 'acc': 0.333}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.507}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.567}, 'Clinical_Medicine': {'num': 30, 'acc': 0.667}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.467}, 'Pharmacy': {'num': 30, 'acc': 0.367}, 'Public_Health': {'num': 30, 'acc': 0.467}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.717}, 'History': {'num': 30, 'acc': 0.767}, 'Literature': {'num': 30, 'acc': 0.9}, 'Sociology': {'num': 30, 'acc': 0.7}, 'Psychology': {'num': 30, 'acc': 0.5}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.39}, 'Agriculture': {'num': 30, 'acc': 0.533}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.333}, 'Computer_Science': {'num': 30, 'acc': 0.5}, 'Electronics': {'num': 30, 'acc': 0.467}, 'Energy_and_Power': {'num': 30, 'acc': 0.4}, 'Materials': {'num': 30, 'acc': 0.233}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.267}, 
'Overall': {'num': 900, 'acc': 0.493}}
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmmu-test --dynamic
```

For the test set, submit the results to the [evaluation server](https://eval.ai/web/challenges/challenge-page/2179/overview).

````

````{tab} 26B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmmu-val --dynamic
```

The expected test results are:

```
{'Overall-Art and Design': {'num': 120, 'acc': 0.7}, 'Art': {'num': 30, 'acc': 0.767}, 'Art_Theory': {'num': 30, 'acc': 0.867}, 'Design': {'num': 30, 'acc': 0.867}, 'Music': {'num': 30, 'acc': 0.3}, 'Overall-Business': {'num': 150, 'acc': 0.407}, 'Accounting': {'num': 30, 'acc': 0.467}, 'Economics': {'num': 30, 'acc': 0.3}, 'Finance': {'num': 30, 'acc': 0.333}, 'Manage': {'num': 30, 'acc': 0.5}, 'Marketing': {'num': 30, 'acc': 0.433}, 'Overall-Science': {'num': 150, 'acc': 0.373}, 'Biology': {'num': 30, 'acc': 0.6}, 'Chemistry': {'num': 30, 'acc': 0.2}, 'Geography': {'num': 30, 'acc': 0.5}, 'Math': {'num': 30, 'acc': 0.233}, 'Physics': {'num': 30, 'acc': 0.333}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.453}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.467}, 'Clinical_Medicine': {'num': 30, 'acc': 0.567}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.367}, 'Pharmacy': {'num': 30, 'acc': 0.367}, 'Public_Health': {'num': 30, 'acc': 0.5}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.7}, 'History': {'num': 30, 'acc': 0.7}, 'Literature': {'num': 30, 'acc': 0.9}, 'Sociology': {'num': 30, 'acc': 0.6}, 'Psychology': {'num': 30, 'acc': 0.6}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.39}, 'Agriculture': {'num': 30, 'acc': 0.467}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.267}, 'Computer_Science': {'num': 30, 'acc': 0.367}, 'Electronics': {'num': 30, 'acc': 0.367}, 'Energy_and_Power': {'num': 30, 'acc': 0.5}, 'Materials': {'num': 30, 'acc': 0.433}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.333}, 
'Overall': {'num': 900, 'acc': 0.483}}
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmmu-test --dynamic
```

For the test set, submit the results to the [evaluation server](https://eval.ai/web/challenges/challenge-page/2179/overview).

````

````{tab} 40B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmmu-val --dynamic --auto
```

The expected test results are:

```
{'Overall-Art and Design': {'num': 120, 'acc': 0.675}, 'Art': {'num': 30, 'acc': 0.733}, 'Art_Theory': {'num': 30, 'acc': 0.833}, 'Design': {'num': 30, 'acc': 0.767}, 'Music': {'num': 30, 'acc': 0.367}, 'Overall-Business': {'num': 150, 'acc': 0.44}, 'Accounting': {'num': 30, 'acc': 0.467}, 'Economics': {'num': 30, 'acc': 0.567}, 'Finance': {'num': 30, 'acc': 0.333}, 'Manage': {'num': 30, 'acc': 0.367}, 'Marketing': {'num': 30, 'acc': 0.467}, 'Overall-Science': {'num': 150, 'acc': 0.493}, 'Biology': {'num': 30, 'acc': 0.633}, 'Chemistry': {'num': 30, 'acc': 0.3}, 'Geography': {'num': 30, 'acc': 0.5}, 'Math': {'num': 30, 'acc': 0.5}, 'Physics': {'num': 30, 'acc': 0.533}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.593}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.5}, 'Clinical_Medicine': {'num': 30, 'acc': 0.6}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.4}, 'Pharmacy': {'num': 30, 'acc': 0.667}, 'Public_Health': {'num': 30, 'acc': 0.8}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.717}, 'History': {'num': 30, 'acc': 0.767}, 'Literature': {'num': 30, 'acc': 0.833}, 'Sociology': {'num': 30, 'acc': 0.6}, 'Psychology': {'num': 30, 'acc': 0.667}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.424}, 'Agriculture': {'num': 30, 'acc': 0.6}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.333}, 'Computer_Science': {'num': 30, 'acc': 0.467}, 'Electronics': {'num': 30, 'acc': 0.433}, 'Energy_and_Power': {'num': 30, 'acc': 0.467}, 'Materials': {'num': 30, 'acc': 0.3}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.367}, 
'Overall': {'num': 900, 'acc': 0.539}}
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmmu-test --dynamic --auto
```

For the test set, submit the results to the [evaluation server](https://eval.ai/web/challenges/challenge-page/2179/overview).

````

````{tab} 76B

For the validation set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmmu-val --dynamic --auto
```

The expected test results are:

```
{'Overall-Art and Design': {'num': 120, 'acc': 0.683}, 'Art': {'num': 30, 'acc': 0.767}, 'Art_Theory': {'num': 30, 'acc': 0.933}, 'Design': {'num': 30, 'acc': 0.7}, 'Music': {'num': 30, 'acc': 0.333}, 'Overall-Business': {'num': 150, 'acc': 0.567}, 'Accounting': {'num': 30, 'acc': 0.5}, 'Economics': {'num': 30, 'acc': 0.567}, 'Finance': {'num': 30, 'acc': 0.433}, 'Manage': {'num': 30, 'acc': 0.633}, 'Marketing': {'num': 30, 'acc': 0.7}, 'Overall-Science': {'num': 150, 'acc': 0.413}, 'Biology': {'num': 30, 'acc': 0.467}, 'Chemistry': {'num': 30, 'acc': 0.3}, 'Geography': {'num': 30, 'acc': 0.433}, 'Math': {'num': 30, 'acc': 0.367}, 'Physics': {'num': 30, 'acc': 0.5}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.587}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.533}, 'Clinical_Medicine': {'num': 30, 'acc': 0.667}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.433}, 'Pharmacy': {'num': 30, 'acc': 0.6}, 'Public_Health': {'num': 30, 'acc': 0.7}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.725}, 'History': {'num': 30, 'acc': 0.733}, 'Literature': {'num': 30, 'acc': 0.867}, 'Sociology': {'num': 30, 'acc': 0.633}, 'Psychology': {'num': 30, 'acc': 0.667}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.443}, 'Agriculture': {'num': 30, 'acc': 0.6}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.367}, 'Computer_Science': {'num': 30, 'acc': 0.567}, 'Electronics': {'num': 30, 'acc': 0.433}, 'Energy_and_Power': {'num': 30, 'acc': 0.367}, 'Materials': {'num': 30, 'acc': 0.267}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.5}, 
'Overall': {'num': 900, 'acc': 0.552}}
```

For the test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmmu-test --dynamic --auto
```

For the test set, submit the results to the [evaluation server](https://eval.ai/web/challenges/challenge-page/2179/overview).

````

`````

### MMVet (GPT-4-0613)

> **⚠️ Warning:** Here, we use `GPT-4-0613` as the judge model, while in VLMEvalKit, `GPT-4-Turbo` is used as the judge model. Using different versions of GPT-4 can result in significant score variations. Therefore, testing the same model with the two codebases can lead to notable score differences.

The MM-Vet dataset is a comprehensive benchmark designed to evaluate the integrated capabilities of MLLMs. It encompasses six core vision-language (VL) capabilities: recognition, knowledge, optical character recognition (OCR), spatial awareness, language generation, and math. The dataset includes 200 images and 218 questions, each requiring one or more of these capabilities to answer. The evaluation uses an open-ended LLM-based approach, allowing assessment across various answer styles and question types.

`````{tabs}

````{tab} 1B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmvet --dynamic
```

Then, submit the results to the [evaluation server](https://huggingface.co/spaces/whyu/MM-Vet_Evaluator). The expected test results are:

```
runs: [37.8]
```

````

````{tab} 2B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmvet --dynamic
```

Then, submit the results to the [evaluation server](https://huggingface.co/spaces/whyu/MM-Vet_Evaluator). The expected test results are:

```
runs: [44.6]
```

````

````{tab} 4B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmvet --dynamic
```

Then, submit the results to the [evaluation server](https://huggingface.co/spaces/whyu/MM-Vet_Evaluator). The expected test results are:

```
runs: [55.7]
```

````

````{tab} 8B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmvet --dynamic
```

Then, submit the results to the [evaluation server](https://huggingface.co/spaces/whyu/MM-Vet_Evaluator). The expected test results are:

```
runs: [60.0]
```

````

````{tab} 26B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmvet --dynamic
```

Then, submit the results to the [evaluation server](https://huggingface.co/spaces/whyu/MM-Vet_Evaluator). The expected test results are:

```
runs: [64.2]
```

````

````{tab} 40B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmvet --dynamic --auto
```

Then, submit the results to the [evaluation server](https://huggingface.co/spaces/whyu/MM-Vet_Evaluator). The expected test results are:

```
runs: [68.5]
```

````

````{tab} 76B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmvet --dynamic --auto
```

Then, submit the results to the [evaluation server](https://huggingface.co/spaces/whyu/MM-Vet_Evaluator). The expected test results are:

```
runs: [69.8]
```

````

`````

### MMBench

The MMBench dataset is a comprehensive multi-modality benchmark designed to evaluate the fine-grained abilities of vision-language models. It contains around 3,000 multiple-choice questions covering 20 ability dimensions, structured into a hierarchical taxonomy. These dimensions include perception and reasoning abilities, further broken down into specific skills like coarse and fine-grained perception, attribute reasoning, and logic reasoning.

`````{tabs}

````{tab} 1B

For the English dev / test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmbench-dev-en --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmbench-test-en --dynamic
```

Then, submit the results to the [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission). The expected test results are:

```
mmbench-dev-en: -
mmbench-test-en: 65.4
```

For the Chinese dev / test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmbench-dev-cn --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmbench-test-cn --dynamic

```

Then, submit the results to the [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission). The expected test results are:

```
mmbench-dev-cn: -
mmbench-test-cn: 60.7
```

````

````{tab} 2B

For the English dev / test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmbench-dev-en --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmbench-test-en --dynamic
```

Then, submit the results to the [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission). The expected test results are:

```
mmbench-dev-en: -
mmbench-test-en: 73.2
```

For the Chinese dev / test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmbench-dev-cn --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmbench-test-cn --dynamic

```

Then, submit the results to the [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission). The expected test results are:

```
mmbench-dev-cn: -
mmbench-test-cn: 70.9
```

````

````{tab} 4B

For the English dev / test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmbench-dev-en --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmbench-test-en --dynamic
```

Then, submit the results to the [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission). The expected test results are:

```
mmbench-dev-en: -
mmbench-test-en: 78.6
```

For the Chinese dev / test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmbench-dev-cn --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmbench-test-cn --dynamic

```

Then, submit the results to the [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission). The expected test results are:

```
mmbench-dev-cn: -
mmbench-test-cn: 73.9
```

````

````{tab} 8B

For the English dev / test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmbench-dev-en --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmbench-test-en --dynamic
```

Then, submit the results to the [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission). The expected test results are:

```
mmbench-dev-en: -
mmbench-test-en: 81.7
```

For the Chinese dev / test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmbench-dev-cn --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmbench-test-cn --dynamic

```

Then, submit the results to the [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission). The expected test results are:

```
mmbench-dev-cn: -
mmbench-test-cn: 81.2
```

````

````{tab} 26B

For the English dev / test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmbench-dev-en --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmbench-test-en --dynamic
```

Then, submit the results to the [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission). The expected test results are:

```
mmbench-dev-en: -
mmbench-test-en: 83.4
```

For the Chinese dev / test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmbench-dev-cn --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmbench-test-cn --dynamic

```

Then, submit the results to the [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission). The expected test results are:

```
mmbench-dev-cn: -
mmbench-test-cn: 82.0
```

````

````{tab} 40B

For the English dev / test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmbench-dev-en --dynamic --auto
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmbench-test-en --dynamic --auto
```

Then, submit the results to the [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission). The expected test results are:

```
mmbench-dev-en: -
mmbench-test-en: 86.8
```

For the Chinese dev / test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmbench-dev-cn --dynamic --auto
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmbench-test-cn --dynamic --auto

```

Then, submit the results to the [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission). The expected test results are:

```
mmbench-dev-cn: -
mmbench-test-cn: 86.5
```

````

````{tab} 76B

For the English dev / test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmbench-dev-en --dynamic --auto
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmbench-test-en --dynamic --auto
```

Then, submit the results to the [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission). The expected test results are:

```
mmbench-dev-en: -
mmbench-test-en: 86.5
```

For the Chinese dev / test set, run:

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmbench-dev-cn --dynamic --auto
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmbench-test-cn --dynamic --auto

```

Then, submit the results to the [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission). The expected test results are:

```
mmbench-dev-cn: -
mmbench-test-cn: 86.3
```

````

`````

### CCBench

CCBench, a multi-modal benchmark in the domain of Chinese Culture, is designed to evaluate the performance of MLLMs on tasks specifically related to Chinese cultural content.

`````{tabs}

````{tab} 1B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B ccbench-dev --dynamic
```

Then, submit the results to the [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission). The expected test results are:

```
ccbench-dev: 75.7
```

````

````{tab} 2B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B ccbench-dev --dynamic
```

Then, submit the results to the [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission). The expected test results are:

```
ccbench-dev: 74.7
```

````

````{tab} 4B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B ccbench-dev --dynamic
```

Then, submit the results to the [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission). The expected test results are:

```
ccbench-dev: 66.5
```

````

````{tab} 8B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B ccbench-dev --dynamic
```

Then, submit the results to the [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission). The expected test results are:

```
ccbench-dev: 75.9
```

````

````{tab} 26B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B ccbench-dev --dynamic
```

Then, submit the results to the [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission). The expected test results are:

```
ccbench-dev: 73.5
```

````

````{tab} 40B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B ccbench-dev --dynamic --auto
```

Then, submit the results to the [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission). The expected test results are:

```
ccbench-dev: 80.6
```

````

````{tab} 76B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B ccbench-dev --dynamic --auto
```

Then, submit the results to the [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission). The expected test results are:

```
ccbench-dev: 81.0
```

````

`````

### SEED

CCBench is a multimodal benchmark specifically designed to evaluate models on tasks related to Chinese culture. It is part of the larger MMBench suite of benchmarks, developed by the OpenCompass Community, and aims to provide fine-grained evaluations across various capabilities of vision-language models. CCBench includes 510 questions in a multiple-choice format, focusing on cultural knowledge and understanding.

`````{tabs}

````{tab} 1B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B seed --dynamic
```

The expected test results are:

```
Acc@1: 0.6074485825458588
length: 17990
Accuracy for each data type:
Data type Scene Understanding: 73.05%
Data type Instance Identity: 71.16%
Data type Instance Location: 69.23%
Data type Instance Attributes: 58.49%
Data type Instances Counting: 52.55%
Data type Spatial Relation: 43.53%
Data type Instance Interaction: 71.13%
Data type Visual Reasoning: 72.51%
Data type Text Understanding: 68.60%
Data type Action Recognition: 53.55%
Data type Action Prediction: 39.92%
Data type Procedure Understanding: 28.74%
Total accuracy: 60.76%
Image accuracy: 65.62%
Video accuracy: 42.35%
```

````

````{tab} 2B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B seed --dynamic
```

The expected test results are:

```
Acc@1: 0.6656475819899944
length: 17990
Accuracy for each data type:
Data type Scene Understanding: 76.92%
Data type Instance Identity: 76.79%
Data type Instance Location: 75.04%
Data type Instance Attributes: 65.44%
Data type Instances Counting: 60.40%
Data type Spatial Relation: 54.03%
Data type Instance Interaction: 72.16%
Data type Visual Reasoning: 76.74%
Data type Text Understanding: 74.42%
Data type Action Recognition: 60.04%
Data type Action Prediction: 43.27%
Data type Procedure Understanding: 34.70%
Total accuracy: 66.56%
Image accuracy: 71.55%
Video accuracy: 47.67%
```

````

````{tab} 4B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B seed --dynamic
```

The expected test results are:

```
Acc@1: 0.6934408004446915
length: 17990
Accuracy for each data type:
Data type Scene Understanding: 78.75%
Data type Instance Identity: 76.79%
Data type Instance Location: 77.45%
Data type Instance Attributes: 66.36%
Data type Instances Counting: 64.57%
Data type Spatial Relation: 56.47%
Data type Instance Interaction: 71.13%
Data type Visual Reasoning: 78.25%
Data type Text Understanding: 75.58%
Data type Action Recognition: 60.57%
Data type Action Prediction: 47.84%
Data type Procedure Understanding: 47.80%
Total accuracy: 69.34%
Image accuracy: 73.67%
Video accuracy: 52.94%
```

````

````{tab} 8B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B seed --dynamic
```

The expected test results are:

```
Acc@1: 0.7072262367982213
length: 17990
Accuracy for each data type:
Data type Scene Understanding: 79.89%
Data type Instance Identity: 78.97%
Data type Instance Location: 79.50%
Data type Instance Attributes: 69.84%
Data type Instances Counting: 68.08%
Data type Spatial Relation: 64.23%
Data type Instance Interaction: 79.38%
Data type Visual Reasoning: 78.85%
Data type Text Understanding: 75.58%
Data type Action Recognition: 60.70%
Data type Action Prediction: 48.57%
Data type Procedure Understanding: 36.56%
Total accuracy: 70.72%
Image accuracy: 76.15%
Video accuracy: 50.17%
```

````

````{tab} 26B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B seed --dynamic
```

The expected test results are:

```
Acc@1: 0.7245136186770428
length: 17990
Accuracy for each data type:
Data type Scene Understanding: 80.30%
Data type Instance Identity: 80.39%
Data type Instance Location: 79.88%
Data type Instance Attributes: 71.78%
Data type Instances Counting: 69.68%
Data type Spatial Relation: 61.95%
Data type Instance Interaction: 75.26%
Data type Visual Reasoning: 79.15%
Data type Text Understanding: 68.60%
Data type Action Recognition: 65.47%
Data type Action Prediction: 54.20%
Data type Procedure Understanding: 44.28%
Total accuracy: 72.45%
Image accuracy: 76.79%
Video accuracy: 56.03%
```

````

````{tab} 40B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B seed --dynamic --auto
```

The expected test results are:

```
Acc@1: 0.7464146748193441
length: 17990
Accuracy for each data type:
Data type Scene Understanding: 80.62%
Data type Instance Identity: 82.36%
Data type Instance Location: 80.92%
Data type Instance Attributes: 71.68%
Data type Instances Counting: 72.46%
Data type Spatial Relation: 66.36%
Data type Instance Interaction: 78.35%
Data type Visual Reasoning: 80.06%
Data type Text Understanding: 66.28%
Data type Action Recognition: 67.93%
Data type Action Prediction: 57.47%
Data type Procedure Understanding: 56.40%
Total accuracy: 74.65%
Image accuracy: 78.15%
Video accuracy: 61.38%
```

````

````{tab} 76B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B seed --dynamic --auto
```

The expected test results are:

```
Acc@1: 0.7446359088382435
length: 17990
Accuracy for each data type:
Data type Scene Understanding: 80.40%
Data type Instance Identity: 82.25%
Data type Instance Location: 80.66%
Data type Instance Attributes: 73.31%
Data type Instances Counting: 72.78%
Data type Spatial Relation: 65.14%
Data type Instance Interaction: 79.38%
Data type Visual Reasoning: 79.15%
Data type Text Understanding: 77.91%
Data type Action Recognition: 68.26%
Data type Action Prediction: 55.10%
Data type Procedure Understanding: 55.23%
Total accuracy: 74.46%
Image accuracy: 78.17%
Video accuracy: 60.42%
```

````

`````

### MMVP

The MMVP dataset is designed to benchmark the performance of multimodal large language models (MLLMs) in visual question answering tasks. This dataset focuses on identifying "CLIP-blind pairs," which are images that appear similar to the CLIP model despite having clear visual differences. The MMVP dataset includes 300 images derived from ImageNet-1k and LAION-Aesthetics, each paired with straightforward questions to evaluate the models' visual capabilities. It highlights the challenges these systems face, often leading to incorrect responses and hallucinated explanations.

`````{tabs}

````{tab} 1B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmvp --dynamic
```

The expected test results are:

```
Evaluating MMVP ...
Results saved to results/MMVP_240708020850.jsonl
The accuracy is 0.2
```

````

````{tab} 2B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmvp --dynamic
```

The expected test results are:

```
Evaluating MMVP ...
Results saved to results/MMVP_240702122300.jsonl
The accuracy is 0.35333333333333333
```

````

````{tab} 4B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmvp --dynamic
```

The expected test results are:

```
Evaluating MMVP ...
Results saved to results/MMVP_240702144108.jsonl
The accuracy is 0.4066666666666667
```

````

````{tab} 8B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmvp --dynamic
```

The expected test results are:

```
Evaluating MMVP ...
Results saved to results/MMVP_240703200956.jsonl
The accuracy is 0.5133333333333333
```

````

````{tab} 26B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmvp --dynamic
```

The expected test results are:

```
Evaluating MMVP ...
Results saved to results/MMVP_240704024433.jsonl
The accuracy is 0.5466666666666666
```

````

````{tab} 40B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmvp --dynamic --auto
```

The expected test results are:

```
Evaluating MMVP ...
Results saved to results/MMVP_240708045836.jsonl
The accuracy is 0.5866666666666667
```

````

````{tab} 76B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmvp --dynamic --auto
```

The expected test results are:

```
Evaluating MMVP ...
Results saved to results/MMVP_240718203234.jsonl
The accuracy is 0.5266666666666666
```

````

`````

### RefCOCO Series

RefCOCO, RefCOCO+, and RefCOCOg are datasets used for tasks involving referring expression comprehension, segmentation, and generation. These datasets are built upon the MSCOCO dataset, and they are essential for evaluating models in natural language processing and computer vision.

`````{tabs}

````{tab} 1B
```bash
GPUS=8 sh evalulate.sh pretrained/InternVL2-1B refcoco --dynamic
```
````

````{tab} 2B
```bash
GPUS=8 sh evalulate.sh pretrained/InternVL2-2B refcoco --dynamic
```
````

````{tab} 4B
```bash
GPUS=8 sh evalulate.sh pretrained/InternVL2-4B refcoco --dynamic
```
````

````{tab} 8B
```bash
GPUS=8 sh evalulate.sh pretrained/InternVL2-8B refcoco --dynamic
```
````

````{tab} 26B
```bash
GPUS=8 sh evalulate.sh pretrained/InternVL2-26B refcoco --dynamic
```
````

````{tab} 40B
```bash
GPUS=8 sh evalulate.sh pretrained/InternVL2-40B refcoco --dynamic --auto
```
````

````{tab} 76B
```bash
GPUS=8 sh evalulate.sh pretrained/InternVL2-Llama3-76B refcoco --dynamic --auto
```
````
`````

The expected test results are:

|          Model           | avg. | RefCOCO<br>(val) | RefCOCO<br>(testA) | RefCOCO<br>(testB) | RefCOCO+<br>(val) | RefCOCO+<br>(testA) | RefCOCO+<br>(testB) | RefCOCO‑g<br>(val) | RefCOCO‑g<br>(test) |
| :----------------------: | :--: | :--------------: | :----------------: | :----------------: | :---------------: | :-----------------: | :-----------------: | :----------------: | :-----------------: |
|       InternVL2‑1B       | 79.9 |       83.6       |        88.7        |        79.8        |       76.0        |        83.6         |        67.7         |        80.2        |        79.9         |
|       InternVL2‑2B       | 77.7 |       82.3       |        88.2        |        75.9        |       73.5        |        82.8         |        63.3         |        77.6        |        78.3         |
|       InternVL2‑4B       | 84.4 |       88.5       |        91.2        |        83.9        |       81.2        |        87.2         |        73.8         |        84.6        |        84.6         |
|       InternVL2‑8B       | 82.9 |       87.1       |        91.1        |        80.7        |       79.8        |        87.9         |        71.4         |        82.7        |        82.7         |
|      InternVL2‑26B       | 88.5 |       91.2       |        93.3        |        87.4        |       86.8        |        91.0         |        81.2         |        88.5        |        88.6         |
|      InternVL2‑40B       | 90.3 |       93.0       |        94.7        |        89.2        |       88.5        |        92.8         |        83.6         |        90.3        |        90.6         |
| InternVL2-<br>Llama3‑76B | 90.0 |       92.2       |        94.8        |        88.4        |       88.8        |        93.1         |        82.8         |        89.5        |        90.3         |

### MVBench

MVBench is a comprehensive multimodal video understanding benchmark developed to evaluate the temporal comprehension capabilities of MLLMs. It includes 20 challenging video tasks that require temporal understanding and cannot be effectively solved using a single frame. The benchmark uses a novel static-to-dynamic method, transforming static tasks into dynamic ones to systematically generate video tasks that demand a wide range of temporal skills, from perception to cognition.

We evaluate our models on MVBench by extracting 16 frames from each video, and each frame was resized to a 448x448 image.

`````{tabs}

````{tab} 1B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mvbench --dynamic --max-num 1
```

The expected test results are:

```
57.9
```

````

````{tab} 2B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mvbench --dynamic --max-num 1
```

The expected test results are:

```
60.2
```

````

````{tab} 4B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mvbench --dynamic --max-num 1
```

The expected test results are:

```
63.7
```

````

````{tab} 8B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mvbench --dynamic --max-num 1
```

The expected test results are:

```
66.4
```

````

````{tab} 26B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mvbench --dynamic --max-num 1
```

The expected test results are:

```
67.5
```

````

````{tab} 40B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mvbench --dynamic --max-num 1 --auto
```

The expected test results are:

```
72.5
```

````

````{tab} 76B

```bash
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mvbench --dynamic --max-num 1 --auto
```

The expected test results are:

```
69.6
```

````

`````

## Evaluation using VLMEvalKit Codebase

### Data Preparation

VLMEvalKit will automatically download the data for evaluation, so you do not need to prepare it manually.

### MathVista

The MathVista dataset is a comprehensive benchmark for evaluating mathematical reasoning within visual contexts. It consists of three newly created datasets—IQTest, FunctionQA, and PaperQA—designed to address logical reasoning on puzzle test figures, algebraic reasoning over functional plots, and scientific reasoning with academic paper figures, respectively.

`````{tabs}

````{tab} 1B

```bash
torchrun --nproc-per-node=8 run.py --data MathVista_MINI --model InternVL2-1B --verbose
```

The expected test results are:

```
"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","489","377","48.9","37.7"
"scientific reasoning","122","85","45","69.67213114754098","36.885245901639344"
"textbook question answering","158","92","63","58.22784810126582","39.87341772151899"
"numeric commonsense","144","39","24","27.083333333333332","16.666666666666664"
"arithmetic reasoning","353","102","103","28.89518413597734","29.178470254957507"
"visual question answering","179","92","53","51.39664804469274","29.608938547486037"
"geometry reasoning","239","147","95","61.50627615062761","39.74895397489539"
"algebraic reasoning","281","170","112","60.4982206405694","39.8576512455516"
"geometry problem solving","208","138","85","66.34615384615384","40.86538461538461"
"math word problem","186","26","52","13.978494623655912","27.956989247311824"
"logical reasoning","37","11","5","29.72972972972973","13.513513513513514"
"figure question answering","269","141","124","52.41635687732342","46.09665427509294"
"statistical reasoning","301","144","148","47.840531561461795","49.16943521594684"
```

````

````{tab} 2B

```bash
torchrun --nproc-per-node=8 run.py --data MathVista_MINI --model InternVL2-2B --verbose
```

The expected test results are:

```
"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","476","464","47.599999999999994","46.400000000000006"
"scientific reasoning","122","83","68","68.0327868852459","55.73770491803278"
"textbook question answering","158","95","79","60.12658227848101","50.0"
"numeric commonsense","144","35","37","24.305555555555554","25.694444444444443"
"arithmetic reasoning","353","100","146","28.328611898016998","41.359773371104815"
"visual question answering","179","91","86","50.83798882681564","48.04469273743017"
"geometry reasoning","239","144","103","60.25104602510461","43.09623430962343"
"algebraic reasoning","281","171","117","60.854092526690394","41.637010676156585"
"geometry problem solving","208","136","94","65.38461538461539","45.19230769230769"
"math word problem","186","20","62","10.75268817204301","33.33333333333333"
"logical reasoning","37","11","4","29.72972972972973","10.81081081081081"
"figure question answering","269","134","143","49.814126394052046","53.159851301115246"
"statistical reasoning","301","137","180","45.51495016611295","59.800664451827245"
```

````

````{tab} 4B

```bash
torchrun --nproc-per-node=8 run.py --data MathVista_MINI --model InternVL2-4B --verbose
```

The expected test results are:

```
"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","544","587","54.400000000000006","58.699999999999996"
"scientific reasoning","122","88","73","72.1311475409836","59.83606557377049"
"textbook question answering","158","97","93","61.39240506329114","58.86075949367089"
"numeric commonsense","144","37","43","25.694444444444443","29.86111111111111"
"arithmetic reasoning","353","139","197","39.376770538243626","55.80736543909348"
"visual question answering","179","94","87","52.513966480446925","48.60335195530726"
"geometry reasoning","239","146","133","61.08786610878661","55.64853556485355"
"algebraic reasoning","281","169","156","60.14234875444839","55.51601423487544"
"geometry problem solving","208","137","119","65.86538461538461","57.21153846153846"
"math word problem","186","54","119","29.03225806451613","63.97849462365591"
"logical reasoning","37","19","9","51.35135135135135","24.324324324324326"
"figure question answering","269","162","169","60.223048327137555","62.825278810408925"
"statistical reasoning","301","167","215","55.48172757475083","71.42857142857143"
```

````

````{tab} 8B

```bash
torchrun --nproc-per-node=8 run.py --data MathVista_MINI --model InternVL2-8B --verbose
```

The expected test results are:

```
"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","549","583","54.900000000000006","58.3"
"scientific reasoning","122","89","72","72.95081967213115","59.01639344262295"
"textbook question answering","158","101","97","63.92405063291139","61.39240506329114"
"numeric commonsense","144","39","44","27.083333333333332","30.555555555555557"
"arithmetic reasoning","353","128","199","36.26062322946176","56.37393767705382"
"visual question answering","179","92","89","51.39664804469274","49.72067039106145"
"geometry reasoning","239","160","144","66.94560669456067","60.25104602510461"
"algebraic reasoning","281","185","168","65.83629893238434","59.7864768683274"
"geometry problem solving","208","150","129","72.11538461538461","62.019230769230774"
"math word problem","186","49","110","26.344086021505376","59.13978494623656"
"logical reasoning","37","16","4","43.24324324324324","10.81081081081081"
"figure question answering","269","157","158","58.36431226765799","58.7360594795539"
"statistical reasoning","301","155","207","51.49501661129568","68.77076411960132"
```

````

````{tab} 26B

```bash
torchrun --nproc-per-node=8 run.py --data MathVista_MINI --model InternVL2-26B --verbose
```

The expected test results are:

```
"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","588","594","58.8","59.4"
"scientific reasoning","122","87","73","71.31147540983606","59.83606557377049"
"textbook question answering","158","98","97","62.0253164556962","61.39240506329114"
"numeric commonsense","144","38","49","26.38888888888889","34.02777777777778"
"arithmetic reasoning","353","157","212","44.47592067988669","60.05665722379604"
"visual question answering","179","91","97","50.83798882681564","54.18994413407822"
"geometry reasoning","239","164","139","68.6192468619247","58.15899581589959"
"algebraic reasoning","281","188","159","66.90391459074732","56.58362989323843"
"geometry problem solving","208","154","121","74.03846153846155","58.17307692307693"
"math word problem","186","76","116","40.86021505376344","62.365591397849464"
"logical reasoning","37","17","3","45.94594594594595","8.108108108108109"
"figure question answering","269","169","163","62.825278810408925","60.594795539033456"
"statistical reasoning","301","168","212","55.81395348837209","70.43189368770764"
```

````

````{tab} 40B

```bash
torchrun --nproc-per-node=8 run.py --data MathVista_MINI --model InternVL2-40B --verbose
```

The expected test results are:

```
"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","552","637","55.2","63.7"
"scientific reasoning","122","90","76","73.77049180327869","62.295081967213115"
"textbook question answering","158","101","99","63.92405063291139","62.65822784810127"
"numeric commonsense","144","34","58","23.61111111111111","40.27777777777778"
"arithmetic reasoning","353","147","229","41.64305949008499","64.87252124645893"
"visual question answering","179","92","103","51.39664804469274","57.54189944134078"
"geometry reasoning","239","155","131","64.85355648535564","54.811715481171554"
"algebraic reasoning","281","180","152","64.05693950177937","54.092526690391466"
"geometry problem solving","208","146","114","70.1923076923077","54.807692307692314"
"math word problem","186","65","135","34.946236559139784","72.58064516129032"
"logical reasoning","37","11","10","29.72972972972973","27.027027027027028"
"figure question answering","269","148","186","55.01858736059479","69.14498141263941"
"statistical reasoning","301","150","233","49.83388704318937","77.40863787375415"
```

````

````{tab} 76B

```bash
torchrun --nproc-per-node=1 run.py --data MathVista_MINI --model InternVL2-76B --verbose
```

The expected test results are:

```
"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","534","655","53.400000000000006","65.5"
"scientific reasoning","122","89","77","72.95081967213115","63.114754098360656"
"textbook question answering","158","100","106","63.29113924050633","67.08860759493672"
"numeric commonsense","144","42","64","29.166666666666668","44.44444444444444"
"arithmetic reasoning","353","154","218","43.626062322946176","61.756373937677054"
"visual question answering","179","95","89","53.072625698324025","49.72067039106145"
"geometry reasoning","239","143","160","59.83263598326359","66.94560669456067"
"algebraic reasoning","281","168","187","59.7864768683274","66.54804270462633"
"geometry problem solving","208","134","142","64.42307692307693","68.26923076923077"
"math word problem","186","73","143","39.247311827956985","76.88172043010752"
"logical reasoning","37","7","6","18.91891891891892","16.216216216216218"
"figure question answering","269","132","175","49.07063197026022","65.05576208178438"
"statistical reasoning","301","139","232","46.179401993355484","77.0764119601329"
```

````

`````

### HallusionBench

HallusionBench is a comprehensive benchmark designed to evaluate image-context reasoning in MLLMs, focusing on identifying issues related to language hallucination and visual illusion. The dataset consists of 346 images paired with 1,129 questions crafted by human experts. These questions are divided into two categories: Visual Dependent (VD) and Visual Supplement (VS), allowing the benchmark to assess the nuanced understanding and interpretation of visual data by MLLMs.

`````{tabs}

````{tab} 1B

```bash
torchrun --nproc-per-node=8 run.py --data HallusionBench --model InternVL2-1B --verbose
```

The expected test results are:

```
"split","aAcc","fAcc","qAcc"
"Overall","54.363827549947416","23.98843930635838","21.978021978021978"
"VS","58.333333333333336","15.517241379310345","28.651685393258425"
"VD","51.945854483925544","28.26086956521739","17.689530685920577"
"VS_map","56.25","9.090909090909092","12.5"
"VD_illusion","48.61111111111111","25.806451612903224","8.333333333333332"
"VD_figure","58.75","36.58536585365854","23.076923076923077"
"VS_ocr","44.44444444444444","23.076923076923077","3.7037037037037033"
"VD_video","51.76470588235295","14.583333333333334","11.594202898550725"
"VD_ocr","78.65168539325843","58.139534883720934","55.81395348837209"
"VS_chart","66.15384615384615","17.5","47.368421052631575"
"VD_math","29.629629629629626","5.555555555555555","3.7037037037037033"
"VS_table","57.14285714285714","10.714285714285714","23.25581395348837"

result = (54.363827549947416 + 23.98843930635838 + 21.978021978021978) / 3 = 33.4
```

````

````{tab} 2B

```bash
torchrun --nproc-per-node=8 run.py --data HallusionBench --model InternVL2-2B --verbose
```

The expected test results are:

```
"split","aAcc","fAcc","qAcc"
"Overall","58.359621451104104","26.589595375722542","28.79120879120879"
"VS","65.27777777777779","24.137931034482758","41.57303370786517"
"VD","54.145516074450086","27.82608695652174","20.577617328519857"
"VS_chart","70.0","27.500000000000004","59.210526315789465"
"VD_math","38.88888888888889","2.7777777777777777","11.11111111111111"
"VS_table","65.17857142857143","14.285714285714285","37.2093023255814"
"VD_ocr","71.91011235955057","46.51162790697674","44.18604651162791"
"VD_figure","60.0","39.02439024390244","23.076923076923077"
"VD_illusion","57.638888888888886","32.25806451612903","23.61111111111111"
"VD_video","48.8235294117647","14.583333333333334","8.695652173913043"
"VS_map","64.0625","27.27272727272727","28.125"
"VS_ocr","55.55555555555556","26.923076923076923","14.814814814814813"

result = (58.359621451104104 + 26.589595375722542 + 28.79120879120879) / 3 = 37.9
```

````

````{tab} 4B

```bash
torchrun --nproc-per-node=8 run.py --data HallusionBench --model InternVL2-4B --verbose
```

The expected test results are:

```
"split","aAcc","fAcc","qAcc"
"Overall","61.09358569926393","32.369942196531795","32.30769230769231"
"VD","56.17597292724196","30.0","22.743682310469314"
"VS","69.16666666666667","37.06896551724138","47.19101123595505"
"VS_map","56.25","27.27272727272727","15.625"
"VS_ocr","55.55555555555556","38.46153846153847","18.51851851851852"
"VD_ocr","75.28089887640449","51.162790697674424","51.162790697674424"
"VS_table","75.89285714285714","35.714285714285715","55.81395348837209"
"VD_figure","62.5","39.02439024390244","25.64102564102564"
"VD_illusion","55.55555555555556","33.87096774193548","19.444444444444446"
"VD_video","48.8235294117647","8.333333333333332","7.246376811594203"
"VD_math","48.148148148148145","16.666666666666664","22.22222222222222"
"VS_chart","75.38461538461539","42.5","65.78947368421053"

result = (61.09358569926393 + 32.369942196531795 + 32.30769230769231) / 3 = 41.9
```

````

````{tab} 8B

```bash
torchrun --nproc-per-node=8 run.py --data HallusionBench --model InternVL2-8B --verbose
```

The expected test results are:

```
"split","aAcc","fAcc","qAcc"
"Overall","64.03785488958991","35.83815028901734","35.824175824175825"
"VS","69.16666666666667","36.206896551724135","45.50561797752809"
"VD","60.913705583756354","35.65217391304348","29.602888086642597"
"VS_chart","76.15384615384615","42.5","63.1578947368421"
"VD_ocr","74.15730337078652","51.162790697674424","48.837209302325576"
"VD_figure","67.5","53.65853658536586","35.8974358974359"
"VD_video","51.17647058823529","14.583333333333334","11.594202898550725"
"VD_math","55.55555555555556","16.666666666666664","29.629629629629626"
"VD_illusion","64.58333333333334","40.32258064516129","31.944444444444443"
"VS_map","56.25","31.818181818181817","18.75"
"VS_ocr","53.70370370370371","26.923076923076923","11.11111111111111"
"VS_table","75.89285714285714","39.285714285714285","55.81395348837209"

result = (64.03785488958991 + 35.83815028901734 + 35.824175824175825) / 3 = 45.2
```

````

````{tab} 26B

```bash
torchrun --nproc-per-node=8 run.py --data HallusionBench --model InternVL2-26B --verbose
```

The expected test results are:

```
"split","aAcc","fAcc","qAcc"
"Overall","67.2975814931651","43.641618497109825","41.098901098901095"
"VD","63.45177664974619","42.608695652173914","33.935018050541515"
"VS","73.61111111111111","45.689655172413794","52.24719101123596"
"VD_illusion","65.97222222222221","50.0","33.33333333333333"
"VS_chart","80.0","50.0","68.42105263157895"
"VD_ocr","77.52808988764045","58.139534883720934","55.81395348837209"
"VD_figure","72.5","53.65853658536586","43.58974358974359"
"VS_map","54.6875","22.727272727272727","18.75"
"VD_video","54.70588235294118","25.0","17.391304347826086"
"VS_ocr","51.85185185185185","34.61538461538461","14.814814814814813"
"VD_math","55.55555555555556","22.22222222222222","31.48148148148148"
"VS_table","87.5","67.85714285714286","72.09302325581395"

result = (67.2975814931651 + 43.641618497109825 + 41.098901098901095) / 3 = 50.7
```

````

````{tab} 40B

```bash
torchrun --nproc-per-node=8 run.py --data HallusionBench --model InternVL2-40B --verbose
```

The expected test results are:

```
"split","aAcc","fAcc","qAcc"
"Overall","71.39852786540484","51.73410404624278","47.69230769230769"
"VS","78.88888888888889","56.896551724137936","58.98876404494382"
"VD","66.83587140439933","49.130434782608695","40.43321299638989"
"VD_math","62.03703703703704","36.11111111111111","38.88888888888889"
"VD_ocr","80.89887640449437","62.7906976744186","60.46511627906976"
"VD_figure","85.0","78.04878048780488","69.23076923076923"
"VS_chart","84.61538461538461","60.0","76.31578947368422"
"VS_map","62.5","45.45454545454545","25.0"
"VS_ocr","72.22222222222221","53.84615384615385","44.44444444444444"
"VS_table","84.82142857142857","64.28571428571429","62.7906976744186"
"VD_video","52.94117647058824","20.833333333333336","15.942028985507244"
"VD_illusion","68.05555555555556","50.0","37.5"

result = (71.39852786540484 + 51.73410404624278 + 47.69230769230769) / 3 = 56.9
```

````

````{tab} 76B

```bash
torchrun --nproc-per-node=1 run.py --data HallusionBench --model InternVL2-76B --verbose
```

The expected test results are:

```
"split","aAcc","fAcc","qAcc"
"Overall","71.1882229232387","48.26589595375722","46.15384615384615"
"VS","76.38888888888889","53.44827586206896","56.74157303370787"
"VD","68.02030456852792","45.65217391304348","39.35018050541516"
"VD_ocr","80.89887640449437","65.11627906976744","65.11627906976744"
"VS_chart","81.53846153846153","60.0","73.68421052631578"
"VD_video","60.588235294117645","25.0","20.28985507246377"
"VD_math","64.81481481481481","27.77777777777778","37.03703703703704"
"VD_illusion","62.5","40.32258064516129","29.166666666666668"
"VS_ocr","64.81481481481481","42.30769230769231","29.629629629629626"
"VD_figure","83.75","73.17073170731707","66.66666666666666"
"VS_table","82.14285714285714","60.71428571428571","62.7906976744186"
"VS_map","65.625","45.45454545454545","31.25"

result = (71.1882229232387 + 48.26589595375722 + 46.15384615384615) / 3 = 55.2
```

````

`````

### MMStar

The MMStar dataset is an advanced multimodal benchmark designed to evaluate the capabilities of MLLMs. It comprises 1,500 carefully selected samples that are balanced and purified to ensure they exhibit visual dependency and minimal data leakage. The dataset evaluates models across six core capabilities and 18 detailed axes, focusing on complex multimodal tasks that require advanced reasoning and understanding of visual content.

`````{tabs}

````{tab} 1B

```bash
torchrun --nproc-per-node=8 run.py --data MMStar --model InternVL2-1B --verbose
```

The expected test results are:

```
"split","Overall","coarse perception","fine-grained perception","instance reasoning","logical reasoning","math","science & technology"
"none","0.452","0.588","0.368","0.548","0.352","0.46","0.396"
```

````

````{tab} 2B

```bash
torchrun --nproc-per-node=8 run.py --data MMStar --model InternVL2-2B --verbose
```

The expected test results are:

```
"split","Overall","coarse perception","fine-grained perception","instance reasoning","logical reasoning","math","science & technology"
"none","0.5013333333333333","0.644","0.392","0.608","0.44","0.496","0.428"
```

````

````{tab} 4B

```bash
torchrun --nproc-per-node=8 run.py --data MMStar --model InternVL2-4B --verbose
```

The expected test results are:

```
"split","Overall","coarse perception","fine-grained perception","instance reasoning","logical reasoning","math","science & technology"
"none","0.5426666666666666","0.672","0.384","0.624","0.532","0.588","0.456"
```

````

````{tab} 8B

```bash
torchrun --nproc-per-node=8 run.py --data MMStar --model InternVL2-8B --verbose
```

The expected test results are:

```
"split","Overall","coarse perception","fine-grained perception","instance reasoning","logical reasoning","math","science & technology"
"none","0.62","0.704","0.504","0.68","0.656","0.672","0.504"
```

````

````{tab} 26B

```bash
torchrun --nproc-per-node=8 run.py --data MMStar --model InternVL2-26B --verbose
```

The expected test results are:

```
"split","Overall","coarse perception","fine-grained perception","instance reasoning","logical reasoning","math","science & technology"
"none","0.612","0.716","0.544","0.688","0.6","0.624","0.5"
```

````

````{tab} 40B

```bash
torchrun --nproc-per-node=8 run.py --data MMStar --model InternVL2-40B --verbose
```

The expected test results are:

```
"split","Overall","coarse perception","fine-grained perception","instance reasoning","logical reasoning","math","science & technology"
"none","0.654","0.692","0.528","0.716","0.696","0.72","0.572"
```

````

````{tab} 76B

```bash
torchrun --nproc-per-node=1 run.py --data MMStar --model InternVL2-76B --verbose
```

The expected test results are:

```
"split","Overall","coarse perception","fine-grained perception","instance reasoning","logical reasoning","math","science & technology"
"none","0.674","0.704","0.568","0.728","0.724","0.752","0.568"
```

````

`````

### OCRBench

OCRBench is a comprehensive evaluation benchmark designed to assess the OCR capabilities of MLLMs. It includes five components: Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). The benchmark encompasses data from 29 datasets, making it one of the most thorough OCR evaluation tools available. OCRBench aims to reveal both the strengths and weaknesses of MLLMs, particularly in handling multilingual text, handwritten text, non-semantic text, and mathematical expressions. The benchmark includes 1,000 question-answer pairs, all manually verified for precision.

`````{tabs}

````{tab} 1B

```bash
torchrun --nproc-per-node=8 run.py --data OCRBench --model InternVL2-1B --verbose
```

The expected test results are:

```
{
    "Text Recognition": 243,
    "Scene Text-centric VQA": 165,
    "Doc-oriented VQA": 125,
    "Key Information Extraction": 149,
    "Handwritten Mathematical Expression Recognition": 72,
    "Final Score": 754,
    "Final Score Norm": 75.4
}
```

````

````{tab} 2B

```bash
torchrun --nproc-per-node=8 run.py --data OCRBench --model InternVL2-2B --verbose
```

The expected test results are:

```
{
    "Text Recognition": 246,
    "Scene Text-centric VQA": 170,
    "Doc-oriented VQA": 133,
    "Key Information Extraction": 167,
    "Handwritten Mathematical Expression Recognition": 68,
    "Final Score": 784,
    "Final Score Norm": 78.4
}
```

````

````{tab} 4B

```bash
torchrun --nproc-per-node=8 run.py --data OCRBench --model InternVL2-4B --verbose
```

The expected test results are:

```
{
    "Text Recognition": 237,
    "Scene Text-centric VQA": 170,
    "Doc-oriented VQA": 154,
    "Key Information Extraction": 159,
    "Handwritten Mathematical Expression Recognition": 68,
    "Final Score": 788,
    "Final Score Norm": 78.8
}
```

````

````{tab} 8B

```bash
torchrun --nproc-per-node=8 run.py --data OCRBench --model InternVL2-8B --verbose
```

The expected test results are:

```
{
    "Text Recognition": 236,
    "Scene Text-centric VQA": 175,
    "Doc-oriented VQA": 156,
    "Key Information Extraction": 162,
    "Handwritten Mathematical Expression Recognition": 65,
    "Final Score": 794,
    "Final Score Norm": 79.4
}
```

````

````{tab} 26B

```bash
torchrun --nproc-per-node=8 run.py --data OCRBench --model InternVL2-26B --verbose
```

The expected test results are:

```
{
    "Text Recognition": 250,
    "Scene Text-centric VQA": 185,
    "Doc-oriented VQA": 154,
    "Key Information Extraction": 168,
    "Handwritten Mathematical Expression Recognition": 68,
    "Final Score": 825,
    "Final Score Norm": 82.5
}
```

````

````{tab} 40B

```bash
torchrun --nproc-per-node=8 run.py --data OCRBench --model InternVL2-40B --verbose
```

The expected test results are:

```
{
    "Text Recognition": 246,
    "Scene Text-centric VQA": 181,
    "Doc-oriented VQA": 160,
    "Key Information Extraction": 175,
    "Handwritten Mathematical Expression Recognition": 75,
    "Final Score": 837,
    "Final Score Norm": 83.7
}
```

````

````{tab} 76B

```bash
torchrun --nproc-per-node=1 run.py --data OCRBench --model InternVL2-76B --verbose
```

The expected test results are:

```
{
    "Text Recognition": 244,
    "Scene Text-centric VQA": 182,
    "Doc-oriented VQA": 165,
    "Key Information Extraction": 176,
    "Handwritten Mathematical Expression Recognition": 72,
    "Final Score": 839,
    "Final Score Norm": 83.9
}
```

````

`````

### MMMU

The MMMU dataset is a comprehensive benchmark designed to evaluate multimodal models on college-level tasks that require domain-specific knowledge and reasoning. It includes 11,500 questions sourced from college exams, quizzes, and textbooks, spanning six disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions cover 30 subjects and feature 30 types of images, such as charts, diagrams, maps, tables, and more.

`````{tabs}

````{tab} 1B

```bash
torchrun --nproc-per-node=8 run.py --data MMMU_DEV_VAL --model InternVL2-1B --verbose
```

The expected test results are:

```
"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"dev","0.34","0.2","0.0","0.2","0.2","0.4","0.4","0.0","0.4","0.0","0.2","0.4","0.4","0.2","0.0","0.6","0.6","0.4","0.2","0.6","0.6","0.6","0.2","0.2","0.0","0.4","0.4","0.8","0.6","0.2","0.8","0.35","0.44","0.28","0.55","0.36","0.17142857142857143"
"validation","0.3688888888888889","0.2","0.2","0.23333333333333334","0.4666666666666667","0.43333333333333335","0.4666666666666667","0.3333333333333333","0.4","0.3333333333333333","0.3333333333333333","0.5333333333333333","0.4666666666666667","0.36666666666666664","0.4666666666666667","0.4","0.23333333333333334","0.4","0.43333333333333335","0.7666666666666667","0.43333333333333335","0.43333333333333335","0.4","0.16666666666666666","0.26666666666666666","0.26666666666666666","0.2","0.36666666666666664","0.26666666666666666","0.3","0.5","0.425","0.3333333333333333","0.35333333333333333","0.49166666666666664","0.3333333333333333","0.32857142857142857"
```

````

````{tab} 2B

```bash
torchrun --nproc-per-node=8 run.py --data MMMU_DEV_VAL --model InternVL2-2B --verbose
```

The expected test results are:

```
"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"dev","0.3333333333333333","0.4","0.0","0.0","0.2","0.2","0.6","0.2","0.2","0.2","0.4","0.6","0.2","0.8","0.6","0.2","0.6","0.0","0.4","0.8","0.2","0.2","0.2","0.8","0.8","0.0","0.2","0.2","0.2","0.0","0.6","0.25","0.44","0.24","0.5","0.28","0.3142857142857143"
"validation","0.36333333333333334","0.3333333333333333","0.4","0.26666666666666666","0.43333333333333335","0.36666666666666664","0.43333333333333335","0.23333333333333334","0.3","0.4","0.3","0.4666666666666667","0.36666666666666664","0.36666666666666664","0.5","0.26666666666666666","0.4","0.23333333333333334","0.43333333333333335","0.7666666666666667","0.43333333333333335","0.3333333333333333","0.3","0.4","0.23333333333333334","0.3","0.2","0.26666666666666666","0.36666666666666664","0.36666666666666664","0.43333333333333335","0.39166666666666666","0.37333333333333335","0.35333333333333333","0.5","0.2866666666666667","0.3238095238095238"
```

````

````{tab} 4B

```bash
torchrun --nproc-per-node=8 run.py --data MMMU_DEV_VAL --model InternVL2-4B --verbose
```

The expected test results are:

```
"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"validation","0.47888888888888886","0.43333333333333335","0.5333333333333333","0.3","0.6","0.6","0.43333333333333335","0.36666666666666664","0.36666666666666664","0.3333333333333333","0.4","0.9","0.4666666666666667","0.5666666666666667","0.43333333333333335","0.4666666666666667","0.4","0.36666666666666664","0.5666666666666667","0.8333333333333334","0.5666666666666667","0.43333333333333335","0.36666666666666664","0.3333333333333333","0.26666666666666666","0.3333333333333333","0.43333333333333335","0.3333333333333333","0.6666666666666666","0.5666666666666667","0.7","0.6083333333333333","0.48","0.44666666666666666","0.6916666666666667","0.35333333333333333","0.3952380952380952"
"dev","0.4866666666666667","0.2","0.2","0.4","0.6","0.6","0.8","1.0","0.4","0.0","0.4","0.6","0.2","0.6","0.4","0.4","0.4","0.0","1.0","0.8","0.6","0.6","0.2","0.6","0.6","0.4","0.4","0.2","0.8","0.6","0.6","0.55","0.48","0.4","0.8","0.44","0.37142857142857144"
```

````

````{tab} 8B

```bash
torchrun --nproc-per-node=8 run.py --data MMMU_DEV_VAL --model InternVL2-8B --verbose
```

The expected test results are:

```
"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"dev","0.49333333333333335","0.2","0.2","0.4","0.6","0.8","0.6","1.0","0.2","0.2","0.6","0.6","0.4","0.2","0.6","0.4","0.6","0.0","1.0","1.0","0.6","0.6","0.2","0.6","0.4","0.2","0.6","0.4","0.6","0.4","0.6","0.55","0.44","0.44","0.8","0.44","0.4"
"validation","0.5177777777777778","0.5333333333333333","0.5333333333333333","0.3","0.7","0.7","0.4666666666666667","0.5","0.5","0.7","0.6333333333333333","0.7","0.43333333333333335","0.5333333333333333","0.4666666666666667","0.4","0.3333333333333333","0.4666666666666667","0.7","0.9","0.5333333333333333","0.5333333333333333","0.3333333333333333","0.5","0.4","0.36666666666666664","0.3333333333333333","0.26666666666666666","0.6","0.5666666666666667","0.6","0.6166666666666667","0.49333333333333335","0.5","0.7","0.44666666666666666","0.4380952380952381"
```

````

````{tab} 26B

```bash
torchrun --nproc-per-node=8 run.py --data MMMU_DEV_VAL --model InternVL2-26B --verbose
```

The expected test results are:

```
"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"dev","0.5266666666666666","0.4","0.4","0.2","0.8","0.8","0.6","0.4","0.4","0.0","0.6","0.6","0.2","0.2","0.6","0.4","1.0","0.0","1.0","0.8","0.6","0.6","0.4","0.6","0.8","0.6","0.6","0.4","0.8","0.4","0.6","0.7","0.56","0.36","0.8","0.36","0.4857142857142857"
"validation","0.5122222222222222","0.43333333333333335","0.4666666666666667","0.26666666666666666","0.8","0.8666666666666667","0.5666666666666667","0.5666666666666667","0.3333333333333333","0.5666666666666667","0.4666666666666667","0.8333333333333334","0.36666666666666664","0.4","0.5","0.4666666666666667","0.4","0.5333333333333333","0.7","0.9","0.5666666666666667","0.4666666666666667","0.36666666666666664","0.3333333333333333","0.4","0.3","0.3333333333333333","0.3333333333333333","0.6","0.6","0.6333333333333333","0.7","0.4533333333333333","0.4866666666666667","0.7083333333333334","0.42","0.41904761904761906"
```

````

````{tab} 40B

```bash
torchrun --nproc-per-node=8 run.py --data MMMU_DEV_VAL --model InternVL2-40B --verbose
```

The expected test results are:

```
"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"validation","0.5522222222222222","0.4","0.6","0.36666666666666664","0.7","0.8666666666666667","0.5333333333333333","0.5333333333333333","0.4666666666666667","0.6","0.5666666666666667","0.7333333333333333","0.36666666666666664","0.6","0.4666666666666667","0.4666666666666667","0.43333333333333335","0.5333333333333333","0.7666666666666667","0.8333333333333334","0.4666666666666667","0.5666666666666667","0.3333333333333333","0.43333333333333335","0.36666666666666664","0.3","0.7","0.5333333333333333","0.6333333333333333","0.8","0.6","0.65","0.49333333333333335","0.6","0.7083333333333334","0.5","0.4523809523809524"
"dev","0.54","0.2","0.2","0.4","1.0","0.8","0.8","0.6","0.2","0.4","0.6","0.6","0.4","0.2","0.4","0.4","0.8","0.0","1.0","1.0","0.6","0.6","0.4","0.4","0.8","0.4","0.8","0.4","0.8","0.4","0.6","0.7","0.48","0.56","0.85","0.32","0.45714285714285713"
```

````

````{tab} 76B

```bash
torchrun --nproc-per-node=1 run.py --data MMMU_DEV_VAL --model InternVL2-76B --verbose
```

The expected test results are:

```
"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"validation","0.5822222222222222","0.5","0.6333333333333333","0.4666666666666667","0.7666666666666667","0.9666666666666667","0.5333333333333333","0.5","0.5","0.6666666666666666","0.6333333333333333","0.7666666666666667","0.43333333333333335","0.5333333333333333","0.6","0.4","0.6333333333333333","0.4666666666666667","0.7","0.9","0.7333333333333333","0.6","0.3","0.3","0.4666666666666667","0.3333333333333333","0.5666666666666667","0.5333333333333333","0.7","0.7","0.6333333333333333","0.7083333333333334","0.6","0.58","0.7333333333333333","0.46","0.5"
"dev","0.5666666666666667","0.2","0.2","0.4","0.8","0.8","0.8","1.0","0.2","0.4","0.6","0.6","0.6","0.2","0.4","0.4","1.0","0.0","1.0","1.0","0.8","0.4","0.2","0.6","1.0","0.2","0.6","0.4","0.8","0.6","0.8","0.6","0.52","0.6","0.9","0.44","0.45714285714285713"
```

````

`````

### RealWorldQA

The RealWorldQA dataset is a benchmark designed to evaluate the real-world spatial understanding capabilities of multimodal AI models. It consists of over 700 images, each accompanied by a question and a verifiable answer, focusing on various real-world scenarios, including those captured from vehicles. This dataset aims to test how well AI models comprehend physical environments and spatial relations, enhancing their ability to interpret and analyze real-world scenes.

`````{tabs}

````{tab} 1B

```bash
torchrun --nproc-per-node=8 run.py --data RealWorldQA --model InternVL2-1B --verbose
```

The expected test results are:

```
"split","Overall"
"none","0.5032679738562091"
```

````

````{tab} 2B

```bash
torchrun --nproc-per-node=8 run.py --data RealWorldQA --model InternVL2-2B --verbose
```

The expected test results are:

```
"split","Overall"
"none","0.5725490196078431"
```

````

````{tab} 4B

```bash
torchrun --nproc-per-node=8 run.py --data RealWorldQA --model InternVL2-4B --verbose
```

The expected test results are:

```
"split","Overall"
"none","0.6065359477124183"
```

````

````{tab} 8B

```bash
torchrun --nproc-per-node=8 run.py --data RealWorldQA --model InternVL2-8B --verbose
```

The expected test results are:

```
"split","Overall"
"none","0.6444444444444445"
```

````

````{tab} 26B

```bash
torchrun --nproc-per-node=8 run.py --data RealWorldQA --model InternVL2-26B --verbose
```

The expected test results are:

```
"split","Overall"
"none","0.6836601307189543"
```

````

````{tab} 40B

```bash
torchrun --nproc-per-node=8 run.py --data RealWorldQA --model InternVL2-40B --verbose
```

The expected test results are:

```
"split","Overall"
"none","0.7176470588235294"
```

````

````{tab} 76B

```bash
torchrun --nproc-per-node=1 run.py --data RealWorldQA --model InternVL2-76B --verbose
```

The expected test results are:

```
"split","Overall"
"none","0.7215686274509804"
```

````

`````

### MMVet (GPT-4-Turbo)

The MM-Vet dataset is a comprehensive benchmark designed to evaluate the integrated capabilities of MLLMs. It encompasses six core vision-language (VL) capabilities: recognition, knowledge, optical character recognition (OCR), spatial awareness, language generation, and math. The dataset includes 200 images and 218 questions, each requiring one or more of these capabilities to answer. The evaluation uses an open-ended LLM-based approach, allowing assessment across various answer styles and question types.

`````{tabs}

````{tab} 1B

```bash
torchrun --nproc-per-node=8 run.py --data MMVet --model InternVL2-1B --verbose
```

The expected test results are:

```
"Category","tot","acc"
"rec","187","37.27272727272725"
"ocr","108","37.96296296296297"
"know","84","14.76190476190476"
"gen","80","14.624999999999996"
"spat","75","33.733333333333334"
"math","26","22.692307692307693"
"Overall","218","33.25688073394493"
```

````

````{tab} 2B

```bash
torchrun --nproc-per-node=8 run.py --data MMVet --model InternVL2-2B --verbose
```

The expected test results are:

```
"Category","tot","acc"
"rec","187","41.71122994652404"
"ocr","108","44.62962962962963"
"know","84","24.999999999999993"
"gen","80","26.25"
"spat","75","40.800000000000004"
"math","26","30.76923076923077"
"Overall","218","39.541284403669714"
```

````

````{tab} 4B

```bash
torchrun --nproc-per-node=8 run.py --data MMVet --model InternVL2-4B --verbose
```

The expected test results are:

```
"Category","tot","acc"
"rec","187","50.000000000000036"
"ocr","108","58.611111111111114"
"know","84","37.26190476190476"
"gen","80","36.499999999999986"
"spat","75","47.20000000000001"
"math","26","57.30769230769231"
"Overall","218","51.00917431192664"
```

````

````{tab} 8B

```bash
torchrun --nproc-per-node=8 run.py --data MMVet --model InternVL2-8B --verbose
```

The expected test results are:

```
"Category","tot","acc"
"rec","187","51.81818181818184"
"ocr","108","63.42592592592594"
"know","84","36.904761904761905"
"gen","80","35.87499999999999"
"spat","75","61.86666666666667"
"math","26","60.769230769230774"
"Overall","218","54.174311926605526"
```

````

````{tab} 26B

```bash
torchrun --nproc-per-node=8 run.py --data MMVet --model InternVL2-26B --verbose
```

The expected test results are:

```
"Category","tot","acc"
"rec","187","62.67379679144389"
"ocr","108","69.72222222222223"
"know","84","50.119047619047606"
"gen","80","48.62499999999999"
"spat","75","61.066666666666656"
"math","26","61.53846153846154"
"Overall","218","62.1100917431193"
```

````

````{tab} 40B

```bash
torchrun --nproc-per-node=8 run.py --data MMVet --model InternVL2-40B --verbose
```

The expected test results are:

```
"Category","tot","acc"
"rec","187","66.25668449197867"
"ocr","108","70.18518518518522"
"know","84","54.40476190476189"
"gen","80","54.74999999999998"
"spat","75","68.53333333333332"
"math","26","64.23076923076924"
"Overall","218","65.50458715596335"
```

````

````{tab} 76B

```bash
torchrun --nproc-per-node=1 run.py --data MMVet --model InternVL2-76B --verbose
```

The expected test results are:

```
"Category","tot","acc"
"rec","187","65.66844919786104"
"ocr","108","70.09259259259262"
"know","84","58.3333333333333"
"gen","80","58.49999999999997"
"spat","75","60.79999999999999"
"math","26","75.76923076923077"
"Overall","218","65.7339449541285"
```

````

`````

Note that because the version of GPT-4 used for scoring differs from the official server, the scores tested by VLMEvalKit will be slightly different.

### LLaVA-Bench (GPT-4-Turbo)

The LLaVA-Bench-in-the-Wild dataset is designed to evaluate the capabilities of MLLMs in handling more complex and diverse visual tasks. It includes a set of 24 images with 60 associated questions, covering a range of indoor and outdoor scenes, memes, paintings, and sketches. Each image is paired with detailed, manually curated descriptions and questions that test the model's generalizability to novel domains.

`````{tabs}

````{tab} 1B

```bash
torchrun --nproc-per-node=8 run.py --data LLaVABench --model InternVL2-1B --verbose
```

The expected test results are:

```
"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","51.6","39.5","76.5"
"detail","58.9","37.3","63.3"
"conv","43.0","40.0","92.9"
"complex","54.9","40.4","73.6"
```

````

````{tab} 2B

```bash
torchrun --nproc-per-node=8 run.py --data LLaVABench --model InternVL2-2B --verbose
```

The expected test results are:

```
"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","62.5","47.8","76.5"
"detail","61.8","42.0","68.0"
"complex","63.5","46.1","72.5"
"conv","61.7","55.9","90.6"
```

````

````{tab} 4B

```bash
torchrun --nproc-per-node=8 run.py --data LLaVABench --model InternVL2-4B --verbose
```

The expected test results are:

```
"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","68.2","51.0","74.8"
"conv","62.3","55.3","88.8"
"detail","65.3","42.7","65.3"
"complex","74.0","52.9","71.4"
```

````

````{tab} 8B

```bash
torchrun --nproc-per-node=8 run.py --data LLaVABench --model InternVL2-8B --verbose
```

The expected test results are:

```
"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","73.2","53.3","72.8"
"complex","86.1","61.8","71.8"
"conv","61.6","54.7","88.8"
"detail","63.5","36.0","56.7"
```

````

````{tab} 26B

```bash
torchrun --nproc-per-node=8 run.py --data LLaVABench --model InternVL2-26B --verbose
```

The expected test results are:

```
"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","92.3","68.0","73.7"
"detail","85.6","51.3","60.0"
"complex","99.0","73.6","74.3"
"conv","86.8","73.5","84.7"
```

````

````{tab} 40B

```bash
torchrun --nproc-per-node=8 run.py --data LLaVABench --model InternVL2-40B --verbose
```

The expected test results are:

```
"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","100.5","72.7","72.3"
"detail","90.4","56.7","62.7"
"complex","104.4","76.1","72.9"
"conv","101.5","81.2","80.0"
```

````

````{tab} 76B

```bash
torchrun --nproc-per-node=1 run.py --data LLaVABench --model InternVL2-76B --verbose
```

The expected test results are:

```
"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","99.3","71.7","72.2"
"detail","92.1","54.7","59.3"
"complex","107.7","79.6","73.9"
"conv","91.2","73.5","80.6"
```

````

`````

### VideoMME

The Video-MME dataset is a comprehensive benchmark designed to evaluate the capabilities of MLLMs in video analysis. It is the first benchmark specifically tailored for this purpose, focusing on a high-quality assessment of models' performance in processing sequential visual data.

`````{tabs}

````{tab} 1B

When testing without subtitles:

```bash
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-1B --verbose --nframe 16
```

The expected test results are:

```
{
    "short": {
        "overall": "0.5289",
        "domain": {
            "Knowledge": "0.5481",
            "Film & Television": "0.6167",
            "Sports Competition": "0.4667",
            "Artistic Performance": "0.5333",
            "Life Record": "0.5143",
            "Multilingual": "0.4000"
        },
        "sub_category": {
            "Humanity & History": "0.3333",
            "Literature & Art": "0.4000",
            "Biology & Medicine": "0.7000",
            "Finance & Commerce": "0.6333",
            "Astronomy": "0.5667",
            "Geography": "0.5333",
            "Law": "0.6000",
            "Life Tip": "0.5333",
            "Technology": "0.6333",
            "Animation": "0.6000",
            "Movie & TV Show": "0.7333",
            "Documentary": "0.5333",
            "News Report": "0.6000",
            "Esports": "0.3667",
            "Basketball": "0.3667",
            "Football": "0.5333",
            "Athletics": "0.5333",
            "Other Sports": "0.5333",
            "Stage Play": "0.7333",
            "Magic Show": "0.3333",
            "Variety Show": "0.6333",
            "Acrobatics": "0.4333",
            "Handicraft": "0.4667",
            "Food": "0.5000",
            "Fashion": "0.6333",
            "Daily Life": "0.4000",
            "Travel": "0.6333",
            "Pet & Animal": "0.6667",
            "Exercise": "0.3000",
            "Multilingual": "0.4000"
        },
        "task_type": {
            "Temporal Perception": "0.6667",
            "Spatial Perception": "0.6000",
            "Attribute Perception": "0.6721",
            "Action Recognition": "0.4427",
            "Object Recognition": "0.4821",
            "OCR Problems": "0.6316",
            "Counting Problem": "0.3040",
            "Temporal Reasoning": "0.6154",
            "Spatial Reasoning": "0.6667",
            "Action Reasoning": "0.6170",
            "Object Reasoning": "0.4750",
            "Information Synopsis": "0.7073"
        }
    },
    "medium": {
        "overall": "0.4144",
        "domain": {
            "Knowledge": "0.3630",
            "Film & Television": "0.5250",
            "Sports Competition": "0.3933",
            "Artistic Performance": "0.4750",
            "Life Record": "0.3952",
            "Multilingual": "0.4333"
        },
        "sub_category": {
            "Humanity & History": "0.2000",
            "Literature & Art": "0.4000",
            "Biology & Medicine": "0.5000",
            "Finance & Commerce": "0.4333",
            "Astronomy": "0.4333",
            "Geography": "0.2333",
            "Law": "0.4000",
            "Life Tip": "0.4333",
            "Technology": "0.2333",
            "Animation": "0.3333",
            "Movie & TV Show": "0.5333",
            "Documentary": "0.6000",
            "News Report": "0.6333",
            "Esports": "0.5000",
            "Basketball": "0.1333",
            "Football": "0.4333",
            "Athletics": "0.3333",
            "Other Sports": "0.5667",
            "Stage Play": "0.5667",
            "Magic Show": "0.3333",
            "Variety Show": "0.5000",
            "Acrobatics": "0.5000",
            "Handicraft": "0.4667",
            "Food": "0.3000",
            "Fashion": "0.3667",
            "Daily Life": "0.3333",
            "Travel": "0.4333",
            "Pet & Animal": "0.4000",
            "Exercise": "0.4667",
            "Multilingual": "0.4333"
        },
        "task_type": {
            "Temporal Perception": "0.3871",
            "Spatial Perception": "0.6190",
            "Attribute Perception": "0.4110",
            "Action Recognition": "0.3613",
            "Object Recognition": "0.5000",
            "OCR Problems": "0.4706",
            "Counting Problem": "0.2526",
            "Temporal Reasoning": "0.2740",
            "Spatial Reasoning": "0.6667",
            "Action Reasoning": "0.3276",
            "Object Reasoning": "0.4179",
            "Information Synopsis": "0.5897"
        }
    },
    "long": {
        "overall": "0.3333",
        "domain": {
            "Knowledge": "0.3259",
            "Film & Television": "0.3250",
            "Sports Competition": "0.3000",
            "Artistic Performance": "0.3167",
            "Life Record": "0.3762",
            "Multilingual": "0.3667"
        },
        "sub_category": {
            "Humanity & History": "0.3333",
            "Literature & Art": "0.3667",
            "Biology & Medicine": "0.3333",
            "Finance & Commerce": "0.4667",
            "Astronomy": "0.2000",
            "Geography": "0.3000",
            "Law": "0.2667",
            "Life Tip": "0.3000",
            "Technology": "0.3667",
            "Animation": "0.2000",
            "Movie & TV Show": "0.4667",
            "Documentary": "0.4000",
            "News Report": "0.2333",
            "Esports": "0.4000",
            "Basketball": "0.3333",
            "Football": "0.2333",
            "Athletics": "0.1333",
            "Other Sports": "0.4000",
            "Stage Play": "0.4000",
            "Magic Show": "0.2667",
            "Variety Show": "0.1333",
            "Acrobatics": "0.4667",
            "Handicraft": "0.5333",
            "Food": "0.4333",
            "Fashion": "0.3333",
            "Daily Life": "0.3667",
            "Travel": "0.2000",
            "Pet & Animal": "0.4333",
            "Exercise": "0.3333",
            "Multilingual": "0.3667"
        },
        "task_type": {
            "Temporal Perception": "0.3333",
            "Spatial Perception": "0.0000",
            "Attribute Perception": "0.5185",
            "Action Recognition": "0.3016",
            "Object Recognition": "0.2963",
            "OCR Problems": "0.5000",
            "Counting Problem": "0.1250",
            "Temporal Reasoning": "0.2857",
            "Spatial Reasoning": "0.6364",
            "Action Reasoning": "0.2556",
            "Object Reasoning": "0.3042",
            "Information Synopsis": "0.5153"
        }
    },
    "overall": {
        "overall": "0.4256",
        "domain": {
            "Knowledge": "0.4123",
            "Film & Television": "0.4889",
            "Sports Competition": "0.3867",
            "Artistic Performance": "0.4417",
            "Life Record": "0.4286",
            "Multilingual": "0.4000"
        },
        "sub_category": {
            "Humanity & History": "0.2889",
            "Literature & Art": "0.3889",
            "Biology & Medicine": "0.5111",
            "Finance & Commerce": "0.5111",
            "Astronomy": "0.4000",
            "Geography": "0.3556",
            "Law": "0.4222",
            "Life Tip": "0.4222",
            "Technology": "0.4111",
            "Animation": "0.3778",
            "Movie & TV Show": "0.5778",
            "Documentary": "0.5111",
            "News Report": "0.4889",
            "Esports": "0.4222",
            "Basketball": "0.2778",
            "Football": "0.4000",
            "Athletics": "0.3333",
            "Other Sports": "0.5000",
            "Stage Play": "0.5667",
            "Magic Show": "0.3111",
            "Variety Show": "0.4222",
            "Acrobatics": "0.4667",
            "Handicraft": "0.4889",
            "Food": "0.4111",
            "Fashion": "0.4444",
            "Daily Life": "0.3667",
            "Travel": "0.4222",
            "Pet & Animal": "0.5000",
            "Exercise": "0.3667",
            "Multilingual": "0.4000"
        },
        "task_type": {
            "Temporal Perception": "0.4727",
            "Spatial Perception": "0.5741",
            "Attribute Perception": "0.5676",
            "Action Recognition": "0.3834",
            "Object Recognition": "0.4605",
            "OCR Problems": "0.5396",
            "Counting Problem": "0.2537",
            "Temporal Reasoning": "0.3051",
            "Spatial Reasoning": "0.6607",
            "Action Reasoning": "0.3298",
            "Object Reasoning": "0.3678",
            "Information Synopsis": "0.5820"
        }
    }
}
```

When testing with subtitles:

```bash
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-1B --verbose --nframe 16 --use-subtitle
```

The expected test results are:

```
{
    "short": {
        "overall": "0.5433",
        "domain": {
            "Knowledge": "0.5630",
            "Film & Television": "0.6000",
            "Sports Competition": "0.4933",
            "Artistic Performance": "0.5167",
            "Life Record": "0.5571",
            "Multilingual": "0.4000"
        },
        "sub_category": {
            "Humanity & History": "0.3333",
            "Literature & Art": "0.4000",
            "Biology & Medicine": "0.7667",
            "Finance & Commerce": "0.6000",
            "Astronomy": "0.6000",
            "Geography": "0.5000",
            "Law": "0.6667",
            "Life Tip": "0.6000",
            "Technology": "0.6000",
            "Animation": "0.5667",
            "Movie & TV Show": "0.7333",
            "Documentary": "0.5000",
            "News Report": "0.6000",
            "Esports": "0.4333",
            "Basketball": "0.4000",
            "Football": "0.5000",
            "Athletics": "0.5000",
            "Other Sports": "0.6333",
            "Stage Play": "0.7667",
            "Magic Show": "0.3333",
            "Variety Show": "0.5333",
            "Acrobatics": "0.4333",
            "Handicraft": "0.5000",
            "Food": "0.6000",
            "Fashion": "0.6333",
            "Daily Life": "0.4333",
            "Travel": "0.7333",
            "Pet & Animal": "0.6667",
            "Exercise": "0.3333",
            "Multilingual": "0.4000"
        },
        "task_type": {
            "Temporal Perception": "0.5556",
            "Spatial Perception": "0.5667",
            "Attribute Perception": "0.6557",
            "Action Recognition": "0.4656",
            "Object Recognition": "0.5238",
            "OCR Problems": "0.6667",
            "Counting Problem": "0.3120",
            "Temporal Reasoning": "0.4615",
            "Spatial Reasoning": "0.6296",
            "Action Reasoning": "0.5957",
            "Object Reasoning": "0.5375",
            "Information Synopsis": "0.7561"
        }
    },
    "medium": {
        "overall": "0.4289",
        "domain": {
            "Knowledge": "0.4111",
            "Film & Television": "0.5250",
            "Sports Competition": "0.4000",
            "Artistic Performance": "0.4917",
            "Life Record": "0.3714",
            "Multilingual": "0.5000"
        },
        "sub_category": {
            "Humanity & History": "0.3667",
            "Literature & Art": "0.4333",
            "Biology & Medicine": "0.5667",
            "Finance & Commerce": "0.5000",
            "Astronomy": "0.5333",
            "Geography": "0.3333",
            "Law": "0.3333",
            "Life Tip": "0.4000",
            "Technology": "0.2333",
            "Animation": "0.2667",
            "Movie & TV Show": "0.5000",
            "Documentary": "0.6333",
            "News Report": "0.7000",
            "Esports": "0.5000",
            "Basketball": "0.1667",
            "Football": "0.4333",
            "Athletics": "0.3667",
            "Other Sports": "0.5333",
            "Stage Play": "0.6333",
            "Magic Show": "0.4333",
            "Variety Show": "0.4333",
            "Acrobatics": "0.4667",
            "Handicraft": "0.5000",
            "Food": "0.3333",
            "Fashion": "0.3333",
            "Daily Life": "0.3000",
            "Travel": "0.4000",
            "Pet & Animal": "0.3000",
            "Exercise": "0.4333",
            "Multilingual": "0.5000"
        },
        "task_type": {
            "Temporal Perception": "0.4194",
            "Spatial Perception": "0.6667",
            "Attribute Perception": "0.4658",
            "Action Recognition": "0.3613",
            "Object Recognition": "0.4924",
            "OCR Problems": "0.4265",
            "Counting Problem": "0.2632",
            "Temporal Reasoning": "0.2877",
            "Spatial Reasoning": "0.7222",
            "Action Reasoning": "0.3276",
            "Object Reasoning": "0.4403",
            "Information Synopsis": "0.6538"
        }
    },
    "long": {
        "overall": "0.3689",
        "domain": {
            "Knowledge": "0.3852",
            "Film & Television": "0.3833",
            "Sports Competition": "0.3267",
            "Artistic Performance": "0.3417",
            "Life Record": "0.3905",
            "Multilingual": "0.3333"
        },
        "sub_category": {
            "Humanity & History": "0.2333",
            "Literature & Art": "0.4333",
            "Biology & Medicine": "0.4333",
            "Finance & Commerce": "0.6000",
            "Astronomy": "0.2667",
            "Geography": "0.2667",
            "Law": "0.5000",
            "Life Tip": "0.4333",
            "Technology": "0.3000",
            "Animation": "0.2667",
            "Movie & TV Show": "0.4667",
            "Documentary": "0.5000",
            "News Report": "0.3000",
            "Esports": "0.3667",
            "Basketball": "0.2667",
            "Football": "0.3667",
            "Athletics": "0.2000",
            "Other Sports": "0.4333",
            "Stage Play": "0.4333",
            "Magic Show": "0.2333",
            "Variety Show": "0.2333",
            "Acrobatics": "0.4667",
            "Handicraft": "0.4667",
            "Food": "0.4333",
            "Fashion": "0.3667",
            "Daily Life": "0.4000",
            "Travel": "0.1667",
            "Pet & Animal": "0.5333",
            "Exercise": "0.3667",
            "Multilingual": "0.3333"
        },
        "task_type": {
            "Temporal Perception": "0.3333",
            "Spatial Perception": "0.0000",
            "Attribute Perception": "0.5185",
            "Action Recognition": "0.3016",
            "Object Recognition": "0.3148",
            "OCR Problems": "0.2857",
            "Counting Problem": "0.1875",
            "Temporal Reasoning": "0.2637",
            "Spatial Reasoning": "0.5455",
            "Action Reasoning": "0.3278",
            "Object Reasoning": "0.3667",
            "Information Synopsis": "0.5521"
        }
    },
    "overall": {
        "overall": "0.4470",
        "domain": {
            "Knowledge": "0.4531",
            "Film & Television": "0.5028",
            "Sports Competition": "0.4067",
            "Artistic Performance": "0.4500",
            "Life Record": "0.4397",
            "Multilingual": "0.4111"
        },
        "sub_category": {
            "Humanity & History": "0.3111",
            "Literature & Art": "0.4222",
            "Biology & Medicine": "0.5889",
            "Finance & Commerce": "0.5667",
            "Astronomy": "0.4667",
            "Geography": "0.3667",
            "Law": "0.5000",
            "Life Tip": "0.4778",
            "Technology": "0.3778",
            "Animation": "0.3667",
            "Movie & TV Show": "0.5667",
            "Documentary": "0.5444",
            "News Report": "0.5333",
            "Esports": "0.4333",
            "Basketball": "0.2778",
            "Football": "0.4333",
            "Athletics": "0.3556",
            "Other Sports": "0.5333",
            "Stage Play": "0.6111",
            "Magic Show": "0.3333",
            "Variety Show": "0.4000",
            "Acrobatics": "0.4556",
            "Handicraft": "0.4889",
            "Food": "0.4556",
            "Fashion": "0.4444",
            "Daily Life": "0.3778",
            "Travel": "0.4333",
            "Pet & Animal": "0.5000",
            "Exercise": "0.3778",
            "Multilingual": "0.4111"
        },
        "task_type": {
            "Temporal Perception": "0.4545",
            "Spatial Perception": "0.5741",
            "Attribute Perception": "0.5766",
            "Action Recognition": "0.3930",
            "Object Recognition": "0.4802",
            "OCR Problems": "0.5108",
            "Counting Problem": "0.2724",
            "Temporal Reasoning": "0.2881",
            "Spatial Reasoning": "0.6429",
            "Action Reasoning": "0.3719",
            "Object Reasoning": "0.4185",
            "Information Synopsis": "0.6285"
        }
    }
}
```

````

````{tab} 2B

When testing without subtitles:

```bash
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-2B --verbose --nframe 16
```

The expected test results are:

```
{
    "short": {
        "overall": "0.5756",
        "domain": {
            "Knowledge": "0.5593",
            "Film & Television": "0.6417",
            "Sports Competition": "0.5800",
            "Artistic Performance": "0.5917",
            "Life Record": "0.5810",
            "Multilingual": "0.3333"
        },
        "sub_category": {
            "Humanity & History": "0.5333",
            "Literature & Art": "0.4333",
            "Biology & Medicine": "0.6667",
            "Finance & Commerce": "0.4667",
            "Astronomy": "0.5333",
            "Geography": "0.6000",
            "Law": "0.5667",
            "Life Tip": "0.6667",
            "Technology": "0.5667",
            "Animation": "0.6000",
            "Movie & TV Show": "0.6000",
            "Documentary": "0.6000",
            "News Report": "0.7667",
            "Esports": "0.5667",
            "Basketball": "0.4667",
            "Football": "0.6333",
            "Athletics": "0.5667",
            "Other Sports": "0.6667",
            "Stage Play": "0.7333",
            "Magic Show": "0.4333",
            "Variety Show": "0.6667",
            "Acrobatics": "0.5333",
            "Handicraft": "0.4000",
            "Food": "0.6000",
            "Fashion": "0.5333",
            "Daily Life": "0.6667",
            "Travel": "0.6000",
            "Pet & Animal": "0.7667",
            "Exercise": "0.5000",
            "Multilingual": "0.3333"
        },
        "task_type": {
            "Temporal Perception": "0.7222",
            "Spatial Perception": "0.7333",
            "Attribute Perception": "0.6967",
            "Action Recognition": "0.5115",
            "Object Recognition": "0.5536",
            "OCR Problems": "0.7368",
            "Counting Problem": "0.3120",
            "Temporal Reasoning": "0.3846",
            "Spatial Reasoning": "0.7407",
            "Action Reasoning": "0.6809",
            "Object Reasoning": "0.5375",
            "Information Synopsis": "0.6951"
        }
    },
    "medium": {
        "overall": "0.4067",
        "domain": {
            "Knowledge": "0.3741",
            "Film & Television": "0.4917",
            "Sports Competition": "0.3333",
            "Artistic Performance": "0.5417",
            "Life Record": "0.3762",
            "Multilingual": "0.4000"
        },
        "sub_category": {
            "Humanity & History": "0.2000",
            "Literature & Art": "0.4333",
            "Biology & Medicine": "0.4000",
            "Finance & Commerce": "0.3667",
            "Astronomy": "0.4000",
            "Geography": "0.3000",
            "Law": "0.5333",
            "Life Tip": "0.5000",
            "Technology": "0.2333",
            "Animation": "0.3000",
            "Movie & TV Show": "0.5667",
            "Documentary": "0.5000",
            "News Report": "0.6000",
            "Esports": "0.3333",
            "Basketball": "0.2000",
            "Football": "0.2667",
            "Athletics": "0.5000",
            "Other Sports": "0.3667",
            "Stage Play": "0.6667",
            "Magic Show": "0.5000",
            "Variety Show": "0.5000",
            "Acrobatics": "0.5000",
            "Handicraft": "0.4333",
            "Food": "0.2000",
            "Fashion": "0.2667",
            "Daily Life": "0.3333",
            "Travel": "0.4333",
            "Pet & Animal": "0.3667",
            "Exercise": "0.6000",
            "Multilingual": "0.4000"
        },
        "task_type": {
            "Temporal Perception": "0.2903",
            "Spatial Perception": "0.5238",
            "Attribute Perception": "0.4932",
            "Action Recognition": "0.3025",
            "Object Recognition": "0.4924",
            "OCR Problems": "0.3676",
            "Counting Problem": "0.2737",
            "Temporal Reasoning": "0.3151",
            "Spatial Reasoning": "0.6667",
            "Action Reasoning": "0.3966",
            "Object Reasoning": "0.4104",
            "Information Synopsis": "0.5769"
        }
    },
    "long": {
        "overall": "0.3689",
        "domain": {
            "Knowledge": "0.3444",
            "Film & Television": "0.3500",
            "Sports Competition": "0.3933",
            "Artistic Performance": "0.3417",
            "Life Record": "0.4000",
            "Multilingual": "0.4333"
        },
        "sub_category": {
            "Humanity & History": "0.3000",
            "Literature & Art": "0.4667",
            "Biology & Medicine": "0.3667",
            "Finance & Commerce": "0.3667",
            "Astronomy": "0.2333",
            "Geography": "0.2333",
            "Law": "0.4667",
            "Life Tip": "0.3000",
            "Technology": "0.3667",
            "Animation": "0.2333",
            "Movie & TV Show": "0.4333",
            "Documentary": "0.4333",
            "News Report": "0.3000",
            "Esports": "0.4333",
            "Basketball": "0.3000",
            "Football": "0.3333",
            "Athletics": "0.3667",
            "Other Sports": "0.5333",
            "Stage Play": "0.3333",
            "Magic Show": "0.3667",
            "Variety Show": "0.1667",
            "Acrobatics": "0.5000",
            "Handicraft": "0.5000",
            "Food": "0.2000",
            "Fashion": "0.3667",
            "Daily Life": "0.4000",
            "Travel": "0.2667",
            "Pet & Animal": "0.6667",
            "Exercise": "0.4000",
            "Multilingual": "0.4333"
        },
        "task_type": {
            "Temporal Perception": "0.0000",
            "Spatial Perception": "0.3333",
            "Attribute Perception": "0.3704",
            "Action Recognition": "0.3968",
            "Object Recognition": "0.4074",
            "OCR Problems": "0.3571",
            "Counting Problem": "0.2292",
            "Temporal Reasoning": "0.3077",
            "Spatial Reasoning": "0.5455",
            "Action Reasoning": "0.3056",
            "Object Reasoning": "0.3375",
            "Information Synopsis": "0.5399"
        }
    },
    "overall": {
        "overall": "0.4504",
        "domain": {
            "Knowledge": "0.4259",
            "Film & Television": "0.4944",
            "Sports Competition": "0.4356",
            "Artistic Performance": "0.4917",
            "Life Record": "0.4524",
            "Multilingual": "0.3889"
        },
        "sub_category": {
            "Humanity & History": "0.3444",
            "Literature & Art": "0.4444",
            "Biology & Medicine": "0.4778",
            "Finance & Commerce": "0.4000",
            "Astronomy": "0.3889",
            "Geography": "0.3778",
            "Law": "0.5222",
            "Life Tip": "0.4889",
            "Technology": "0.3889",
            "Animation": "0.3778",
            "Movie & TV Show": "0.5333",
            "Documentary": "0.5111",
            "News Report": "0.5556",
            "Esports": "0.4444",
            "Basketball": "0.3222",
            "Football": "0.4111",
            "Athletics": "0.4778",
            "Other Sports": "0.5222",
            "Stage Play": "0.5778",
            "Magic Show": "0.4333",
            "Variety Show": "0.4444",
            "Acrobatics": "0.5111",
            "Handicraft": "0.4444",
            "Food": "0.3333",
            "Fashion": "0.3889",
            "Daily Life": "0.4667",
            "Travel": "0.4333",
            "Pet & Animal": "0.6000",
            "Exercise": "0.5000",
            "Multilingual": "0.3889"
        },
        "task_type": {
            "Temporal Perception": "0.4000",
            "Spatial Perception": "0.6296",
            "Attribute Perception": "0.5901",
            "Action Recognition": "0.4089",
            "Object Recognition": "0.5085",
            "OCR Problems": "0.5180",
            "Counting Problem": "0.2836",
            "Temporal Reasoning": "0.3164",
            "Spatial Reasoning": "0.6786",
            "Action Reasoning": "0.3860",
            "Object Reasoning": "0.3943",
            "Information Synopsis": "0.5882"
        }
    }
}
```

When testing with subtitles:

```bash
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-2B --verbose --nframe 16 --use-subtitle
```

The expected test results are:

```
{
    "short": {
        "overall": "0.5978",
        "domain": {
            "Knowledge": "0.5926",
            "Film & Television": "0.6583",
            "Sports Competition": "0.5867",
            "Artistic Performance": "0.6083",
            "Life Record": "0.5952",
            "Multilingual": "0.4333"
        },
        "sub_category": {
            "Humanity & History": "0.4667",
            "Literature & Art": "0.5333",
            "Biology & Medicine": "0.8000",
            "Finance & Commerce": "0.5333",
            "Astronomy": "0.5667",
            "Geography": "0.6333",
            "Law": "0.6000",
            "Life Tip": "0.6333",
            "Technology": "0.5667",
            "Animation": "0.5667",
            "Movie & TV Show": "0.6333",
            "Documentary": "0.6333",
            "News Report": "0.8000",
            "Esports": "0.5667",
            "Basketball": "0.4333",
            "Football": "0.6667",
            "Athletics": "0.6333",
            "Other Sports": "0.6333",
            "Stage Play": "0.7000",
            "Magic Show": "0.5000",
            "Variety Show": "0.7000",
            "Acrobatics": "0.5333",
            "Handicraft": "0.4000",
            "Food": "0.6667",
            "Fashion": "0.5333",
            "Daily Life": "0.6667",
            "Travel": "0.5667",
            "Pet & Animal": "0.7333",
            "Exercise": "0.6000",
            "Multilingual": "0.4333"
        },
        "task_type": {
            "Temporal Perception": "0.8333",
            "Spatial Perception": "0.6333",
            "Attribute Perception": "0.7213",
            "Action Recognition": "0.5496",
            "Object Recognition": "0.5536",
            "OCR Problems": "0.7368",
            "Counting Problem": "0.3440",
            "Temporal Reasoning": "0.3077",
            "Spatial Reasoning": "0.8148",
            "Action Reasoning": "0.7021",
            "Object Reasoning": "0.5500",
            "Information Synopsis": "0.7683"
        }
    },
    "medium": {
        "overall": "0.4367",
        "domain": {
            "Knowledge": "0.4444",
            "Film & Television": "0.4833",
            "Sports Competition": "0.3600",
            "Artistic Performance": "0.5833",
            "Life Record": "0.3714",
            "Multilingual": "0.4333"
        },
        "sub_category": {
            "Humanity & History": "0.3000",
            "Literature & Art": "0.5000",
            "Biology & Medicine": "0.5333",
            "Finance & Commerce": "0.5333",
            "Astronomy": "0.4667",
            "Geography": "0.3667",
            "Law": "0.5000",
            "Life Tip": "0.6000",
            "Technology": "0.2000",
            "Animation": "0.3000",
            "Movie & TV Show": "0.5667",
            "Documentary": "0.5333",
            "News Report": "0.5333",
            "Esports": "0.3333",
            "Basketball": "0.2333",
            "Football": "0.3667",
            "Athletics": "0.4667",
            "Other Sports": "0.4000",
            "Stage Play": "0.6667",
            "Magic Show": "0.6000",
            "Variety Show": "0.5667",
            "Acrobatics": "0.5000",
            "Handicraft": "0.5000",
            "Food": "0.2000",
            "Fashion": "0.3000",
            "Daily Life": "0.2667",
            "Travel": "0.4333",
            "Pet & Animal": "0.3333",
            "Exercise": "0.5667",
            "Multilingual": "0.4333"
        },
        "task_type": {
            "Temporal Perception": "0.3226",
            "Spatial Perception": "0.5238",
            "Attribute Perception": "0.5068",
            "Action Recognition": "0.3277",
            "Object Recognition": "0.4924",
            "OCR Problems": "0.4118",
            "Counting Problem": "0.3053",
            "Temporal Reasoning": "0.3288",
            "Spatial Reasoning": "0.6667",
            "Action Reasoning": "0.4655",
            "Object Reasoning": "0.4478",
            "Information Synopsis": "0.6538"
        }
    },
    "long": {
        "overall": "0.3856",
        "domain": {
            "Knowledge": "0.3889",
            "Film & Television": "0.3750",
            "Sports Competition": "0.3867",
            "Artistic Performance": "0.3417",
            "Life Record": "0.4048",
            "Multilingual": "0.4333"
        },
        "sub_category": {
            "Humanity & History": "0.3000",
            "Literature & Art": "0.5000",
            "Biology & Medicine": "0.4333",
            "Finance & Commerce": "0.5000",
            "Astronomy": "0.3000",
            "Geography": "0.3000",
            "Law": "0.4333",
            "Life Tip": "0.3333",
            "Technology": "0.4000",
            "Animation": "0.2333",
            "Movie & TV Show": "0.4667",
            "Documentary": "0.4333",
            "News Report": "0.3667",
            "Esports": "0.4667",
            "Basketball": "0.2667",
            "Football": "0.3000",
            "Athletics": "0.3333",
            "Other Sports": "0.5667",
            "Stage Play": "0.4000",
            "Magic Show": "0.3000",
            "Variety Show": "0.2000",
            "Acrobatics": "0.4667",
            "Handicraft": "0.5000",
            "Food": "0.2000",
            "Fashion": "0.4000",
            "Daily Life": "0.4333",
            "Travel": "0.2333",
            "Pet & Animal": "0.7000",
            "Exercise": "0.3667",
            "Multilingual": "0.4333"
        },
        "task_type": {
            "Temporal Perception": "0.0000",
            "Spatial Perception": "0.3333",
            "Attribute Perception": "0.4444",
            "Action Recognition": "0.4603",
            "Object Recognition": "0.3519",
            "OCR Problems": "0.4286",
            "Counting Problem": "0.2292",
            "Temporal Reasoning": "0.3187",
            "Spatial Reasoning": "0.5455",
            "Action Reasoning": "0.3222",
            "Object Reasoning": "0.3625",
            "Information Synopsis": "0.5460"
        }
    },
    "overall": {
        "overall": "0.4733",
        "domain": {
            "Knowledge": "0.4753",
            "Film & Television": "0.5056",
            "Sports Competition": "0.4444",
            "Artistic Performance": "0.5111",
            "Life Record": "0.4571",
            "Multilingual": "0.4333"
        },
        "sub_category": {
            "Humanity & History": "0.3556",
            "Literature & Art": "0.5111",
            "Biology & Medicine": "0.5889",
            "Finance & Commerce": "0.5222",
            "Astronomy": "0.4444",
            "Geography": "0.4333",
            "Law": "0.5111",
            "Life Tip": "0.5222",
            "Technology": "0.3889",
            "Animation": "0.3667",
            "Movie & TV Show": "0.5556",
            "Documentary": "0.5333",
            "News Report": "0.5667",
            "Esports": "0.4556",
            "Basketball": "0.3111",
            "Football": "0.4444",
            "Athletics": "0.4778",
            "Other Sports": "0.5333",
            "Stage Play": "0.5889",
            "Magic Show": "0.4667",
            "Variety Show": "0.4889",
            "Acrobatics": "0.5000",
            "Handicraft": "0.4667",
            "Food": "0.3556",
            "Fashion": "0.4111",
            "Daily Life": "0.4556",
            "Travel": "0.4111",
            "Pet & Animal": "0.5889",
            "Exercise": "0.5111",
            "Multilingual": "0.4333"
        },
        "task_type": {
            "Temporal Perception": "0.4545",
            "Spatial Perception": "0.5741",
            "Attribute Perception": "0.6171",
            "Action Recognition": "0.4473",
            "Object Recognition": "0.5000",
            "OCR Problems": "0.5468",
            "Counting Problem": "0.3097",
            "Temporal Reasoning": "0.3220",
            "Spatial Reasoning": "0.7143",
            "Action Reasoning": "0.4140",
            "Object Reasoning": "0.4207",
            "Information Synopsis": "0.6285"
        }
    }
}
```

````

````{tab} 4B

When testing without subtitles:

```bash
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-4B --verbose --nframe 16
```

The expected test results are:

```
{
    "short": {
        "overall": "0.6289",
        "domain": {
            "Knowledge": "0.6519",
            "Film & Television": "0.7000",
            "Sports Competition": "0.5800",
            "Artistic Performance": "0.6417",
            "Life Record": "0.6095",
            "Multilingual": "0.4667"
        },
        "sub_category": {
            "Humanity & History": "0.5333",
            "Literature & Art": "0.6000",
            "Biology & Medicine": "0.7667",
            "Finance & Commerce": "0.6000",
            "Astronomy": "0.6333",
            "Geography": "0.5667",
            "Law": "0.7333",
            "Life Tip": "0.7667",
            "Technology": "0.6667",
            "Animation": "0.6000",
            "Movie & TV Show": "0.6667",
            "Documentary": "0.6333",
            "News Report": "0.9000",
            "Esports": "0.5333",
            "Basketball": "0.4667",
            "Football": "0.6667",
            "Athletics": "0.6333",
            "Other Sports": "0.6000",
            "Stage Play": "0.8000",
            "Magic Show": "0.6000",
            "Variety Show": "0.5667",
            "Acrobatics": "0.6000",
            "Handicraft": "0.5667",
            "Food": "0.5667",
            "Fashion": "0.5333",
            "Daily Life": "0.6000",
            "Travel": "0.7000",
            "Pet & Animal": "0.7667",
            "Exercise": "0.5333",
            "Multilingual": "0.4667"
        },
        "task_type": {
            "Temporal Perception": "0.8889",
            "Spatial Perception": "0.6333",
            "Attribute Perception": "0.7459",
            "Action Recognition": "0.6183",
            "Object Recognition": "0.6369",
            "OCR Problems": "0.6140",
            "Counting Problem": "0.3200",
            "Temporal Reasoning": "0.4615",
            "Spatial Reasoning": "0.7778",
            "Action Reasoning": "0.7021",
            "Object Reasoning": "0.6250",
            "Information Synopsis": "0.8171"
        }
    },
    "medium": {
        "overall": "0.4678",
        "domain": {
            "Knowledge": "0.4704",
            "Film & Television": "0.5083",
            "Sports Competition": "0.4133",
            "Artistic Performance": "0.5333",
            "Life Record": "0.4381",
            "Multilingual": "0.5000"
        },
        "sub_category": {
            "Humanity & History": "0.2667",
            "Literature & Art": "0.6000",
            "Biology & Medicine": "0.5333",
            "Finance & Commerce": "0.5333",
            "Astronomy": "0.5000",
            "Geography": "0.4000",
            "Law": "0.5000",
            "Life Tip": "0.5333",
            "Technology": "0.3667",
            "Animation": "0.2333",
            "Movie & TV Show": "0.6333",
            "Documentary": "0.6000",
            "News Report": "0.5667",
            "Esports": "0.3667",
            "Basketball": "0.3667",
            "Football": "0.4333",
            "Athletics": "0.4333",
            "Other Sports": "0.4667",
            "Stage Play": "0.6000",
            "Magic Show": "0.4000",
            "Variety Show": "0.5000",
            "Acrobatics": "0.6333",
            "Handicraft": "0.7000",
            "Food": "0.3667",
            "Fashion": "0.3333",
            "Daily Life": "0.3000",
            "Travel": "0.4333",
            "Pet & Animal": "0.3667",
            "Exercise": "0.5667",
            "Multilingual": "0.5000"
        },
        "task_type": {
            "Temporal Perception": "0.4839",
            "Spatial Perception": "0.4762",
            "Attribute Perception": "0.5205",
            "Action Recognition": "0.3866",
            "Object Recognition": "0.5530",
            "OCR Problems": "0.4559",
            "Counting Problem": "0.3053",
            "Temporal Reasoning": "0.3014",
            "Spatial Reasoning": "0.7222",
            "Action Reasoning": "0.5172",
            "Object Reasoning": "0.4925",
            "Information Synopsis": "0.6154"
        }
    },
    "long": {
        "overall": "0.4467",
        "domain": {
            "Knowledge": "0.4815",
            "Film & Television": "0.4333",
            "Sports Competition": "0.4267",
            "Artistic Performance": "0.4250",
            "Life Record": "0.4333",
            "Multilingual": "0.4667"
        },
        "sub_category": {
            "Humanity & History": "0.3333",
            "Literature & Art": "0.5000",
            "Biology & Medicine": "0.5000",
            "Finance & Commerce": "0.5333",
            "Astronomy": "0.5333",
            "Geography": "0.3333",
            "Law": "0.5000",
            "Life Tip": "0.5333",
            "Technology": "0.5667",
            "Animation": "0.2667",
            "Movie & TV Show": "0.5333",
            "Documentary": "0.5000",
            "News Report": "0.4333",
            "Esports": "0.4667",
            "Basketball": "0.4000",
            "Football": "0.4333",
            "Athletics": "0.3667",
            "Other Sports": "0.4667",
            "Stage Play": "0.6000",
            "Magic Show": "0.4333",
            "Variety Show": "0.2667",
            "Acrobatics": "0.4000",
            "Handicraft": "0.5000",
            "Food": "0.3000",
            "Fashion": "0.4667",
            "Daily Life": "0.3000",
            "Travel": "0.3000",
            "Pet & Animal": "0.6667",
            "Exercise": "0.5000",
            "Multilingual": "0.4667"
        },
        "task_type": {
            "Temporal Perception": "0.5000",
            "Spatial Perception": "0.6667",
            "Attribute Perception": "0.5185",
            "Action Recognition": "0.3810",
            "Object Recognition": "0.4815",
            "OCR Problems": "0.3571",
            "Counting Problem": "0.2708",
            "Temporal Reasoning": "0.2637",
            "Spatial Reasoning": "0.5455",
            "Action Reasoning": "0.4556",
            "Object Reasoning": "0.4500",
            "Information Synopsis": "0.5828"
        }
    },
    "overall": {
        "overall": "0.5144",
        "domain": {
            "Knowledge": "0.5346",
            "Film & Television": "0.5472",
            "Sports Competition": "0.4733",
            "Artistic Performance": "0.5333",
            "Life Record": "0.4937",
            "Multilingual": "0.4778"
        },
        "sub_category": {
            "Humanity & History": "0.3778",
            "Literature & Art": "0.5667",
            "Biology & Medicine": "0.6000",
            "Finance & Commerce": "0.5556",
            "Astronomy": "0.5556",
            "Geography": "0.4333",
            "Law": "0.5778",
            "Life Tip": "0.6111",
            "Technology": "0.5333",
            "Animation": "0.3667",
            "Movie & TV Show": "0.6111",
            "Documentary": "0.5778",
            "News Report": "0.6333",
            "Esports": "0.4556",
            "Basketball": "0.4111",
            "Football": "0.5111",
            "Athletics": "0.4778",
            "Other Sports": "0.5111",
            "Stage Play": "0.6667",
            "Magic Show": "0.4778",
            "Variety Show": "0.4444",
            "Acrobatics": "0.5444",
            "Handicraft": "0.5889",
            "Food": "0.4111",
            "Fashion": "0.4444",
            "Daily Life": "0.4000",
            "Travel": "0.4778",
            "Pet & Animal": "0.6000",
            "Exercise": "0.5333",
            "Multilingual": "0.4778"
        },
        "task_type": {
            "Temporal Perception": "0.6182",
            "Spatial Perception": "0.5741",
            "Attribute Perception": "0.6441",
            "Action Recognition": "0.4824",
            "Object Recognition": "0.5819",
            "OCR Problems": "0.5108",
            "Counting Problem": "0.3060",
            "Temporal Reasoning": "0.2938",
            "Spatial Reasoning": "0.7143",
            "Action Reasoning": "0.5088",
            "Object Reasoning": "0.4934",
            "Information Synopsis": "0.6502"
        }
    }
}
```

When testing with subtitles:

```bash
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-4B --verbose --nframe 16 --use-subtitle
```

The expected test results are:

```
{
    "short": {
        "overall": "0.6511",
        "domain": {
            "Knowledge": "0.6852",
            "Film & Television": "0.7083",
            "Sports Competition": "0.5933",
            "Artistic Performance": "0.6750",
            "Life Record": "0.6286",
            "Multilingual": "0.4667"
        },
        "sub_category": {
            "Humanity & History": "0.5333",
            "Literature & Art": "0.5667",
            "Biology & Medicine": "0.8333",
            "Finance & Commerce": "0.6667",
            "Astronomy": "0.7000",
            "Geography": "0.6333",
            "Law": "0.7667",
            "Life Tip": "0.7667",
            "Technology": "0.7000",
            "Animation": "0.4667",
            "Movie & TV Show": "0.7333",
            "Documentary": "0.7000",
            "News Report": "0.9333",
            "Esports": "0.5000",
            "Basketball": "0.5000",
            "Football": "0.6333",
            "Athletics": "0.7000",
            "Other Sports": "0.6333",
            "Stage Play": "0.7667",
            "Magic Show": "0.7000",
            "Variety Show": "0.5667",
            "Acrobatics": "0.6667",
            "Handicraft": "0.6333",
            "Food": "0.6000",
            "Fashion": "0.5333",
            "Daily Life": "0.6667",
            "Travel": "0.7000",
            "Pet & Animal": "0.7333",
            "Exercise": "0.5333",
            "Multilingual": "0.4667"
        },
        "task_type": {
            "Temporal Perception": "0.8333",
            "Spatial Perception": "0.6667",
            "Attribute Perception": "0.7787",
            "Action Recognition": "0.6260",
            "Object Recognition": "0.6429",
            "OCR Problems": "0.6667",
            "Counting Problem": "0.3360",
            "Temporal Reasoning": "0.6154",
            "Spatial Reasoning": "0.8148",
            "Action Reasoning": "0.7234",
            "Object Reasoning": "0.6375",
            "Information Synopsis": "0.8659"
        }
    },
    "medium": {
        "overall": "0.4878",
        "domain": {
            "Knowledge": "0.5148",
            "Film & Television": "0.5417",
            "Sports Competition": "0.4067",
            "Artistic Performance": "0.5417",
            "Life Record": "0.4619",
            "Multilingual": "0.4000"
        },
        "sub_category": {
            "Humanity & History": "0.3667",
            "Literature & Art": "0.5667",
            "Biology & Medicine": "0.5667",
            "Finance & Commerce": "0.5667",
            "Astronomy": "0.7000",
            "Geography": "0.3667",
            "Law": "0.6000",
            "Life Tip": "0.4667",
            "Technology": "0.4333",
            "Animation": "0.2667",
            "Movie & TV Show": "0.6667",
            "Documentary": "0.5667",
            "News Report": "0.6667",
            "Esports": "0.4667",
            "Basketball": "0.2333",
            "Football": "0.4333",
            "Athletics": "0.4333",
            "Other Sports": "0.4667",
            "Stage Play": "0.6333",
            "Magic Show": "0.4333",
            "Variety Show": "0.5000",
            "Acrobatics": "0.6000",
            "Handicraft": "0.7000",
            "Food": "0.3333",
            "Fashion": "0.3667",
            "Daily Life": "0.3667",
            "Travel": "0.5000",
            "Pet & Animal": "0.4000",
            "Exercise": "0.5667",
            "Multilingual": "0.4000"
        },
        "task_type": {
            "Temporal Perception": "0.4194",
            "Spatial Perception": "0.4286",
            "Attribute Perception": "0.5479",
            "Action Recognition": "0.3950",
            "Object Recognition": "0.5606",
            "OCR Problems": "0.4559",
            "Counting Problem": "0.3474",
            "Temporal Reasoning": "0.2877",
            "Spatial Reasoning": "0.8333",
            "Action Reasoning": "0.4655",
            "Object Reasoning": "0.5522",
            "Information Synopsis": "0.7051"
        }
    },
    "long": {
        "overall": "0.4622",
        "domain": {
            "Knowledge": "0.4889",
            "Film & Television": "0.4750",
            "Sports Competition": "0.4267",
            "Artistic Performance": "0.4500",
            "Life Record": "0.4476",
            "Multilingual": "0.5000"
        },
        "sub_category": {
            "Humanity & History": "0.2667",
            "Literature & Art": "0.5667",
            "Biology & Medicine": "0.5333",
            "Finance & Commerce": "0.6000",
            "Astronomy": "0.5333",
            "Geography": "0.3333",
            "Law": "0.6000",
            "Life Tip": "0.5000",
            "Technology": "0.4667",
            "Animation": "0.3333",
            "Movie & TV Show": "0.5000",
            "Documentary": "0.6000",
            "News Report": "0.4667",
            "Esports": "0.5000",
            "Basketball": "0.4000",
            "Football": "0.5333",
            "Athletics": "0.3000",
            "Other Sports": "0.4000",
            "Stage Play": "0.7333",
            "Magic Show": "0.4333",
            "Variety Show": "0.2333",
            "Acrobatics": "0.4000",
            "Handicraft": "0.5667",
            "Food": "0.2667",
            "Fashion": "0.4667",
            "Daily Life": "0.3333",
            "Travel": "0.3000",
            "Pet & Animal": "0.7000",
            "Exercise": "0.5000",
            "Multilingual": "0.5000"
        },
        "task_type": {
            "Temporal Perception": "0.3333",
            "Spatial Perception": "0.3333",
            "Attribute Perception": "0.5185",
            "Action Recognition": "0.4444",
            "Object Recognition": "0.4815",
            "OCR Problems": "0.2857",
            "Counting Problem": "0.2708",
            "Temporal Reasoning": "0.2418",
            "Spatial Reasoning": "0.5455",
            "Action Reasoning": "0.4444",
            "Object Reasoning": "0.4708",
            "Information Synopsis": "0.6564"
        }
    },
    "overall": {
        "overall": "0.5337",
        "domain": {
            "Knowledge": "0.5630",
            "Film & Television": "0.5750",
            "Sports Competition": "0.4756",
            "Artistic Performance": "0.5556",
            "Life Record": "0.5127",
            "Multilingual": "0.4556"
        },
        "sub_category": {
            "Humanity & History": "0.3889",
            "Literature & Art": "0.5667",
            "Biology & Medicine": "0.6444",
            "Finance & Commerce": "0.6111",
            "Astronomy": "0.6444",
            "Geography": "0.4444",
            "Law": "0.6556",
            "Life Tip": "0.5778",
            "Technology": "0.5333",
            "Animation": "0.3556",
            "Movie & TV Show": "0.6333",
            "Documentary": "0.6222",
            "News Report": "0.6889",
            "Esports": "0.4889",
            "Basketball": "0.3778",
            "Football": "0.5333",
            "Athletics": "0.4778",
            "Other Sports": "0.5000",
            "Stage Play": "0.7111",
            "Magic Show": "0.5222",
            "Variety Show": "0.4333",
            "Acrobatics": "0.5556",
            "Handicraft": "0.6333",
            "Food": "0.4000",
            "Fashion": "0.4556",
            "Daily Life": "0.4556",
            "Travel": "0.5000",
            "Pet & Animal": "0.6111",
            "Exercise": "0.5333",
            "Multilingual": "0.4556"
        },
        "task_type": {
            "Temporal Perception": "0.5455",
            "Spatial Perception": "0.5556",
            "Attribute Perception": "0.6712",
            "Action Recognition": "0.5016",
            "Object Recognition": "0.5876",
            "OCR Problems": "0.5252",
            "Counting Problem": "0.3284",
            "Temporal Reasoning": "0.2881",
            "Spatial Reasoning": "0.7679",
            "Action Reasoning": "0.4947",
            "Object Reasoning": "0.5242",
            "Information Synopsis": "0.7214"
        }
    }
}
```

````

````{tab} 8B

When testing without subtitles:

```bash
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-8B --verbose --nframe 16
```

The expected test results are:

```
{
    "short": {
        "overall": "0.6567",
        "domain": {
            "Knowledge": "0.6704",
            "Film & Television": "0.7083",
            "Sports Competition": "0.5933",
            "Artistic Performance": "0.7000",
            "Life Record": "0.6619",
            "Multilingual": "0.4333"
        },
        "sub_category": {
            "Humanity & History": "0.6000",
            "Literature & Art": "0.6000",
            "Biology & Medicine": "0.7667",
            "Finance & Commerce": "0.7000",
            "Astronomy": "0.6000",
            "Geography": "0.7000",
            "Law": "0.7000",
            "Life Tip": "0.7000",
            "Technology": "0.6667",
            "Animation": "0.8000",
            "Movie & TV Show": "0.6000",
            "Documentary": "0.6333",
            "News Report": "0.8000",
            "Esports": "0.5333",
            "Basketball": "0.3667",
            "Football": "0.7000",
            "Athletics": "0.7333",
            "Other Sports": "0.6333",
            "Stage Play": "0.8333",
            "Magic Show": "0.6667",
            "Variety Show": "0.6333",
            "Acrobatics": "0.6667",
            "Handicraft": "0.7000",
            "Food": "0.6667",
            "Fashion": "0.5333",
            "Daily Life": "0.6667",
            "Travel": "0.7667",
            "Pet & Animal": "0.7667",
            "Exercise": "0.5333",
            "Multilingual": "0.4333"
        },
        "task_type": {
            "Temporal Perception": "0.7222",
            "Spatial Perception": "0.7667",
            "Attribute Perception": "0.7623",
            "Action Recognition": "0.5954",
            "Object Recognition": "0.6845",
            "OCR Problems": "0.7719",
            "Counting Problem": "0.4080",
            "Temporal Reasoning": "0.6154",
            "Spatial Reasoning": "0.8148",
            "Action Reasoning": "0.6596",
            "Object Reasoning": "0.6250",
            "Information Synopsis": "0.7683"
        }
    },
    "medium": {
        "overall": "0.5044",
        "domain": {
            "Knowledge": "0.5148",
            "Film & Television": "0.5750",
            "Sports Competition": "0.4533",
            "Artistic Performance": "0.5917",
            "Life Record": "0.4429",
            "Multilingual": "0.4667"
        },
        "sub_category": {
            "Humanity & History": "0.4333",
            "Literature & Art": "0.6333",
            "Biology & Medicine": "0.5667",
            "Finance & Commerce": "0.6000",
            "Astronomy": "0.4333",
            "Geography": "0.3333",
            "Law": "0.5667",
            "Life Tip": "0.6333",
            "Technology": "0.4333",
            "Animation": "0.4000",
            "Movie & TV Show": "0.6667",
            "Documentary": "0.5667",
            "News Report": "0.6667",
            "Esports": "0.5667",
            "Basketball": "0.2667",
            "Football": "0.4667",
            "Athletics": "0.4333",
            "Other Sports": "0.5333",
            "Stage Play": "0.8000",
            "Magic Show": "0.4667",
            "Variety Show": "0.5667",
            "Acrobatics": "0.5333",
            "Handicraft": "0.5667",
            "Food": "0.4000",
            "Fashion": "0.5000",
            "Daily Life": "0.3333",
            "Travel": "0.4333",
            "Pet & Animal": "0.3667",
            "Exercise": "0.5000",
            "Multilingual": "0.4667"
        },
        "task_type": {
            "Temporal Perception": "0.4516",
            "Spatial Perception": "0.5714",
            "Attribute Perception": "0.4932",
            "Action Recognition": "0.3782",
            "Object Recognition": "0.6212",
            "OCR Problems": "0.4706",
            "Counting Problem": "0.3053",
            "Temporal Reasoning": "0.3836",
            "Spatial Reasoning": "0.6111",
            "Action Reasoning": "0.5172",
            "Object Reasoning": "0.5970",
            "Information Synopsis": "0.7051"
        }
    },
    "long": {
        "overall": "0.4589",
        "domain": {
            "Knowledge": "0.5037",
            "Film & Television": "0.4500",
            "Sports Competition": "0.4733",
            "Artistic Performance": "0.4417",
            "Life Record": "0.4048",
            "Multilingual": "0.4667"
        },
        "sub_category": {
            "Humanity & History": "0.5333",
            "Literature & Art": "0.5000",
            "Biology & Medicine": "0.6000",
            "Finance & Commerce": "0.5000",
            "Astronomy": "0.5000",
            "Geography": "0.3667",
            "Law": "0.5333",
            "Life Tip": "0.5667",
            "Technology": "0.4333",
            "Animation": "0.2667",
            "Movie & TV Show": "0.5667",
            "Documentary": "0.4667",
            "News Report": "0.5000",
            "Esports": "0.5000",
            "Basketball": "0.3667",
            "Football": "0.5000",
            "Athletics": "0.5000",
            "Other Sports": "0.5000",
            "Stage Play": "0.6333",
            "Magic Show": "0.3333",
            "Variety Show": "0.3000",
            "Acrobatics": "0.5000",
            "Handicraft": "0.4667",
            "Food": "0.2667",
            "Fashion": "0.4000",
            "Daily Life": "0.3333",
            "Travel": "0.3667",
            "Pet & Animal": "0.6333",
            "Exercise": "0.3667",
            "Multilingual": "0.4667"
        },
        "task_type": {
            "Temporal Perception": "0.1667",
            "Spatial Perception": "0.0000",
            "Attribute Perception": "0.6296",
            "Action Recognition": "0.4127",
            "Object Recognition": "0.5000",
            "OCR Problems": "0.5000",
            "Counting Problem": "0.3542",
            "Temporal Reasoning": "0.3297",
            "Spatial Reasoning": "0.6364",
            "Action Reasoning": "0.4000",
            "Object Reasoning": "0.4625",
            "Information Synopsis": "0.6012"
        }
    },
    "overall": {
        "overall": "0.5400",
        "domain": {
            "Knowledge": "0.5630",
            "Film & Television": "0.5778",
            "Sports Competition": "0.5067",
            "Artistic Performance": "0.5778",
            "Life Record": "0.5032",
            "Multilingual": "0.4556"
        },
        "sub_category": {
            "Humanity & History": "0.5222",
            "Literature & Art": "0.5778",
            "Biology & Medicine": "0.6444",
            "Finance & Commerce": "0.6000",
            "Astronomy": "0.5111",
            "Geography": "0.4667",
            "Law": "0.6000",
            "Life Tip": "0.6333",
            "Technology": "0.5111",
            "Animation": "0.4889",
            "Movie & TV Show": "0.6111",
            "Documentary": "0.5556",
            "News Report": "0.6556",
            "Esports": "0.5333",
            "Basketball": "0.3333",
            "Football": "0.5556",
            "Athletics": "0.5556",
            "Other Sports": "0.5556",
            "Stage Play": "0.7556",
            "Magic Show": "0.4889",
            "Variety Show": "0.5000",
            "Acrobatics": "0.5667",
            "Handicraft": "0.5778",
            "Food": "0.4444",
            "Fashion": "0.4778",
            "Daily Life": "0.4444",
            "Travel": "0.5222",
            "Pet & Animal": "0.5889",
            "Exercise": "0.4667",
            "Multilingual": "0.4556"
        },
        "task_type": {
            "Temporal Perception": "0.5091",
            "Spatial Perception": "0.6481",
            "Attribute Perception": "0.6577",
            "Action Recognition": "0.4760",
            "Object Recognition": "0.6328",
            "OCR Problems": "0.5971",
            "Counting Problem": "0.3619",
            "Temporal Reasoning": "0.3729",
            "Spatial Reasoning": "0.7143",
            "Action Reasoning": "0.4667",
            "Object Reasoning": "0.5308",
            "Information Synopsis": "0.6687"
        }
    }
}
```

When testing with subtitles:

```bash
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-8B --verbose --nframe 16 --use-subtitle
```

The expected test results are:

```
{
    "short": {
        "overall": "0.6900",
        "domain": {
            "Knowledge": "0.7148",
            "Film & Television": "0.7500",
            "Sports Competition": "0.5933",
            "Artistic Performance": "0.7250",
            "Life Record": "0.7000",
            "Multilingual": "0.5000"
        },
        "sub_category": {
            "Humanity & History": "0.5667",
            "Literature & Art": "0.6333",
            "Biology & Medicine": "0.8333",
            "Finance & Commerce": "0.8333",
            "Astronomy": "0.6667",
            "Geography": "0.7000",
            "Law": "0.7000",
            "Life Tip": "0.8000",
            "Technology": "0.7000",
            "Animation": "0.7667",
            "Movie & TV Show": "0.6667",
            "Documentary": "0.6667",
            "News Report": "0.9000",
            "Esports": "0.5333",
            "Basketball": "0.4000",
            "Football": "0.6333",
            "Athletics": "0.7667",
            "Other Sports": "0.6333",
            "Stage Play": "0.8000",
            "Magic Show": "0.6667",
            "Variety Show": "0.7667",
            "Acrobatics": "0.6667",
            "Handicraft": "0.6667",
            "Food": "0.7000",
            "Fashion": "0.5667",
            "Daily Life": "0.7000",
            "Travel": "0.8333",
            "Pet & Animal": "0.8333",
            "Exercise": "0.6000",
            "Multilingual": "0.5000"
        },
        "task_type": {
            "Temporal Perception": "0.6667",
            "Spatial Perception": "0.7667",
            "Attribute Perception": "0.7951",
            "Action Recognition": "0.6412",
            "Object Recognition": "0.6964",
            "OCR Problems": "0.7895",
            "Counting Problem": "0.4240",
            "Temporal Reasoning": "0.6923",
            "Spatial Reasoning": "0.8519",
            "Action Reasoning": "0.7021",
            "Object Reasoning": "0.6875",
            "Information Synopsis": "0.8537"
        }
    },
    "medium": {
        "overall": "0.5256",
        "domain": {
            "Knowledge": "0.5593",
            "Film & Television": "0.6167",
            "Sports Competition": "0.4400",
            "Artistic Performance": "0.6167",
            "Life Record": "0.4429",
            "Multilingual": "0.5000"
        },
        "sub_category": {
            "Humanity & History": "0.4667",
            "Literature & Art": "0.6000",
            "Biology & Medicine": "0.5667",
            "Finance & Commerce": "0.6667",
            "Astronomy": "0.5667",
            "Geography": "0.4667",
            "Law": "0.5667",
            "Life Tip": "0.6667",
            "Technology": "0.4667",
            "Animation": "0.3667",
            "Movie & TV Show": "0.6667",
            "Documentary": "0.6667",
            "News Report": "0.7667",
            "Esports": "0.5667",
            "Basketball": "0.2667",
            "Football": "0.4333",
            "Athletics": "0.4333",
            "Other Sports": "0.5000",
            "Stage Play": "0.8333",
            "Magic Show": "0.5333",
            "Variety Show": "0.5667",
            "Acrobatics": "0.5333",
            "Handicraft": "0.5667",
            "Food": "0.3667",
            "Fashion": "0.4000",
            "Daily Life": "0.4333",
            "Travel": "0.4333",
            "Pet & Animal": "0.4000",
            "Exercise": "0.5000",
            "Multilingual": "0.5000"
        },
        "task_type": {
            "Temporal Perception": "0.4516",
            "Spatial Perception": "0.5238",
            "Attribute Perception": "0.5068",
            "Action Recognition": "0.4034",
            "Object Recognition": "0.6515",
            "OCR Problems": "0.4118",
            "Counting Problem": "0.3053",
            "Temporal Reasoning": "0.3973",
            "Spatial Reasoning": "0.7778",
            "Action Reasoning": "0.5517",
            "Object Reasoning": "0.6194",
            "Information Synopsis": "0.7949"
        }
    },
    "long": {
        "overall": "0.4922",
        "domain": {
            "Knowledge": "0.5667",
            "Film & Television": "0.4917",
            "Sports Competition": "0.4800",
            "Artistic Performance": "0.4583",
            "Life Record": "0.4381",
            "Multilingual": "0.4000"
        },
        "sub_category": {
            "Humanity & History": "0.5667",
            "Literature & Art": "0.5667",
            "Biology & Medicine": "0.7333",
            "Finance & Commerce": "0.5333",
            "Astronomy": "0.5667",
            "Geography": "0.4000",
            "Law": "0.6667",
            "Life Tip": "0.6000",
            "Technology": "0.4667",
            "Animation": "0.3333",
            "Movie & TV Show": "0.5000",
            "Documentary": "0.6000",
            "News Report": "0.5333",
            "Esports": "0.4333",
            "Basketball": "0.4000",
            "Football": "0.5333",
            "Athletics": "0.4667",
            "Other Sports": "0.5667",
            "Stage Play": "0.7333",
            "Magic Show": "0.3333",
            "Variety Show": "0.3000",
            "Acrobatics": "0.4667",
            "Handicraft": "0.5667",
            "Food": "0.3333",
            "Fashion": "0.4333",
            "Daily Life": "0.2667",
            "Travel": "0.3667",
            "Pet & Animal": "0.7333",
            "Exercise": "0.3667",
            "Multilingual": "0.4000"
        },
        "task_type": {
            "Temporal Perception": "0.1667",
            "Spatial Perception": "0.0000",
            "Attribute Perception": "0.7037",
            "Action Recognition": "0.4286",
            "Object Recognition": "0.5000",
            "OCR Problems": "0.5714",
            "Counting Problem": "0.2917",
            "Temporal Reasoning": "0.3077",
            "Spatial Reasoning": "0.7273",
            "Action Reasoning": "0.4278",
            "Object Reasoning": "0.4917",
            "Information Synopsis": "0.7117"
        }
    },
    "overall": {
        "overall": "0.5693",
        "domain": {
            "Knowledge": "0.6136",
            "Film & Television": "0.6194",
            "Sports Competition": "0.5044",
            "Artistic Performance": "0.6000",
            "Life Record": "0.5270",
            "Multilingual": "0.4667"
        },
        "sub_category": {
            "Humanity & History": "0.5333",
            "Literature & Art": "0.6000",
            "Biology & Medicine": "0.7111",
            "Finance & Commerce": "0.6778",
            "Astronomy": "0.6000",
            "Geography": "0.5222",
            "Law": "0.6444",
            "Life Tip": "0.6889",
            "Technology": "0.5444",
            "Animation": "0.4889",
            "Movie & TV Show": "0.6111",
            "Documentary": "0.6444",
            "News Report": "0.7333",
            "Esports": "0.5111",
            "Basketball": "0.3556",
            "Football": "0.5333",
            "Athletics": "0.5556",
            "Other Sports": "0.5667",
            "Stage Play": "0.7889",
            "Magic Show": "0.5111",
            "Variety Show": "0.5444",
            "Acrobatics": "0.5556",
            "Handicraft": "0.6000",
            "Food": "0.4667",
            "Fashion": "0.4667",
            "Daily Life": "0.4667",
            "Travel": "0.5444",
            "Pet & Animal": "0.6556",
            "Exercise": "0.4889",
            "Multilingual": "0.4667"
        },
        "task_type": {
            "Temporal Perception": "0.4909",
            "Spatial Perception": "0.6296",
            "Attribute Perception": "0.6892",
            "Action Recognition": "0.5080",
            "Object Recognition": "0.6497",
            "OCR Problems": "0.5827",
            "Counting Problem": "0.3582",
            "Temporal Reasoning": "0.3729",
            "Spatial Reasoning": "0.8036",
            "Action Reasoning": "0.4982",
            "Object Reasoning": "0.5639",
            "Information Synopsis": "0.7678"
        }
    }
}
```

````

````{tab} 26B

When testing without subtitles:

```bash
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-26B --verbose --nframe 16
```

The expected test results are:

```
{
    "short": {
        "overall": "0.6667",
        "domain": {
            "Knowledge": "0.6741",
            "Film & Television": "0.7333",
            "Sports Competition": "0.6133",
            "Artistic Performance": "0.6750",
            "Life Record": "0.6762",
            "Multilingual": "0.5000"
        },
        "sub_category": {
            "Humanity & History": "0.4000",
            "Literature & Art": "0.5667",
            "Biology & Medicine": "0.8667",
            "Finance & Commerce": "0.7000",
            "Astronomy": "0.6667",
            "Geography": "0.6333",
            "Law": "0.8000",
            "Life Tip": "0.8000",
            "Technology": "0.6333",
            "Animation": "0.8000",
            "Movie & TV Show": "0.7000",
            "Documentary": "0.5667",
            "News Report": "0.8667",
            "Esports": "0.5333",
            "Basketball": "0.4667",
            "Football": "0.6333",
            "Athletics": "0.7667",
            "Other Sports": "0.6667",
            "Stage Play": "0.8667",
            "Magic Show": "0.5333",
            "Variety Show": "0.6333",
            "Acrobatics": "0.6667",
            "Handicraft": "0.7000",
            "Food": "0.7667",
            "Fashion": "0.6667",
            "Daily Life": "0.6667",
            "Travel": "0.7667",
            "Pet & Animal": "0.7333",
            "Exercise": "0.4333",
            "Multilingual": "0.5000"
        },
        "task_type": {
            "Temporal Perception": "0.8333",
            "Spatial Perception": "0.7333",
            "Attribute Perception": "0.7541",
            "Action Recognition": "0.6489",
            "Object Recognition": "0.6548",
            "OCR Problems": "0.7719",
            "Counting Problem": "0.4080",
            "Temporal Reasoning": "0.6154",
            "Spatial Reasoning": "0.7778",
            "Action Reasoning": "0.7234",
            "Object Reasoning": "0.6500",
            "Information Synopsis": "0.8049"
        }
    },
    "medium": {
        "overall": "0.5200",
        "domain": {
            "Knowledge": "0.5481",
            "Film & Television": "0.5833",
            "Sports Competition": "0.4267",
            "Artistic Performance": "0.6167",
            "Life Record": "0.4524",
            "Multilingual": "0.5667"
        },
        "sub_category": {
            "Humanity & History": "0.4000",
            "Literature & Art": "0.6000",
            "Biology & Medicine": "0.6667",
            "Finance & Commerce": "0.5667",
            "Astronomy": "0.5333",
            "Geography": "0.4667",
            "Law": "0.6667",
            "Life Tip": "0.5333",
            "Technology": "0.5000",
            "Animation": "0.3667",
            "Movie & TV Show": "0.6000",
            "Documentary": "0.7000",
            "News Report": "0.6667",
            "Esports": "0.4667",
            "Basketball": "0.3000",
            "Football": "0.5000",
            "Athletics": "0.3667",
            "Other Sports": "0.5000",
            "Stage Play": "0.6667",
            "Magic Show": "0.6333",
            "Variety Show": "0.6000",
            "Acrobatics": "0.5667",
            "Handicraft": "0.6667",
            "Food": "0.3000",
            "Fashion": "0.4000",
            "Daily Life": "0.4000",
            "Travel": "0.5333",
            "Pet & Animal": "0.4667",
            "Exercise": "0.4000",
            "Multilingual": "0.5667"
        },
        "task_type": {
            "Temporal Perception": "0.4839",
            "Spatial Perception": "0.5238",
            "Attribute Perception": "0.5890",
            "Action Recognition": "0.4454",
            "Object Recognition": "0.6364",
            "OCR Problems": "0.4412",
            "Counting Problem": "0.3474",
            "Temporal Reasoning": "0.3836",
            "Spatial Reasoning": "0.7222",
            "Action Reasoning": "0.4655",
            "Object Reasoning": "0.5448",
            "Information Synopsis": "0.7436"
        }
    },
    "long": {
        "overall": "0.4578",
        "domain": {
            "Knowledge": "0.4815",
            "Film & Television": "0.4583",
            "Sports Competition": "0.4200",
            "Artistic Performance": "0.4167",
            "Life Record": "0.4857",
            "Multilingual": "0.4000"
        },
        "sub_category": {
            "Humanity & History": "0.5000",
            "Literature & Art": "0.5667",
            "Biology & Medicine": "0.5333",
            "Finance & Commerce": "0.6000",
            "Astronomy": "0.4667",
            "Geography": "0.3000",
            "Law": "0.5000",
            "Life Tip": "0.4667",
            "Technology": "0.4000",
            "Animation": "0.3667",
            "Movie & TV Show": "0.4667",
            "Documentary": "0.5000",
            "News Report": "0.5000",
            "Esports": "0.4667",
            "Basketball": "0.4000",
            "Football": "0.4667",
            "Athletics": "0.4000",
            "Other Sports": "0.3667",
            "Stage Play": "0.5667",
            "Magic Show": "0.4667",
            "Variety Show": "0.1333",
            "Acrobatics": "0.5000",
            "Handicraft": "0.6333",
            "Food": "0.4333",
            "Fashion": "0.3667",
            "Daily Life": "0.5333",
            "Travel": "0.3667",
            "Pet & Animal": "0.6667",
            "Exercise": "0.4000",
            "Multilingual": "0.4000"
        },
        "task_type": {
            "Temporal Perception": "0.0000",
            "Spatial Perception": "0.3333",
            "Attribute Perception": "0.5926",
            "Action Recognition": "0.3968",
            "Object Recognition": "0.5741",
            "OCR Problems": "0.5000",
            "Counting Problem": "0.2917",
            "Temporal Reasoning": "0.2967",
            "Spatial Reasoning": "0.6364",
            "Action Reasoning": "0.4111",
            "Object Reasoning": "0.4583",
            "Information Synopsis": "0.6135"
        }
    },
    "overall": {
        "overall": "0.5481",
        "domain": {
            "Knowledge": "0.5679",
            "Film & Television": "0.5917",
            "Sports Competition": "0.4867",
            "Artistic Performance": "0.5694",
            "Life Record": "0.5381",
            "Multilingual": "0.4889"
        },
        "sub_category": {
            "Humanity & History": "0.4333",
            "Literature & Art": "0.5778",
            "Biology & Medicine": "0.6889",
            "Finance & Commerce": "0.6222",
            "Astronomy": "0.5556",
            "Geography": "0.4667",
            "Law": "0.6556",
            "Life Tip": "0.6000",
            "Technology": "0.5111",
            "Animation": "0.5111",
            "Movie & TV Show": "0.5889",
            "Documentary": "0.5889",
            "News Report": "0.6778",
            "Esports": "0.4889",
            "Basketball": "0.3889",
            "Football": "0.5333",
            "Athletics": "0.5111",
            "Other Sports": "0.5111",
            "Stage Play": "0.7000",
            "Magic Show": "0.5444",
            "Variety Show": "0.4556",
            "Acrobatics": "0.5778",
            "Handicraft": "0.6667",
            "Food": "0.5000",
            "Fashion": "0.4778",
            "Daily Life": "0.5333",
            "Travel": "0.5556",
            "Pet & Animal": "0.6222",
            "Exercise": "0.4111",
            "Multilingual": "0.4889"
        },
        "task_type": {
            "Temporal Perception": "0.5455",
            "Spatial Perception": "0.6296",
            "Attribute Perception": "0.6802",
            "Action Recognition": "0.5208",
            "Object Recognition": "0.6356",
            "OCR Problems": "0.5827",
            "Counting Problem": "0.3657",
            "Temporal Reasoning": "0.3559",
            "Spatial Reasoning": "0.7321",
            "Action Reasoning": "0.4737",
            "Object Reasoning": "0.5176",
            "Information Synopsis": "0.6935"
        }
    }
}
```

When testing with subtitles:

```bash
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-26B --verbose --nframe 16 --use-subtitle
```

The expected test results are:

```
{
    "short": {
        "overall": "0.6844",
        "domain": {
            "Knowledge": "0.6889",
            "Film & Television": "0.7250",
            "Sports Competition": "0.6200",
            "Artistic Performance": "0.7167",
            "Life Record": "0.7000",
            "Multilingual": "0.5667"
        },
        "sub_category": {
            "Humanity & History": "0.3667",
            "Literature & Art": "0.6000",
            "Biology & Medicine": "0.9000",
            "Finance & Commerce": "0.7333",
            "Astronomy": "0.7000",
            "Geography": "0.7333",
            "Law": "0.8333",
            "Life Tip": "0.7000",
            "Technology": "0.6333",
            "Animation": "0.7333",
            "Movie & TV Show": "0.7333",
            "Documentary": "0.5667",
            "News Report": "0.8667",
            "Esports": "0.6667",
            "Basketball": "0.4333",
            "Football": "0.6667",
            "Athletics": "0.7333",
            "Other Sports": "0.6000",
            "Stage Play": "0.8333",
            "Magic Show": "0.6000",
            "Variety Show": "0.7667",
            "Acrobatics": "0.6667",
            "Handicraft": "0.6667",
            "Food": "0.8333",
            "Fashion": "0.6667",
            "Daily Life": "0.7667",
            "Travel": "0.7667",
            "Pet & Animal": "0.7333",
            "Exercise": "0.4667",
            "Multilingual": "0.5667"
        },
        "task_type": {
            "Temporal Perception": "0.7778",
            "Spatial Perception": "0.7000",
            "Attribute Perception": "0.7869",
            "Action Recognition": "0.6336",
            "Object Recognition": "0.6905",
            "OCR Problems": "0.8070",
            "Counting Problem": "0.4080",
            "Temporal Reasoning": "0.7692",
            "Spatial Reasoning": "0.8519",
            "Action Reasoning": "0.7021",
            "Object Reasoning": "0.7125",
            "Information Synopsis": "0.8049"
        }
    },
    "medium": {
        "overall": "0.5456",
        "domain": {
            "Knowledge": "0.5852",
            "Film & Television": "0.6167",
            "Sports Competition": "0.4400",
            "Artistic Performance": "0.6333",
            "Life Record": "0.4714",
            "Multilingual": "0.6000"
        },
        "sub_category": {
            "Humanity & History": "0.5333",
            "Literature & Art": "0.5667",
            "Biology & Medicine": "0.6333",
            "Finance & Commerce": "0.6667",
            "Astronomy": "0.6667",
            "Geography": "0.5000",
            "Law": "0.6333",
            "Life Tip": "0.5667",
            "Technology": "0.5000",
            "Animation": "0.3333",
            "Movie & TV Show": "0.6333",
            "Documentary": "0.7000",
            "News Report": "0.8000",
            "Esports": "0.4333",
            "Basketball": "0.2667",
            "Football": "0.6000",
            "Athletics": "0.3667",
            "Other Sports": "0.5333",
            "Stage Play": "0.7667",
            "Magic Show": "0.6000",
            "Variety Show": "0.6000",
            "Acrobatics": "0.5667",
            "Handicraft": "0.6333",
            "Food": "0.3000",
            "Fashion": "0.4333",
            "Daily Life": "0.3667",
            "Travel": "0.6000",
            "Pet & Animal": "0.4667",
            "Exercise": "0.5000",
            "Multilingual": "0.6000"
        },
        "task_type": {
            "Temporal Perception": "0.4839",
            "Spatial Perception": "0.4762",
            "Attribute Perception": "0.5890",
            "Action Recognition": "0.4622",
            "Object Recognition": "0.6591",
            "OCR Problems": "0.4706",
            "Counting Problem": "0.3474",
            "Temporal Reasoning": "0.4247",
            "Spatial Reasoning": "0.8333",
            "Action Reasoning": "0.4310",
            "Object Reasoning": "0.6194",
            "Information Synopsis": "0.7949"
        }
    },
    "long": {
        "overall": "0.4833",
        "domain": {
            "Knowledge": "0.5296",
            "Film & Television": "0.5083",
            "Sports Competition": "0.4333",
            "Artistic Performance": "0.4583",
            "Life Record": "0.4667",
            "Multilingual": "0.4333"
        },
        "sub_category": {
            "Humanity & History": "0.4667",
            "Literature & Art": "0.5000",
            "Biology & Medicine": "0.7000",
            "Finance & Commerce": "0.6667",
            "Astronomy": "0.6000",
            "Geography": "0.3333",
            "Law": "0.5667",
            "Life Tip": "0.5000",
            "Technology": "0.4333",
            "Animation": "0.4000",
            "Movie & TV Show": "0.4667",
            "Documentary": "0.6667",
            "News Report": "0.5000",
            "Esports": "0.4667",
            "Basketball": "0.3667",
            "Football": "0.5667",
            "Athletics": "0.3333",
            "Other Sports": "0.4333",
            "Stage Play": "0.7667",
            "Magic Show": "0.4000",
            "Variety Show": "0.2000",
            "Acrobatics": "0.4667",
            "Handicraft": "0.6333",
            "Food": "0.3333",
            "Fashion": "0.4333",
            "Daily Life": "0.4667",
            "Travel": "0.3000",
            "Pet & Animal": "0.7000",
            "Exercise": "0.4000",
            "Multilingual": "0.4333"
        },
        "task_type": {
            "Temporal Perception": "0.0000",
            "Spatial Perception": "0.3333",
            "Attribute Perception": "0.5556",
            "Action Recognition": "0.4444",
            "Object Recognition": "0.4815",
            "OCR Problems": "0.6429",
            "Counting Problem": "0.3333",
            "Temporal Reasoning": "0.2967",
            "Spatial Reasoning": "0.7273",
            "Action Reasoning": "0.4611",
            "Object Reasoning": "0.4667",
            "Information Synopsis": "0.6748"
        }
    },
    "overall": {
        "overall": "0.5711",
        "domain": {
            "Knowledge": "0.6012",
            "Film & Television": "0.6167",
            "Sports Competition": "0.4978",
            "Artistic Performance": "0.6028",
            "Life Record": "0.5460",
            "Multilingual": "0.5333"
        },
        "sub_category": {
            "Humanity & History": "0.4556",
            "Literature & Art": "0.5556",
            "Biology & Medicine": "0.7444",
            "Finance & Commerce": "0.6889",
            "Astronomy": "0.6556",
            "Geography": "0.5222",
            "Law": "0.6778",
            "Life Tip": "0.5889",
            "Technology": "0.5222",
            "Animation": "0.4889",
            "Movie & TV Show": "0.6111",
            "Documentary": "0.6444",
            "News Report": "0.7222",
            "Esports": "0.5222",
            "Basketball": "0.3556",
            "Football": "0.6111",
            "Athletics": "0.4778",
            "Other Sports": "0.5222",
            "Stage Play": "0.7889",
            "Magic Show": "0.5333",
            "Variety Show": "0.5222",
            "Acrobatics": "0.5667",
            "Handicraft": "0.6444",
            "Food": "0.4889",
            "Fashion": "0.5111",
            "Daily Life": "0.5333",
            "Travel": "0.5556",
            "Pet & Animal": "0.6333",
            "Exercise": "0.4556",
            "Multilingual": "0.5333"
        },
        "task_type": {
            "Temporal Perception": "0.5273",
            "Spatial Perception": "0.5926",
            "Attribute Perception": "0.6937",
            "Action Recognition": "0.5304",
            "Object Recognition": "0.6469",
            "OCR Problems": "0.6259",
            "Counting Problem": "0.3731",
            "Temporal Reasoning": "0.3842",
            "Spatial Reasoning": "0.8214",
            "Action Reasoning": "0.4947",
            "Object Reasoning": "0.5551",
            "Information Synopsis": "0.7368"
        }
    }
}
```

````

````{tab} 40B

When testing without subtitles:

```bash
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-40B --verbose --nframe 16
```

The expected test results are:

```
{
    "short": {
        "overall": "0.7200",
        "domain": {
            "Knowledge": "0.7222",
            "Film & Television": "0.7417",
            "Sports Competition": "0.6667",
            "Artistic Performance": "0.7583",
            "Life Record": "0.7476",
            "Multilingual": "0.5333"
        },
        "sub_category": {
            "Humanity & History": "0.4333",
            "Literature & Art": "0.6667",
            "Biology & Medicine": "0.9667",
            "Finance & Commerce": "0.8000",
            "Astronomy": "0.8000",
            "Geography": "0.6333",
            "Law": "0.7333",
            "Life Tip": "0.7333",
            "Technology": "0.7333",
            "Animation": "0.8000",
            "Movie & TV Show": "0.7333",
            "Documentary": "0.5667",
            "News Report": "0.8667",
            "Esports": "0.6667",
            "Basketball": "0.4333",
            "Football": "0.7667",
            "Athletics": "0.8000",
            "Other Sports": "0.6667",
            "Stage Play": "0.9000",
            "Magic Show": "0.6667",
            "Variety Show": "0.7667",
            "Acrobatics": "0.7000",
            "Handicraft": "0.8667",
            "Food": "0.7333",
            "Fashion": "0.7333",
            "Daily Life": "0.7333",
            "Travel": "0.7667",
            "Pet & Animal": "0.8000",
            "Exercise": "0.6000",
            "Multilingual": "0.5333"
        },
        "task_type": {
            "Temporal Perception": "0.8889",
            "Spatial Perception": "0.7333",
            "Attribute Perception": "0.8033",
            "Action Recognition": "0.6718",
            "Object Recognition": "0.7262",
            "OCR Problems": "0.8596",
            "Counting Problem": "0.4400",
            "Temporal Reasoning": "0.8462",
            "Spatial Reasoning": "0.8889",
            "Action Reasoning": "0.7660",
            "Object Reasoning": "0.7250",
            "Information Synopsis": "0.8415"
        }
    },
    "medium": {
        "overall": "0.5911",
        "domain": {
            "Knowledge": "0.6074",
            "Film & Television": "0.6417",
            "Sports Competition": "0.5067",
            "Artistic Performance": "0.6583",
            "Life Record": "0.5429",
            "Multilingual": "0.7333"
        },
        "sub_category": {
            "Humanity & History": "0.5333",
            "Literature & Art": "0.6333",
            "Biology & Medicine": "0.6000",
            "Finance & Commerce": "0.6000",
            "Astronomy": "0.5667",
            "Geography": "0.5333",
            "Law": "0.8000",
            "Life Tip": "0.5667",
            "Technology": "0.6333",
            "Animation": "0.4000",
            "Movie & TV Show": "0.7000",
            "Documentary": "0.8000",
            "News Report": "0.6667",
            "Esports": "0.6333",
            "Basketball": "0.1667",
            "Football": "0.5333",
            "Athletics": "0.6000",
            "Other Sports": "0.6000",
            "Stage Play": "0.7667",
            "Magic Show": "0.6333",
            "Variety Show": "0.5667",
            "Acrobatics": "0.6667",
            "Handicraft": "0.7000",
            "Food": "0.3667",
            "Fashion": "0.4333",
            "Daily Life": "0.5333",
            "Travel": "0.6333",
            "Pet & Animal": "0.4000",
            "Exercise": "0.7333",
            "Multilingual": "0.7333"
        },
        "task_type": {
            "Temporal Perception": "0.5484",
            "Spatial Perception": "0.6190",
            "Attribute Perception": "0.6712",
            "Action Recognition": "0.5126",
            "Object Recognition": "0.6667",
            "OCR Problems": "0.5000",
            "Counting Problem": "0.3579",
            "Temporal Reasoning": "0.5068",
            "Spatial Reasoning": "0.7778",
            "Action Reasoning": "0.5345",
            "Object Reasoning": "0.6716",
            "Information Synopsis": "0.8205"
        }
    },
    "long": {
        "overall": "0.5256",
        "domain": {
            "Knowledge": "0.5926",
            "Film & Television": "0.4583",
            "Sports Competition": "0.5267",
            "Artistic Performance": "0.5417",
            "Life Record": "0.4762",
            "Multilingual": "0.4667"
        },
        "sub_category": {
            "Humanity & History": "0.5333",
            "Literature & Art": "0.6000",
            "Biology & Medicine": "0.7000",
            "Finance & Commerce": "0.7667",
            "Astronomy": "0.5667",
            "Geography": "0.4333",
            "Law": "0.5000",
            "Life Tip": "0.6333",
            "Technology": "0.6000",
            "Animation": "0.3000",
            "Movie & TV Show": "0.5333",
            "Documentary": "0.5333",
            "News Report": "0.4667",
            "Esports": "0.6667",
            "Basketball": "0.3667",
            "Football": "0.6000",
            "Athletics": "0.4000",
            "Other Sports": "0.6000",
            "Stage Play": "0.7000",
            "Magic Show": "0.5667",
            "Variety Show": "0.3667",
            "Acrobatics": "0.5333",
            "Handicraft": "0.5667",
            "Food": "0.3667",
            "Fashion": "0.4000",
            "Daily Life": "0.4333",
            "Travel": "0.3667",
            "Pet & Animal": "0.6333",
            "Exercise": "0.5667",
            "Multilingual": "0.4667"
        },
        "task_type": {
            "Temporal Perception": "0.3333",
            "Spatial Perception": "0.3333",
            "Attribute Perception": "0.6667",
            "Action Recognition": "0.5397",
            "Object Recognition": "0.5185",
            "OCR Problems": "0.4286",
            "Counting Problem": "0.2917",
            "Temporal Reasoning": "0.3297",
            "Spatial Reasoning": "0.6364",
            "Action Reasoning": "0.5000",
            "Object Reasoning": "0.5292",
            "Information Synopsis": "0.7117"
        }
    },
    "overall": {
        "overall": "0.6122",
        "domain": {
            "Knowledge": "0.6407",
            "Film & Television": "0.6139",
            "Sports Competition": "0.5667",
            "Artistic Performance": "0.6528",
            "Life Record": "0.5889",
            "Multilingual": "0.5778"
        },
        "sub_category": {
            "Humanity & History": "0.5000",
            "Literature & Art": "0.6333",
            "Biology & Medicine": "0.7556",
            "Finance & Commerce": "0.7222",
            "Astronomy": "0.6444",
            "Geography": "0.5333",
            "Law": "0.6778",
            "Life Tip": "0.6444",
            "Technology": "0.6556",
            "Animation": "0.5000",
            "Movie & TV Show": "0.6556",
            "Documentary": "0.6333",
            "News Report": "0.6667",
            "Esports": "0.6556",
            "Basketball": "0.3222",
            "Football": "0.6333",
            "Athletics": "0.6000",
            "Other Sports": "0.6222",
            "Stage Play": "0.7889",
            "Magic Show": "0.6222",
            "Variety Show": "0.5667",
            "Acrobatics": "0.6333",
            "Handicraft": "0.7111",
            "Food": "0.4889",
            "Fashion": "0.5222",
            "Daily Life": "0.5667",
            "Travel": "0.5889",
            "Pet & Animal": "0.6111",
            "Exercise": "0.6333",
            "Multilingual": "0.5778"
        },
        "task_type": {
            "Temporal Perception": "0.6364",
            "Spatial Perception": "0.6667",
            "Attribute Perception": "0.7432",
            "Action Recognition": "0.5847",
            "Object Recognition": "0.6723",
            "OCR Problems": "0.6403",
            "Counting Problem": "0.3843",
            "Temporal Reasoning": "0.4407",
            "Spatial Reasoning": "0.8036",
            "Action Reasoning": "0.5509",
            "Object Reasoning": "0.6057",
            "Information Synopsis": "0.7709"
        }
    }
}
```

When testing with subtitles:

```bash
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-40B --verbose --nframe 16 --use-subtitle
```

The expected test results are:

```
{
    "short": {
        "overall": "0.7278",
        "domain": {
            "Knowledge": "0.7370",
            "Film & Television": "0.7583",
            "Sports Competition": "0.6800",
            "Artistic Performance": "0.7750",
            "Life Record": "0.7286",
            "Multilingual": "0.5667"
        },
        "sub_category": {
            "Humanity & History": "0.4333",
            "Literature & Art": "0.6333",
            "Biology & Medicine": "0.9667",
            "Finance & Commerce": "0.8667",
            "Astronomy": "0.8333",
            "Geography": "0.7000",
            "Law": "0.7667",
            "Life Tip": "0.7000",
            "Technology": "0.7333",
            "Animation": "0.7667",
            "Movie & TV Show": "0.7000",
            "Documentary": "0.6667",
            "News Report": "0.9000",
            "Esports": "0.6667",
            "Basketball": "0.3667",
            "Football": "0.8000",
            "Athletics": "0.8333",
            "Other Sports": "0.7333",
            "Stage Play": "0.8667",
            "Magic Show": "0.7333",
            "Variety Show": "0.8000",
            "Acrobatics": "0.7000",
            "Handicraft": "0.7667",
            "Food": "0.8000",
            "Fashion": "0.6667",
            "Daily Life": "0.7333",
            "Travel": "0.7667",
            "Pet & Animal": "0.8000",
            "Exercise": "0.5667",
            "Multilingual": "0.5667"
        },
        "task_type": {
            "Temporal Perception": "0.8889",
            "Spatial Perception": "0.7333",
            "Attribute Perception": "0.8115",
            "Action Recognition": "0.6870",
            "Object Recognition": "0.7202",
            "OCR Problems": "0.8596",
            "Counting Problem": "0.4640",
            "Temporal Reasoning": "0.6923",
            "Spatial Reasoning": "0.8889",
            "Action Reasoning": "0.7234",
            "Object Reasoning": "0.7625",
            "Information Synopsis": "0.8780"
        }
    },
    "medium": {
        "overall": "0.6133",
        "domain": {
            "Knowledge": "0.6630",
            "Film & Television": "0.6583",
            "Sports Competition": "0.5133",
            "Artistic Performance": "0.6917",
            "Life Record": "0.5333",
            "Multilingual": "0.7333"
        },
        "sub_category": {
            "Humanity & History": "0.6000",
            "Literature & Art": "0.7000",
            "Biology & Medicine": "0.5667",
            "Finance & Commerce": "0.7333",
            "Astronomy": "0.7000",
            "Geography": "0.5667",
            "Law": "0.8333",
            "Life Tip": "0.6667",
            "Technology": "0.6000",
            "Animation": "0.4333",
            "Movie & TV Show": "0.7667",
            "Documentary": "0.7333",
            "News Report": "0.7000",
            "Esports": "0.5667",
            "Basketball": "0.2667",
            "Football": "0.5667",
            "Athletics": "0.5667",
            "Other Sports": "0.6000",
            "Stage Play": "0.8000",
            "Magic Show": "0.6333",
            "Variety Show": "0.6667",
            "Acrobatics": "0.6667",
            "Handicraft": "0.7000",
            "Food": "0.3333",
            "Fashion": "0.4000",
            "Daily Life": "0.5333",
            "Travel": "0.6333",
            "Pet & Animal": "0.4667",
            "Exercise": "0.6667",
            "Multilingual": "0.7333"
        },
        "task_type": {
            "Temporal Perception": "0.5484",
            "Spatial Perception": "0.5238",
            "Attribute Perception": "0.6438",
            "Action Recognition": "0.5798",
            "Object Recognition": "0.7121",
            "OCR Problems": "0.4706",
            "Counting Problem": "0.3684",
            "Temporal Reasoning": "0.5479",
            "Spatial Reasoning": "0.8333",
            "Action Reasoning": "0.6034",
            "Object Reasoning": "0.6791",
            "Information Synopsis": "0.8462"
        }
    },
    "long": {
        "overall": "0.5300",
        "domain": {
            "Knowledge": "0.5889",
            "Film & Television": "0.5000",
            "Sports Competition": "0.5000",
            "Artistic Performance": "0.6000",
            "Life Record": "0.4571",
            "Multilingual": "0.5000"
        },
        "sub_category": {
            "Humanity & History": "0.5333",
            "Literature & Art": "0.6000",
            "Biology & Medicine": "0.6333",
            "Finance & Commerce": "0.6333",
            "Astronomy": "0.6667",
            "Geography": "0.4000",
            "Law": "0.7000",
            "Life Tip": "0.6000",
            "Technology": "0.5333",
            "Animation": "0.3667",
            "Movie & TV Show": "0.5000",
            "Documentary": "0.5333",
            "News Report": "0.6000",
            "Esports": "0.6000",
            "Basketball": "0.3333",
            "Football": "0.6333",
            "Athletics": "0.4000",
            "Other Sports": "0.5333",
            "Stage Play": "0.8667",
            "Magic Show": "0.5667",
            "Variety Show": "0.4333",
            "Acrobatics": "0.5333",
            "Handicraft": "0.5667",
            "Food": "0.3333",
            "Fashion": "0.3667",
            "Daily Life": "0.4333",
            "Travel": "0.4000",
            "Pet & Animal": "0.6667",
            "Exercise": "0.4333",
            "Multilingual": "0.5000"
        },
        "task_type": {
            "Temporal Perception": "0.1667",
            "Spatial Perception": "0.3333",
            "Attribute Perception": "0.6296",
            "Action Recognition": "0.5714",
            "Object Recognition": "0.5185",
            "OCR Problems": "0.5714",
            "Counting Problem": "0.2708",
            "Temporal Reasoning": "0.3187",
            "Spatial Reasoning": "0.6364",
            "Action Reasoning": "0.4889",
            "Object Reasoning": "0.5417",
            "Information Synopsis": "0.7301"
        }
    },
    "overall": {
        "overall": "0.6237",
        "domain": {
            "Knowledge": "0.6630",
            "Film & Television": "0.6389",
            "Sports Competition": "0.5644",
            "Artistic Performance": "0.6889",
            "Life Record": "0.5730",
            "Multilingual": "0.6000"
        },
        "sub_category": {
            "Humanity & History": "0.5222",
            "Literature & Art": "0.6444",
            "Biology & Medicine": "0.7222",
            "Finance & Commerce": "0.7444",
            "Astronomy": "0.7333",
            "Geography": "0.5556",
            "Law": "0.7667",
            "Life Tip": "0.6556",
            "Technology": "0.6222",
            "Animation": "0.5222",
            "Movie & TV Show": "0.6556",
            "Documentary": "0.6444",
            "News Report": "0.7333",
            "Esports": "0.6111",
            "Basketball": "0.3222",
            "Football": "0.6667",
            "Athletics": "0.6000",
            "Other Sports": "0.6222",
            "Stage Play": "0.8444",
            "Magic Show": "0.6444",
            "Variety Show": "0.6333",
            "Acrobatics": "0.6333",
            "Handicraft": "0.6778",
            "Food": "0.4889",
            "Fashion": "0.4778",
            "Daily Life": "0.5667",
            "Travel": "0.6000",
            "Pet & Animal": "0.6444",
            "Exercise": "0.5556",
            "Multilingual": "0.6000"
        },
        "task_type": {
            "Temporal Perception": "0.6182",
            "Spatial Perception": "0.6296",
            "Attribute Perception": "0.7342",
            "Action Recognition": "0.6230",
            "Object Recognition": "0.6864",
            "OCR Problems": "0.6403",
            "Counting Problem": "0.3955",
            "Temporal Reasoning": "0.4407",
            "Spatial Reasoning": "0.8214",
            "Action Reasoning": "0.5509",
            "Object Reasoning": "0.6211",
            "Information Synopsis": "0.7957"
        }
    }
}
```

````

````{tab} 76B


When testing without subtitles:

```bash
torchrun --nproc-per-node=1 run.py --data Video-MME --model InternVL2-76B --verbose --nframe 16
```

The expected test results are:

```
{
    "short": {
        "overall": "0.7222",
        "domain": {
            "Knowledge": "0.7593",
            "Film & Television": "0.7167",
            "Sports Competition": "0.6800",
            "Artistic Performance": "0.7500",
            "Life Record": "0.7143",
            "Multilingual": "0.5667"
        },
        "sub_category": {
            "Humanity & History": "0.5333",
            "Literature & Art": "0.6667",
            "Biology & Medicine": "0.9333",
            "Finance & Commerce": "0.8333",
            "Astronomy": "0.7667",
            "Geography": "0.7333",
            "Law": "0.8000",
            "Life Tip": "0.7667",
            "Technology": "0.8000",
            "Animation": "0.8000",
            "Movie & TV Show": "0.6333",
            "Documentary": "0.5667",
            "News Report": "0.8667",
            "Esports": "0.6667",
            "Basketball": "0.6000",
            "Football": "0.7667",
            "Athletics": "0.7333",
            "Other Sports": "0.6333",
            "Stage Play": "0.8667",
            "Magic Show": "0.6667",
            "Variety Show": "0.7333",
            "Acrobatics": "0.7333",
            "Handicraft": "0.8000",
            "Food": "0.7333",
            "Fashion": "0.6000",
            "Daily Life": "0.7333",
            "Travel": "0.8667",
            "Pet & Animal": "0.7667",
            "Exercise": "0.5000",
            "Multilingual": "0.5667"
        },
        "task_type": {
            "Temporal Perception": "0.9444",
            "Spatial Perception": "0.8333",
            "Attribute Perception": "0.7869",
            "Action Recognition": "0.6870",
            "Object Recognition": "0.6786",
            "OCR Problems": "0.8596",
            "Counting Problem": "0.4400",
            "Temporal Reasoning": "0.6923",
            "Spatial Reasoning": "0.8519",
            "Action Reasoning": "0.8085",
            "Object Reasoning": "0.8000",
            "Information Synopsis": "0.8537"
        }
    },
    "medium": {
        "overall": "0.5800",
        "domain": {
            "Knowledge": "0.5741",
            "Film & Television": "0.6833",
            "Sports Competition": "0.5200",
            "Artistic Performance": "0.6833",
            "Life Record": "0.5095",
            "Multilingual": "0.6000"
        },
        "sub_category": {
            "Humanity & History": "0.5000",
            "Literature & Art": "0.6000",
            "Biology & Medicine": "0.5667",
            "Finance & Commerce": "0.6333",
            "Astronomy": "0.6000",
            "Geography": "0.5000",
            "Law": "0.6333",
            "Life Tip": "0.6000",
            "Technology": "0.5333",
            "Animation": "0.6000",
            "Movie & TV Show": "0.7667",
            "Documentary": "0.7667",
            "News Report": "0.6000",
            "Esports": "0.5000",
            "Basketball": "0.4000",
            "Football": "0.6000",
            "Athletics": "0.4667",
            "Other Sports": "0.6333",
            "Stage Play": "0.8000",
            "Magic Show": "0.6333",
            "Variety Show": "0.6000",
            "Acrobatics": "0.7000",
            "Handicraft": "0.7333",
            "Food": "0.3000",
            "Fashion": "0.4000",
            "Daily Life": "0.3667",
            "Travel": "0.5667",
            "Pet & Animal": "0.6333",
            "Exercise": "0.5667",
            "Multilingual": "0.6000"
        },
        "task_type": {
            "Temporal Perception": "0.5806",
            "Spatial Perception": "0.5238",
            "Attribute Perception": "0.6027",
            "Action Recognition": "0.5546",
            "Object Recognition": "0.6212",
            "OCR Problems": "0.5000",
            "Counting Problem": "0.4000",
            "Temporal Reasoning": "0.3836",
            "Spatial Reasoning": "0.7222",
            "Action Reasoning": "0.6207",
            "Object Reasoning": "0.6642",
            "Information Synopsis": "0.8077"
        }
    },
    "long": {
        "overall": "0.5333",
        "domain": {
            "Knowledge": "0.5926",
            "Film & Television": "0.4667",
            "Sports Competition": "0.5200",
            "Artistic Performance": "0.5750",
            "Life Record": "0.4810",
            "Multilingual": "0.5333"
        },
        "sub_category": {
            "Humanity & History": "0.5333",
            "Literature & Art": "0.6000",
            "Biology & Medicine": "0.5667",
            "Finance & Commerce": "0.6667",
            "Astronomy": "0.7333",
            "Geography": "0.5000",
            "Law": "0.5333",
            "Life Tip": "0.7000",
            "Technology": "0.5000",
            "Animation": "0.4000",
            "Movie & TV Show": "0.4000",
            "Documentary": "0.4667",
            "News Report": "0.6000",
            "Esports": "0.4333",
            "Basketball": "0.5333",
            "Football": "0.5667",
            "Athletics": "0.5000",
            "Other Sports": "0.5667",
            "Stage Play": "0.7333",
            "Magic Show": "0.5667",
            "Variety Show": "0.3333",
            "Acrobatics": "0.6667",
            "Handicraft": "0.5667",
            "Food": "0.3667",
            "Fashion": "0.5000",
            "Daily Life": "0.4667",
            "Travel": "0.3667",
            "Pet & Animal": "0.7000",
            "Exercise": "0.4000",
            "Multilingual": "0.5333"
        },
        "task_type": {
            "Temporal Perception": "0.5000",
            "Spatial Perception": "0.3333",
            "Attribute Perception": "0.5185",
            "Action Recognition": "0.5556",
            "Object Recognition": "0.5741",
            "OCR Problems": "0.3571",
            "Counting Problem": "0.3750",
            "Temporal Reasoning": "0.4835",
            "Spatial Reasoning": "0.6364",
            "Action Reasoning": "0.4778",
            "Object Reasoning": "0.5250",
            "Information Synopsis": "0.6748"
        }
    },
    "overall": {
        "overall": "0.6119",
        "domain": {
            "Knowledge": "0.6420",
            "Film & Television": "0.6222",
            "Sports Competition": "0.5733",
            "Artistic Performance": "0.6694",
            "Life Record": "0.5683",
            "Multilingual": "0.5667"
        },
        "sub_category": {
            "Humanity & History": "0.5222",
            "Literature & Art": "0.6222",
            "Biology & Medicine": "0.6889",
            "Finance & Commerce": "0.7111",
            "Astronomy": "0.7000",
            "Geography": "0.5778",
            "Law": "0.6556",
            "Life Tip": "0.6889",
            "Technology": "0.6111",
            "Animation": "0.6000",
            "Movie & TV Show": "0.6000",
            "Documentary": "0.6000",
            "News Report": "0.6889",
            "Esports": "0.5333",
            "Basketball": "0.5111",
            "Football": "0.6444",
            "Athletics": "0.5667",
            "Other Sports": "0.6111",
            "Stage Play": "0.8000",
            "Magic Show": "0.6222",
            "Variety Show": "0.5556",
            "Acrobatics": "0.7000",
            "Handicraft": "0.7000",
            "Food": "0.4667",
            "Fashion": "0.5000",
            "Daily Life": "0.5222",
            "Travel": "0.6000",
            "Pet & Animal": "0.7000",
            "Exercise": "0.4889",
            "Multilingual": "0.5667"
        },
        "task_type": {
            "Temporal Perception": "0.6909",
            "Spatial Perception": "0.6852",
            "Attribute Perception": "0.6937",
            "Action Recognition": "0.6102",
            "Object Recognition": "0.6412",
            "OCR Problems": "0.6331",
            "Counting Problem": "0.4142",
            "Temporal Reasoning": "0.4576",
            "Spatial Reasoning": "0.7679",
            "Action Reasoning": "0.5614",
            "Object Reasoning": "0.6145",
            "Information Synopsis": "0.7523"
        }
    }
}
```

When testing with subtitles:

```bash
torchrun --nproc-per-node=1 run.py --data Video-MME --model InternVL2-76B --verbose --nframe 16 --use-subtitle
```

The expected test results are:

```
{
    "short": {
        "overall": "0.7422",
        "domain": {
            "Knowledge": "0.7667",
            "Film & Television": "0.7583",
            "Sports Competition": "0.7067",
            "Artistic Performance": "0.7833",
            "Life Record": "0.7286",
            "Multilingual": "0.5667"
        },
        "sub_category": {
            "Humanity & History": "0.5000",
            "Literature & Art": "0.6667",
            "Biology & Medicine": "0.9667",
            "Finance & Commerce": "0.8667",
            "Astronomy": "0.8000",
            "Geography": "0.7667",
            "Law": "0.8000",
            "Life Tip": "0.7667",
            "Technology": "0.7667",
            "Animation": "0.7667",
            "Movie & TV Show": "0.7000",
            "Documentary": "0.6667",
            "News Report": "0.9000",
            "Esports": "0.7000",
            "Basketball": "0.5000",
            "Football": "0.7667",
            "Athletics": "0.8333",
            "Other Sports": "0.7333",
            "Stage Play": "0.8333",
            "Magic Show": "0.7667",
            "Variety Show": "0.8000",
            "Acrobatics": "0.7333",
            "Handicraft": "0.8000",
            "Food": "0.8000",
            "Fashion": "0.6333",
            "Daily Life": "0.7333",
            "Travel": "0.8667",
            "Pet & Animal": "0.7333",
            "Exercise": "0.5333",
            "Multilingual": "0.5667"
        },
        "task_type": {
            "Temporal Perception": "0.8889",
            "Spatial Perception": "0.8000",
            "Attribute Perception": "0.8115",
            "Action Recognition": "0.7023",
            "Object Recognition": "0.6964",
            "OCR Problems": "0.9123",
            "Counting Problem": "0.4720",
            "Temporal Reasoning": "0.7692",
            "Spatial Reasoning": "0.8519",
            "Action Reasoning": "0.8511",
            "Object Reasoning": "0.7875",
            "Information Synopsis": "0.8902"
        }
    },
    "medium": {
        "overall": "0.5900",
        "domain": {
            "Knowledge": "0.6111",
            "Film & Television": "0.7083",
            "Sports Competition": "0.4800",
            "Artistic Performance": "0.7083",
            "Life Record": "0.5048",
            "Multilingual": "0.6000"
        },
        "sub_category": {
            "Humanity & History": "0.6000",
            "Literature & Art": "0.6333",
            "Biology & Medicine": "0.5667",
            "Finance & Commerce": "0.6333",
            "Astronomy": "0.6333",
            "Geography": "0.6000",
            "Law": "0.6667",
            "Life Tip": "0.6333",
            "Technology": "0.5333",
            "Animation": "0.5333",
            "Movie & TV Show": "0.8000",
            "Documentary": "0.7667",
            "News Report": "0.7333",
            "Esports": "0.5000",
            "Basketball": "0.3000",
            "Football": "0.5667",
            "Athletics": "0.4667",
            "Other Sports": "0.5667",
            "Stage Play": "0.8333",
            "Magic Show": "0.6667",
            "Variety Show": "0.6000",
            "Acrobatics": "0.7333",
            "Handicraft": "0.7333",
            "Food": "0.3333",
            "Fashion": "0.3333",
            "Daily Life": "0.4333",
            "Travel": "0.5333",
            "Pet & Animal": "0.6333",
            "Exercise": "0.5333",
            "Multilingual": "0.6000"
        },
        "task_type": {
            "Temporal Perception": "0.5161",
            "Spatial Perception": "0.5238",
            "Attribute Perception": "0.6027",
            "Action Recognition": "0.5546",
            "Object Recognition": "0.6439",
            "OCR Problems": "0.5147",
            "Counting Problem": "0.3579",
            "Temporal Reasoning": "0.3973",
            "Spatial Reasoning": "0.8889",
            "Action Reasoning": "0.6207",
            "Object Reasoning": "0.6791",
            "Information Synopsis": "0.8718"
        }
    },
    "long": {
        "overall": "0.5522",
        "domain": {
            "Knowledge": "0.6222",
            "Film & Television": "0.5167",
            "Sports Competition": "0.5267",
            "Artistic Performance": "0.5750",
            "Life Record": "0.4905",
            "Multilingual": "0.5333"
        },
        "sub_category": {
            "Humanity & History": "0.6333",
            "Literature & Art": "0.7000",
            "Biology & Medicine": "0.6000",
            "Finance & Commerce": "0.7667",
            "Astronomy": "0.6000",
            "Geography": "0.5333",
            "Law": "0.6667",
            "Life Tip": "0.6333",
            "Technology": "0.4667",
            "Animation": "0.4667",
            "Movie & TV Show": "0.4333",
            "Documentary": "0.5333",
            "News Report": "0.6333",
            "Esports": "0.5333",
            "Basketball": "0.4333",
            "Football": "0.6333",
            "Athletics": "0.5000",
            "Other Sports": "0.5333",
            "Stage Play": "0.7333",
            "Magic Show": "0.5667",
            "Variety Show": "0.3667",
            "Acrobatics": "0.6333",
            "Handicraft": "0.5667",
            "Food": "0.3667",
            "Fashion": "0.4667",
            "Daily Life": "0.4667",
            "Travel": "0.4333",
            "Pet & Animal": "0.7000",
            "Exercise": "0.4333",
            "Multilingual": "0.5333"
        },
        "task_type": {
            "Temporal Perception": "0.5000",
            "Spatial Perception": "0.6667",
            "Attribute Perception": "0.6667",
            "Action Recognition": "0.5238",
            "Object Recognition": "0.5000",
            "OCR Problems": "0.5714",
            "Counting Problem": "0.2917",
            "Temporal Reasoning": "0.5165",
            "Spatial Reasoning": "0.6364",
            "Action Reasoning": "0.4944",
            "Object Reasoning": "0.5458",
            "Information Synopsis": "0.7239"
        }
    },
    "overall": {
        "overall": "0.6281",
        "domain": {
            "Knowledge": "0.6667",
            "Film & Television": "0.6611",
            "Sports Competition": "0.5711",
            "Artistic Performance": "0.6889",
            "Life Record": "0.5746",
            "Multilingual": "0.5667"
        },
        "sub_category": {
            "Humanity & History": "0.5778",
            "Literature & Art": "0.6667",
            "Biology & Medicine": "0.7111",
            "Finance & Commerce": "0.7556",
            "Astronomy": "0.6778",
            "Geography": "0.6333",
            "Law": "0.7111",
            "Life Tip": "0.6778",
            "Technology": "0.5889",
            "Animation": "0.5889",
            "Movie & TV Show": "0.6444",
            "Documentary": "0.6556",
            "News Report": "0.7556",
            "Esports": "0.5778",
            "Basketball": "0.4111",
            "Football": "0.6556",
            "Athletics": "0.6000",
            "Other Sports": "0.6111",
            "Stage Play": "0.8000",
            "Magic Show": "0.6667",
            "Variety Show": "0.5889",
            "Acrobatics": "0.7000",
            "Handicraft": "0.7000",
            "Food": "0.5000",
            "Fashion": "0.4778",
            "Daily Life": "0.5444",
            "Travel": "0.6111",
            "Pet & Animal": "0.6889",
            "Exercise": "0.5000",
            "Multilingual": "0.5667"
        },
        "task_type": {
            "Temporal Perception": "0.6364",
            "Spatial Perception": "0.6852",
            "Attribute Perception": "0.7252",
            "Action Recognition": "0.6102",
            "Object Recognition": "0.6469",
            "OCR Problems": "0.6835",
            "Counting Problem": "0.3993",
            "Temporal Reasoning": "0.4859",
            "Spatial Reasoning": "0.8214",
            "Action Reasoning": "0.5789",
            "Object Reasoning": "0.6278",
            "Information Synopsis": "0.8019"
        }
    }
}
```

````

`````

### MMBench-Video

MMBench-Video is a benchmark designed to evaluate the proficiency of MLLMs in understanding video content. It addresses the limitations of traditional VideoQA benchmarks by incorporating long-form videos sourced from YouTube, which better reflect real-world scenarios. The benchmark uses free-form questions that require temporal reasoning, which are human-annotated based on a comprehensive capability taxonomy.

`````{tabs}

````{tab} 1B

When testing with 8 frames:

```bash
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-1B --verbose --nframe 8
```

The expected test results are:

```
{
    "coarse_all": {
        "CP": "1.11",
        "FP-S": "1.00",
        "FP-C": "0.84",
        "HL": "0.27",
        "LR": "0.71",
        "AR": "1.01",
        "RR": "1.17",
        "CSR": "0.77",
        "TR": "0.71",
        "Perception": "0.97",
        "Reasoning": "0.88",
        "Overall": "0.95"
    },
    "coarse_valid": {
        "CP": "1.11",
        "FP-S": "1.00",
        "FP-C": "0.84",
        "HL": "0.27",
        "LR": "0.71",
        "AR": "1.01",
        "RR": "1.17",
        "CSR": "0.77",
        "TR": "0.71",
        "Perception": "0.97",
        "Reasoning": "0.88",
        "Overall": "0.95"
    },
    "fine_all": {
        "Video Topic": "1.05",
        "Video Emotion": "1.27",
        "Video Scene": "0.84",
        "Video Style": "1.38",
        "OCR": "0.87",
        "Object Recognition": "1.07",
        "Attribute Recognition": "1.41",
        "Event Recognition": "0.93",
        "Human Motion": "0.84",
        "Counting": "0.99",
        "Spatial Relationship": "1.16",
        "Human-object Interaction": "0.80",
        "Human Interaction": "0.70",
        "Hallucination": "0.27",
        "Structuralized Image-Text Understanding": "0.97",
        "Mathematical Calculation": "0.31",
        "Physical Property": "0.78",
        "Function Reasoning": "0.95",
        "Identity Reasoning": "1.30",
        "Natural Relation": "1.04",
        "Physical Relation": "0.92",
        "Social Relation": "1.48",
        "Common Sense Reasoning": "0.77",
        "Counterfactual Reasoning": "0.80",
        "Causal Reasoning": "0.67",
        "Future Prediction": "0.77"
    },
    "fine_valid": {
        "Video Topic": "1.05",
        "Video Emotion": "1.27",
        "Video Scene": "0.84",
        "Video Style": "1.38",
        "OCR": "0.87",
        "Object Recognition": "1.07",
        "Attribute Recognition": "1.41",
        "Event Recognition": "0.93",
        "Human Motion": "0.84",
        "Counting": "0.99",
        "Spatial Relationship": "1.16",
        "Human-object Interaction": "0.80",
        "Human Interaction": "0.70",
        "Hallucination": "0.27",
        "Structuralized Image-Text Understanding": "0.97",
        "Mathematical Calculation": "0.31",
        "Physical Property": "0.78",
        "Function Reasoning": "0.95",
        "Identity Reasoning": "1.30",
        "Natural Relation": "1.04",
        "Physical Relation": "0.92",
        "Social Relation": "1.48",
        "Common Sense Reasoning": "0.77",
        "Counterfactual Reasoning": "0.80",
        "Causal Reasoning": "0.67",
        "Future Prediction": "0.77"
    }
}
```

When testing with 16 frames:

```bash
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-1B --verbose --nframe 16
```

The expected test results are:

```
{
    "coarse_all": {
        "CP": "1.21",
        "FP-S": "1.03",
        "FP-C": "0.85",
        "HL": "0.29",
        "LR": "0.73",
        "AR": "1.00",
        "RR": "1.26",
        "CSR": "0.70",
        "TR": "0.74",
        "Perception": "1.00",
        "Reasoning": "0.90",
        "Overall": "0.98"
    },
    "coarse_valid": {
        "CP": "1.21",
        "FP-S": "1.03",
        "FP-C": "0.85",
        "HL": "0.29",
        "LR": "0.73",
        "AR": "1.00",
        "RR": "1.26",
        "CSR": "0.70",
        "TR": "0.74",
        "Perception": "1.00",
        "Reasoning": "0.90",
        "Overall": "0.98"
    },
    "fine_all": {
        "Video Topic": "1.15",
        "Video Emotion": "1.37",
        "Video Scene": "0.96",
        "Video Style": "1.43",
        "OCR": "0.96",
        "Object Recognition": "1.08",
        "Attribute Recognition": "1.47",
        "Event Recognition": "0.86",
        "Human Motion": "0.77",
        "Counting": "0.94",
        "Spatial Relationship": "1.09",
        "Human-object Interaction": "0.85",
        "Human Interaction": "0.64",
        "Hallucination": "0.29",
        "Structuralized Image-Text Understanding": "0.96",
        "Mathematical Calculation": "0.38",
        "Physical Property": "0.76",
        "Function Reasoning": "0.89",
        "Identity Reasoning": "1.36",
        "Natural Relation": "1.00",
        "Physical Relation": "1.10",
        "Social Relation": "1.54",
        "Common Sense Reasoning": "0.70",
        "Counterfactual Reasoning": "0.88",
        "Causal Reasoning": "0.72",
        "Future Prediction": "0.74"
    },
    "fine_valid": {
        "Video Topic": "1.15",
        "Video Emotion": "1.37",
        "Video Scene": "0.96",
        "Video Style": "1.43",
        "OCR": "0.96",
        "Object Recognition": "1.08",
        "Attribute Recognition": "1.47",
        "Event Recognition": "0.86",
        "Human Motion": "0.77",
        "Counting": "0.94",
        "Spatial Relationship": "1.09",
        "Human-object Interaction": "0.85",
        "Human Interaction": "0.64",
        "Hallucination": "0.29",
        "Structuralized Image-Text Understanding": "0.96",
        "Mathematical Calculation": "0.38",
        "Physical Property": "0.76",
        "Function Reasoning": "0.89",
        "Identity Reasoning": "1.36",
        "Natural Relation": "1.00",
        "Physical Relation": "1.10",
        "Social Relation": "1.54",
        "Common Sense Reasoning": "0.70",
        "Counterfactual Reasoning": "0.88",
        "Causal Reasoning": "0.72",
        "Future Prediction": "0.74"
    }
}
```

````

````{tab} 2B

When testing with 8 frames:

```bash
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-2B --verbose --nframe 8
```

The expected test results are:

```
{
    "coarse_all": {
        "CP": "1.16",
        "FP-S": "1.05",
        "FP-C": "0.81",
        "HL": "0.26",
        "LR": "0.50",
        "AR": "1.12",
        "RR": "1.11",
        "CSR": "0.81",
        "TR": "0.83",
        "Perception": "1.00",
        "Reasoning": "0.91",
        "Overall": "0.97"
    },
    "coarse_valid": {
        "CP": "1.16",
        "FP-S": "1.05",
        "FP-C": "0.81",
        "HL": "0.26",
        "LR": "0.50",
        "AR": "1.12",
        "RR": "1.11",
        "CSR": "0.81",
        "TR": "0.83",
        "Perception": "1.00",
        "Reasoning": "0.91",
        "Overall": "0.97"
    },
    "fine_all": {
        "Video Topic": "1.12",
        "Video Emotion": "1.29",
        "Video Scene": "0.99",
        "Video Style": "1.24",
        "OCR": "0.94",
        "Object Recognition": "1.04",
        "Attribute Recognition": "1.46",
        "Event Recognition": "1.02",
        "Human Motion": "0.66",
        "Counting": "1.16",
        "Spatial Relationship": "0.93",
        "Human-object Interaction": "0.77",
        "Human Interaction": "0.77",
        "Hallucination": "0.26",
        "Structuralized Image-Text Understanding": "0.69",
        "Mathematical Calculation": "0.22",
        "Physical Property": "0.94",
        "Function Reasoning": "1.09",
        "Identity Reasoning": "1.32",
        "Natural Relation": "0.93",
        "Physical Relation": "0.98",
        "Social Relation": "1.33",
        "Common Sense Reasoning": "0.81",
        "Counterfactual Reasoning": "1.00",
        "Causal Reasoning": "0.76",
        "Future Prediction": "0.87"
    },
    "fine_valid": {
        "Video Topic": "1.12",
        "Video Emotion": "1.29",
        "Video Scene": "0.99",
        "Video Style": "1.24",
        "OCR": "0.94",
        "Object Recognition": "1.04",
        "Attribute Recognition": "1.46",
        "Event Recognition": "1.02",
        "Human Motion": "0.66",
        "Counting": "1.16",
        "Spatial Relationship": "0.93",
        "Human-object Interaction": "0.77",
        "Human Interaction": "0.77",
        "Hallucination": "0.26",
        "Structuralized Image-Text Understanding": "0.69",
        "Mathematical Calculation": "0.22",
        "Physical Property": "0.94",
        "Function Reasoning": "1.09",
        "Identity Reasoning": "1.32",
        "Natural Relation": "0.93",
        "Physical Relation": "0.98",
        "Social Relation": "1.33",
        "Common Sense Reasoning": "0.81",
        "Counterfactual Reasoning": "1.00",
        "Causal Reasoning": "0.76",
        "Future Prediction": "0.87"
    }
}
```

When testing with 16 frames:

```bash
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-2B --verbose --nframe 16
```

The expected test results are:

```
{
    "coarse_all": {
        "CP": "1.22",
        "FP-S": "1.13",
        "FP-C": "0.80",
        "HL": "0.34",
        "LR": "0.64",
        "AR": "1.01",
        "RR": "1.23",
        "CSR": "0.88",
        "TR": "0.87",
        "Perception": "1.06",
        "Reasoning": "0.95",
        "Overall": "1.03"
    },
    "coarse_valid": {
        "CP": "1.22",
        "FP-S": "1.13",
        "FP-C": "0.80",
        "HL": "0.34",
        "LR": "0.64",
        "AR": "1.01",
        "RR": "1.23",
        "CSR": "0.88",
        "TR": "0.87",
        "Perception": "1.06",
        "Reasoning": "0.95",
        "Overall": "1.03"
    },
    "fine_all": {
        "Video Topic": "1.14",
        "Video Emotion": "1.29",
        "Video Scene": "1.17",
        "Video Style": "1.21",
        "OCR": "1.02",
        "Object Recognition": "1.13",
        "Attribute Recognition": "1.59",
        "Event Recognition": "0.99",
        "Human Motion": "0.72",
        "Counting": "1.24",
        "Spatial Relationship": "1.02",
        "Human-object Interaction": "0.67",
        "Human Interaction": "0.85",
        "Hallucination": "0.34",
        "Structuralized Image-Text Understanding": "0.79",
        "Mathematical Calculation": "0.40",
        "Physical Property": "0.85",
        "Function Reasoning": "1.07",
        "Identity Reasoning": "1.11",
        "Natural Relation": "1.15",
        "Physical Relation": "1.00",
        "Social Relation": "1.48",
        "Common Sense Reasoning": "0.88",
        "Counterfactual Reasoning": "1.10",
        "Causal Reasoning": "0.82",
        "Future Prediction": "0.81"
    },
    "fine_valid": {
        "Video Topic": "1.14",
        "Video Emotion": "1.29",
        "Video Scene": "1.17",
        "Video Style": "1.21",
        "OCR": "1.02",
        "Object Recognition": "1.13",
        "Attribute Recognition": "1.59",
        "Event Recognition": "0.99",
        "Human Motion": "0.72",
        "Counting": "1.24",
        "Spatial Relationship": "1.02",
        "Human-object Interaction": "0.67",
        "Human Interaction": "0.85",
        "Hallucination": "0.34",
        "Structuralized Image-Text Understanding": "0.79",
        "Mathematical Calculation": "0.40",
        "Physical Property": "0.85",
        "Function Reasoning": "1.07",
        "Identity Reasoning": "1.11",
        "Natural Relation": "1.15",
        "Physical Relation": "1.00",
        "Social Relation": "1.48",
        "Common Sense Reasoning": "0.88",
        "Counterfactual Reasoning": "1.10",
        "Causal Reasoning": "0.82",
        "Future Prediction": "0.81"
    }
}
```

````

````{tab} 4B

When testing with 8 frames:

```bash
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-4B --verbose --nframe 8
```

The expected test results are:

```
{
    "coarse_all": {
        "CP": "1.34",
        "FP-S": "1.16",
        "FP-C": "0.97",
        "HL": "0.13",
        "LR": "0.58",
        "AR": "1.16",
        "RR": "1.26",
        "CSR": "1.02",
        "TR": "0.99",
        "Perception": "1.13",
        "Reasoning": "1.03",
        "Overall": "1.10"
    },
    "coarse_valid": {
        "CP": "1.34",
        "FP-S": "1.16",
        "FP-C": "0.97",
        "HL": "0.13",
        "LR": "0.58",
        "AR": "1.16",
        "RR": "1.26",
        "CSR": "1.02",
        "TR": "0.99",
        "Perception": "1.13",
        "Reasoning": "1.03",
        "Overall": "1.10"
    },
    "fine_all": {
        "Video Topic": "1.30",
        "Video Emotion": "1.43",
        "Video Scene": "1.18",
        "Video Style": "1.62",
        "OCR": "0.98",
        "Object Recognition": "1.24",
        "Attribute Recognition": "1.53",
        "Event Recognition": "1.11",
        "Human Motion": "0.95",
        "Counting": "1.31",
        "Spatial Relationship": "1.07",
        "Human-object Interaction": "0.95",
        "Human Interaction": "0.95",
        "Hallucination": "0.13",
        "Structuralized Image-Text Understanding": "0.75",
        "Mathematical Calculation": "0.33",
        "Physical Property": "1.11",
        "Function Reasoning": "1.07",
        "Identity Reasoning": "1.30",
        "Natural Relation": "0.96",
        "Physical Relation": "1.25",
        "Social Relation": "1.41",
        "Common Sense Reasoning": "1.02",
        "Counterfactual Reasoning": "0.97",
        "Causal Reasoning": "0.98",
        "Future Prediction": "1.02"
    },
    "fine_valid": {
        "Video Topic": "1.30",
        "Video Emotion": "1.43",
        "Video Scene": "1.18",
        "Video Style": "1.62",
        "OCR": "0.98",
        "Object Recognition": "1.24",
        "Attribute Recognition": "1.53",
        "Event Recognition": "1.11",
        "Human Motion": "0.95",
        "Counting": "1.31",
        "Spatial Relationship": "1.07",
        "Human-object Interaction": "0.95",
        "Human Interaction": "0.95",
        "Hallucination": "0.13",
        "Structuralized Image-Text Understanding": "0.75",
        "Mathematical Calculation": "0.33",
        "Physical Property": "1.11",
        "Function Reasoning": "1.07",
        "Identity Reasoning": "1.30",
        "Natural Relation": "0.96",
        "Physical Relation": "1.25",
        "Social Relation": "1.41",
        "Common Sense Reasoning": "1.02",
        "Counterfactual Reasoning": "0.97",
        "Causal Reasoning": "0.98",
        "Future Prediction": "1.02"
    }
}
```

When testing with 16 frames:

```bash
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-4B --verbose --nframe 16
```

The expected test results are:

```
{
    "coarse_all": {
        "CP": "1.38",
        "FP-S": "1.27",
        "FP-C": "1.03",
        "HL": "0.15",
        "LR": "0.73",
        "AR": "1.24",
        "RR": "1.29",
        "CSR": "1.17",
        "TR": "0.99",
        "Perception": "1.22",
        "Reasoning": "1.09",
        "Overall": "1.18"
    },
    "coarse_valid": {
        "CP": "1.38",
        "FP-S": "1.27",
        "FP-C": "1.03",
        "HL": "0.15",
        "LR": "0.73",
        "AR": "1.24",
        "RR": "1.29",
        "CSR": "1.17",
        "TR": "0.99",
        "Perception": "1.22",
        "Reasoning": "1.09",
        "Overall": "1.18"
    },
    "fine_all": {
        "Video Topic": "1.31",
        "Video Emotion": "1.47",
        "Video Scene": "1.22",
        "Video Style": "1.74",
        "OCR": "1.19",
        "Object Recognition": "1.29",
        "Attribute Recognition": "1.62",
        "Event Recognition": "1.13",
        "Human Motion": "1.02",
        "Counting": "1.25",
        "Spatial Relationship": "1.16",
        "Human-object Interaction": "0.99",
        "Human Interaction": "1.00",
        "Hallucination": "0.15",
        "Structuralized Image-Text Understanding": "0.87",
        "Mathematical Calculation": "0.51",
        "Physical Property": "1.17",
        "Function Reasoning": "1.05",
        "Identity Reasoning": "1.49",
        "Natural Relation": "1.00",
        "Physical Relation": "1.25",
        "Social Relation": "1.46",
        "Common Sense Reasoning": "1.17",
        "Counterfactual Reasoning": "1.05",
        "Causal Reasoning": "0.96",
        "Future Prediction": "1.04"
    },
    "fine_valid": {
        "Video Topic": "1.31",
        "Video Emotion": "1.47",
        "Video Scene": "1.22",
        "Video Style": "1.74",
        "OCR": "1.19",
        "Object Recognition": "1.29",
        "Attribute Recognition": "1.62",
        "Event Recognition": "1.13",
        "Human Motion": "1.02",
        "Counting": "1.25",
        "Spatial Relationship": "1.16",
        "Human-object Interaction": "0.99",
        "Human Interaction": "1.00",
        "Hallucination": "0.15",
        "Structuralized Image-Text Understanding": "0.87",
        "Mathematical Calculation": "0.51",
        "Physical Property": "1.17",
        "Function Reasoning": "1.05",
        "Identity Reasoning": "1.49",
        "Natural Relation": "1.00",
        "Physical Relation": "1.25",
        "Social Relation": "1.46",
        "Common Sense Reasoning": "1.17",
        "Counterfactual Reasoning": "1.05",
        "Causal Reasoning": "0.96",
        "Future Prediction": "1.04"
    }
}
```

````

````{tab} 8B

When testing with 8 frames:

```bash
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-8B --verbose --nframe 8
```

The expected test results are:

```
{
    "coarse_all": {
        "CP": "1.36",
        "FP-S": "1.26",
        "FP-C": "1.07",
        "HL": "0.32",
        "LR": "0.83",
        "AR": "1.19",
        "RR": "1.33",
        "CSR": "1.14",
        "TR": "1.02",
        "Perception": "1.22",
        "Reasoning": "1.12",
        "Overall": "1.19"
    },
    "coarse_valid": {
        "CP": "1.36",
        "FP-S": "1.26",
        "FP-C": "1.07",
        "HL": "0.32",
        "LR": "0.83",
        "AR": "1.19",
        "RR": "1.33",
        "CSR": "1.14",
        "TR": "1.02",
        "Perception": "1.22",
        "Reasoning": "1.12",
        "Overall": "1.19"
    },
    "fine_all": {
        "Video Topic": "1.23",
        "Video Emotion": "1.49",
        "Video Scene": "1.22",
        "Video Style": "1.67",
        "OCR": "1.14",
        "Object Recognition": "1.35",
        "Attribute Recognition": "1.66",
        "Event Recognition": "1.18",
        "Human Motion": "0.90",
        "Counting": "1.31",
        "Spatial Relationship": "1.24",
        "Human-object Interaction": "1.05",
        "Human Interaction": "1.02",
        "Hallucination": "0.32",
        "Structuralized Image-Text Understanding": "1.03",
        "Mathematical Calculation": "0.53",
        "Physical Property": "1.24",
        "Function Reasoning": "1.05",
        "Identity Reasoning": "1.26",
        "Natural Relation": "1.00",
        "Physical Relation": "1.27",
        "Social Relation": "1.56",
        "Common Sense Reasoning": "1.14",
        "Counterfactual Reasoning": "0.95",
        "Causal Reasoning": "1.07",
        "Future Prediction": "0.98"
    },
    "fine_valid": {
        "Video Topic": "1.23",
        "Video Emotion": "1.49",
        "Video Scene": "1.22",
        "Video Style": "1.67",
        "OCR": "1.14",
        "Object Recognition": "1.35",
        "Attribute Recognition": "1.66",
        "Event Recognition": "1.18",
        "Human Motion": "0.90",
        "Counting": "1.31",
        "Spatial Relationship": "1.24",
        "Human-object Interaction": "1.05",
        "Human Interaction": "1.02",
        "Hallucination": "0.32",
        "Structuralized Image-Text Understanding": "1.03",
        "Mathematical Calculation": "0.53",
        "Physical Property": "1.24",
        "Function Reasoning": "1.05",
        "Identity Reasoning": "1.26",
        "Natural Relation": "1.00",
        "Physical Relation": "1.27",
        "Social Relation": "1.56",
        "Common Sense Reasoning": "1.14",
        "Counterfactual Reasoning": "0.95",
        "Causal Reasoning": "1.07",
        "Future Prediction": "0.98"
    }
}
```

When testing with 16 frames:

```bash
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-8B --verbose --nframe 16
```

The expected test results are:

```
{
    "coarse_all": {
        "CP": "1.45",
        "FP-S": "1.40",
        "FP-C": "1.13",
        "HL": "0.18",
        "LR": "0.90",
        "AR": "1.32",
        "RR": "1.45",
        "CSR": "1.19",
        "TR": "1.04",
        "Perception": "1.32",
        "Reasoning": "1.18",
        "Overall": "1.28"
    },
    "coarse_valid": {
        "CP": "1.45",
        "FP-S": "1.40",
        "FP-C": "1.13",
        "HL": "0.18",
        "LR": "0.90",
        "AR": "1.32",
        "RR": "1.45",
        "CSR": "1.19",
        "TR": "1.04",
        "Perception": "1.32",
        "Reasoning": "1.18",
        "Overall": "1.28"
    },
    "fine_all": {
        "Video Topic": "1.38",
        "Video Emotion": "1.57",
        "Video Scene": "1.27",
        "Video Style": "1.69",
        "OCR": "1.32",
        "Object Recognition": "1.40",
        "Attribute Recognition": "1.80",
        "Event Recognition": "1.18",
        "Human Motion": "1.15",
        "Counting": "1.44",
        "Spatial Relationship": "1.22",
        "Human-object Interaction": "1.15",
        "Human Interaction": "1.03",
        "Hallucination": "0.18",
        "Structuralized Image-Text Understanding": "1.13",
        "Mathematical Calculation": "0.56",
        "Physical Property": "1.20",
        "Function Reasoning": "1.05",
        "Identity Reasoning": "1.72",
        "Natural Relation": "0.93",
        "Physical Relation": "1.45",
        "Social Relation": "1.70",
        "Common Sense Reasoning": "1.19",
        "Counterfactual Reasoning": "1.07",
        "Causal Reasoning": "1.04",
        "Future Prediction": "1.06"
    },
    "fine_valid": {
        "Video Topic": "1.38",
        "Video Emotion": "1.57",
        "Video Scene": "1.27",
        "Video Style": "1.69",
        "OCR": "1.32",
        "Object Recognition": "1.40",
        "Attribute Recognition": "1.80",
        "Event Recognition": "1.18",
        "Human Motion": "1.15",
        "Counting": "1.44",
        "Spatial Relationship": "1.22",
        "Human-object Interaction": "1.15",
        "Human Interaction": "1.03",
        "Hallucination": "0.18",
        "Structuralized Image-Text Understanding": "1.13",
        "Mathematical Calculation": "0.56",
        "Physical Property": "1.20",
        "Function Reasoning": "1.05",
        "Identity Reasoning": "1.72",
        "Natural Relation": "0.93",
        "Physical Relation": "1.45",
        "Social Relation": "1.70",
        "Common Sense Reasoning": "1.19",
        "Counterfactual Reasoning": "1.07",
        "Causal Reasoning": "1.04",
        "Future Prediction": "1.06"
    }
}
```

````

````{tab} 26B

When testing with 8 frames:

```bash
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-26B --verbose --nframe 8
```

The expected test results are:

```
{
    "coarse_all": {
        "CP": "1.47",
        "FP-S": "1.32",
        "FP-C": "1.07",
        "HL": "0.35",
        "LR": "1.04",
        "AR": "1.42",
        "RR": "1.43",
        "CSR": "1.16",
        "TR": "1.04",
        "Perception": "1.28",
        "Reasoning": "1.22",
        "Overall": "1.27"
    },
    "coarse_valid": {
        "CP": "1.47",
        "FP-S": "1.32",
        "FP-C": "1.07",
        "HL": "0.35",
        "LR": "1.04",
        "AR": "1.42",
        "RR": "1.43",
        "CSR": "1.16",
        "TR": "1.04",
        "Perception": "1.28",
        "Reasoning": "1.22",
        "Overall": "1.27"
    },
    "fine_all": {
        "Video Topic": "1.35",
        "Video Emotion": "1.47",
        "Video Scene": "1.51",
        "Video Style": "1.69",
        "OCR": "1.21",
        "Object Recognition": "1.37",
        "Attribute Recognition": "1.82",
        "Event Recognition": "1.16",
        "Human Motion": "0.97",
        "Counting": "1.43",
        "Spatial Relationship": "1.20",
        "Human-object Interaction": "1.05",
        "Human Interaction": "1.02",
        "Hallucination": "0.35",
        "Structuralized Image-Text Understanding": "1.22",
        "Mathematical Calculation": "0.76",
        "Physical Property": "1.43",
        "Function Reasoning": "1.29",
        "Identity Reasoning": "1.55",
        "Natural Relation": "1.33",
        "Physical Relation": "1.12",
        "Social Relation": "1.78",
        "Common Sense Reasoning": "1.16",
        "Counterfactual Reasoning": "1.05",
        "Causal Reasoning": "1.05",
        "Future Prediction": "1.06"
    },
    "fine_valid": {
        "Video Topic": "1.35",
        "Video Emotion": "1.47",
        "Video Scene": "1.51",
        "Video Style": "1.69",
        "OCR": "1.21",
        "Object Recognition": "1.37",
        "Attribute Recognition": "1.82",
        "Event Recognition": "1.16",
        "Human Motion": "0.97",
        "Counting": "1.43",
        "Spatial Relationship": "1.20",
        "Human-object Interaction": "1.05",
        "Human Interaction": "1.02",
        "Hallucination": "0.35",
        "Structuralized Image-Text Understanding": "1.22",
        "Mathematical Calculation": "0.76",
        "Physical Property": "1.43",
        "Function Reasoning": "1.29",
        "Identity Reasoning": "1.55",
        "Natural Relation": "1.33",
        "Physical Relation": "1.12",
        "Social Relation": "1.78",
        "Common Sense Reasoning": "1.16",
        "Counterfactual Reasoning": "1.05",
        "Causal Reasoning": "1.06",
        "Future Prediction": "1.06"
    }
}
```

When testing with 16 frames:

```bash
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-26B --verbose --nframe 16
```

The expected test results are:

```
{
    "coarse_all": {
        "CP": "1.56",
        "FP-S": "1.48",
        "FP-C": "1.23",
        "HL": "0.52",
        "LR": "1.06",
        "AR": "1.61",
        "RR": "1.45",
        "CSR": "1.38",
        "TR": "1.23",
        "Perception": "1.42",
        "Reasoning": "1.35",
        "Overall": "1.41"
    },
    "coarse_valid": {
        "CP": "1.56",
        "FP-S": "1.48",
        "FP-C": "1.23",
        "HL": "0.52",
        "LR": "1.06",
        "AR": "1.61",
        "RR": "1.47",
        "CSR": "1.38",
        "TR": "1.23",
        "Perception": "1.42",
        "Reasoning": "1.35",
        "Overall": "1.41"
    },
    "fine_all": {
        "Video Topic": "1.52",
        "Video Emotion": "1.48",
        "Video Scene": "1.59",
        "Video Style": "1.76",
        "OCR": "1.37",
        "Object Recognition": "1.55",
        "Attribute Recognition": "1.91",
        "Event Recognition": "1.30",
        "Human Motion": "1.15",
        "Counting": "1.46",
        "Spatial Relationship": "1.18",
        "Human-object Interaction": "1.35",
        "Human Interaction": "1.08",
        "Hallucination": "0.52",
        "Structuralized Image-Text Understanding": "1.25",
        "Mathematical Calculation": "0.78",
        "Physical Property": "1.46",
        "Function Reasoning": "1.42",
        "Identity Reasoning": "1.96",
        "Natural Relation": "1.44",
        "Physical Relation": "1.06",
        "Social Relation": "1.83",
        "Common Sense Reasoning": "1.38",
        "Counterfactual Reasoning": "1.25",
        "Causal Reasoning": "1.23",
        "Future Prediction": "1.17"
    },
    "fine_valid": {
        "Video Topic": "1.52",
        "Video Emotion": "1.48",
        "Video Scene": "1.59",
        "Video Style": "1.76",
        "OCR": "1.38",
        "Object Recognition": "1.56",
        "Attribute Recognition": "1.91",
        "Event Recognition": "1.30",
        "Human Motion": "1.15",
        "Counting": "1.46",
        "Spatial Relationship": "1.18",
        "Human-object Interaction": "1.35",
        "Human Interaction": "1.08",
        "Hallucination": "0.52",
        "Structuralized Image-Text Understanding": "1.25",
        "Mathematical Calculation": "0.78",
        "Physical Property": "1.46",
        "Function Reasoning": "1.42",
        "Identity Reasoning": "1.96",
        "Natural Relation": "1.50",
        "Physical Relation": "1.06",
        "Social Relation": "1.83",
        "Common Sense Reasoning": "1.38",
        "Counterfactual Reasoning": "1.25",
        "Causal Reasoning": "1.24",
        "Future Prediction": "1.17"
    }
}
```

````

````{tab} 40B

When testing with 8 frames:

```bash
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-40B --verbose --nframe 8
```

The expected test results are:

```
{
    "coarse_all": {
        "CP": "1.53",
        "FP-S": "1.39",
        "FP-C": "1.12",
        "HL": "0.32",
        "LR": "0.88",
        "AR": "1.45",
        "RR": "1.52",
        "CSR": "1.15",
        "TR": "1.13",
        "Perception": "1.34",
        "Reasoning": "1.25",
        "Overall": "1.32"
    },
    "coarse_valid": {
        "CP": "1.53",
        "FP-S": "1.39",
        "FP-C": "1.12",
        "HL": "0.32",
        "LR": "0.88",
        "AR": "1.45",
        "RR": "1.52",
        "CSR": "1.15",
        "TR": "1.13",
        "Perception": "1.34",
        "Reasoning": "1.25",
        "Overall": "1.32"
    },
    "fine_all": {
        "Video Topic": "1.57",
        "Video Emotion": "1.65",
        "Video Scene": "1.24",
        "Video Style": "1.81",
        "OCR": "1.29",
        "Object Recognition": "1.40",
        "Attribute Recognition": "1.80",
        "Event Recognition": "1.21",
        "Human Motion": "1.36",
        "Counting": "1.45",
        "Spatial Relationship": "1.22",
        "Human-object Interaction": "1.14",
        "Human Interaction": "1.02",
        "Hallucination": "0.32",
        "Structuralized Image-Text Understanding": "1.04",
        "Mathematical Calculation": "0.62",
        "Physical Property": "1.30",
        "Function Reasoning": "1.33",
        "Identity Reasoning": "1.74",
        "Natural Relation": "1.30",
        "Physical Relation": "1.35",
        "Social Relation": "1.78",
        "Common Sense Reasoning": "1.15",
        "Counterfactual Reasoning": "1.18",
        "Causal Reasoning": "1.14",
        "Future Prediction": "1.13"
    },
    "fine_valid": {
        "Video Topic": "1.57",
        "Video Emotion": "1.65",
        "Video Scene": "1.24",
        "Video Style": "1.81",
        "OCR": "1.29",
        "Object Recognition": "1.40",
        "Attribute Recognition": "1.80",
        "Event Recognition": "1.21",
        "Human Motion": "1.36",
        "Counting": "1.45",
        "Spatial Relationship": "1.22",
        "Human-object Interaction": "1.14",
        "Human Interaction": "1.02",
        "Hallucination": "0.32",
        "Structuralized Image-Text Understanding": "1.04",
        "Mathematical Calculation": "0.62",
        "Physical Property": "1.30",
        "Function Reasoning": "1.33",
        "Identity Reasoning": "1.74",
        "Natural Relation": "1.30",
        "Physical Relation": "1.35",
        "Social Relation": "1.78",
        "Common Sense Reasoning": "1.15",
        "Counterfactual Reasoning": "1.18",
        "Causal Reasoning": "1.14",
        "Future Prediction": "1.13"
    }
}
```

When testing with 16 frames:

```bash
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-40B --verbose --nframe 16
```

The expected test results are:

```
{
    "coarse_all": {
        "CP": "1.58",
        "FP-S": "1.56",
        "FP-C": "1.28",
        "HL": "0.39",
        "LR": "1.10",
        "AR": "1.61",
        "RR": "1.53",
        "CSR": "1.25",
        "TR": "1.20",
        "Perception": "1.48",
        "Reasoning": "1.35",
        "Overall": "1.45"
    },
    "coarse_valid": {
        "CP": "1.58",
        "FP-S": "1.56",
        "FP-C": "1.28",
        "HL": "0.39",
        "LR": "1.10",
        "AR": "1.61",
        "RR": "1.53",
        "CSR": "1.25",
        "TR": "1.20",
        "Perception": "1.48",
        "Reasoning": "1.35",
        "Overall": "1.45"
    },
    "fine_all": {
        "Video Topic": "1.57",
        "Video Emotion": "1.67",
        "Video Scene": "1.39",
        "Video Style": "1.83",
        "OCR": "1.47",
        "Object Recognition": "1.64",
        "Attribute Recognition": "2.03",
        "Event Recognition": "1.32",
        "Human Motion": "1.26",
        "Counting": "1.49",
        "Spatial Relationship": "1.31",
        "Human-object Interaction": "1.30",
        "Human Interaction": "1.26",
        "Hallucination": "0.39",
        "Structuralized Image-Text Understanding": "1.26",
        "Mathematical Calculation": "0.84",
        "Physical Property": "1.43",
        "Function Reasoning": "1.49",
        "Identity Reasoning": "1.92",
        "Natural Relation": "1.56",
        "Physical Relation": "1.27",
        "Social Relation": "1.76",
        "Common Sense Reasoning": "1.25",
        "Counterfactual Reasoning": "1.27",
        "Causal Reasoning": "1.19",
        "Future Prediction": "1.15"
    },
    "fine_valid": {
        "Video Topic": "1.57",
        "Video Emotion": "1.67",
        "Video Scene": "1.39",
        "Video Style": "1.83",
        "OCR": "1.47",
        "Object Recognition": "1.64",
        "Attribute Recognition": "2.03",
        "Event Recognition": "1.32",
        "Human Motion": "1.26",
        "Counting": "1.49",
        "Spatial Relationship": "1.31",
        "Human-object Interaction": "1.30",
        "Human Interaction": "1.26",
        "Hallucination": "0.39",
        "Structuralized Image-Text Understanding": "1.26",
        "Mathematical Calculation": "0.84",
        "Physical Property": "1.43",
        "Function Reasoning": "1.49",
        "Identity Reasoning": "1.92",
        "Natural Relation": "1.56",
        "Physical Relation": "1.27",
        "Social Relation": "1.76",
        "Common Sense Reasoning": "1.25",
        "Counterfactual Reasoning": "1.27",
        "Causal Reasoning": "1.19",
        "Future Prediction": "1.15"
    }
}
```

````

````{tab} 76B

When testing with 8 frames:

```bash
torchrun --nproc-per-node=1 run.py --data MMBench-Video --model InternVL2-76B --verbose --nframe 8
```

The expected test results are:

```
{
    "coarse_all": {
        "CP": "1.59",
        "FP-S": "1.41",
        "FP-C": "1.25",
        "HL": "0.42",
        "LR": "0.98",
        "AR": "1.60",
        "RR": "1.41",
        "CSR": "1.44",
        "TR": "1.27",
        "Perception": "1.38",
        "Reasoning": "1.35",
        "Overall": "1.37"
    },
    "coarse_valid": {
        "CP": "1.59",
        "FP-S": "1.41",
        "FP-C": "1.25",
        "HL": "0.42",
        "LR": "0.98",
        "AR": "1.60",
        "RR": "1.41",
        "CSR": "1.44",
        "TR": "1.27",
        "Perception": "1.38",
        "Reasoning": "1.35",
        "Overall": "1.37"
    },
    "fine_all": {
        "Video Topic": "1.51",
        "Video Emotion": "1.66",
        "Video Scene": "1.46",
        "Video Style": "1.90",
        "OCR": "1.32",
        "Object Recognition": "1.45",
        "Attribute Recognition": "1.78",
        "Event Recognition": "1.30",
        "Human Motion": "1.07",
        "Counting": "1.49",
        "Spatial Relationship": "1.36",
        "Human-object Interaction": "1.27",
        "Human Interaction": "1.21",
        "Hallucination": "0.42",
        "Structuralized Image-Text Understanding": "1.21",
        "Mathematical Calculation": "0.64",
        "Physical Property": "1.57",
        "Function Reasoning": "1.51",
        "Identity Reasoning": "1.72",
        "Natural Relation": "1.33",
        "Physical Relation": "1.33",
        "Social Relation": "1.52",
        "Common Sense Reasoning": "1.44",
        "Counterfactual Reasoning": "1.27",
        "Causal Reasoning": "1.33",
        "Future Prediction": "1.17"
    },
    "fine_valid": {
        "Video Topic": "1.51",
        "Video Emotion": "1.66",
        "Video Scene": "1.46",
        "Video Style": "1.90",
        "OCR": "1.32",
        "Object Recognition": "1.45",
        "Attribute Recognition": "1.78",
        "Event Recognition": "1.30",
        "Human Motion": "1.07",
        "Counting": "1.49",
        "Spatial Relationship": "1.36",
        "Human-object Interaction": "1.27",
        "Human Interaction": "1.21",
        "Hallucination": "0.42",
        "Structuralized Image-Text Understanding": "1.21",
        "Mathematical Calculation": "0.64",
        "Physical Property": "1.57",
        "Function Reasoning": "1.51",
        "Identity Reasoning": "1.72",
        "Natural Relation": "1.33",
        "Physical Relation": "1.33",
        "Social Relation": "1.52",
        "Common Sense Reasoning": "1.44",
        "Counterfactual Reasoning": "1.27",
        "Causal Reasoning": "1.33",
        "Future Prediction": "1.17"
    }
}
```

When testing with 16 frames:

```bash
torchrun --nproc-per-node=1 run.py --data MMBench-Video --model InternVL2-76B --verbose --nframe 16
```

The expected test results are:

```
{
    "coarse_all": {
        "CP": "1.69",
        "FP-S": "1.60",
        "FP-C": "1.34",
        "HL": "0.44",
        "LR": "1.19",
        "AR": "1.77",
        "RR": "1.48",
        "CSR": "1.51",
        "TR": "1.36",
        "Perception": "1.54",
        "Reasoning": "1.46",
        "Overall": "1.52"
    },
    "coarse_valid": {
        "CP": "1.69",
        "FP-S": "1.60",
        "FP-C": "1.34",
        "HL": "0.44",
        "LR": "1.19",
        "AR": "1.77",
        "RR": "1.48",
        "CSR": "1.51",
        "TR": "1.36",
        "Perception": "1.54",
        "Reasoning": "1.46",
        "Overall": "1.52"
    },
    "fine_all": {
        "Video Topic": "1.64",
        "Video Emotion": "1.73",
        "Video Scene": "1.60",
        "Video Style": "1.93",
        "OCR": "1.48",
        "Object Recognition": "1.65",
        "Attribute Recognition": "2.06",
        "Event Recognition": "1.42",
        "Human Motion": "1.39",
        "Counting": "1.69",
        "Spatial Relationship": "1.36",
        "Human-object Interaction": "1.44",
        "Human Interaction": "1.20",
        "Hallucination": "0.44",
        "Structuralized Image-Text Understanding": "1.40",
        "Mathematical Calculation": "0.89",
        "Physical Property": "1.65",
        "Function Reasoning": "1.49",
        "Identity Reasoning": "2.17",
        "Natural Relation": "1.30",
        "Physical Relation": "1.47",
        "Social Relation": "1.59",
        "Common Sense Reasoning": "1.51",
        "Counterfactual Reasoning": "1.43",
        "Causal Reasoning": "1.36",
        "Future Prediction": "1.34"
    },
    "fine_valid": {
        "Video Topic": "1.64",
        "Video Emotion": "1.73",
        "Video Scene": "1.60",
        "Video Style": "1.93",
        "OCR": "1.48",
        "Object Recognition": "1.65",
        "Attribute Recognition": "2.06",
        "Event Recognition": "1.42",
        "Human Motion": "1.39",
        "Counting": "1.69",
        "Spatial Relationship": "1.36",
        "Human-object Interaction": "1.44",
        "Human Interaction": "1.20",
        "Hallucination": "0.44",
        "Structuralized Image-Text Understanding": "1.40",
        "Mathematical Calculation": "0.89",
        "Physical Property": "1.65",
        "Function Reasoning": "1.49",
        "Identity Reasoning": "2.17",
        "Natural Relation": "1.30",
        "Physical Relation": "1.47",
        "Social Relation": "1.59",
        "Common Sense Reasoning": "1.51",
        "Counterfactual Reasoning": "1.43",
        "Causal Reasoning": "1.36",
        "Future Prediction": "1.34"
    }
}
```

````

`````

### MathVision

The MathVision (MATH-V) dataset is a comprehensive benchmark designed to evaluate the mathematical reasoning capabilities of multimodal large models. This dataset includes 3,040 high-quality mathematical problems, each paired with visual contexts sourced from real math competitions. It spans 16 distinct mathematical disciplines, including algebra, geometry, topology, and graph theory, and is graded across five levels of difficulty. This setup provides a diverse set of challenges that assess both the visual perception and reasoning abilities of models.

`````{tabs}

````{tab} 1B

```python
torchrun --nproc-per-node=8 run.py --model InternVL2-1B --data MathVision_MINI
```

The expected test results are:

```
--  ------------------------  ---  ---  --  --------  --------
 0  Overall                   304  100  37  32.8947   12.1711
 1  algebra                    19    5   1  26.3158    5.26316
 2  analytic geometry          19    5   3  26.3158   15.7895
 3  arithmetic                 19    4   2  21.0526   10.5263
 4  combinatorial geometry     19    7   2  36.8421   10.5263
 5  combinatorics              19    1   3   5.26316  15.7895
 6  counting                   19    1   2   5.26316  10.5263
 7  descriptive geometry       19   10   4  52.6316   21.0526
 8  graph theory               19    7   2  36.8421   10.5263
 9  logic                      19    6   3  31.5789   15.7895
10  metric geometry - angle    19   10   4  52.6316   21.0526
11  metric geometry - area     19    8   1  42.1053    5.26316
12  metric geometry - length   19    8   3  42.1053   15.7895
13  solid geometry             19    6   0  31.5789    0
14  statistics                 19    6   2  31.5789   10.5263
15  topology                   19    8   2  42.1053   10.5263
16  transformation geometry    19    8   3  42.1053   15.7895
--  ------------------------  ---  ---  --  --------  --------
```

````

````{tab} 2B

```python
torchrun --nproc-per-node=8 run.py --model InternVL2-2B --data MathVision_MINI
```

The expected test results are:

```
--  ------------------------  ---  ---  --  --------  --------
 0  Overall                   304  100  48  32.8947   15.7895
 1  algebra                    19    6   1  31.5789    5.26316
 2  analytic geometry          19    7   2  36.8421   10.5263
 3  arithmetic                 19    4   1  21.0526    5.26316
 4  combinatorial geometry     19    5   5  26.3158   26.3158
 5  combinatorics              19    1   1   5.26316   5.26316
 6  counting                   19    0   2   0        10.5263
 7  descriptive geometry       19    8   4  42.1053   21.0526
 8  graph theory               19    3   4  15.7895   21.0526
 9  logic                      19    9   5  47.3684   26.3158
10  metric geometry - angle    19   11   4  57.8947   21.0526
11  metric geometry - area     19    8   3  42.1053   15.7895
12  metric geometry - length   19   10   4  52.6316   21.0526
13  solid geometry             19    6   1  31.5789    5.26316
14  statistics                 19    7   5  36.8421   26.3158
15  topology                   19    5   1  26.3158    5.26316
16  transformation geometry    19   10   5  52.6316   26.3158
--  ------------------------  ---  ---  --  --------  --------
```

````
````{tab} 4B

```python
torchrun --nproc-per-node=8 run.py --model InternVL2-4B --data MathVision_MINI
```

The expected test results are:

```
--  ------------------------  ---  --  --  --------  --------
 0  Overall                   304  89  54  29.2763   17.7632
 1  algebra                    19   4   4  21.0526   21.0526
 2  analytic geometry          19   7   4  36.8421   21.0526
 3  arithmetic                 19   1   4   5.26316  21.0526
 4  combinatorial geometry     19   6   2  31.5789   10.5263
 5  combinatorics              19   1   2   5.26316  10.5263
 6  counting                   19   0   5   0        26.3158
 7  descriptive geometry       19   8   5  42.1053   26.3158
 8  graph theory               19   6   2  31.5789   10.5263
 9  logic                      19   8   2  42.1053   10.5263
10  metric geometry - angle    19  10   6  52.6316   31.5789
11  metric geometry - area     19   7   5  36.8421   26.3158
12  metric geometry - length   19  11   2  57.8947   10.5263
13  solid geometry             19   7   2  36.8421   10.5263
14  statistics                 19   4   5  21.0526   26.3158
15  topology                   19   6   1  31.5789    5.26316
16  transformation geometry    19   3   3  15.7895   15.7895
--  ------------------------  ---  --  --  --------  --------
```

````

````{tab} 8B

```python
torchrun --nproc-per-node=8 run.py --model InternVL2-8B --data MathVision_MINI
```

The expected test results are:

```
--  ------------------------  ---  ---  --  --------  -------
 0  Overall                   304  104  62  34.2105   20.3947
 1  algebra                    19    4   4  21.0526   21.0526
 2  analytic geometry          19    4   3  21.0526   15.7895
 3  arithmetic                 19    2   4  10.5263   21.0526
 4  combinatorial geometry     19    9   6  47.3684   31.5789
 5  combinatorics              19    1   3   5.26316  15.7895
 6  counting                   19    2   4  10.5263   21.0526
 7  descriptive geometry       19   11   4  57.8947   21.0526
 8  graph theory               19    6   2  31.5789   10.5263
 9  logic                      19   10   2  52.6316   10.5263
10  metric geometry - angle    19    7   4  36.8421   21.0526
11  metric geometry - area     19    7   7  36.8421   36.8421
12  metric geometry - length   19    7   2  36.8421   10.5263
13  solid geometry             19    8   4  42.1053   21.0526
14  statistics                 19    6   4  31.5789   21.0526
15  topology                   19   11   5  57.8947   26.3158
16  transformation geometry    19    9   4  47.3684   21.0526
--  ------------------------  ---  ---  --  --------  -------
```

````

````{tab} 26B

```python
torchrun --nproc-per-node=8 run.py --model InternVL2-26B --data MathVision_MINI
```

The expected test results are:

```
--  ------------------------  ---  ---  --  --------  --------
 0  Overall                   304  105  71  34.5395   23.3553
 1  algebra                    19    6   3  31.5789   15.7895
 2  analytic geometry          19    6   7  31.5789   36.8421
 3  arithmetic                 19    4   4  21.0526   21.0526
 4  combinatorial geometry     19    4   3  21.0526   15.7895
 5  combinatorics              19    4   6  21.0526   31.5789
 6  counting                   19    1   3   5.26316  15.7895
 7  descriptive geometry       19    7   4  36.8421   21.0526
 8  graph theory               19    5   5  26.3158   26.3158
 9  logic                      19   11   7  57.8947   36.8421
10  metric geometry - angle    19    9   3  47.3684   15.7895
11  metric geometry - area     19    9   7  47.3684   36.8421
12  metric geometry - length   19   10   3  52.6316   15.7895
13  solid geometry             19    6   1  31.5789    5.26316
14  statistics                 19    8   7  42.1053   36.8421
15  topology                   19   10   5  52.6316   26.3158
16  transformation geometry    19    5   3  26.3158   15.7895
--  ------------------------  ---  ---  --  --------  --------
```

````

````{tab} 40B

```python
torchrun --nproc-per-node=8 run.py --model InternVL2-40B --data MathVision_MINI
```

The expected test results are:

```
--  ------------------------  ---  ---  --  --------  -------
 0  Overall                   304  100  65  32.8947   21.3816
 1  algebra                    19    6   4  31.5789   21.0526
 2  analytic geometry          19    7   5  36.8421   26.3158
 3  arithmetic                 19    4   8  21.0526   42.1053
 4  combinatorial geometry     19    3   6  15.7895   31.5789
 5  combinatorics              19    0   4   0        21.0526
 6  counting                   19    1   2   5.26316  10.5263
 7  descriptive geometry       19    8   2  42.1053   10.5263
 8  graph theory               19    6   3  31.5789   15.7895
 9  logic                      19    8   4  42.1053   21.0526
10  metric geometry - angle    19   10   5  52.6316   26.3158
11  metric geometry - area     19    8   2  42.1053   10.5263
12  metric geometry - length   19   10   3  52.6316   15.7895
13  solid geometry             19    6   3  31.5789   15.7895
14  statistics                 19   10   6  52.6316   31.5789
15  topology                   19    7   4  36.8421   21.0526
16  transformation geometry    19    6   4  31.5789   21.0526
--  ------------------------  ---  ---  --  --------  -------
```

````

````{tab} 76B

```python
torchrun --nproc-per-node=1 run.py --model InternVL2-76B --data MathVision_MINI
```

The expected test results are:

```
--  ------------------------  ---  ---  --  --------  -------
 0  Overall                   304  102  72  33.5526   23.6842
 1  algebra                    19    1   3   5.26316  15.7895
 2  analytic geometry          19    6   8  31.5789   42.1053
 3  arithmetic                 19    5   7  26.3158   36.8421
 4  combinatorial geometry     19    7   2  36.8421   10.5263
 5  combinatorics              19    1   4   5.26316  21.0526
 6  counting                   19    0   3   0        15.7895
 7  descriptive geometry       19    9   2  47.3684   10.5263
 8  graph theory               19    6   3  31.5789   15.7895
 9  logic                      19    8   5  42.1053   26.3158
10  metric geometry - angle    19   11   5  57.8947   26.3158
11  metric geometry - area     19    9   5  47.3684   26.3158
12  metric geometry - length   19   10   5  52.6316   26.3158
13  solid geometry             19    6   5  31.5789   26.3158
14  statistics                 19    6   8  31.5789   42.1053
15  topology                   19    7   4  36.8421   21.0526
16  transformation geometry    19   10   3  52.6316   15.7895
--  ------------------------  ---  ---  --  --------  -------
```

````
`````

### BLINK

The BLINK dataset is a new benchmark designed to challenge MLLMs by focusing on core visual perception tasks that are not typically covered by other benchmarks. It reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompts. These tasks include relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning, which humans can generally solve quickly but are significantly challenging for current multimodal LLMs.

`````{tabs}

````{tab} 1B

```python
torchrun --nproc-per-node=8 run.py --model InternVL2-1B --data BLINK
```

The expected test results are:

```
2024-08-02 13:47:04,164 - RUN - INFO - The evaluation of model InternVL2-1B x dataset BLINK has finished!
2024-08-02 13:47:04,164 - RUN - INFO - Evaluation Results:
2024-08-02 13:47:04,166 - RUN - INFO -
-------------------------  -------------------
split                      none
Overall                    0.3855865334034719
Art_Style                  0.4700854700854701
Counting                   0.325
Forensic_Detection         0.25
Functional_Correspondence  0.26153846153846155
IQ_Test                    0.2866666666666667
Jigsaw                     0.5266666666666666
Multi-view_Reasoning       0.44360902255639095
Object_Localization        0.4918032786885246
Relative_Depth             0.49193548387096775
Relative_Reflectance       0.3283582089552239
Semantic_Correspondence    0.2446043165467626
Spatial_Relation           0.5664335664335665
Visual_Correspondence      0.27325581395348836
Visual_Similarity          0.4740740740740741
-------------------------  -------------------
```

````

````{tab} 2B

```python
torchrun --nproc-per-node=8 run.py --model InternVL2-2B --data BLINK
```

The expected test results are:

```
2024-08-02 13:46:22,686 - RUN - INFO - The evaluation of model InternVL2-2B x dataset BLINK has finished!
2024-08-02 13:46:22,686 - RUN - INFO - Evaluation Results:
2024-08-02 13:46:22,689 - RUN - INFO -
-------------------------  -------------------
split                      none
Overall                    0.43766438716465017
Art_Style                  0.5299145299145299
Counting                   0.4666666666666667
Forensic_Detection         0.2803030303030303
Functional_Correspondence  0.23076923076923078
IQ_Test                    0.2866666666666667
Jigsaw                     0.47333333333333333
Multi-view_Reasoning       0.556390977443609
Object_Localization        0.36885245901639346
Relative_Depth             0.6048387096774194
Relative_Reflectance       0.39552238805970147
Semantic_Correspondence    0.3669064748201439
Spatial_Relation           0.7622377622377622
Visual_Correspondence      0.3313953488372093
Visual_Similarity          0.5111111111111111
-------------------------  -------------------
```

````

````{tab} 4B

```python
torchrun --nproc-per-node=8 run.py --model InternVL2-4B --data BLINK
```

The expected test results are:

```
2024-08-02 13:34:06,982 - RUN - INFO - The evaluation of model InternVL2-4B x dataset BLINK has finished!
2024-08-02 13:34:06,982 - RUN - INFO - Evaluation Results:
2024-08-02 13:34:06,984 - RUN - INFO -
-------------------------  -------------------
split                      none
Overall                    0.46081009994739613
Art_Style                  0.5897435897435898
Counting                   0.55
Forensic_Detection         0.32575757575757575
Functional_Correspondence  0.25384615384615383
IQ_Test                    0.23333333333333334
Jigsaw                     0.48
Multi-view_Reasoning       0.556390977443609
Object_Localization        0.5245901639344263
Relative_Depth             0.6370967741935484
Relative_Reflectance       0.3283582089552239
Semantic_Correspondence    0.2805755395683453
Spatial_Relation           0.8111888111888111
Visual_Correspondence      0.36046511627906974
Visual_Similarity          0.5925925925925926
-------------------------  -------------------
```

````

````{tab} 8B

```python
torchrun --nproc-per-node=8 run.py --model InternVL2-8B --data BLINK
```

The expected test results are:

```
2024-08-02 13:28:10,915 - RUN - INFO - The evaluation of model InternVL2-8B x dataset BLINK has finished!
2024-08-02 13:28:10,915 - RUN - INFO - Evaluation Results:
2024-08-02 13:28:10,917 - RUN - INFO -
-------------------------  -------------------
split                      none
Overall                    0.5086796422935297
Art_Style                  0.7094017094017094
Counting                   0.75
Forensic_Detection         0.3484848484848485
Functional_Correspondence  0.17692307692307693
IQ_Test                    0.30666666666666664
Jigsaw                     0.5466666666666666
Multi-view_Reasoning       0.48872180451127817
Object_Localization        0.5573770491803278
Relative_Depth             0.7419354838709677
Relative_Reflectance       0.39552238805970147
Semantic_Correspondence    0.26618705035971224
Spatial_Relation           0.7972027972027972
Visual_Correspondence      0.36046511627906974
Visual_Similarity          0.7851851851851852
-------------------------  -------------------
```

````

````{tab} 26B

```python
torchrun --nproc-per-node=8 run.py --model InternVL2-26B --data BLINK
```

The expected test results are:

```
2024-08-02 13:00:51,453 - RUN - INFO - The evaluation of model InternVL2-26B x dataset BLINK has finished!
2024-08-02 13:00:51,453 - RUN - INFO - Evaluation Results:
2024-08-02 13:00:51,455 - RUN - INFO -
-------------------------  -------------------
split                      none
Overall                    0.5623356128353498
Art_Style                  0.7606837606837606
Counting                   0.675
Forensic_Detection         0.45454545454545453
Functional_Correspondence  0.3
IQ_Test                    0.30666666666666664
Jigsaw                     0.7466666666666667
Multi-view_Reasoning       0.41353383458646614
Object_Localization        0.5737704918032787
Relative_Depth             0.782258064516129
Relative_Reflectance       0.3582089552238806
Semantic_Correspondence    0.4172661870503597
Spatial_Relation           0.8461538461538461
Visual_Correspondence      0.47674418604651164
Visual_Similarity          0.8222222222222222
-------------------------  -------------------
```

````

````{tab} 40B

```python
torchrun --nproc-per-node=8 run.py --model InternVL2-40B --data BLINK
```

The expected test results are:

```
2024-08-02 14:03:54,291 - RUN - INFO - The evaluation of model InternVL2-40B x dataset BLINK has finished!
2024-08-02 14:03:54,291 - RUN - INFO - Evaluation Results:
2024-08-02 14:03:54,292 - RUN - INFO -
-------------------------  -------------------
split                      none
Overall                    0.5718043135192005
Art_Style                  0.6923076923076923
Counting                   0.7166666666666667
Forensic_Detection         0.44696969696969696
Functional_Correspondence  0.25384615384615383
IQ_Test                    0.22666666666666666
Jigsaw                     0.8
Multi-view_Reasoning       0.5639097744360902
Object_Localization        0.5819672131147541
Relative_Depth             0.7903225806451613
Relative_Reflectance       0.3880597014925373
Semantic_Correspondence    0.41007194244604317
Spatial_Relation           0.8461538461538461
Visual_Correspondence      0.4941860465116279
Visual_Similarity          0.8518518518518519
-------------------------  -------------------
```

````

````{tab} 76B

```python
torchrun --nproc-per-node=1 run.py --model InternVL2-76B --data BLINK
```

The expected test results are:

```
2024-08-02 16:08:58,199 - RUN - INFO - The evaluation of model InternVL2-76B x dataset BLINK has finished!
2024-08-02 16:08:58,199 - RUN - INFO - Evaluation Results:
2024-08-02 16:08:58,200 - RUN - INFO -
-------------------------  -------------------
split                      none
Overall                    0.5681220410310363
Art_Style                  0.6581196581196581
Counting                   0.7
Forensic_Detection         0.42424242424242425
Functional_Correspondence  0.3
IQ_Test                    0.2733333333333333
Jigsaw                     0.74
Multi-view_Reasoning       0.5639097744360902
Object_Localization        0.5245901639344263
Relative_Depth             0.782258064516129
Relative_Reflectance       0.30597014925373134
Semantic_Correspondence    0.4028776978417266
Spatial_Relation           0.8391608391608392
Visual_Correspondence      0.6802325581395349
Visual_Similarity          0.7555555555555555
-------------------------  -------------------
```

````


`````

### MTVQA

MTVQA (Multilingual Text-Centric Visual Question Answering) introduces high-quality human expert annotations across nine diverse languages to address multilingual TEC-VQA challenges, enhancing AI models' performance in text-centric visual environments.

`````{tabs}

````{tab} 1B

```python
torchrun --nproc-per-node=8 run.py --model InternVL2-1B --data MTVQA_TEST
```

The expected test results are:

```
{
    "AR": 1.991465149359886,
    "Average": 12.570079669519032,
    "DE": 21.85114503816794,
    "FR": 20.54176072234763,
    "IT": 22.39819004524887,
    "JA": 6.159420289855073,
    "KR": 8.422939068100359,
    "RU": 3.571428571428571,
    "TH": 2.1645021645021645,
    "VI": 11.199095022624435
}
```

````

````{tab} 2B

```python
torchrun --nproc-per-node=8 run.py --model InternVL2-2B --data MTVQA_TEST
```

The expected test results are:

```
{
    "AR": 1.422475106685633,
    "Average": 10.88816760106226,
    "DE": 15.744274809160306,
    "FR": 19.751693002257337,
    "IT": 21.380090497737555,
    "JA": 7.367149758454106,
    "KR": 5.913978494623656,
    "RU": 3.0423280423280423,
    "TH": 0.8658008658008658,
    "VI": 9.049773755656108
}
```

````

````{tab} 4B

```python
torchrun --nproc-per-node=8 run.py --model InternVL2-4B --data MTVQA_TEST
```

The expected test results are:

```
{
    "AR": 1.849217638691323,
    "Average": 15.34375922100915,
    "DE": 24.904580152671755,
    "FR": 30.81264108352145,
    "IT": 26.923076923076923,
    "JA": 8.091787439613526,
    "KR": 8.064516129032258,
    "RU": 3.7037037037037033,
    "TH": 3.463203463203463,
    "VI": 12.104072398190045
}
```

````

````{tab} 8B

```python
torchrun --nproc-per-node=8 run.py --model InternVL2-8B --data MTVQA_TEST
```

The expected test results are:

```
{
    "AR": 2.418207681365576,
    "Average": 18.102685157863675,
    "DE": 28.435114503816795,
    "FR": 33.972911963882616,
    "IT": 30.20361990950226,
    "JA": 8.57487922705314,
    "KR": 10.931899641577061,
    "RU": 5.158730158730158,
    "TH": 6.926406926406926,
    "VI": 17.760180995475114
}
```

````

````{tab} 26B

```python
torchrun --nproc-per-node=8 run.py --model InternVL2-26B --data MTVQA_TEST
```

The expected test results are:

```
{
    "AR": 3.982930298719772,
    "Average": 17.71909117733845,
    "DE": 28.053435114503817,
    "FR": 26.52370203160271,
    "IT": 30.316742081447963,
    "JA": 9.903381642512077,
    "KR": 11.29032258064516,
    "RU": 6.613756613756613,
    "TH": 8.225108225108226,
    "VI": 18.32579185520362
}
```

````

````{tab} 40B

```python
torchrun --nproc-per-node=8 run.py --model InternVL2-40B --data MTVQA_TEST
```

The expected test results are:

```
{
    "AR": 4.551920341394026,
    "Average": 20.61079964591325,
    "DE": 30.62977099236641,
    "FR": 36.455981941309254,
    "IT": 34.61538461538461,
    "JA": 10.748792270531402,
    "KR": 13.261648745519713,
    "RU": 6.481481481481481,
    "TH": 5.627705627705628,
    "VI": 21.49321266968326
}
```

````

````{tab} 76B

```python
torchrun --nproc-per-node=1 run.py --model InternVL2-76B --data MTVQA_TEST
```

The expected test results are:

```
{
    "AR": 9.53058321479374,
    "Average": 22.794334611979934,
    "DE": 31.297709923664126,
    "FR": 35.66591422121896,
    "IT": 35.18099547511312,
    "JA": 11.11111111111111,
    "KR": 14.336917562724013,
    "RU": 11.904761904761903,
    "TH": 9.956709956709958,
    "VI": 26.923076923076923
}
```

````


`````

## Citation

If you find this project useful in your research, please consider citing:

```BibTeX
@article{chen2024far,
  title={How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={Science China Information Sciences},
  volume={67},
  number={12},
  pages={220101},
  year={2024},
  publisher={Springer}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}
```

<br>
<br>
