Evaluation of InternVL2 Series#
To evaluate the performance of the InternVL2 series across various tasks, follow the instructions for each specific dataset. Ensure that the appropriate number of GPUs is allocated as specified.
1⃣️ We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.
2⃣️ Please note that evaluating the same model using different testing toolkits like InternVL and VLMEvalKit can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results.
3⃣️️ Note, the dataset description is generated by GPT-4 and may contain errors.
Model Preparation#
model name |
type |
param |
download |
size |
|---|---|---|---|---|
InternVL2-1B |
MLLM |
0.9B |
🤗 HF link |
1.8 GB |
InternVL2-2B |
MLLM |
2.2B |
🤗 HF link |
4.2 GB |
InternVL2-4B |
MLLM |
4.2B |
🤗 HF link |
7.8 GB |
InternVL2-8B |
MLLM |
8.1B |
🤗 HF link |
16 GB |
InternVL2-26B |
MLLM |
25.5B |
🤗 HF link |
48 GB |
InternVL2-40B |
MLLM |
40.1B |
🤗 HF link |
75 GB |
InternVL2-Llama3-76B |
MLLM |
76.3B |
🤗 HF link |
143 GB |
Before evaluation, download the trained model we provide.
cd pretrained/
# pip install -U huggingface_hub
# Download OpenGVLab/InternVL2-1B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-1B --local-dir InternVL2-1B
# Download OpenGVLab/InternVL2-2B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-2B --local-dir InternVL2-2B
# Download OpenGVLab/InternVL2-4B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-4B --local-dir InternVL2-4B
# Download OpenGVLab/InternVL2-8B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-8B --local-dir InternVL2-8B
# Download OpenGVLab/InternVL2-26B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-26B --local-dir InternVL2-26B
# Download OpenGVLab/InternVL2-40B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-40B --local-dir InternVL2-40B
# Download OpenGVLab/InternVL2-Llama3-76B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-Llama3-76B --local-dir InternVL2-Llama3-76B
The directory structure is:
pretrained
├── InternVL2-1B
├── InternVL2-2B
├── InternVL2-4B
├── InternVL2-8B
├── InternVL2-26B
├── InternVL2-40B
└── InternVL2-Llama3-76B
Evaluation using InternVL Codebase#
Data Preparation#
Please prepare the evaluation data according to the guidance provided here.
MME#
MME is a comprehensive benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on both perception and cognition abilities across 14 different subtasks, ensuring robust and diverse testing of these models.
Please use the following command to perform the test with 1 GPU:
GPUS=1 sh evaluate.sh pretrained/InternVL2-1B mme --dynamic
The expected test results are:
=========== Perception ===========
total score: 1346.1990796318528
existence score: 175.0
count score: 113.33333333333334
position score: 135.0
color score: 138.33333333333331
posters score: 116.32653061224491
celebrity score: 144.70588235294116
scene score: 143.25
landmark score: 128.5
artwork score: 141.75
OCR score: 110.0
=========== Cognition ===========
total score: 448.2142857142857
commonsense_reasoning score: 95.71428571428571
numerical_calculation score: 57.5
text_translation score: 177.5
code_reasoning score: 117.5
Please use the following command to perform the test with 1 GPU:
GPUS=1 sh evaluate.sh pretrained/InternVL2-2B mme --dynamic
The expected test results are:
=========== Perception ===========
total score: 1439.6688675470189
existence score: 200.0
count score: 128.33333333333334
position score: 145.0
color score: 163.33333333333334
posters score: 131.97278911564626
celebrity score: 118.52941176470588
scene score: 157.0
landmark score: 154.0
artwork score: 146.5
OCR score: 95.0
=========== Cognition ===========
total score: 437.1428571428571
commonsense_reasoning score: 112.14285714285714
numerical_calculation score: 45.0
text_translation score: 177.5
code_reasoning score: 102.5
Please use the following command to perform the test with 1 GPU:
GPUS=1 sh evaluate.sh pretrained/InternVL2-4B mme --dynamic
The expected test results are:
=========== Perception ===========
total score: 1532.31662665066
existence score: 200.0
count score: 123.33333333333333
position score: 148.33333333333331
color score: 165.0
posters score: 155.78231292517006
celebrity score: 124.11764705882354
scene score: 158.75
landmark score: 165.0
artwork score: 144.5
OCR score: 147.5
=========== Cognition ===========
total score: 531.7857142857142
commonsense_reasoning score: 129.28571428571428
numerical_calculation score: 115.0
text_translation score: 170.0
code_reasoning score: 117.5
Please use the following command to perform the test with 1 GPU:
GPUS=1 sh evaluate.sh pretrained/InternVL2-8B mme --dynamic
The expected test results are:
=========== Perception ===========
total score: 1648.1331532613044
existence score: 190.0
count score: 158.33333333333331
position score: 163.33333333333334
color score: 175.0
posters score: 167.68707482993196
celebrity score: 148.52941176470586
scene score: 152.5
landmark score: 176.5
artwork score: 153.75
OCR score: 162.5
=========== Cognition ===========
total score: 562.1428571428571
commonsense_reasoning score: 147.14285714285714
numerical_calculation score: 87.5
text_translation score: 192.5
code_reasoning score: 135.0
Please use the following command to perform the test with 1 GPU:
GPUS=1 sh evaluate.sh pretrained/InternVL2-26B mme --dynamic
The expected test results are:
=========== Perception ===========
total score: 1720.0325130052022
existence score: 195.0
count score: 170.0
position score: 176.66666666666669
color score: 168.33333333333331
posters score: 176.87074829931973
celebrity score: 159.41176470588235
scene score: 154.0
landmark score: 179.5
artwork score: 162.75
OCR score: 177.5
=========== Cognition ===========
total score: 540.7142857142858
commonsense_reasoning score: 145.71428571428572
numerical_calculation score: 95.0
text_translation score: 185.0
code_reasoning score: 115.0
Please use the following command to perform the test with 8 GPU:
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mme --dynamic --auto
The expected test results are:
=========== Perception ===========
total score: 1715.390456182473
existence score: 185.0
count score: 175.0
position score: 158.33333333333331
color score: 188.33333333333331
posters score: 187.41496598639458
celebrity score: 162.05882352941177
scene score: 152.5
landmark score: 180.25
artwork score: 171.5
OCR score: 155.0
=========== Cognition ===========
total score: 599.6428571428571
commonsense_reasoning score: 152.14285714285714
numerical_calculation score: 125.0
text_translation score: 177.5
code_reasoning score: 145.0
Please use the following command to perform the test with 8 GPU:
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mme --dynamic --auto
The expected test results are:
=========== Perception ===========
total score: 1731.095538215286
existence score: 200.0
count score: 175.0
position score: 168.33333333333331
color score: 185.0
posters score: 186.39455782312925
celebrity score: 169.11764705882354
scene score: 152.0
landmark score: 182.0
artwork score: 173.25
OCR score: 140.0
=========== Cognition ===========
total score: 683.5714285714286
commonsense_reasoning score: 158.57142857142856
numerical_calculation score: 185.0
text_translation score: 177.5
code_reasoning score: 162.5
OKVQA#
OKVQA (Outside Knowledge Visual Question Answering) is a dataset designed for visual question answering tasks that require external knowledge beyond what is visible in the image, featuring over 14,000 questions to evaluate the reasoning abilities of AI models.
Please use the following command to perform the test with 8 GPU:
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-okvqa-val --dynamic
The expected test results are:
okvqa_val 0.48513674197383483
Please use the following command to perform the test with 8 GPU:
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-okvqa-val --dynamic
The expected test results are:
okvqa_val 0.5316290130796605
Please use the following command to perform the test with 8 GPU:
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-okvqa-val --dynamic
The expected test results are:
okvqa_val 0.6007530717399846
Please use the following command to perform the test with 8 GPU:
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-okvqa-val --dynamic
The expected test results are:
okvqa_val 0.6289734443123187
Please use the following command to perform the test with 8 GPU:
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-okvqa-val --dynamic
The expected test results are:
okvqa_val 0.6594530321046287
Please use the following command to perform the test with 8 GPU:
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-okvqa-val --dynamic --auto
The expected test results are:
okvqa_val 0.664288545382473
Please use the following command to perform the test with 8 GPU:
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-okvqa-val --dynamic --auto
The expected test results are:
okvqa_val 0.683432421720166
TextVQA#
TextVQA is a dataset designed to evaluate visual question answering models by requiring them to read and reason about text present within images, containing 45,336 questions over 28,408 images from the OpenImages dataset.
The TextVQA dataset provides official OCR results, specifically Rosetta OCR tokens. During testing with InstructBLIP and LLaVA 1.5, the OCR results are input to the LLM as a prompt. If you want to input Rosetta OCR tokens, use the following command:
We do not use Rosetta OCR tokens, run this command:
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-textvqa-val --dynamic
The expected test results are:
textvqa_val 0.7052400000000033
We do not use Rosetta OCR tokens, run this command:
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-textvqa-val --dynamic
The expected test results are:
textvqa_val 0.7335600000000035
We do not use Rosetta OCR tokens, run this command:
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-textvqa-val --dynamic
The expected test results are:
textvqa_val 0.7437000000000039
We do not use Rosetta OCR tokens, run this command:
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-textvqa-val --dynamic
The expected test results are:
textvqa_val 0.773740000000004
We do not use Rosetta OCR tokens, run this command:
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-textvqa-val --dynamic
The expected test results are:
textvqa_val 0.8228200000000048
We do not use Rosetta OCR tokens, run this command:
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-textvqa-val --dynamic --auto
The expected test results are:
textvqa_val 0.8301600000000046
We do not use Rosetta OCR tokens, run this command:
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-textvqa-val --dynamic --auto
The expected test results are:
textvqa_val 0.844100000000004
VizWiz#
The VizWiz VQA dataset is a visual question answering dataset created to help answer visual questions posed by blind individuals. It contains over 31,000 visual questions, where users took a picture using a mobile phone and recorded a spoken question about it. Each question comes with 10 crowdsourced answers. This dataset addresses tasks such as predicting the answer to a visual question and determining whether a visual question can be answered.
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-vizwiz-val --dynamic
The expected test results are:
vizwiz_val 0.5306783977772626
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-vizwiz-test --dynamic
For the test set, submit the results to the evaluation server.
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-vizwiz-val --dynamic
The expected test results are:
vizwiz_val 0.47376707571196724
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-vizwiz-test --dynamic
For the test set, submit the results to the evaluation server.
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-vizwiz-val --dynamic
The expected test results are:
vizwiz_val 0.622088446399631
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-vizwiz-test --dynamic
For the test set, submit the results to the evaluation server.
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-vizwiz-val --dynamic
The expected test results are:
vizwiz_val 0.6290808057420708
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-vizwiz-test --dynamic
For the test set, submit the results to the evaluation server.
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-vizwiz-val --dynamic
The expected test results are:
vizwiz_val 0.6839083121092873
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-vizwiz-test --dynamic
For the test set, submit the results to the evaluation server.
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-vizwiz-val --dynamic --auto
The expected test results are:
vizwiz_val 0.6521880064829846
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-vizwiz-test --dynamic --auto
For the test set, submit the results to the evaluation server.
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-vizwiz-val --dynamic --auto
The expected test results are:
vizwiz_val 0.6767075711970381
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-vizwiz-test --dynamic --auto
For the test set, submit the results to the evaluation server.
ChartQA#
The ChartQA dataset is a comprehensive benchmark for question answering about charts that involves both visual and logical reasoning. It includes a mix of 9.6K human-written questions and 23.1K machine-generated questions derived from chart summaries. This dataset is designed to evaluate models that can understand and analyze charts by answering complex questions that often require multiple logical and arithmetic operations, as well as referencing visual features of the charts.
The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. The final score for model evaluation is calculated as the average of the scores on these two test sets:
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-chartqa-test --dynamic --max-num 12
The expected test results are:
['chartqa_test_human', {'relaxed_accuracy': 0.5392}]
['chartqa_test_augmented', {'relaxed_accuracy': 0.9184}]
result = (53.92 + 91.84) / 2 = 72.88
The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. The final score for model evaluation is calculated as the average of the scores on these two test sets:
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-chartqa-test --dynamic --max-num 12
The expected test results are:
['chartqa_test_human', {'relaxed_accuracy': 0.5952}]
['chartqa_test_augmented', {'relaxed_accuracy': 0.9296}]
result = (59.52 + 92.96) / 2 = 76.24
The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. The final score for model evaluation is calculated as the average of the scores on these two test sets:
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-chartqa-test --dynamic --max-num 12
The expected test results are:
['chartqa_test_human', {'relaxed_accuracy': 0.6992}]
['chartqa_test_augmented', {'relaxed_accuracy': 0.9304}]
result = (69.92 + 93.04) / 2 = 81.48
The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. The final score for model evaluation is calculated as the average of the scores on these two test sets:
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-chartqa-test --dynamic --max-num 12
The expected test results are:
['chartqa_test_human', {'relaxed_accuracy': 0.7288}]
['chartqa_test_augmented', {'relaxed_accuracy': 0.9368}]
result = (72.88 + 93.68) / 2 = 83.28
The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. The final score for model evaluation is calculated as the average of the scores on these two test sets:
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-chartqa-test --dynamic --max-num 12
The expected test results are:
['chartqa_test_human', {'relaxed_accuracy': 0.7528}]
['chartqa_test_augmented', {'relaxed_accuracy': 0.9448}]
result = (75.28 + 94.48) / 2 = 84.88
The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. The final score for model evaluation is calculated as the average of the scores on these two test sets:
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-chartqa-test --dynamic --max-num 12 --auto
The expected test results are:
['chartqa_test_human', {'relaxed_accuracy': 0.772}]
['chartqa_test_augmented', {'relaxed_accuracy': 0.952}]
result = (77.2 + 95.2) / 2 = 86.2
The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. The final score for model evaluation is calculated as the average of the scores on these two test sets:
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-chartqa-test --dynamic --max-num 12 --auto
The expected test results are:
['chartqa_test_human', {'relaxed_accuracy': 0.816}]
['chartqa_test_augmented', {'relaxed_accuracy': 0.952}]
result = (81.6 + 95.2) / 2 = 88.4
DocVQA#
The DocVQA dataset consists of 50,000 questions on 12,000+ document images. It is designed for visual question answering tasks where questions are answered using text within the document images. The dataset includes OCR transcriptions and ground truth answers, supporting evaluation of models that interpret and extract information from documents.
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-docvqa-val --dynamic --max-num 18
The expected test results are:
Overall ANLS: 0.7999
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-docvqa-test --dynamic --max-num 18
For the test set, submit the results to the evaluation server.
The expected test results are:
Overall ANLS: 0.8170
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-docvqa-val --dynamic --max-num 18
The expected test results are:
Overall ANLS: 0.8590
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-docvqa-test --dynamic --max-num 18
For the test set, submit the results to the evaluation server.
The expected test results are:
Overall ANLS: 0.8690
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-docvqa-val --dynamic --max-num 18
The expected test results are:
Overall ANLS: 0.8809
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-docvqa-test --dynamic --max-num 18
For the test set, submit the results to the evaluation server.
The expected test results are:
Overall ANLS: 0.8920
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-docvqa-val --dynamic --max-num 18
The expected test results are:
Overall ANLS: 0.9081
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-docvqa-test --dynamic --max-num 18
For the test set, submit the results to the evaluation server.
The expected test results are:
Overall ANLS: 0.9160
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-docvqa-val --dynamic --max-num 18
The expected test results are:
Overall ANLS: 0.9212
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-docvqa-test --dynamic --max-num 18
For the test set, submit the results to the evaluation server.
The expected test results are:
Overall ANLS: 0.9290
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-docvqa-val --dynamic --max-num 18 --auto
The expected test results are:
Overall ANLS: 0.9373
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-docvqa-test --dynamic --max-num 18 --auto
For the test set, submit the results to the evaluation server.
The expected test results are:
Overall ANLS: 0.9390
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-docvqa-val --dynamic --max-num 18 --auto
The expected test results are:
Overall ANLS: 0.9417
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-docvqa-test --dynamic --max-num 18 --auto
For the test set, submit the results to the evaluation server.
The expected test results are:
Overall ANLS: 0.9410
AI2D#
The AI2D dataset contains over 5,000 grade school science diagrams with extensive annotations and 15,000 multiple-choice questions for research on diagram understanding and question answering.
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-ai2d-test --dynamic
The expected test results are:
ai2diagram_test {'accuracy': 0.6408678756476683}
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-ai2d-test --dynamic
The expected test results are:
ai2diagram_test {'accuracy': 0.7409326424870466}
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-ai2d-test --dynamic
The expected test results are:
ai2diagram_test {'accuracy': 0.788860103626943}
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-ai2d-test --dynamic
The expected test results are:
ai2diagram_test {'accuracy': 0.8377590673575129}
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-ai2d-test --dynamic
The expected test results are:
ai2diagram_test {'accuracy': 0.844559585492228}
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-ai2d-test --dynamic --auto
The expected test results are:
ai2diagram_test {'accuracy': 0.8711139896373057}
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-ai2d-test --dynamic --auto
The expected test results are:
ai2diagram_test {'accuracy': 0.8762953367875648}
InfographicVQA#
The InfographicVQA dataset is a collection of infographics accompanied by natural language questions and answers. This dataset includes a diverse range of infographics sourced from thousands of different websites, ensuring a variety of layouts and designs. It comprises 30,035 questions across 5,485 images, split into training, validation, and test sets.
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-infovqa-val --dynamic --max-num 24
The expected test results are:
Overall ANLS: 0.5018
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-infovqa-test --dynamic --max-num 24
For the test set, submit the results to the evaluation server.
The expected test results are:
Overall ANLS: 0.5090
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-infovqa-val --dynamic --max-num 24
The expected test results are:
Overall ANLS: 0.5766
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-infovqa-test --dynamic --max-num 24
For the test set, submit the results to the evaluation server.
The expected test results are:
Overall ANLS: 0.5890
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-infovqa-val --dynamic --max-num 24
The expected test results are:
Overall ANLS: 0.6625
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-infovqa-test --dynamic --max-num 24
For the test set, submit the results to the evaluation server.
The expected test results are:
Overall ANLS: 0.6700
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-infovqa-val --dynamic --max-num 24
The expected test results are:
Overall ANLS: 0.7260
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-infovqa-test --dynamic --max-num 24
For the test set, submit the results to the evaluation server.
The expected test results are:
Overall ANLS: 0.7480
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-infovqa-val --dynamic --max-num 24
The expected test results are:
Overall ANLS: 0.7601
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-infovqa-test --dynamic --max-num 24
For the test set, submit the results to the evaluation server.
The expected test results are:
Overall ANLS: 0.7590
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-infovqa-val --dynamic --max-num 24 --auto
The expected test results are:
Overall ANLS: 0.7851
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-infovqa-test --dynamic --max-num 24 --auto
For the test set, submit the results to the evaluation server.
The expected test results are:
Overall ANLS: 0.7870
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-infovqa-val --dynamic --max-num 24 --auto
The expected test results are:
Overall ANLS: 0.8021
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-infovqa-test --dynamic --max-num 24 --auto
For the test set, submit the results to the evaluation server.
The expected test results are:
Overall ANLS: 0.8200
GQA#
The GQA dataset is a large-scale visual question answering dataset designed for real-world visual reasoning and compositional question answering. It contains over 22 million questions grounded in real images, each accompanied by detailed scene graphs that describe objects, their attributes, and relationships within the scene. The dataset includes images from the Visual Genome dataset, with questions that require various reasoning skills such as spatial understanding and multi-step inference.
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-gqa-testdev --dynamic
The expected test results are:
Accuracy: 59.77%
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B vqa-gqa-testdev --dynamic
The expected test results are:
Accuracy: 61.03%
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B vqa-gqa-testdev --dynamic
The expected test results are:
Accuracy: 62.07%
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B vqa-gqa-testdev --dynamic
The expected test results are:
Accuracy: 63.23%
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B vqa-gqa-testdev --dynamic
The expected test results are:
Accuracy: 64.89%
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B vqa-gqa-testdev --dynamic --auto
The expected test results are:
Accuracy: 64.89%
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-gqa-testdev --dynamic --auto
The expected test results are:
Accuracy: 65.22%
POPE#
The POPE (Polling-based Object Probing Evaluation) dataset is designed to evaluate object hallucination in MLLMs. The dataset consists of 3,000 questions related to the captions of 500 images. By treating the MLLMs’ answers to these questions as a binary classification task, the dataset allows researchers to measure accuracy, precision, recall, and F1 scores to determine the extent of hallucination in the models.
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B pope --dynamic
The expected test results are:
Category: random, # samples: 2910
TP FP TN FN
1239 51 1359 261
Accuracy: 0.8927835051546392
Precision: 0.9604651162790697
Recall: 0.826
F1 score: 0.8881720430107527
Yes ratio: 0.44329896907216493
0.888, 0.893, 0.960, 0.826, 0.443
====================================
Category: popular, # samples: 3000
TP FP TN FN
1239 93 1407 261
Accuracy: 0.882
Precision: 0.9301801801801802
Recall: 0.826
F1 score: 0.875
Yes ratio: 0.444
0.875, 0.882, 0.930, 0.826, 0.444
====================================
Category: adversarial, # samples: 3000
TP FP TN FN
1239 151 1349 261
Accuracy: 0.8626666666666667
Precision: 0.8913669064748202
Recall: 0.826
F1 score: 0.8574394463667819
Yes ratio: 0.4633333333333333
0.857, 0.863, 0.891, 0.826, 0.463
====================================
result = (88.8 + 87.5 + 85.7) / 3 = 87.3
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B pope --dynamic
The expected test results are:
Category: random, # samples: 2910
TP FP TN FN
1256 39 1371 244
Accuracy: 0.9027491408934708
Precision: 0.9698841698841699
Recall: 0.8373333333333334
F1 score: 0.898747763864043
Yes ratio: 0.44501718213058417
0.899, 0.903, 0.970, 0.837, 0.445
====================================
Category: popular, # samples: 3000
TP FP TN FN
1256 89 1411 244
Accuracy: 0.889
Precision: 0.9338289962825279
Recall: 0.8373333333333334
F1 score: 0.8829525483304044
Yes ratio: 0.4483333333333333
0.883, 0.889, 0.934, 0.837, 0.448
====================================
Category: adversarial, # samples: 3000
TP FP TN FN
1256 139 1361 244
Accuracy: 0.8723333333333333
Precision: 0.9003584229390681
Recall: 0.8373333333333334
F1 score: 0.8677029360967184
Yes ratio: 0.465
0.868, 0.872, 0.900, 0.837, 0.465
====================================
result = (89.9 + 88.3 + 86.8) / 3 = 88.3
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B pope --dynamic
The expected test results are:
Category: random, # samples: 2910
TP FP TN FN
1247 54 1356 253
Accuracy: 0.8945017182130585
Precision: 0.9584934665641814
Recall: 0.8313333333333334
F1 score: 0.8903962870403428
Yes ratio: 0.4470790378006873
0.890, 0.895, 0.958, 0.831, 0.447
====================================
Category: popular, # samples: 3000
TP FP TN FN
1247 116 1384 253
Accuracy: 0.877
Precision: 0.9148936170212766
Recall: 0.8313333333333334
F1 score: 0.8711142158574922
Yes ratio: 0.4543333333333333
0.871, 0.877, 0.915, 0.831, 0.454
====================================
Category: adversarial, # samples: 3000
TP FP TN FN
1247 175 1325 253
Accuracy: 0.8573333333333333
Precision: 0.8769338959212377
Recall: 0.8313333333333334
F1 score: 0.8535249828884327
Yes ratio: 0.474
0.854, 0.857, 0.877, 0.831, 0.474
====================================
result = (89.0 + 87.1 + 85.4) / 3 = 87.2
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B pope --dynamic
The expected test results are:
Category: random, # samples: 2910
TP FP TN FN
1204 29 1381 296
Accuracy: 0.8883161512027491
Precision: 0.9764801297648013
Recall: 0.8026666666666666
F1 score: 0.8810830589096232
Yes ratio: 0.42371134020618556
0.881, 0.888, 0.976, 0.803, 0.424
====================================
Category: popular, # samples: 3000
TP FP TN FN
1204 67 1433 296
Accuracy: 0.879
Precision: 0.9472856018882769
Recall: 0.8026666666666666
F1 score: 0.8690003608805486
Yes ratio: 0.4236666666666667
0.869, 0.879, 0.947, 0.803, 0.424
====================================
Category: adversarial, # samples: 3000
TP FP TN FN
1204 101 1399 296
Accuracy: 0.8676666666666667
Precision: 0.9226053639846743
Recall: 0.8026666666666666
F1 score: 0.8584670231729055
Yes ratio: 0.435
0.858, 0.868, 0.923, 0.803, 0.435
====================================
result = (88.1 + 86.9 + 85.8) / 3 = 86.9
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B pope --dynamic
The expected test results are:
Category: random, # samples: 2910
TP FP TN FN
1221 25 1385 279
Accuracy: 0.89553264604811
Precision: 0.9799357945425361
Recall: 0.814
F1 score: 0.8892935178441369
Yes ratio: 0.4281786941580756
0.889, 0.896, 0.980, 0.814, 0.428
====================================
Category: popular, # samples: 3000
TP FP TN FN
1221 57 1443 279
Accuracy: 0.888
Precision: 0.9553990610328639
Recall: 0.814
F1 score: 0.8790496760259179
Yes ratio: 0.426
0.879, 0.888, 0.955, 0.814, 0.426
====================================
Category: adversarial, # samples: 3000
TP FP TN FN
1221 84 1416 279
Accuracy: 0.879
Precision: 0.9356321839080459
Recall: 0.814
F1 score: 0.8705882352941177
Yes ratio: 0.435
0.871, 0.879, 0.936, 0.814, 0.435
====================================
result = (88.9 + 87.9 + 87.1) / 3 = 88.0
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B pope --dynamic --auto
The expected test results are:
Category: random, # samples: 2910
TP FP TN FN
1232 16 1394 268
Accuracy: 0.902405498281787
Precision: 0.9871794871794872
Recall: 0.8213333333333334
F1 score: 0.8966521106259098
Yes ratio: 0.4288659793814433
0.897, 0.902, 0.987, 0.821, 0.429
====================================
Category: popular, # samples: 3000
TP FP TN FN
1232 65 1435 268
Accuracy: 0.889
Precision: 0.9498843484965305
Recall: 0.8213333333333334
F1 score: 0.8809438684304614
Yes ratio: 0.43233333333333335
0.881, 0.889, 0.950, 0.821, 0.432
====================================
Category: adversarial, # samples: 3000
TP FP TN FN
1232 87 1413 268
Accuracy: 0.8816666666666667
Precision: 0.934040940106141
Recall: 0.8213333333333334
F1 score: 0.8740688187300462
Yes ratio: 0.43966666666666665
0.874, 0.882, 0.934, 0.821, 0.440
====================================
result = (89.7 + 88.1 + 87.4) / 3 = 88.4
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B pope --dynamic --auto
The expected test results are:
Category: random, # samples: 2910
TP FP TN FN
1251 26 1384 249
Accuracy: 0.9054982817869416
Precision: 0.9796397807361003
Recall: 0.834
F1 score: 0.9009722722362261
Yes ratio: 0.4388316151202749
0.901, 0.905, 0.980, 0.834, 0.439
====================================
Category: popular, # samples: 3000
TP FP TN FN
1251 62 1438 249
Accuracy: 0.8963333333333333
Precision: 0.9527798933739527
Recall: 0.834
F1 score: 0.8894418769996445
Yes ratio: 0.43766666666666665
0.889, 0.896, 0.953, 0.834, 0.438
====================================
Category: adversarial, # samples: 3000
TP FP TN FN
1251 91 1409 249
Accuracy: 0.8866666666666667
Precision: 0.9321907600596125
Recall: 0.834
F1 score: 0.8803659394792399
Yes ratio: 0.44733333333333336
0.880, 0.887, 0.932, 0.834, 0.447
====================================
result = (90.1 + 88.9 + 88.0) / 3 = 89.0
Tiny LVLM#
The Tiny LVLM-eHub is a streamlined evaluation benchmark designed to assess the multimodal capabilities of MLLMs, including models like Bard. It focuses on six categories of multimodal abilities: visual perception, visual knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence.
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B tiny_lvlm --dynamic
The expected test results are:
Visual_Knowledge_Acquisition: 0.6857142857142857
Object_Hallucination: 0.91
Visual_Commonsense: 0.556
Visual_Perception: 0.4875
Visual_Reasoning: 0.6145454545454545
Overall: 3.2537597402597402
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B tiny_lvlm --dynamic
The expected test results are:
Visual_Knowledge_Acquisition: 0.71
Object_Hallucination: 0.91
Visual_Commonsense: 0.558
Visual_Perception: 0.4675
Visual_Reasoning: 0.649090909090909
Overall: 3.294590909090909
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B tiny_lvlm --dynamic
The expected test results are:
Visual_Knowledge_Acquisition: 0.6814285714285714
Object_Hallucination: 0.89
Visual_Commonsense: 0.652
Visual_Perception: 0.4875
Visual_Reasoning: 0.6563636363636364
Overall: 3.3672922077922074
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B tiny_lvlm --dynamic
The expected test results are:
Visual_Knowledge_Acquisition: 0.6985714285714286
Object_Hallucination: 0.8966666666666666
Visual_Commonsense: 0.652
Visual_Perception: 0.485
Visual_Reasoning: 0.6854545454545454
Overall: 3.417692640692641
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B tiny_lvlm --dynamic
The expected test results are:
Visual_Knowledge_Acquisition: 0.7614285714285715
Object_Hallucination: 0.9
Visual_Commonsense: 0.652
Visual_Perception: 0.555
Visual_Reasoning: 0.7109090909090909
Overall: 3.5793376623376627
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B tiny_lvlm --dynamic --auto
The expected test results are:
Visual_Knowledge_Acquisition: 0.75
Object_Hallucination: 0.8966666666666666
Visual_Commonsense: 0.674
Visual_Perception: 0.5325
Visual_Reasoning: 0.730909090909091
Overall: 3.5840757575757576
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B tiny_lvlm --dynamic --auto
The expected test results are:
Visual_Knowledge_Acquisition: 0.7557142857142857
Object_Hallucination: 0.9166666666666666
Visual_Commonsense: 0.69
Visual_Perception: 0.525
Visual_Reasoning: 0.7418181818181818
Overall: 3.629199134199134
MMMU#
The MMMU dataset is a comprehensive benchmark designed to evaluate multimodal models on college-level tasks that require domain-specific knowledge and reasoning. It includes 11,500 questions sourced from college exams, quizzes, and textbooks, spanning six disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions cover 30 subjects and feature 30 types of images, such as charts, diagrams, maps, tables, and more.
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmmu-val --dynamic
The expected test results are:
{'Overall-Art and Design': {'num': 120, 'acc': 0.383}, 'Art': {'num': 30, 'acc': 0.4}, 'Art_Theory': {'num': 30, 'acc': 0.4}, 'Design': {'num': 30, 'acc': 0.567}, 'Music': {'num': 30, 'acc': 0.167}, 'Overall-Business': {'num': 150, 'acc': 0.333}, 'Accounting': {'num': 30, 'acc': 0.333}, 'Economics': {'num': 30, 'acc': 0.433}, 'Finance': {'num': 30, 'acc': 0.067}, 'Manage': {'num': 30, 'acc': 0.367}, 'Marketing': {'num': 30, 'acc': 0.467}, 'Overall-Science': {'num': 150, 'acc': 0.3}, 'Biology': {'num': 30, 'acc': 0.267}, 'Chemistry': {'num': 30, 'acc': 0.233}, 'Geography': {'num': 30, 'acc': 0.367}, 'Math': {'num': 30, 'acc': 0.167}, 'Physics': {'num': 30, 'acc': 0.467}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.313}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.433}, 'Clinical_Medicine': {'num': 30, 'acc': 0.233}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.4}, 'Pharmacy': {'num': 30, 'acc': 0.3}, 'Public_Health': {'num': 30, 'acc': 0.2}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.483}, 'History': {'num': 30, 'acc': 0.4}, 'Literature': {'num': 30, 'acc': 0.667}, 'Sociology': {'num': 30, 'acc': 0.467}, 'Psychology': {'num': 30, 'acc': 0.4}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.348}, 'Agriculture': {'num': 30, 'acc': 0.233}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.367}, 'Computer_Science': {'num': 30, 'acc': 0.4}, 'Electronics': {'num': 30, 'acc': 0.4}, 'Energy_and_Power': {'num': 30, 'acc': 0.333}, 'Materials': {'num': 30, 'acc': 0.4}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.3},
'Overall': {'num': 900, 'acc': 0.354}}
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmmu-test --dynamic
For the test set, submit the results to the evaluation server.
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmmu-val --dynamic
The expected test results are:
{'Overall-Art and Design': {'num': 120, 'acc': 0.392}, 'Art': {'num': 30, 'acc': 0.467}, 'Art_Theory': {'num': 30, 'acc': 0.4}, 'Design': {'num': 30, 'acc': 0.5}, 'Music': {'num': 30, 'acc': 0.2}, 'Overall-Business': {'num': 150, 'acc': 0.347}, 'Accounting': {'num': 30, 'acc': 0.367}, 'Economics': {'num': 30, 'acc': 0.333}, 'Finance': {'num': 30, 'acc': 0.333}, 'Manage': {'num': 30, 'acc': 0.367}, 'Marketing': {'num': 30, 'acc': 0.333}, 'Overall-Science': {'num': 150, 'acc': 0.213}, 'Biology': {'num': 30, 'acc': 0.233}, 'Chemistry': {'num': 30, 'acc': 0.1}, 'Geography': {'num': 30, 'acc': 0.167}, 'Math': {'num': 30, 'acc': 0.367}, 'Physics': {'num': 30, 'acc': 0.2}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.373}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.433}, 'Clinical_Medicine': {'num': 30, 'acc': 0.4}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.4}, 'Pharmacy': {'num': 30, 'acc': 0.267}, 'Public_Health': {'num': 30, 'acc': 0.367}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.492}, 'History': {'num': 30, 'acc': 0.4}, 'Literature': {'num': 30, 'acc': 0.767}, 'Sociology': {'num': 30, 'acc': 0.433}, 'Psychology': {'num': 30, 'acc': 0.367}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.3}, 'Agriculture': {'num': 30, 'acc': 0.433}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.233}, 'Computer_Science': {'num': 30, 'acc': 0.233}, 'Electronics': {'num': 30, 'acc': 0.367}, 'Energy_and_Power': {'num': 30, 'acc': 0.233}, 'Materials': {'num': 30, 'acc': 0.4}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.2},
'Overall': {'num': 900, 'acc': 0.343}}
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmmu-test --dynamic
For the test set, submit the results to the evaluation server.
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmmu-val --dynamic
The expected test results are:
'Overall': {'num': 900, 'acc': 0.470}
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmmu-test --dynamic
For the test set, submit the results to the evaluation server.
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmmu-val --dynamic
The expected test results are:
{'Overall-Art and Design': {'num': 120, 'acc': 0.608}, 'Art': {'num': 30, 'acc': 0.733}, 'Art_Theory': {'num': 30, 'acc': 0.7}, 'Design': {'num': 30, 'acc': 0.733}, 'Music': {'num': 30, 'acc': 0.267}, 'Overall-Business': {'num': 150, 'acc': 0.453}, 'Accounting': {'num': 30, 'acc': 0.467}, 'Economics': {'num': 30, 'acc': 0.533}, 'Finance': {'num': 30, 'acc': 0.333}, 'Manage': {'num': 30, 'acc': 0.4}, 'Marketing': {'num': 30, 'acc': 0.533}, 'Overall-Science': {'num': 150, 'acc': 0.393}, 'Biology': {'num': 30, 'acc': 0.467}, 'Chemistry': {'num': 30, 'acc': 0.267}, 'Geography': {'num': 30, 'acc': 0.4}, 'Math': {'num': 30, 'acc': 0.5}, 'Physics': {'num': 30, 'acc': 0.333}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.507}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.567}, 'Clinical_Medicine': {'num': 30, 'acc': 0.667}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.467}, 'Pharmacy': {'num': 30, 'acc': 0.367}, 'Public_Health': {'num': 30, 'acc': 0.467}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.717}, 'History': {'num': 30, 'acc': 0.767}, 'Literature': {'num': 30, 'acc': 0.9}, 'Sociology': {'num': 30, 'acc': 0.7}, 'Psychology': {'num': 30, 'acc': 0.5}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.39}, 'Agriculture': {'num': 30, 'acc': 0.533}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.333}, 'Computer_Science': {'num': 30, 'acc': 0.5}, 'Electronics': {'num': 30, 'acc': 0.467}, 'Energy_and_Power': {'num': 30, 'acc': 0.4}, 'Materials': {'num': 30, 'acc': 0.233}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.267},
'Overall': {'num': 900, 'acc': 0.493}}
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmmu-test --dynamic
For the test set, submit the results to the evaluation server.
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmmu-val --dynamic
The expected test results are:
{'Overall-Art and Design': {'num': 120, 'acc': 0.7}, 'Art': {'num': 30, 'acc': 0.767}, 'Art_Theory': {'num': 30, 'acc': 0.867}, 'Design': {'num': 30, 'acc': 0.867}, 'Music': {'num': 30, 'acc': 0.3}, 'Overall-Business': {'num': 150, 'acc': 0.407}, 'Accounting': {'num': 30, 'acc': 0.467}, 'Economics': {'num': 30, 'acc': 0.3}, 'Finance': {'num': 30, 'acc': 0.333}, 'Manage': {'num': 30, 'acc': 0.5}, 'Marketing': {'num': 30, 'acc': 0.433}, 'Overall-Science': {'num': 150, 'acc': 0.373}, 'Biology': {'num': 30, 'acc': 0.6}, 'Chemistry': {'num': 30, 'acc': 0.2}, 'Geography': {'num': 30, 'acc': 0.5}, 'Math': {'num': 30, 'acc': 0.233}, 'Physics': {'num': 30, 'acc': 0.333}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.453}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.467}, 'Clinical_Medicine': {'num': 30, 'acc': 0.567}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.367}, 'Pharmacy': {'num': 30, 'acc': 0.367}, 'Public_Health': {'num': 30, 'acc': 0.5}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.7}, 'History': {'num': 30, 'acc': 0.7}, 'Literature': {'num': 30, 'acc': 0.9}, 'Sociology': {'num': 30, 'acc': 0.6}, 'Psychology': {'num': 30, 'acc': 0.6}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.39}, 'Agriculture': {'num': 30, 'acc': 0.467}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.267}, 'Computer_Science': {'num': 30, 'acc': 0.367}, 'Electronics': {'num': 30, 'acc': 0.367}, 'Energy_and_Power': {'num': 30, 'acc': 0.5}, 'Materials': {'num': 30, 'acc': 0.433}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.333},
'Overall': {'num': 900, 'acc': 0.483}}
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmmu-test --dynamic
For the test set, submit the results to the evaluation server.
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmmu-val --dynamic --auto
The expected test results are:
{'Overall-Art and Design': {'num': 120, 'acc': 0.675}, 'Art': {'num': 30, 'acc': 0.733}, 'Art_Theory': {'num': 30, 'acc': 0.833}, 'Design': {'num': 30, 'acc': 0.767}, 'Music': {'num': 30, 'acc': 0.367}, 'Overall-Business': {'num': 150, 'acc': 0.44}, 'Accounting': {'num': 30, 'acc': 0.467}, 'Economics': {'num': 30, 'acc': 0.567}, 'Finance': {'num': 30, 'acc': 0.333}, 'Manage': {'num': 30, 'acc': 0.367}, 'Marketing': {'num': 30, 'acc': 0.467}, 'Overall-Science': {'num': 150, 'acc': 0.493}, 'Biology': {'num': 30, 'acc': 0.633}, 'Chemistry': {'num': 30, 'acc': 0.3}, 'Geography': {'num': 30, 'acc': 0.5}, 'Math': {'num': 30, 'acc': 0.5}, 'Physics': {'num': 30, 'acc': 0.533}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.593}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.5}, 'Clinical_Medicine': {'num': 30, 'acc': 0.6}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.4}, 'Pharmacy': {'num': 30, 'acc': 0.667}, 'Public_Health': {'num': 30, 'acc': 0.8}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.717}, 'History': {'num': 30, 'acc': 0.767}, 'Literature': {'num': 30, 'acc': 0.833}, 'Sociology': {'num': 30, 'acc': 0.6}, 'Psychology': {'num': 30, 'acc': 0.667}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.424}, 'Agriculture': {'num': 30, 'acc': 0.6}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.333}, 'Computer_Science': {'num': 30, 'acc': 0.467}, 'Electronics': {'num': 30, 'acc': 0.433}, 'Energy_and_Power': {'num': 30, 'acc': 0.467}, 'Materials': {'num': 30, 'acc': 0.3}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.367},
'Overall': {'num': 900, 'acc': 0.539}}
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmmu-test --dynamic --auto
For the test set, submit the results to the evaluation server.
For the validation set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmmu-val --dynamic --auto
The expected test results are:
{'Overall-Art and Design': {'num': 120, 'acc': 0.683}, 'Art': {'num': 30, 'acc': 0.767}, 'Art_Theory': {'num': 30, 'acc': 0.933}, 'Design': {'num': 30, 'acc': 0.7}, 'Music': {'num': 30, 'acc': 0.333}, 'Overall-Business': {'num': 150, 'acc': 0.567}, 'Accounting': {'num': 30, 'acc': 0.5}, 'Economics': {'num': 30, 'acc': 0.567}, 'Finance': {'num': 30, 'acc': 0.433}, 'Manage': {'num': 30, 'acc': 0.633}, 'Marketing': {'num': 30, 'acc': 0.7}, 'Overall-Science': {'num': 150, 'acc': 0.413}, 'Biology': {'num': 30, 'acc': 0.467}, 'Chemistry': {'num': 30, 'acc': 0.3}, 'Geography': {'num': 30, 'acc': 0.433}, 'Math': {'num': 30, 'acc': 0.367}, 'Physics': {'num': 30, 'acc': 0.5}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.587}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.533}, 'Clinical_Medicine': {'num': 30, 'acc': 0.667}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.433}, 'Pharmacy': {'num': 30, 'acc': 0.6}, 'Public_Health': {'num': 30, 'acc': 0.7}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.725}, 'History': {'num': 30, 'acc': 0.733}, 'Literature': {'num': 30, 'acc': 0.867}, 'Sociology': {'num': 30, 'acc': 0.633}, 'Psychology': {'num': 30, 'acc': 0.667}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.443}, 'Agriculture': {'num': 30, 'acc': 0.6}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.367}, 'Computer_Science': {'num': 30, 'acc': 0.567}, 'Electronics': {'num': 30, 'acc': 0.433}, 'Energy_and_Power': {'num': 30, 'acc': 0.367}, 'Materials': {'num': 30, 'acc': 0.267}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.5},
'Overall': {'num': 900, 'acc': 0.552}}
For the test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmmu-test --dynamic --auto
For the test set, submit the results to the evaluation server.
MMVet (GPT-4-0613)#
⚠️ Warning: Here, we use
GPT-4-0613as the judge model, while in VLMEvalKit,GPT-4-Turbois used as the judge model. Using different versions of GPT-4 can result in significant score variations. Therefore, testing the same model with the two codebases can lead to notable score differences.
The MM-Vet dataset is a comprehensive benchmark designed to evaluate the integrated capabilities of MLLMs. It encompasses six core vision-language (VL) capabilities: recognition, knowledge, optical character recognition (OCR), spatial awareness, language generation, and math. The dataset includes 200 images and 218 questions, each requiring one or more of these capabilities to answer. The evaluation uses an open-ended LLM-based approach, allowing assessment across various answer styles and question types.
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmvet --dynamic
Then, submit the results to the evaluation server. The expected test results are:
runs: [37.8]
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmvet --dynamic
Then, submit the results to the evaluation server. The expected test results are:
runs: [44.6]
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmvet --dynamic
Then, submit the results to the evaluation server. The expected test results are:
runs: [55.7]
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmvet --dynamic
Then, submit the results to the evaluation server. The expected test results are:
runs: [60.0]
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmvet --dynamic
Then, submit the results to the evaluation server. The expected test results are:
runs: [64.2]
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmvet --dynamic --auto
Then, submit the results to the evaluation server. The expected test results are:
runs: [68.5]
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmvet --dynamic --auto
Then, submit the results to the evaluation server. The expected test results are:
runs: [69.8]
MMBench#
The MMBench dataset is a comprehensive multi-modality benchmark designed to evaluate the fine-grained abilities of vision-language models. It contains around 3,000 multiple-choice questions covering 20 ability dimensions, structured into a hierarchical taxonomy. These dimensions include perception and reasoning abilities, further broken down into specific skills like coarse and fine-grained perception, attribute reasoning, and logic reasoning.
For the English dev / test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmbench-dev-en --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmbench-test-en --dynamic
Then, submit the results to the evaluation server. The expected test results are:
mmbench-dev-en: -
mmbench-test-en: 65.4
For the Chinese dev / test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmbench-dev-cn --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmbench-test-cn --dynamic
Then, submit the results to the evaluation server. The expected test results are:
mmbench-dev-cn: -
mmbench-test-cn: 60.7
For the English dev / test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmbench-dev-en --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmbench-test-en --dynamic
Then, submit the results to the evaluation server. The expected test results are:
mmbench-dev-en: -
mmbench-test-en: 73.2
For the Chinese dev / test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmbench-dev-cn --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmbench-test-cn --dynamic
Then, submit the results to the evaluation server. The expected test results are:
mmbench-dev-cn: -
mmbench-test-cn: 70.9
For the English dev / test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmbench-dev-en --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmbench-test-en --dynamic
Then, submit the results to the evaluation server. The expected test results are:
mmbench-dev-en: -
mmbench-test-en: 78.6
For the Chinese dev / test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmbench-dev-cn --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmbench-test-cn --dynamic
Then, submit the results to the evaluation server. The expected test results are:
mmbench-dev-cn: -
mmbench-test-cn: 73.9
For the English dev / test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmbench-dev-en --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmbench-test-en --dynamic
Then, submit the results to the evaluation server. The expected test results are:
mmbench-dev-en: -
mmbench-test-en: 81.7
For the Chinese dev / test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmbench-dev-cn --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmbench-test-cn --dynamic
Then, submit the results to the evaluation server. The expected test results are:
mmbench-dev-cn: -
mmbench-test-cn: 81.2
For the English dev / test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmbench-dev-en --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmbench-test-en --dynamic
Then, submit the results to the evaluation server. The expected test results are:
mmbench-dev-en: -
mmbench-test-en: 83.4
For the Chinese dev / test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmbench-dev-cn --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmbench-test-cn --dynamic
Then, submit the results to the evaluation server. The expected test results are:
mmbench-dev-cn: -
mmbench-test-cn: 82.0
For the English dev / test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmbench-dev-en --dynamic --auto
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmbench-test-en --dynamic --auto
Then, submit the results to the evaluation server. The expected test results are:
mmbench-dev-en: -
mmbench-test-en: 86.8
For the Chinese dev / test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmbench-dev-cn --dynamic --auto
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmbench-test-cn --dynamic --auto
Then, submit the results to the evaluation server. The expected test results are:
mmbench-dev-cn: -
mmbench-test-cn: 86.5
For the English dev / test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmbench-dev-en --dynamic --auto
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmbench-test-en --dynamic --auto
Then, submit the results to the evaluation server. The expected test results are:
mmbench-dev-en: -
mmbench-test-en: 86.5
For the Chinese dev / test set, run:
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmbench-dev-cn --dynamic --auto
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmbench-test-cn --dynamic --auto
Then, submit the results to the evaluation server. The expected test results are:
mmbench-dev-cn: -
mmbench-test-cn: 86.3
CCBench#
CCBench, a multi-modal benchmark in the domain of Chinese Culture, is designed to evaluate the performance of MLLMs on tasks specifically related to Chinese cultural content.
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B ccbench-dev --dynamic
Then, submit the results to the evaluation server. The expected test results are:
ccbench-dev: 75.7
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B ccbench-dev --dynamic
Then, submit the results to the evaluation server. The expected test results are:
ccbench-dev: 74.7
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B ccbench-dev --dynamic
Then, submit the results to the evaluation server. The expected test results are:
ccbench-dev: 66.5
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B ccbench-dev --dynamic
Then, submit the results to the evaluation server. The expected test results are:
ccbench-dev: 75.9
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B ccbench-dev --dynamic
Then, submit the results to the evaluation server. The expected test results are:
ccbench-dev: 73.5
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B ccbench-dev --dynamic --auto
Then, submit the results to the evaluation server. The expected test results are:
ccbench-dev: 80.6
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B ccbench-dev --dynamic --auto
Then, submit the results to the evaluation server. The expected test results are:
ccbench-dev: 81.0
SEED#
CCBench is a multimodal benchmark specifically designed to evaluate models on tasks related to Chinese culture. It is part of the larger MMBench suite of benchmarks, developed by the OpenCompass Community, and aims to provide fine-grained evaluations across various capabilities of vision-language models. CCBench includes 510 questions in a multiple-choice format, focusing on cultural knowledge and understanding.
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B seed --dynamic
The expected test results are:
Acc@1: 0.6074485825458588
length: 17990
Accuracy for each data type:
Data type Scene Understanding: 73.05%
Data type Instance Identity: 71.16%
Data type Instance Location: 69.23%
Data type Instance Attributes: 58.49%
Data type Instances Counting: 52.55%
Data type Spatial Relation: 43.53%
Data type Instance Interaction: 71.13%
Data type Visual Reasoning: 72.51%
Data type Text Understanding: 68.60%
Data type Action Recognition: 53.55%
Data type Action Prediction: 39.92%
Data type Procedure Understanding: 28.74%
Total accuracy: 60.76%
Image accuracy: 65.62%
Video accuracy: 42.35%
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B seed --dynamic
The expected test results are:
Acc@1: 0.6656475819899944
length: 17990
Accuracy for each data type:
Data type Scene Understanding: 76.92%
Data type Instance Identity: 76.79%
Data type Instance Location: 75.04%
Data type Instance Attributes: 65.44%
Data type Instances Counting: 60.40%
Data type Spatial Relation: 54.03%
Data type Instance Interaction: 72.16%
Data type Visual Reasoning: 76.74%
Data type Text Understanding: 74.42%
Data type Action Recognition: 60.04%
Data type Action Prediction: 43.27%
Data type Procedure Understanding: 34.70%
Total accuracy: 66.56%
Image accuracy: 71.55%
Video accuracy: 47.67%
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B seed --dynamic
The expected test results are:
Acc@1: 0.6934408004446915
length: 17990
Accuracy for each data type:
Data type Scene Understanding: 78.75%
Data type Instance Identity: 76.79%
Data type Instance Location: 77.45%
Data type Instance Attributes: 66.36%
Data type Instances Counting: 64.57%
Data type Spatial Relation: 56.47%
Data type Instance Interaction: 71.13%
Data type Visual Reasoning: 78.25%
Data type Text Understanding: 75.58%
Data type Action Recognition: 60.57%
Data type Action Prediction: 47.84%
Data type Procedure Understanding: 47.80%
Total accuracy: 69.34%
Image accuracy: 73.67%
Video accuracy: 52.94%
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B seed --dynamic
The expected test results are:
Acc@1: 0.7072262367982213
length: 17990
Accuracy for each data type:
Data type Scene Understanding: 79.89%
Data type Instance Identity: 78.97%
Data type Instance Location: 79.50%
Data type Instance Attributes: 69.84%
Data type Instances Counting: 68.08%
Data type Spatial Relation: 64.23%
Data type Instance Interaction: 79.38%
Data type Visual Reasoning: 78.85%
Data type Text Understanding: 75.58%
Data type Action Recognition: 60.70%
Data type Action Prediction: 48.57%
Data type Procedure Understanding: 36.56%
Total accuracy: 70.72%
Image accuracy: 76.15%
Video accuracy: 50.17%
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B seed --dynamic
The expected test results are:
Acc@1: 0.7245136186770428
length: 17990
Accuracy for each data type:
Data type Scene Understanding: 80.30%
Data type Instance Identity: 80.39%
Data type Instance Location: 79.88%
Data type Instance Attributes: 71.78%
Data type Instances Counting: 69.68%
Data type Spatial Relation: 61.95%
Data type Instance Interaction: 75.26%
Data type Visual Reasoning: 79.15%
Data type Text Understanding: 68.60%
Data type Action Recognition: 65.47%
Data type Action Prediction: 54.20%
Data type Procedure Understanding: 44.28%
Total accuracy: 72.45%
Image accuracy: 76.79%
Video accuracy: 56.03%
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B seed --dynamic --auto
The expected test results are:
Acc@1: 0.7464146748193441
length: 17990
Accuracy for each data type:
Data type Scene Understanding: 80.62%
Data type Instance Identity: 82.36%
Data type Instance Location: 80.92%
Data type Instance Attributes: 71.68%
Data type Instances Counting: 72.46%
Data type Spatial Relation: 66.36%
Data type Instance Interaction: 78.35%
Data type Visual Reasoning: 80.06%
Data type Text Understanding: 66.28%
Data type Action Recognition: 67.93%
Data type Action Prediction: 57.47%
Data type Procedure Understanding: 56.40%
Total accuracy: 74.65%
Image accuracy: 78.15%
Video accuracy: 61.38%
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B seed --dynamic --auto
The expected test results are:
Acc@1: 0.7446359088382435
length: 17990
Accuracy for each data type:
Data type Scene Understanding: 80.40%
Data type Instance Identity: 82.25%
Data type Instance Location: 80.66%
Data type Instance Attributes: 73.31%
Data type Instances Counting: 72.78%
Data type Spatial Relation: 65.14%
Data type Instance Interaction: 79.38%
Data type Visual Reasoning: 79.15%
Data type Text Understanding: 77.91%
Data type Action Recognition: 68.26%
Data type Action Prediction: 55.10%
Data type Procedure Understanding: 55.23%
Total accuracy: 74.46%
Image accuracy: 78.17%
Video accuracy: 60.42%
MMVP#
The MMVP dataset is designed to benchmark the performance of multimodal large language models (MLLMs) in visual question answering tasks. This dataset focuses on identifying “CLIP-blind pairs,” which are images that appear similar to the CLIP model despite having clear visual differences. The MMVP dataset includes 300 images derived from ImageNet-1k and LAION-Aesthetics, each paired with straightforward questions to evaluate the models’ visual capabilities. It highlights the challenges these systems face, often leading to incorrect responses and hallucinated explanations.
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmvp --dynamic
The expected test results are:
Evaluating MMVP ...
Results saved to results/MMVP_240708020850.jsonl
The accuracy is 0.2
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mmvp --dynamic
The expected test results are:
Evaluating MMVP ...
Results saved to results/MMVP_240702122300.jsonl
The accuracy is 0.35333333333333333
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mmvp --dynamic
The expected test results are:
Evaluating MMVP ...
Results saved to results/MMVP_240702144108.jsonl
The accuracy is 0.4066666666666667
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mmvp --dynamic
The expected test results are:
Evaluating MMVP ...
Results saved to results/MMVP_240703200956.jsonl
The accuracy is 0.5133333333333333
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mmvp --dynamic
The expected test results are:
Evaluating MMVP ...
Results saved to results/MMVP_240704024433.jsonl
The accuracy is 0.5466666666666666
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mmvp --dynamic --auto
The expected test results are:
Evaluating MMVP ...
Results saved to results/MMVP_240708045836.jsonl
The accuracy is 0.5866666666666667
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmvp --dynamic --auto
The expected test results are:
Evaluating MMVP ...
Results saved to results/MMVP_240718203234.jsonl
The accuracy is 0.5266666666666666
RefCOCO Series#
RefCOCO, RefCOCO+, and RefCOCOg are datasets used for tasks involving referring expression comprehension, segmentation, and generation. These datasets are built upon the MSCOCO dataset, and they are essential for evaluating models in natural language processing and computer vision.
GPUS=8 sh evalulate.sh pretrained/InternVL2-1B refcoco --dynamic
GPUS=8 sh evalulate.sh pretrained/InternVL2-2B refcoco --dynamic
GPUS=8 sh evalulate.sh pretrained/InternVL2-4B refcoco --dynamic
GPUS=8 sh evalulate.sh pretrained/InternVL2-8B refcoco --dynamic
GPUS=8 sh evalulate.sh pretrained/InternVL2-26B refcoco --dynamic
GPUS=8 sh evalulate.sh pretrained/InternVL2-40B refcoco --dynamic --auto
GPUS=8 sh evalulate.sh pretrained/InternVL2-Llama3-76B refcoco --dynamic --auto
The expected test results are:
Model |
avg. |
RefCOCO |
RefCOCO |
RefCOCO |
RefCOCO+ |
RefCOCO+ |
RefCOCO+ |
RefCOCO‑g |
RefCOCO‑g |
|---|---|---|---|---|---|---|---|---|---|
InternVL2‑1B |
79.9 |
83.6 |
88.7 |
79.8 |
76.0 |
83.6 |
67.7 |
80.2 |
79.9 |
InternVL2‑2B |
77.7 |
82.3 |
88.2 |
75.9 |
73.5 |
82.8 |
63.3 |
77.6 |
78.3 |
InternVL2‑4B |
84.4 |
88.5 |
91.2 |
83.9 |
81.2 |
87.2 |
73.8 |
84.6 |
84.6 |
InternVL2‑8B |
82.9 |
87.1 |
91.1 |
80.7 |
79.8 |
87.9 |
71.4 |
82.7 |
82.7 |
InternVL2‑26B |
88.5 |
91.2 |
93.3 |
87.4 |
86.8 |
91.0 |
81.2 |
88.5 |
88.6 |
InternVL2‑40B |
90.3 |
93.0 |
94.7 |
89.2 |
88.5 |
92.8 |
83.6 |
90.3 |
90.6 |
InternVL2- |
90.0 |
92.2 |
94.8 |
88.4 |
88.8 |
93.1 |
82.8 |
89.5 |
90.3 |
MVBench#
MVBench is a comprehensive multimodal video understanding benchmark developed to evaluate the temporal comprehension capabilities of MLLMs. It includes 20 challenging video tasks that require temporal understanding and cannot be effectively solved using a single frame. The benchmark uses a novel static-to-dynamic method, transforming static tasks into dynamic ones to systematically generate video tasks that demand a wide range of temporal skills, from perception to cognition.
We evaluate our models on MVBench by extracting 16 frames from each video, and each frame was resized to a 448x448 image.
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mvbench --dynamic --max-num 1
The expected test results are:
57.9
GPUS=8 sh evaluate.sh pretrained/InternVL2-2B mvbench --dynamic --max-num 1
The expected test results are:
60.2
GPUS=8 sh evaluate.sh pretrained/InternVL2-4B mvbench --dynamic --max-num 1
The expected test results are:
63.7
GPUS=8 sh evaluate.sh pretrained/InternVL2-8B mvbench --dynamic --max-num 1
The expected test results are:
66.4
GPUS=8 sh evaluate.sh pretrained/InternVL2-26B mvbench --dynamic --max-num 1
The expected test results are:
67.5
GPUS=8 sh evaluate.sh pretrained/InternVL2-40B mvbench --dynamic --max-num 1 --auto
The expected test results are:
72.5
GPUS=8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mvbench --dynamic --max-num 1 --auto
The expected test results are:
69.6
Evaluation using VLMEvalKit Codebase#
Data Preparation#
VLMEvalKit will automatically download the data for evaluation, so you do not need to prepare it manually.
MathVista#
The MathVista dataset is a comprehensive benchmark for evaluating mathematical reasoning within visual contexts. It consists of three newly created datasets—IQTest, FunctionQA, and PaperQA—designed to address logical reasoning on puzzle test figures, algebraic reasoning over functional plots, and scientific reasoning with academic paper figures, respectively.
torchrun --nproc-per-node=8 run.py --data MathVista_MINI --model InternVL2-1B --verbose
The expected test results are:
"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","489","377","48.9","37.7"
"scientific reasoning","122","85","45","69.67213114754098","36.885245901639344"
"textbook question answering","158","92","63","58.22784810126582","39.87341772151899"
"numeric commonsense","144","39","24","27.083333333333332","16.666666666666664"
"arithmetic reasoning","353","102","103","28.89518413597734","29.178470254957507"
"visual question answering","179","92","53","51.39664804469274","29.608938547486037"
"geometry reasoning","239","147","95","61.50627615062761","39.74895397489539"
"algebraic reasoning","281","170","112","60.4982206405694","39.8576512455516"
"geometry problem solving","208","138","85","66.34615384615384","40.86538461538461"
"math word problem","186","26","52","13.978494623655912","27.956989247311824"
"logical reasoning","37","11","5","29.72972972972973","13.513513513513514"
"figure question answering","269","141","124","52.41635687732342","46.09665427509294"
"statistical reasoning","301","144","148","47.840531561461795","49.16943521594684"
torchrun --nproc-per-node=8 run.py --data MathVista_MINI --model InternVL2-2B --verbose
The expected test results are:
"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","476","464","47.599999999999994","46.400000000000006"
"scientific reasoning","122","83","68","68.0327868852459","55.73770491803278"
"textbook question answering","158","95","79","60.12658227848101","50.0"
"numeric commonsense","144","35","37","24.305555555555554","25.694444444444443"
"arithmetic reasoning","353","100","146","28.328611898016998","41.359773371104815"
"visual question answering","179","91","86","50.83798882681564","48.04469273743017"
"geometry reasoning","239","144","103","60.25104602510461","43.09623430962343"
"algebraic reasoning","281","171","117","60.854092526690394","41.637010676156585"
"geometry problem solving","208","136","94","65.38461538461539","45.19230769230769"
"math word problem","186","20","62","10.75268817204301","33.33333333333333"
"logical reasoning","37","11","4","29.72972972972973","10.81081081081081"
"figure question answering","269","134","143","49.814126394052046","53.159851301115246"
"statistical reasoning","301","137","180","45.51495016611295","59.800664451827245"
torchrun --nproc-per-node=8 run.py --data MathVista_MINI --model InternVL2-4B --verbose
The expected test results are:
"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","544","587","54.400000000000006","58.699999999999996"
"scientific reasoning","122","88","73","72.1311475409836","59.83606557377049"
"textbook question answering","158","97","93","61.39240506329114","58.86075949367089"
"numeric commonsense","144","37","43","25.694444444444443","29.86111111111111"
"arithmetic reasoning","353","139","197","39.376770538243626","55.80736543909348"
"visual question answering","179","94","87","52.513966480446925","48.60335195530726"
"geometry reasoning","239","146","133","61.08786610878661","55.64853556485355"
"algebraic reasoning","281","169","156","60.14234875444839","55.51601423487544"
"geometry problem solving","208","137","119","65.86538461538461","57.21153846153846"
"math word problem","186","54","119","29.03225806451613","63.97849462365591"
"logical reasoning","37","19","9","51.35135135135135","24.324324324324326"
"figure question answering","269","162","169","60.223048327137555","62.825278810408925"
"statistical reasoning","301","167","215","55.48172757475083","71.42857142857143"
torchrun --nproc-per-node=8 run.py --data MathVista_MINI --model InternVL2-8B --verbose
The expected test results are:
"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","549","583","54.900000000000006","58.3"
"scientific reasoning","122","89","72","72.95081967213115","59.01639344262295"
"textbook question answering","158","101","97","63.92405063291139","61.39240506329114"
"numeric commonsense","144","39","44","27.083333333333332","30.555555555555557"
"arithmetic reasoning","353","128","199","36.26062322946176","56.37393767705382"
"visual question answering","179","92","89","51.39664804469274","49.72067039106145"
"geometry reasoning","239","160","144","66.94560669456067","60.25104602510461"
"algebraic reasoning","281","185","168","65.83629893238434","59.7864768683274"
"geometry problem solving","208","150","129","72.11538461538461","62.019230769230774"
"math word problem","186","49","110","26.344086021505376","59.13978494623656"
"logical reasoning","37","16","4","43.24324324324324","10.81081081081081"
"figure question answering","269","157","158","58.36431226765799","58.7360594795539"
"statistical reasoning","301","155","207","51.49501661129568","68.77076411960132"
torchrun --nproc-per-node=8 run.py --data MathVista_MINI --model InternVL2-26B --verbose
The expected test results are:
"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","588","594","58.8","59.4"
"scientific reasoning","122","87","73","71.31147540983606","59.83606557377049"
"textbook question answering","158","98","97","62.0253164556962","61.39240506329114"
"numeric commonsense","144","38","49","26.38888888888889","34.02777777777778"
"arithmetic reasoning","353","157","212","44.47592067988669","60.05665722379604"
"visual question answering","179","91","97","50.83798882681564","54.18994413407822"
"geometry reasoning","239","164","139","68.6192468619247","58.15899581589959"
"algebraic reasoning","281","188","159","66.90391459074732","56.58362989323843"
"geometry problem solving","208","154","121","74.03846153846155","58.17307692307693"
"math word problem","186","76","116","40.86021505376344","62.365591397849464"
"logical reasoning","37","17","3","45.94594594594595","8.108108108108109"
"figure question answering","269","169","163","62.825278810408925","60.594795539033456"
"statistical reasoning","301","168","212","55.81395348837209","70.43189368770764"
torchrun --nproc-per-node=8 run.py --data MathVista_MINI --model InternVL2-40B --verbose
The expected test results are:
"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","552","637","55.2","63.7"
"scientific reasoning","122","90","76","73.77049180327869","62.295081967213115"
"textbook question answering","158","101","99","63.92405063291139","62.65822784810127"
"numeric commonsense","144","34","58","23.61111111111111","40.27777777777778"
"arithmetic reasoning","353","147","229","41.64305949008499","64.87252124645893"
"visual question answering","179","92","103","51.39664804469274","57.54189944134078"
"geometry reasoning","239","155","131","64.85355648535564","54.811715481171554"
"algebraic reasoning","281","180","152","64.05693950177937","54.092526690391466"
"geometry problem solving","208","146","114","70.1923076923077","54.807692307692314"
"math word problem","186","65","135","34.946236559139784","72.58064516129032"
"logical reasoning","37","11","10","29.72972972972973","27.027027027027028"
"figure question answering","269","148","186","55.01858736059479","69.14498141263941"
"statistical reasoning","301","150","233","49.83388704318937","77.40863787375415"
torchrun --nproc-per-node=1 run.py --data MathVista_MINI --model InternVL2-76B --verbose
The expected test results are:
"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","534","655","53.400000000000006","65.5"
"scientific reasoning","122","89","77","72.95081967213115","63.114754098360656"
"textbook question answering","158","100","106","63.29113924050633","67.08860759493672"
"numeric commonsense","144","42","64","29.166666666666668","44.44444444444444"
"arithmetic reasoning","353","154","218","43.626062322946176","61.756373937677054"
"visual question answering","179","95","89","53.072625698324025","49.72067039106145"
"geometry reasoning","239","143","160","59.83263598326359","66.94560669456067"
"algebraic reasoning","281","168","187","59.7864768683274","66.54804270462633"
"geometry problem solving","208","134","142","64.42307692307693","68.26923076923077"
"math word problem","186","73","143","39.247311827956985","76.88172043010752"
"logical reasoning","37","7","6","18.91891891891892","16.216216216216218"
"figure question answering","269","132","175","49.07063197026022","65.05576208178438"
"statistical reasoning","301","139","232","46.179401993355484","77.0764119601329"
HallusionBench#
HallusionBench is a comprehensive benchmark designed to evaluate image-context reasoning in MLLMs, focusing on identifying issues related to language hallucination and visual illusion. The dataset consists of 346 images paired with 1,129 questions crafted by human experts. These questions are divided into two categories: Visual Dependent (VD) and Visual Supplement (VS), allowing the benchmark to assess the nuanced understanding and interpretation of visual data by MLLMs.
torchrun --nproc-per-node=8 run.py --data HallusionBench --model InternVL2-1B --verbose
The expected test results are:
"split","aAcc","fAcc","qAcc"
"Overall","54.363827549947416","23.98843930635838","21.978021978021978"
"VS","58.333333333333336","15.517241379310345","28.651685393258425"
"VD","51.945854483925544","28.26086956521739","17.689530685920577"
"VS_map","56.25","9.090909090909092","12.5"
"VD_illusion","48.61111111111111","25.806451612903224","8.333333333333332"
"VD_figure","58.75","36.58536585365854","23.076923076923077"
"VS_ocr","44.44444444444444","23.076923076923077","3.7037037037037033"
"VD_video","51.76470588235295","14.583333333333334","11.594202898550725"
"VD_ocr","78.65168539325843","58.139534883720934","55.81395348837209"
"VS_chart","66.15384615384615","17.5","47.368421052631575"
"VD_math","29.629629629629626","5.555555555555555","3.7037037037037033"
"VS_table","57.14285714285714","10.714285714285714","23.25581395348837"
result = (54.363827549947416 + 23.98843930635838 + 21.978021978021978) / 3 = 33.4
torchrun --nproc-per-node=8 run.py --data HallusionBench --model InternVL2-2B --verbose
The expected test results are:
"split","aAcc","fAcc","qAcc"
"Overall","58.359621451104104","26.589595375722542","28.79120879120879"
"VS","65.27777777777779","24.137931034482758","41.57303370786517"
"VD","54.145516074450086","27.82608695652174","20.577617328519857"
"VS_chart","70.0","27.500000000000004","59.210526315789465"
"VD_math","38.88888888888889","2.7777777777777777","11.11111111111111"
"VS_table","65.17857142857143","14.285714285714285","37.2093023255814"
"VD_ocr","71.91011235955057","46.51162790697674","44.18604651162791"
"VD_figure","60.0","39.02439024390244","23.076923076923077"
"VD_illusion","57.638888888888886","32.25806451612903","23.61111111111111"
"VD_video","48.8235294117647","14.583333333333334","8.695652173913043"
"VS_map","64.0625","27.27272727272727","28.125"
"VS_ocr","55.55555555555556","26.923076923076923","14.814814814814813"
result = (58.359621451104104 + 26.589595375722542 + 28.79120879120879) / 3 = 37.9
torchrun --nproc-per-node=8 run.py --data HallusionBench --model InternVL2-4B --verbose
The expected test results are:
"split","aAcc","fAcc","qAcc"
"Overall","61.09358569926393","32.369942196531795","32.30769230769231"
"VD","56.17597292724196","30.0","22.743682310469314"
"VS","69.16666666666667","37.06896551724138","47.19101123595505"
"VS_map","56.25","27.27272727272727","15.625"
"VS_ocr","55.55555555555556","38.46153846153847","18.51851851851852"
"VD_ocr","75.28089887640449","51.162790697674424","51.162790697674424"
"VS_table","75.89285714285714","35.714285714285715","55.81395348837209"
"VD_figure","62.5","39.02439024390244","25.64102564102564"
"VD_illusion","55.55555555555556","33.87096774193548","19.444444444444446"
"VD_video","48.8235294117647","8.333333333333332","7.246376811594203"
"VD_math","48.148148148148145","16.666666666666664","22.22222222222222"
"VS_chart","75.38461538461539","42.5","65.78947368421053"
result = (61.09358569926393 + 32.369942196531795 + 32.30769230769231) / 3 = 41.9
torchrun --nproc-per-node=8 run.py --data HallusionBench --model InternVL2-8B --verbose
The expected test results are:
"split","aAcc","fAcc","qAcc"
"Overall","64.03785488958991","35.83815028901734","35.824175824175825"
"VS","69.16666666666667","36.206896551724135","45.50561797752809"
"VD","60.913705583756354","35.65217391304348","29.602888086642597"
"VS_chart","76.15384615384615","42.5","63.1578947368421"
"VD_ocr","74.15730337078652","51.162790697674424","48.837209302325576"
"VD_figure","67.5","53.65853658536586","35.8974358974359"
"VD_video","51.17647058823529","14.583333333333334","11.594202898550725"
"VD_math","55.55555555555556","16.666666666666664","29.629629629629626"
"VD_illusion","64.58333333333334","40.32258064516129","31.944444444444443"
"VS_map","56.25","31.818181818181817","18.75"
"VS_ocr","53.70370370370371","26.923076923076923","11.11111111111111"
"VS_table","75.89285714285714","39.285714285714285","55.81395348837209"
result = (64.03785488958991 + 35.83815028901734 + 35.824175824175825) / 3 = 45.2
torchrun --nproc-per-node=8 run.py --data HallusionBench --model InternVL2-26B --verbose
The expected test results are:
"split","aAcc","fAcc","qAcc"
"Overall","67.2975814931651","43.641618497109825","41.098901098901095"
"VD","63.45177664974619","42.608695652173914","33.935018050541515"
"VS","73.61111111111111","45.689655172413794","52.24719101123596"
"VD_illusion","65.97222222222221","50.0","33.33333333333333"
"VS_chart","80.0","50.0","68.42105263157895"
"VD_ocr","77.52808988764045","58.139534883720934","55.81395348837209"
"VD_figure","72.5","53.65853658536586","43.58974358974359"
"VS_map","54.6875","22.727272727272727","18.75"
"VD_video","54.70588235294118","25.0","17.391304347826086"
"VS_ocr","51.85185185185185","34.61538461538461","14.814814814814813"
"VD_math","55.55555555555556","22.22222222222222","31.48148148148148"
"VS_table","87.5","67.85714285714286","72.09302325581395"
result = (67.2975814931651 + 43.641618497109825 + 41.098901098901095) / 3 = 50.7
torchrun --nproc-per-node=8 run.py --data HallusionBench --model InternVL2-40B --verbose
The expected test results are:
"split","aAcc","fAcc","qAcc"
"Overall","71.39852786540484","51.73410404624278","47.69230769230769"
"VS","78.88888888888889","56.896551724137936","58.98876404494382"
"VD","66.83587140439933","49.130434782608695","40.43321299638989"
"VD_math","62.03703703703704","36.11111111111111","38.88888888888889"
"VD_ocr","80.89887640449437","62.7906976744186","60.46511627906976"
"VD_figure","85.0","78.04878048780488","69.23076923076923"
"VS_chart","84.61538461538461","60.0","76.31578947368422"
"VS_map","62.5","45.45454545454545","25.0"
"VS_ocr","72.22222222222221","53.84615384615385","44.44444444444444"
"VS_table","84.82142857142857","64.28571428571429","62.7906976744186"
"VD_video","52.94117647058824","20.833333333333336","15.942028985507244"
"VD_illusion","68.05555555555556","50.0","37.5"
result = (71.39852786540484 + 51.73410404624278 + 47.69230769230769) / 3 = 56.9
torchrun --nproc-per-node=1 run.py --data HallusionBench --model InternVL2-76B --verbose
The expected test results are:
"split","aAcc","fAcc","qAcc"
"Overall","71.1882229232387","48.26589595375722","46.15384615384615"
"VS","76.38888888888889","53.44827586206896","56.74157303370787"
"VD","68.02030456852792","45.65217391304348","39.35018050541516"
"VD_ocr","80.89887640449437","65.11627906976744","65.11627906976744"
"VS_chart","81.53846153846153","60.0","73.68421052631578"
"VD_video","60.588235294117645","25.0","20.28985507246377"
"VD_math","64.81481481481481","27.77777777777778","37.03703703703704"
"VD_illusion","62.5","40.32258064516129","29.166666666666668"
"VS_ocr","64.81481481481481","42.30769230769231","29.629629629629626"
"VD_figure","83.75","73.17073170731707","66.66666666666666"
"VS_table","82.14285714285714","60.71428571428571","62.7906976744186"
"VS_map","65.625","45.45454545454545","31.25"
result = (71.1882229232387 + 48.26589595375722 + 46.15384615384615) / 3 = 55.2
MMStar#
The MMStar dataset is an advanced multimodal benchmark designed to evaluate the capabilities of MLLMs. It comprises 1,500 carefully selected samples that are balanced and purified to ensure they exhibit visual dependency and minimal data leakage. The dataset evaluates models across six core capabilities and 18 detailed axes, focusing on complex multimodal tasks that require advanced reasoning and understanding of visual content.
torchrun --nproc-per-node=8 run.py --data MMStar --model InternVL2-1B --verbose
The expected test results are:
"split","Overall","coarse perception","fine-grained perception","instance reasoning","logical reasoning","math","science & technology"
"none","0.452","0.588","0.368","0.548","0.352","0.46","0.396"
torchrun --nproc-per-node=8 run.py --data MMStar --model InternVL2-2B --verbose
The expected test results are:
"split","Overall","coarse perception","fine-grained perception","instance reasoning","logical reasoning","math","science & technology"
"none","0.5013333333333333","0.644","0.392","0.608","0.44","0.496","0.428"
torchrun --nproc-per-node=8 run.py --data MMStar --model InternVL2-4B --verbose
The expected test results are:
"split","Overall","coarse perception","fine-grained perception","instance reasoning","logical reasoning","math","science & technology"
"none","0.5426666666666666","0.672","0.384","0.624","0.532","0.588","0.456"
torchrun --nproc-per-node=8 run.py --data MMStar --model InternVL2-8B --verbose
The expected test results are:
"split","Overall","coarse perception","fine-grained perception","instance reasoning","logical reasoning","math","science & technology"
"none","0.62","0.704","0.504","0.68","0.656","0.672","0.504"
torchrun --nproc-per-node=8 run.py --data MMStar --model InternVL2-26B --verbose
The expected test results are:
"split","Overall","coarse perception","fine-grained perception","instance reasoning","logical reasoning","math","science & technology"
"none","0.612","0.716","0.544","0.688","0.6","0.624","0.5"
torchrun --nproc-per-node=8 run.py --data MMStar --model InternVL2-40B --verbose
The expected test results are:
"split","Overall","coarse perception","fine-grained perception","instance reasoning","logical reasoning","math","science & technology"
"none","0.654","0.692","0.528","0.716","0.696","0.72","0.572"
torchrun --nproc-per-node=1 run.py --data MMStar --model InternVL2-76B --verbose
The expected test results are:
"split","Overall","coarse perception","fine-grained perception","instance reasoning","logical reasoning","math","science & technology"
"none","0.674","0.704","0.568","0.728","0.724","0.752","0.568"
OCRBench#
OCRBench is a comprehensive evaluation benchmark designed to assess the OCR capabilities of MLLMs. It includes five components: Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). The benchmark encompasses data from 29 datasets, making it one of the most thorough OCR evaluation tools available. OCRBench aims to reveal both the strengths and weaknesses of MLLMs, particularly in handling multilingual text, handwritten text, non-semantic text, and mathematical expressions. The benchmark includes 1,000 question-answer pairs, all manually verified for precision.
torchrun --nproc-per-node=8 run.py --data OCRBench --model InternVL2-1B --verbose
The expected test results are:
{
"Text Recognition": 243,
"Scene Text-centric VQA": 165,
"Doc-oriented VQA": 125,
"Key Information Extraction": 149,
"Handwritten Mathematical Expression Recognition": 72,
"Final Score": 754,
"Final Score Norm": 75.4
}
torchrun --nproc-per-node=8 run.py --data OCRBench --model InternVL2-2B --verbose
The expected test results are:
{
"Text Recognition": 246,
"Scene Text-centric VQA": 170,
"Doc-oriented VQA": 133,
"Key Information Extraction": 167,
"Handwritten Mathematical Expression Recognition": 68,
"Final Score": 784,
"Final Score Norm": 78.4
}
torchrun --nproc-per-node=8 run.py --data OCRBench --model InternVL2-4B --verbose
The expected test results are:
{
"Text Recognition": 237,
"Scene Text-centric VQA": 170,
"Doc-oriented VQA": 154,
"Key Information Extraction": 159,
"Handwritten Mathematical Expression Recognition": 68,
"Final Score": 788,
"Final Score Norm": 78.8
}
torchrun --nproc-per-node=8 run.py --data OCRBench --model InternVL2-8B --verbose
The expected test results are:
{
"Text Recognition": 236,
"Scene Text-centric VQA": 175,
"Doc-oriented VQA": 156,
"Key Information Extraction": 162,
"Handwritten Mathematical Expression Recognition": 65,
"Final Score": 794,
"Final Score Norm": 79.4
}
torchrun --nproc-per-node=8 run.py --data OCRBench --model InternVL2-26B --verbose
The expected test results are:
{
"Text Recognition": 250,
"Scene Text-centric VQA": 185,
"Doc-oriented VQA": 154,
"Key Information Extraction": 168,
"Handwritten Mathematical Expression Recognition": 68,
"Final Score": 825,
"Final Score Norm": 82.5
}
torchrun --nproc-per-node=8 run.py --data OCRBench --model InternVL2-40B --verbose
The expected test results are:
{
"Text Recognition": 246,
"Scene Text-centric VQA": 181,
"Doc-oriented VQA": 160,
"Key Information Extraction": 175,
"Handwritten Mathematical Expression Recognition": 75,
"Final Score": 837,
"Final Score Norm": 83.7
}
torchrun --nproc-per-node=1 run.py --data OCRBench --model InternVL2-76B --verbose
The expected test results are:
{
"Text Recognition": 244,
"Scene Text-centric VQA": 182,
"Doc-oriented VQA": 165,
"Key Information Extraction": 176,
"Handwritten Mathematical Expression Recognition": 72,
"Final Score": 839,
"Final Score Norm": 83.9
}
MMMU#
The MMMU dataset is a comprehensive benchmark designed to evaluate multimodal models on college-level tasks that require domain-specific knowledge and reasoning. It includes 11,500 questions sourced from college exams, quizzes, and textbooks, spanning six disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions cover 30 subjects and feature 30 types of images, such as charts, diagrams, maps, tables, and more.
torchrun --nproc-per-node=8 run.py --data MMMU_DEV_VAL --model InternVL2-1B --verbose
The expected test results are:
"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"dev","0.34","0.2","0.0","0.2","0.2","0.4","0.4","0.0","0.4","0.0","0.2","0.4","0.4","0.2","0.0","0.6","0.6","0.4","0.2","0.6","0.6","0.6","0.2","0.2","0.0","0.4","0.4","0.8","0.6","0.2","0.8","0.35","0.44","0.28","0.55","0.36","0.17142857142857143"
"validation","0.3688888888888889","0.2","0.2","0.23333333333333334","0.4666666666666667","0.43333333333333335","0.4666666666666667","0.3333333333333333","0.4","0.3333333333333333","0.3333333333333333","0.5333333333333333","0.4666666666666667","0.36666666666666664","0.4666666666666667","0.4","0.23333333333333334","0.4","0.43333333333333335","0.7666666666666667","0.43333333333333335","0.43333333333333335","0.4","0.16666666666666666","0.26666666666666666","0.26666666666666666","0.2","0.36666666666666664","0.26666666666666666","0.3","0.5","0.425","0.3333333333333333","0.35333333333333333","0.49166666666666664","0.3333333333333333","0.32857142857142857"
torchrun --nproc-per-node=8 run.py --data MMMU_DEV_VAL --model InternVL2-2B --verbose
The expected test results are:
"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"dev","0.3333333333333333","0.4","0.0","0.0","0.2","0.2","0.6","0.2","0.2","0.2","0.4","0.6","0.2","0.8","0.6","0.2","0.6","0.0","0.4","0.8","0.2","0.2","0.2","0.8","0.8","0.0","0.2","0.2","0.2","0.0","0.6","0.25","0.44","0.24","0.5","0.28","0.3142857142857143"
"validation","0.36333333333333334","0.3333333333333333","0.4","0.26666666666666666","0.43333333333333335","0.36666666666666664","0.43333333333333335","0.23333333333333334","0.3","0.4","0.3","0.4666666666666667","0.36666666666666664","0.36666666666666664","0.5","0.26666666666666666","0.4","0.23333333333333334","0.43333333333333335","0.7666666666666667","0.43333333333333335","0.3333333333333333","0.3","0.4","0.23333333333333334","0.3","0.2","0.26666666666666666","0.36666666666666664","0.36666666666666664","0.43333333333333335","0.39166666666666666","0.37333333333333335","0.35333333333333333","0.5","0.2866666666666667","0.3238095238095238"
torchrun --nproc-per-node=8 run.py --data MMMU_DEV_VAL --model InternVL2-4B --verbose
The expected test results are:
"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"validation","0.47888888888888886","0.43333333333333335","0.5333333333333333","0.3","0.6","0.6","0.43333333333333335","0.36666666666666664","0.36666666666666664","0.3333333333333333","0.4","0.9","0.4666666666666667","0.5666666666666667","0.43333333333333335","0.4666666666666667","0.4","0.36666666666666664","0.5666666666666667","0.8333333333333334","0.5666666666666667","0.43333333333333335","0.36666666666666664","0.3333333333333333","0.26666666666666666","0.3333333333333333","0.43333333333333335","0.3333333333333333","0.6666666666666666","0.5666666666666667","0.7","0.6083333333333333","0.48","0.44666666666666666","0.6916666666666667","0.35333333333333333","0.3952380952380952"
"dev","0.4866666666666667","0.2","0.2","0.4","0.6","0.6","0.8","1.0","0.4","0.0","0.4","0.6","0.2","0.6","0.4","0.4","0.4","0.0","1.0","0.8","0.6","0.6","0.2","0.6","0.6","0.4","0.4","0.2","0.8","0.6","0.6","0.55","0.48","0.4","0.8","0.44","0.37142857142857144"
torchrun --nproc-per-node=8 run.py --data MMMU_DEV_VAL --model InternVL2-8B --verbose
The expected test results are:
"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"dev","0.49333333333333335","0.2","0.2","0.4","0.6","0.8","0.6","1.0","0.2","0.2","0.6","0.6","0.4","0.2","0.6","0.4","0.6","0.0","1.0","1.0","0.6","0.6","0.2","0.6","0.4","0.2","0.6","0.4","0.6","0.4","0.6","0.55","0.44","0.44","0.8","0.44","0.4"
"validation","0.5177777777777778","0.5333333333333333","0.5333333333333333","0.3","0.7","0.7","0.4666666666666667","0.5","0.5","0.7","0.6333333333333333","0.7","0.43333333333333335","0.5333333333333333","0.4666666666666667","0.4","0.3333333333333333","0.4666666666666667","0.7","0.9","0.5333333333333333","0.5333333333333333","0.3333333333333333","0.5","0.4","0.36666666666666664","0.3333333333333333","0.26666666666666666","0.6","0.5666666666666667","0.6","0.6166666666666667","0.49333333333333335","0.5","0.7","0.44666666666666666","0.4380952380952381"
torchrun --nproc-per-node=8 run.py --data MMMU_DEV_VAL --model InternVL2-26B --verbose
The expected test results are:
"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"dev","0.5266666666666666","0.4","0.4","0.2","0.8","0.8","0.6","0.4","0.4","0.0","0.6","0.6","0.2","0.2","0.6","0.4","1.0","0.0","1.0","0.8","0.6","0.6","0.4","0.6","0.8","0.6","0.6","0.4","0.8","0.4","0.6","0.7","0.56","0.36","0.8","0.36","0.4857142857142857"
"validation","0.5122222222222222","0.43333333333333335","0.4666666666666667","0.26666666666666666","0.8","0.8666666666666667","0.5666666666666667","0.5666666666666667","0.3333333333333333","0.5666666666666667","0.4666666666666667","0.8333333333333334","0.36666666666666664","0.4","0.5","0.4666666666666667","0.4","0.5333333333333333","0.7","0.9","0.5666666666666667","0.4666666666666667","0.36666666666666664","0.3333333333333333","0.4","0.3","0.3333333333333333","0.3333333333333333","0.6","0.6","0.6333333333333333","0.7","0.4533333333333333","0.4866666666666667","0.7083333333333334","0.42","0.41904761904761906"
torchrun --nproc-per-node=8 run.py --data MMMU_DEV_VAL --model InternVL2-40B --verbose
The expected test results are:
"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"validation","0.5522222222222222","0.4","0.6","0.36666666666666664","0.7","0.8666666666666667","0.5333333333333333","0.5333333333333333","0.4666666666666667","0.6","0.5666666666666667","0.7333333333333333","0.36666666666666664","0.6","0.4666666666666667","0.4666666666666667","0.43333333333333335","0.5333333333333333","0.7666666666666667","0.8333333333333334","0.4666666666666667","0.5666666666666667","0.3333333333333333","0.43333333333333335","0.36666666666666664","0.3","0.7","0.5333333333333333","0.6333333333333333","0.8","0.6","0.65","0.49333333333333335","0.6","0.7083333333333334","0.5","0.4523809523809524"
"dev","0.54","0.2","0.2","0.4","1.0","0.8","0.8","0.6","0.2","0.4","0.6","0.6","0.4","0.2","0.4","0.4","0.8","0.0","1.0","1.0","0.6","0.6","0.4","0.4","0.8","0.4","0.8","0.4","0.8","0.4","0.6","0.7","0.48","0.56","0.85","0.32","0.45714285714285713"
torchrun --nproc-per-node=1 run.py --data MMMU_DEV_VAL --model InternVL2-76B --verbose
The expected test results are:
"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"validation","0.5822222222222222","0.5","0.6333333333333333","0.4666666666666667","0.7666666666666667","0.9666666666666667","0.5333333333333333","0.5","0.5","0.6666666666666666","0.6333333333333333","0.7666666666666667","0.43333333333333335","0.5333333333333333","0.6","0.4","0.6333333333333333","0.4666666666666667","0.7","0.9","0.7333333333333333","0.6","0.3","0.3","0.4666666666666667","0.3333333333333333","0.5666666666666667","0.5333333333333333","0.7","0.7","0.6333333333333333","0.7083333333333334","0.6","0.58","0.7333333333333333","0.46","0.5"
"dev","0.5666666666666667","0.2","0.2","0.4","0.8","0.8","0.8","1.0","0.2","0.4","0.6","0.6","0.6","0.2","0.4","0.4","1.0","0.0","1.0","1.0","0.8","0.4","0.2","0.6","1.0","0.2","0.6","0.4","0.8","0.6","0.8","0.6","0.52","0.6","0.9","0.44","0.45714285714285713"
RealWorldQA#
The RealWorldQA dataset is a benchmark designed to evaluate the real-world spatial understanding capabilities of multimodal AI models. It consists of over 700 images, each accompanied by a question and a verifiable answer, focusing on various real-world scenarios, including those captured from vehicles. This dataset aims to test how well AI models comprehend physical environments and spatial relations, enhancing their ability to interpret and analyze real-world scenes.
torchrun --nproc-per-node=8 run.py --data RealWorldQA --model InternVL2-1B --verbose
The expected test results are:
"split","Overall"
"none","0.5032679738562091"
torchrun --nproc-per-node=8 run.py --data RealWorldQA --model InternVL2-2B --verbose
The expected test results are:
"split","Overall"
"none","0.5725490196078431"
torchrun --nproc-per-node=8 run.py --data RealWorldQA --model InternVL2-4B --verbose
The expected test results are:
"split","Overall"
"none","0.6065359477124183"
torchrun --nproc-per-node=8 run.py --data RealWorldQA --model InternVL2-8B --verbose
The expected test results are:
"split","Overall"
"none","0.6444444444444445"
torchrun --nproc-per-node=8 run.py --data RealWorldQA --model InternVL2-26B --verbose
The expected test results are:
"split","Overall"
"none","0.6836601307189543"
torchrun --nproc-per-node=8 run.py --data RealWorldQA --model InternVL2-40B --verbose
The expected test results are:
"split","Overall"
"none","0.7176470588235294"
torchrun --nproc-per-node=1 run.py --data RealWorldQA --model InternVL2-76B --verbose
The expected test results are:
"split","Overall"
"none","0.7215686274509804"
MMVet (GPT-4-Turbo)#
The MM-Vet dataset is a comprehensive benchmark designed to evaluate the integrated capabilities of MLLMs. It encompasses six core vision-language (VL) capabilities: recognition, knowledge, optical character recognition (OCR), spatial awareness, language generation, and math. The dataset includes 200 images and 218 questions, each requiring one or more of these capabilities to answer. The evaluation uses an open-ended LLM-based approach, allowing assessment across various answer styles and question types.
torchrun --nproc-per-node=8 run.py --data MMVet --model InternVL2-1B --verbose
The expected test results are:
"Category","tot","acc"
"rec","187","37.27272727272725"
"ocr","108","37.96296296296297"
"know","84","14.76190476190476"
"gen","80","14.624999999999996"
"spat","75","33.733333333333334"
"math","26","22.692307692307693"
"Overall","218","33.25688073394493"
torchrun --nproc-per-node=8 run.py --data MMVet --model InternVL2-2B --verbose
The expected test results are:
"Category","tot","acc"
"rec","187","41.71122994652404"
"ocr","108","44.62962962962963"
"know","84","24.999999999999993"
"gen","80","26.25"
"spat","75","40.800000000000004"
"math","26","30.76923076923077"
"Overall","218","39.541284403669714"
torchrun --nproc-per-node=8 run.py --data MMVet --model InternVL2-4B --verbose
The expected test results are:
"Category","tot","acc"
"rec","187","50.000000000000036"
"ocr","108","58.611111111111114"
"know","84","37.26190476190476"
"gen","80","36.499999999999986"
"spat","75","47.20000000000001"
"math","26","57.30769230769231"
"Overall","218","51.00917431192664"
torchrun --nproc-per-node=8 run.py --data MMVet --model InternVL2-8B --verbose
The expected test results are:
"Category","tot","acc"
"rec","187","51.81818181818184"
"ocr","108","63.42592592592594"
"know","84","36.904761904761905"
"gen","80","35.87499999999999"
"spat","75","61.86666666666667"
"math","26","60.769230769230774"
"Overall","218","54.174311926605526"
torchrun --nproc-per-node=8 run.py --data MMVet --model InternVL2-26B --verbose
The expected test results are:
"Category","tot","acc"
"rec","187","62.67379679144389"
"ocr","108","69.72222222222223"
"know","84","50.119047619047606"
"gen","80","48.62499999999999"
"spat","75","61.066666666666656"
"math","26","61.53846153846154"
"Overall","218","62.1100917431193"
torchrun --nproc-per-node=8 run.py --data MMVet --model InternVL2-40B --verbose
The expected test results are:
"Category","tot","acc"
"rec","187","66.25668449197867"
"ocr","108","70.18518518518522"
"know","84","54.40476190476189"
"gen","80","54.74999999999998"
"spat","75","68.53333333333332"
"math","26","64.23076923076924"
"Overall","218","65.50458715596335"
torchrun --nproc-per-node=1 run.py --data MMVet --model InternVL2-76B --verbose
The expected test results are:
"Category","tot","acc"
"rec","187","65.66844919786104"
"ocr","108","70.09259259259262"
"know","84","58.3333333333333"
"gen","80","58.49999999999997"
"spat","75","60.79999999999999"
"math","26","75.76923076923077"
"Overall","218","65.7339449541285"
Note that because the version of GPT-4 used for scoring differs from the official server, the scores tested by VLMEvalKit will be slightly different.
LLaVA-Bench (GPT-4-Turbo)#
The LLaVA-Bench-in-the-Wild dataset is designed to evaluate the capabilities of MLLMs in handling more complex and diverse visual tasks. It includes a set of 24 images with 60 associated questions, covering a range of indoor and outdoor scenes, memes, paintings, and sketches. Each image is paired with detailed, manually curated descriptions and questions that test the model’s generalizability to novel domains.
torchrun --nproc-per-node=8 run.py --data LLaVABench --model InternVL2-1B --verbose
The expected test results are:
"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","51.6","39.5","76.5"
"detail","58.9","37.3","63.3"
"conv","43.0","40.0","92.9"
"complex","54.9","40.4","73.6"
torchrun --nproc-per-node=8 run.py --data LLaVABench --model InternVL2-2B --verbose
The expected test results are:
"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","62.5","47.8","76.5"
"detail","61.8","42.0","68.0"
"complex","63.5","46.1","72.5"
"conv","61.7","55.9","90.6"
torchrun --nproc-per-node=8 run.py --data LLaVABench --model InternVL2-4B --verbose
The expected test results are:
"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","68.2","51.0","74.8"
"conv","62.3","55.3","88.8"
"detail","65.3","42.7","65.3"
"complex","74.0","52.9","71.4"
torchrun --nproc-per-node=8 run.py --data LLaVABench --model InternVL2-8B --verbose
The expected test results are:
"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","73.2","53.3","72.8"
"complex","86.1","61.8","71.8"
"conv","61.6","54.7","88.8"
"detail","63.5","36.0","56.7"
torchrun --nproc-per-node=8 run.py --data LLaVABench --model InternVL2-26B --verbose
The expected test results are:
"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","92.3","68.0","73.7"
"detail","85.6","51.3","60.0"
"complex","99.0","73.6","74.3"
"conv","86.8","73.5","84.7"
torchrun --nproc-per-node=8 run.py --data LLaVABench --model InternVL2-40B --verbose
The expected test results are:
"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","100.5","72.7","72.3"
"detail","90.4","56.7","62.7"
"complex","104.4","76.1","72.9"
"conv","101.5","81.2","80.0"
torchrun --nproc-per-node=1 run.py --data LLaVABench --model InternVL2-76B --verbose
The expected test results are:
"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","99.3","71.7","72.2"
"detail","92.1","54.7","59.3"
"complex","107.7","79.6","73.9"
"conv","91.2","73.5","80.6"
VideoMME#
The Video-MME dataset is a comprehensive benchmark designed to evaluate the capabilities of MLLMs in video analysis. It is the first benchmark specifically tailored for this purpose, focusing on a high-quality assessment of models’ performance in processing sequential visual data.
When testing without subtitles:
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-1B --verbose --nframe 16
The expected test results are:
{
"short": {
"overall": "0.5289",
"domain": {
"Knowledge": "0.5481",
"Film & Television": "0.6167",
"Sports Competition": "0.4667",
"Artistic Performance": "0.5333",
"Life Record": "0.5143",
"Multilingual": "0.4000"
},
"sub_category": {
"Humanity & History": "0.3333",
"Literature & Art": "0.4000",
"Biology & Medicine": "0.7000",
"Finance & Commerce": "0.6333",
"Astronomy": "0.5667",
"Geography": "0.5333",
"Law": "0.6000",
"Life Tip": "0.5333",
"Technology": "0.6333",
"Animation": "0.6000",
"Movie & TV Show": "0.7333",
"Documentary": "0.5333",
"News Report": "0.6000",
"Esports": "0.3667",
"Basketball": "0.3667",
"Football": "0.5333",
"Athletics": "0.5333",
"Other Sports": "0.5333",
"Stage Play": "0.7333",
"Magic Show": "0.3333",
"Variety Show": "0.6333",
"Acrobatics": "0.4333",
"Handicraft": "0.4667",
"Food": "0.5000",
"Fashion": "0.6333",
"Daily Life": "0.4000",
"Travel": "0.6333",
"Pet & Animal": "0.6667",
"Exercise": "0.3000",
"Multilingual": "0.4000"
},
"task_type": {
"Temporal Perception": "0.6667",
"Spatial Perception": "0.6000",
"Attribute Perception": "0.6721",
"Action Recognition": "0.4427",
"Object Recognition": "0.4821",
"OCR Problems": "0.6316",
"Counting Problem": "0.3040",
"Temporal Reasoning": "0.6154",
"Spatial Reasoning": "0.6667",
"Action Reasoning": "0.6170",
"Object Reasoning": "0.4750",
"Information Synopsis": "0.7073"
}
},
"medium": {
"overall": "0.4144",
"domain": {
"Knowledge": "0.3630",
"Film & Television": "0.5250",
"Sports Competition": "0.3933",
"Artistic Performance": "0.4750",
"Life Record": "0.3952",
"Multilingual": "0.4333"
},
"sub_category": {
"Humanity & History": "0.2000",
"Literature & Art": "0.4000",
"Biology & Medicine": "0.5000",
"Finance & Commerce": "0.4333",
"Astronomy": "0.4333",
"Geography": "0.2333",
"Law": "0.4000",
"Life Tip": "0.4333",
"Technology": "0.2333",
"Animation": "0.3333",
"Movie & TV Show": "0.5333",
"Documentary": "0.6000",
"News Report": "0.6333",
"Esports": "0.5000",
"Basketball": "0.1333",
"Football": "0.4333",
"Athletics": "0.3333",
"Other Sports": "0.5667",
"Stage Play": "0.5667",
"Magic Show": "0.3333",
"Variety Show": "0.5000",
"Acrobatics": "0.5000",
"Handicraft": "0.4667",
"Food": "0.3000",
"Fashion": "0.3667",
"Daily Life": "0.3333",
"Travel": "0.4333",
"Pet & Animal": "0.4000",
"Exercise": "0.4667",
"Multilingual": "0.4333"
},
"task_type": {
"Temporal Perception": "0.3871",
"Spatial Perception": "0.6190",
"Attribute Perception": "0.4110",
"Action Recognition": "0.3613",
"Object Recognition": "0.5000",
"OCR Problems": "0.4706",
"Counting Problem": "0.2526",
"Temporal Reasoning": "0.2740",
"Spatial Reasoning": "0.6667",
"Action Reasoning": "0.3276",
"Object Reasoning": "0.4179",
"Information Synopsis": "0.5897"
}
},
"long": {
"overall": "0.3333",
"domain": {
"Knowledge": "0.3259",
"Film & Television": "0.3250",
"Sports Competition": "0.3000",
"Artistic Performance": "0.3167",
"Life Record": "0.3762",
"Multilingual": "0.3667"
},
"sub_category": {
"Humanity & History": "0.3333",
"Literature & Art": "0.3667",
"Biology & Medicine": "0.3333",
"Finance & Commerce": "0.4667",
"Astronomy": "0.2000",
"Geography": "0.3000",
"Law": "0.2667",
"Life Tip": "0.3000",
"Technology": "0.3667",
"Animation": "0.2000",
"Movie & TV Show": "0.4667",
"Documentary": "0.4000",
"News Report": "0.2333",
"Esports": "0.4000",
"Basketball": "0.3333",
"Football": "0.2333",
"Athletics": "0.1333",
"Other Sports": "0.4000",
"Stage Play": "0.4000",
"Magic Show": "0.2667",
"Variety Show": "0.1333",
"Acrobatics": "0.4667",
"Handicraft": "0.5333",
"Food": "0.4333",
"Fashion": "0.3333",
"Daily Life": "0.3667",
"Travel": "0.2000",
"Pet & Animal": "0.4333",
"Exercise": "0.3333",
"Multilingual": "0.3667"
},
"task_type": {
"Temporal Perception": "0.3333",
"Spatial Perception": "0.0000",
"Attribute Perception": "0.5185",
"Action Recognition": "0.3016",
"Object Recognition": "0.2963",
"OCR Problems": "0.5000",
"Counting Problem": "0.1250",
"Temporal Reasoning": "0.2857",
"Spatial Reasoning": "0.6364",
"Action Reasoning": "0.2556",
"Object Reasoning": "0.3042",
"Information Synopsis": "0.5153"
}
},
"overall": {
"overall": "0.4256",
"domain": {
"Knowledge": "0.4123",
"Film & Television": "0.4889",
"Sports Competition": "0.3867",
"Artistic Performance": "0.4417",
"Life Record": "0.4286",
"Multilingual": "0.4000"
},
"sub_category": {
"Humanity & History": "0.2889",
"Literature & Art": "0.3889",
"Biology & Medicine": "0.5111",
"Finance & Commerce": "0.5111",
"Astronomy": "0.4000",
"Geography": "0.3556",
"Law": "0.4222",
"Life Tip": "0.4222",
"Technology": "0.4111",
"Animation": "0.3778",
"Movie & TV Show": "0.5778",
"Documentary": "0.5111",
"News Report": "0.4889",
"Esports": "0.4222",
"Basketball": "0.2778",
"Football": "0.4000",
"Athletics": "0.3333",
"Other Sports": "0.5000",
"Stage Play": "0.5667",
"Magic Show": "0.3111",
"Variety Show": "0.4222",
"Acrobatics": "0.4667",
"Handicraft": "0.4889",
"Food": "0.4111",
"Fashion": "0.4444",
"Daily Life": "0.3667",
"Travel": "0.4222",
"Pet & Animal": "0.5000",
"Exercise": "0.3667",
"Multilingual": "0.4000"
},
"task_type": {
"Temporal Perception": "0.4727",
"Spatial Perception": "0.5741",
"Attribute Perception": "0.5676",
"Action Recognition": "0.3834",
"Object Recognition": "0.4605",
"OCR Problems": "0.5396",
"Counting Problem": "0.2537",
"Temporal Reasoning": "0.3051",
"Spatial Reasoning": "0.6607",
"Action Reasoning": "0.3298",
"Object Reasoning": "0.3678",
"Information Synopsis": "0.5820"
}
}
}
When testing with subtitles:
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-1B --verbose --nframe 16 --use-subtitle
The expected test results are:
{
"short": {
"overall": "0.5433",
"domain": {
"Knowledge": "0.5630",
"Film & Television": "0.6000",
"Sports Competition": "0.4933",
"Artistic Performance": "0.5167",
"Life Record": "0.5571",
"Multilingual": "0.4000"
},
"sub_category": {
"Humanity & History": "0.3333",
"Literature & Art": "0.4000",
"Biology & Medicine": "0.7667",
"Finance & Commerce": "0.6000",
"Astronomy": "0.6000",
"Geography": "0.5000",
"Law": "0.6667",
"Life Tip": "0.6000",
"Technology": "0.6000",
"Animation": "0.5667",
"Movie & TV Show": "0.7333",
"Documentary": "0.5000",
"News Report": "0.6000",
"Esports": "0.4333",
"Basketball": "0.4000",
"Football": "0.5000",
"Athletics": "0.5000",
"Other Sports": "0.6333",
"Stage Play": "0.7667",
"Magic Show": "0.3333",
"Variety Show": "0.5333",
"Acrobatics": "0.4333",
"Handicraft": "0.5000",
"Food": "0.6000",
"Fashion": "0.6333",
"Daily Life": "0.4333",
"Travel": "0.7333",
"Pet & Animal": "0.6667",
"Exercise": "0.3333",
"Multilingual": "0.4000"
},
"task_type": {
"Temporal Perception": "0.5556",
"Spatial Perception": "0.5667",
"Attribute Perception": "0.6557",
"Action Recognition": "0.4656",
"Object Recognition": "0.5238",
"OCR Problems": "0.6667",
"Counting Problem": "0.3120",
"Temporal Reasoning": "0.4615",
"Spatial Reasoning": "0.6296",
"Action Reasoning": "0.5957",
"Object Reasoning": "0.5375",
"Information Synopsis": "0.7561"
}
},
"medium": {
"overall": "0.4289",
"domain": {
"Knowledge": "0.4111",
"Film & Television": "0.5250",
"Sports Competition": "0.4000",
"Artistic Performance": "0.4917",
"Life Record": "0.3714",
"Multilingual": "0.5000"
},
"sub_category": {
"Humanity & History": "0.3667",
"Literature & Art": "0.4333",
"Biology & Medicine": "0.5667",
"Finance & Commerce": "0.5000",
"Astronomy": "0.5333",
"Geography": "0.3333",
"Law": "0.3333",
"Life Tip": "0.4000",
"Technology": "0.2333",
"Animation": "0.2667",
"Movie & TV Show": "0.5000",
"Documentary": "0.6333",
"News Report": "0.7000",
"Esports": "0.5000",
"Basketball": "0.1667",
"Football": "0.4333",
"Athletics": "0.3667",
"Other Sports": "0.5333",
"Stage Play": "0.6333",
"Magic Show": "0.4333",
"Variety Show": "0.4333",
"Acrobatics": "0.4667",
"Handicraft": "0.5000",
"Food": "0.3333",
"Fashion": "0.3333",
"Daily Life": "0.3000",
"Travel": "0.4000",
"Pet & Animal": "0.3000",
"Exercise": "0.4333",
"Multilingual": "0.5000"
},
"task_type": {
"Temporal Perception": "0.4194",
"Spatial Perception": "0.6667",
"Attribute Perception": "0.4658",
"Action Recognition": "0.3613",
"Object Recognition": "0.4924",
"OCR Problems": "0.4265",
"Counting Problem": "0.2632",
"Temporal Reasoning": "0.2877",
"Spatial Reasoning": "0.7222",
"Action Reasoning": "0.3276",
"Object Reasoning": "0.4403",
"Information Synopsis": "0.6538"
}
},
"long": {
"overall": "0.3689",
"domain": {
"Knowledge": "0.3852",
"Film & Television": "0.3833",
"Sports Competition": "0.3267",
"Artistic Performance": "0.3417",
"Life Record": "0.3905",
"Multilingual": "0.3333"
},
"sub_category": {
"Humanity & History": "0.2333",
"Literature & Art": "0.4333",
"Biology & Medicine": "0.4333",
"Finance & Commerce": "0.6000",
"Astronomy": "0.2667",
"Geography": "0.2667",
"Law": "0.5000",
"Life Tip": "0.4333",
"Technology": "0.3000",
"Animation": "0.2667",
"Movie & TV Show": "0.4667",
"Documentary": "0.5000",
"News Report": "0.3000",
"Esports": "0.3667",
"Basketball": "0.2667",
"Football": "0.3667",
"Athletics": "0.2000",
"Other Sports": "0.4333",
"Stage Play": "0.4333",
"Magic Show": "0.2333",
"Variety Show": "0.2333",
"Acrobatics": "0.4667",
"Handicraft": "0.4667",
"Food": "0.4333",
"Fashion": "0.3667",
"Daily Life": "0.4000",
"Travel": "0.1667",
"Pet & Animal": "0.5333",
"Exercise": "0.3667",
"Multilingual": "0.3333"
},
"task_type": {
"Temporal Perception": "0.3333",
"Spatial Perception": "0.0000",
"Attribute Perception": "0.5185",
"Action Recognition": "0.3016",
"Object Recognition": "0.3148",
"OCR Problems": "0.2857",
"Counting Problem": "0.1875",
"Temporal Reasoning": "0.2637",
"Spatial Reasoning": "0.5455",
"Action Reasoning": "0.3278",
"Object Reasoning": "0.3667",
"Information Synopsis": "0.5521"
}
},
"overall": {
"overall": "0.4470",
"domain": {
"Knowledge": "0.4531",
"Film & Television": "0.5028",
"Sports Competition": "0.4067",
"Artistic Performance": "0.4500",
"Life Record": "0.4397",
"Multilingual": "0.4111"
},
"sub_category": {
"Humanity & History": "0.3111",
"Literature & Art": "0.4222",
"Biology & Medicine": "0.5889",
"Finance & Commerce": "0.5667",
"Astronomy": "0.4667",
"Geography": "0.3667",
"Law": "0.5000",
"Life Tip": "0.4778",
"Technology": "0.3778",
"Animation": "0.3667",
"Movie & TV Show": "0.5667",
"Documentary": "0.5444",
"News Report": "0.5333",
"Esports": "0.4333",
"Basketball": "0.2778",
"Football": "0.4333",
"Athletics": "0.3556",
"Other Sports": "0.5333",
"Stage Play": "0.6111",
"Magic Show": "0.3333",
"Variety Show": "0.4000",
"Acrobatics": "0.4556",
"Handicraft": "0.4889",
"Food": "0.4556",
"Fashion": "0.4444",
"Daily Life": "0.3778",
"Travel": "0.4333",
"Pet & Animal": "0.5000",
"Exercise": "0.3778",
"Multilingual": "0.4111"
},
"task_type": {
"Temporal Perception": "0.4545",
"Spatial Perception": "0.5741",
"Attribute Perception": "0.5766",
"Action Recognition": "0.3930",
"Object Recognition": "0.4802",
"OCR Problems": "0.5108",
"Counting Problem": "0.2724",
"Temporal Reasoning": "0.2881",
"Spatial Reasoning": "0.6429",
"Action Reasoning": "0.3719",
"Object Reasoning": "0.4185",
"Information Synopsis": "0.6285"
}
}
}
When testing without subtitles:
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-2B --verbose --nframe 16
The expected test results are:
{
"short": {
"overall": "0.5756",
"domain": {
"Knowledge": "0.5593",
"Film & Television": "0.6417",
"Sports Competition": "0.5800",
"Artistic Performance": "0.5917",
"Life Record": "0.5810",
"Multilingual": "0.3333"
},
"sub_category": {
"Humanity & History": "0.5333",
"Literature & Art": "0.4333",
"Biology & Medicine": "0.6667",
"Finance & Commerce": "0.4667",
"Astronomy": "0.5333",
"Geography": "0.6000",
"Law": "0.5667",
"Life Tip": "0.6667",
"Technology": "0.5667",
"Animation": "0.6000",
"Movie & TV Show": "0.6000",
"Documentary": "0.6000",
"News Report": "0.7667",
"Esports": "0.5667",
"Basketball": "0.4667",
"Football": "0.6333",
"Athletics": "0.5667",
"Other Sports": "0.6667",
"Stage Play": "0.7333",
"Magic Show": "0.4333",
"Variety Show": "0.6667",
"Acrobatics": "0.5333",
"Handicraft": "0.4000",
"Food": "0.6000",
"Fashion": "0.5333",
"Daily Life": "0.6667",
"Travel": "0.6000",
"Pet & Animal": "0.7667",
"Exercise": "0.5000",
"Multilingual": "0.3333"
},
"task_type": {
"Temporal Perception": "0.7222",
"Spatial Perception": "0.7333",
"Attribute Perception": "0.6967",
"Action Recognition": "0.5115",
"Object Recognition": "0.5536",
"OCR Problems": "0.7368",
"Counting Problem": "0.3120",
"Temporal Reasoning": "0.3846",
"Spatial Reasoning": "0.7407",
"Action Reasoning": "0.6809",
"Object Reasoning": "0.5375",
"Information Synopsis": "0.6951"
}
},
"medium": {
"overall": "0.4067",
"domain": {
"Knowledge": "0.3741",
"Film & Television": "0.4917",
"Sports Competition": "0.3333",
"Artistic Performance": "0.5417",
"Life Record": "0.3762",
"Multilingual": "0.4000"
},
"sub_category": {
"Humanity & History": "0.2000",
"Literature & Art": "0.4333",
"Biology & Medicine": "0.4000",
"Finance & Commerce": "0.3667",
"Astronomy": "0.4000",
"Geography": "0.3000",
"Law": "0.5333",
"Life Tip": "0.5000",
"Technology": "0.2333",
"Animation": "0.3000",
"Movie & TV Show": "0.5667",
"Documentary": "0.5000",
"News Report": "0.6000",
"Esports": "0.3333",
"Basketball": "0.2000",
"Football": "0.2667",
"Athletics": "0.5000",
"Other Sports": "0.3667",
"Stage Play": "0.6667",
"Magic Show": "0.5000",
"Variety Show": "0.5000",
"Acrobatics": "0.5000",
"Handicraft": "0.4333",
"Food": "0.2000",
"Fashion": "0.2667",
"Daily Life": "0.3333",
"Travel": "0.4333",
"Pet & Animal": "0.3667",
"Exercise": "0.6000",
"Multilingual": "0.4000"
},
"task_type": {
"Temporal Perception": "0.2903",
"Spatial Perception": "0.5238",
"Attribute Perception": "0.4932",
"Action Recognition": "0.3025",
"Object Recognition": "0.4924",
"OCR Problems": "0.3676",
"Counting Problem": "0.2737",
"Temporal Reasoning": "0.3151",
"Spatial Reasoning": "0.6667",
"Action Reasoning": "0.3966",
"Object Reasoning": "0.4104",
"Information Synopsis": "0.5769"
}
},
"long": {
"overall": "0.3689",
"domain": {
"Knowledge": "0.3444",
"Film & Television": "0.3500",
"Sports Competition": "0.3933",
"Artistic Performance": "0.3417",
"Life Record": "0.4000",
"Multilingual": "0.4333"
},
"sub_category": {
"Humanity & History": "0.3000",
"Literature & Art": "0.4667",
"Biology & Medicine": "0.3667",
"Finance & Commerce": "0.3667",
"Astronomy": "0.2333",
"Geography": "0.2333",
"Law": "0.4667",
"Life Tip": "0.3000",
"Technology": "0.3667",
"Animation": "0.2333",
"Movie & TV Show": "0.4333",
"Documentary": "0.4333",
"News Report": "0.3000",
"Esports": "0.4333",
"Basketball": "0.3000",
"Football": "0.3333",
"Athletics": "0.3667",
"Other Sports": "0.5333",
"Stage Play": "0.3333",
"Magic Show": "0.3667",
"Variety Show": "0.1667",
"Acrobatics": "0.5000",
"Handicraft": "0.5000",
"Food": "0.2000",
"Fashion": "0.3667",
"Daily Life": "0.4000",
"Travel": "0.2667",
"Pet & Animal": "0.6667",
"Exercise": "0.4000",
"Multilingual": "0.4333"
},
"task_type": {
"Temporal Perception": "0.0000",
"Spatial Perception": "0.3333",
"Attribute Perception": "0.3704",
"Action Recognition": "0.3968",
"Object Recognition": "0.4074",
"OCR Problems": "0.3571",
"Counting Problem": "0.2292",
"Temporal Reasoning": "0.3077",
"Spatial Reasoning": "0.5455",
"Action Reasoning": "0.3056",
"Object Reasoning": "0.3375",
"Information Synopsis": "0.5399"
}
},
"overall": {
"overall": "0.4504",
"domain": {
"Knowledge": "0.4259",
"Film & Television": "0.4944",
"Sports Competition": "0.4356",
"Artistic Performance": "0.4917",
"Life Record": "0.4524",
"Multilingual": "0.3889"
},
"sub_category": {
"Humanity & History": "0.3444",
"Literature & Art": "0.4444",
"Biology & Medicine": "0.4778",
"Finance & Commerce": "0.4000",
"Astronomy": "0.3889",
"Geography": "0.3778",
"Law": "0.5222",
"Life Tip": "0.4889",
"Technology": "0.3889",
"Animation": "0.3778",
"Movie & TV Show": "0.5333",
"Documentary": "0.5111",
"News Report": "0.5556",
"Esports": "0.4444",
"Basketball": "0.3222",
"Football": "0.4111",
"Athletics": "0.4778",
"Other Sports": "0.5222",
"Stage Play": "0.5778",
"Magic Show": "0.4333",
"Variety Show": "0.4444",
"Acrobatics": "0.5111",
"Handicraft": "0.4444",
"Food": "0.3333",
"Fashion": "0.3889",
"Daily Life": "0.4667",
"Travel": "0.4333",
"Pet & Animal": "0.6000",
"Exercise": "0.5000",
"Multilingual": "0.3889"
},
"task_type": {
"Temporal Perception": "0.4000",
"Spatial Perception": "0.6296",
"Attribute Perception": "0.5901",
"Action Recognition": "0.4089",
"Object Recognition": "0.5085",
"OCR Problems": "0.5180",
"Counting Problem": "0.2836",
"Temporal Reasoning": "0.3164",
"Spatial Reasoning": "0.6786",
"Action Reasoning": "0.3860",
"Object Reasoning": "0.3943",
"Information Synopsis": "0.5882"
}
}
}
When testing with subtitles:
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-2B --verbose --nframe 16 --use-subtitle
The expected test results are:
{
"short": {
"overall": "0.5978",
"domain": {
"Knowledge": "0.5926",
"Film & Television": "0.6583",
"Sports Competition": "0.5867",
"Artistic Performance": "0.6083",
"Life Record": "0.5952",
"Multilingual": "0.4333"
},
"sub_category": {
"Humanity & History": "0.4667",
"Literature & Art": "0.5333",
"Biology & Medicine": "0.8000",
"Finance & Commerce": "0.5333",
"Astronomy": "0.5667",
"Geography": "0.6333",
"Law": "0.6000",
"Life Tip": "0.6333",
"Technology": "0.5667",
"Animation": "0.5667",
"Movie & TV Show": "0.6333",
"Documentary": "0.6333",
"News Report": "0.8000",
"Esports": "0.5667",
"Basketball": "0.4333",
"Football": "0.6667",
"Athletics": "0.6333",
"Other Sports": "0.6333",
"Stage Play": "0.7000",
"Magic Show": "0.5000",
"Variety Show": "0.7000",
"Acrobatics": "0.5333",
"Handicraft": "0.4000",
"Food": "0.6667",
"Fashion": "0.5333",
"Daily Life": "0.6667",
"Travel": "0.5667",
"Pet & Animal": "0.7333",
"Exercise": "0.6000",
"Multilingual": "0.4333"
},
"task_type": {
"Temporal Perception": "0.8333",
"Spatial Perception": "0.6333",
"Attribute Perception": "0.7213",
"Action Recognition": "0.5496",
"Object Recognition": "0.5536",
"OCR Problems": "0.7368",
"Counting Problem": "0.3440",
"Temporal Reasoning": "0.3077",
"Spatial Reasoning": "0.8148",
"Action Reasoning": "0.7021",
"Object Reasoning": "0.5500",
"Information Synopsis": "0.7683"
}
},
"medium": {
"overall": "0.4367",
"domain": {
"Knowledge": "0.4444",
"Film & Television": "0.4833",
"Sports Competition": "0.3600",
"Artistic Performance": "0.5833",
"Life Record": "0.3714",
"Multilingual": "0.4333"
},
"sub_category": {
"Humanity & History": "0.3000",
"Literature & Art": "0.5000",
"Biology & Medicine": "0.5333",
"Finance & Commerce": "0.5333",
"Astronomy": "0.4667",
"Geography": "0.3667",
"Law": "0.5000",
"Life Tip": "0.6000",
"Technology": "0.2000",
"Animation": "0.3000",
"Movie & TV Show": "0.5667",
"Documentary": "0.5333",
"News Report": "0.5333",
"Esports": "0.3333",
"Basketball": "0.2333",
"Football": "0.3667",
"Athletics": "0.4667",
"Other Sports": "0.4000",
"Stage Play": "0.6667",
"Magic Show": "0.6000",
"Variety Show": "0.5667",
"Acrobatics": "0.5000",
"Handicraft": "0.5000",
"Food": "0.2000",
"Fashion": "0.3000",
"Daily Life": "0.2667",
"Travel": "0.4333",
"Pet & Animal": "0.3333",
"Exercise": "0.5667",
"Multilingual": "0.4333"
},
"task_type": {
"Temporal Perception": "0.3226",
"Spatial Perception": "0.5238",
"Attribute Perception": "0.5068",
"Action Recognition": "0.3277",
"Object Recognition": "0.4924",
"OCR Problems": "0.4118",
"Counting Problem": "0.3053",
"Temporal Reasoning": "0.3288",
"Spatial Reasoning": "0.6667",
"Action Reasoning": "0.4655",
"Object Reasoning": "0.4478",
"Information Synopsis": "0.6538"
}
},
"long": {
"overall": "0.3856",
"domain": {
"Knowledge": "0.3889",
"Film & Television": "0.3750",
"Sports Competition": "0.3867",
"Artistic Performance": "0.3417",
"Life Record": "0.4048",
"Multilingual": "0.4333"
},
"sub_category": {
"Humanity & History": "0.3000",
"Literature & Art": "0.5000",
"Biology & Medicine": "0.4333",
"Finance & Commerce": "0.5000",
"Astronomy": "0.3000",
"Geography": "0.3000",
"Law": "0.4333",
"Life Tip": "0.3333",
"Technology": "0.4000",
"Animation": "0.2333",
"Movie & TV Show": "0.4667",
"Documentary": "0.4333",
"News Report": "0.3667",
"Esports": "0.4667",
"Basketball": "0.2667",
"Football": "0.3000",
"Athletics": "0.3333",
"Other Sports": "0.5667",
"Stage Play": "0.4000",
"Magic Show": "0.3000",
"Variety Show": "0.2000",
"Acrobatics": "0.4667",
"Handicraft": "0.5000",
"Food": "0.2000",
"Fashion": "0.4000",
"Daily Life": "0.4333",
"Travel": "0.2333",
"Pet & Animal": "0.7000",
"Exercise": "0.3667",
"Multilingual": "0.4333"
},
"task_type": {
"Temporal Perception": "0.0000",
"Spatial Perception": "0.3333",
"Attribute Perception": "0.4444",
"Action Recognition": "0.4603",
"Object Recognition": "0.3519",
"OCR Problems": "0.4286",
"Counting Problem": "0.2292",
"Temporal Reasoning": "0.3187",
"Spatial Reasoning": "0.5455",
"Action Reasoning": "0.3222",
"Object Reasoning": "0.3625",
"Information Synopsis": "0.5460"
}
},
"overall": {
"overall": "0.4733",
"domain": {
"Knowledge": "0.4753",
"Film & Television": "0.5056",
"Sports Competition": "0.4444",
"Artistic Performance": "0.5111",
"Life Record": "0.4571",
"Multilingual": "0.4333"
},
"sub_category": {
"Humanity & History": "0.3556",
"Literature & Art": "0.5111",
"Biology & Medicine": "0.5889",
"Finance & Commerce": "0.5222",
"Astronomy": "0.4444",
"Geography": "0.4333",
"Law": "0.5111",
"Life Tip": "0.5222",
"Technology": "0.3889",
"Animation": "0.3667",
"Movie & TV Show": "0.5556",
"Documentary": "0.5333",
"News Report": "0.5667",
"Esports": "0.4556",
"Basketball": "0.3111",
"Football": "0.4444",
"Athletics": "0.4778",
"Other Sports": "0.5333",
"Stage Play": "0.5889",
"Magic Show": "0.4667",
"Variety Show": "0.4889",
"Acrobatics": "0.5000",
"Handicraft": "0.4667",
"Food": "0.3556",
"Fashion": "0.4111",
"Daily Life": "0.4556",
"Travel": "0.4111",
"Pet & Animal": "0.5889",
"Exercise": "0.5111",
"Multilingual": "0.4333"
},
"task_type": {
"Temporal Perception": "0.4545",
"Spatial Perception": "0.5741",
"Attribute Perception": "0.6171",
"Action Recognition": "0.4473",
"Object Recognition": "0.5000",
"OCR Problems": "0.5468",
"Counting Problem": "0.3097",
"Temporal Reasoning": "0.3220",
"Spatial Reasoning": "0.7143",
"Action Reasoning": "0.4140",
"Object Reasoning": "0.4207",
"Information Synopsis": "0.6285"
}
}
}
When testing without subtitles:
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-4B --verbose --nframe 16
The expected test results are:
{
"short": {
"overall": "0.6289",
"domain": {
"Knowledge": "0.6519",
"Film & Television": "0.7000",
"Sports Competition": "0.5800",
"Artistic Performance": "0.6417",
"Life Record": "0.6095",
"Multilingual": "0.4667"
},
"sub_category": {
"Humanity & History": "0.5333",
"Literature & Art": "0.6000",
"Biology & Medicine": "0.7667",
"Finance & Commerce": "0.6000",
"Astronomy": "0.6333",
"Geography": "0.5667",
"Law": "0.7333",
"Life Tip": "0.7667",
"Technology": "0.6667",
"Animation": "0.6000",
"Movie & TV Show": "0.6667",
"Documentary": "0.6333",
"News Report": "0.9000",
"Esports": "0.5333",
"Basketball": "0.4667",
"Football": "0.6667",
"Athletics": "0.6333",
"Other Sports": "0.6000",
"Stage Play": "0.8000",
"Magic Show": "0.6000",
"Variety Show": "0.5667",
"Acrobatics": "0.6000",
"Handicraft": "0.5667",
"Food": "0.5667",
"Fashion": "0.5333",
"Daily Life": "0.6000",
"Travel": "0.7000",
"Pet & Animal": "0.7667",
"Exercise": "0.5333",
"Multilingual": "0.4667"
},
"task_type": {
"Temporal Perception": "0.8889",
"Spatial Perception": "0.6333",
"Attribute Perception": "0.7459",
"Action Recognition": "0.6183",
"Object Recognition": "0.6369",
"OCR Problems": "0.6140",
"Counting Problem": "0.3200",
"Temporal Reasoning": "0.4615",
"Spatial Reasoning": "0.7778",
"Action Reasoning": "0.7021",
"Object Reasoning": "0.6250",
"Information Synopsis": "0.8171"
}
},
"medium": {
"overall": "0.4678",
"domain": {
"Knowledge": "0.4704",
"Film & Television": "0.5083",
"Sports Competition": "0.4133",
"Artistic Performance": "0.5333",
"Life Record": "0.4381",
"Multilingual": "0.5000"
},
"sub_category": {
"Humanity & History": "0.2667",
"Literature & Art": "0.6000",
"Biology & Medicine": "0.5333",
"Finance & Commerce": "0.5333",
"Astronomy": "0.5000",
"Geography": "0.4000",
"Law": "0.5000",
"Life Tip": "0.5333",
"Technology": "0.3667",
"Animation": "0.2333",
"Movie & TV Show": "0.6333",
"Documentary": "0.6000",
"News Report": "0.5667",
"Esports": "0.3667",
"Basketball": "0.3667",
"Football": "0.4333",
"Athletics": "0.4333",
"Other Sports": "0.4667",
"Stage Play": "0.6000",
"Magic Show": "0.4000",
"Variety Show": "0.5000",
"Acrobatics": "0.6333",
"Handicraft": "0.7000",
"Food": "0.3667",
"Fashion": "0.3333",
"Daily Life": "0.3000",
"Travel": "0.4333",
"Pet & Animal": "0.3667",
"Exercise": "0.5667",
"Multilingual": "0.5000"
},
"task_type": {
"Temporal Perception": "0.4839",
"Spatial Perception": "0.4762",
"Attribute Perception": "0.5205",
"Action Recognition": "0.3866",
"Object Recognition": "0.5530",
"OCR Problems": "0.4559",
"Counting Problem": "0.3053",
"Temporal Reasoning": "0.3014",
"Spatial Reasoning": "0.7222",
"Action Reasoning": "0.5172",
"Object Reasoning": "0.4925",
"Information Synopsis": "0.6154"
}
},
"long": {
"overall": "0.4467",
"domain": {
"Knowledge": "0.4815",
"Film & Television": "0.4333",
"Sports Competition": "0.4267",
"Artistic Performance": "0.4250",
"Life Record": "0.4333",
"Multilingual": "0.4667"
},
"sub_category": {
"Humanity & History": "0.3333",
"Literature & Art": "0.5000",
"Biology & Medicine": "0.5000",
"Finance & Commerce": "0.5333",
"Astronomy": "0.5333",
"Geography": "0.3333",
"Law": "0.5000",
"Life Tip": "0.5333",
"Technology": "0.5667",
"Animation": "0.2667",
"Movie & TV Show": "0.5333",
"Documentary": "0.5000",
"News Report": "0.4333",
"Esports": "0.4667",
"Basketball": "0.4000",
"Football": "0.4333",
"Athletics": "0.3667",
"Other Sports": "0.4667",
"Stage Play": "0.6000",
"Magic Show": "0.4333",
"Variety Show": "0.2667",
"Acrobatics": "0.4000",
"Handicraft": "0.5000",
"Food": "0.3000",
"Fashion": "0.4667",
"Daily Life": "0.3000",
"Travel": "0.3000",
"Pet & Animal": "0.6667",
"Exercise": "0.5000",
"Multilingual": "0.4667"
},
"task_type": {
"Temporal Perception": "0.5000",
"Spatial Perception": "0.6667",
"Attribute Perception": "0.5185",
"Action Recognition": "0.3810",
"Object Recognition": "0.4815",
"OCR Problems": "0.3571",
"Counting Problem": "0.2708",
"Temporal Reasoning": "0.2637",
"Spatial Reasoning": "0.5455",
"Action Reasoning": "0.4556",
"Object Reasoning": "0.4500",
"Information Synopsis": "0.5828"
}
},
"overall": {
"overall": "0.5144",
"domain": {
"Knowledge": "0.5346",
"Film & Television": "0.5472",
"Sports Competition": "0.4733",
"Artistic Performance": "0.5333",
"Life Record": "0.4937",
"Multilingual": "0.4778"
},
"sub_category": {
"Humanity & History": "0.3778",
"Literature & Art": "0.5667",
"Biology & Medicine": "0.6000",
"Finance & Commerce": "0.5556",
"Astronomy": "0.5556",
"Geography": "0.4333",
"Law": "0.5778",
"Life Tip": "0.6111",
"Technology": "0.5333",
"Animation": "0.3667",
"Movie & TV Show": "0.6111",
"Documentary": "0.5778",
"News Report": "0.6333",
"Esports": "0.4556",
"Basketball": "0.4111",
"Football": "0.5111",
"Athletics": "0.4778",
"Other Sports": "0.5111",
"Stage Play": "0.6667",
"Magic Show": "0.4778",
"Variety Show": "0.4444",
"Acrobatics": "0.5444",
"Handicraft": "0.5889",
"Food": "0.4111",
"Fashion": "0.4444",
"Daily Life": "0.4000",
"Travel": "0.4778",
"Pet & Animal": "0.6000",
"Exercise": "0.5333",
"Multilingual": "0.4778"
},
"task_type": {
"Temporal Perception": "0.6182",
"Spatial Perception": "0.5741",
"Attribute Perception": "0.6441",
"Action Recognition": "0.4824",
"Object Recognition": "0.5819",
"OCR Problems": "0.5108",
"Counting Problem": "0.3060",
"Temporal Reasoning": "0.2938",
"Spatial Reasoning": "0.7143",
"Action Reasoning": "0.5088",
"Object Reasoning": "0.4934",
"Information Synopsis": "0.6502"
}
}
}
When testing with subtitles:
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-4B --verbose --nframe 16 --use-subtitle
The expected test results are:
{
"short": {
"overall": "0.6511",
"domain": {
"Knowledge": "0.6852",
"Film & Television": "0.7083",
"Sports Competition": "0.5933",
"Artistic Performance": "0.6750",
"Life Record": "0.6286",
"Multilingual": "0.4667"
},
"sub_category": {
"Humanity & History": "0.5333",
"Literature & Art": "0.5667",
"Biology & Medicine": "0.8333",
"Finance & Commerce": "0.6667",
"Astronomy": "0.7000",
"Geography": "0.6333",
"Law": "0.7667",
"Life Tip": "0.7667",
"Technology": "0.7000",
"Animation": "0.4667",
"Movie & TV Show": "0.7333",
"Documentary": "0.7000",
"News Report": "0.9333",
"Esports": "0.5000",
"Basketball": "0.5000",
"Football": "0.6333",
"Athletics": "0.7000",
"Other Sports": "0.6333",
"Stage Play": "0.7667",
"Magic Show": "0.7000",
"Variety Show": "0.5667",
"Acrobatics": "0.6667",
"Handicraft": "0.6333",
"Food": "0.6000",
"Fashion": "0.5333",
"Daily Life": "0.6667",
"Travel": "0.7000",
"Pet & Animal": "0.7333",
"Exercise": "0.5333",
"Multilingual": "0.4667"
},
"task_type": {
"Temporal Perception": "0.8333",
"Spatial Perception": "0.6667",
"Attribute Perception": "0.7787",
"Action Recognition": "0.6260",
"Object Recognition": "0.6429",
"OCR Problems": "0.6667",
"Counting Problem": "0.3360",
"Temporal Reasoning": "0.6154",
"Spatial Reasoning": "0.8148",
"Action Reasoning": "0.7234",
"Object Reasoning": "0.6375",
"Information Synopsis": "0.8659"
}
},
"medium": {
"overall": "0.4878",
"domain": {
"Knowledge": "0.5148",
"Film & Television": "0.5417",
"Sports Competition": "0.4067",
"Artistic Performance": "0.5417",
"Life Record": "0.4619",
"Multilingual": "0.4000"
},
"sub_category": {
"Humanity & History": "0.3667",
"Literature & Art": "0.5667",
"Biology & Medicine": "0.5667",
"Finance & Commerce": "0.5667",
"Astronomy": "0.7000",
"Geography": "0.3667",
"Law": "0.6000",
"Life Tip": "0.4667",
"Technology": "0.4333",
"Animation": "0.2667",
"Movie & TV Show": "0.6667",
"Documentary": "0.5667",
"News Report": "0.6667",
"Esports": "0.4667",
"Basketball": "0.2333",
"Football": "0.4333",
"Athletics": "0.4333",
"Other Sports": "0.4667",
"Stage Play": "0.6333",
"Magic Show": "0.4333",
"Variety Show": "0.5000",
"Acrobatics": "0.6000",
"Handicraft": "0.7000",
"Food": "0.3333",
"Fashion": "0.3667",
"Daily Life": "0.3667",
"Travel": "0.5000",
"Pet & Animal": "0.4000",
"Exercise": "0.5667",
"Multilingual": "0.4000"
},
"task_type": {
"Temporal Perception": "0.4194",
"Spatial Perception": "0.4286",
"Attribute Perception": "0.5479",
"Action Recognition": "0.3950",
"Object Recognition": "0.5606",
"OCR Problems": "0.4559",
"Counting Problem": "0.3474",
"Temporal Reasoning": "0.2877",
"Spatial Reasoning": "0.8333",
"Action Reasoning": "0.4655",
"Object Reasoning": "0.5522",
"Information Synopsis": "0.7051"
}
},
"long": {
"overall": "0.4622",
"domain": {
"Knowledge": "0.4889",
"Film & Television": "0.4750",
"Sports Competition": "0.4267",
"Artistic Performance": "0.4500",
"Life Record": "0.4476",
"Multilingual": "0.5000"
},
"sub_category": {
"Humanity & History": "0.2667",
"Literature & Art": "0.5667",
"Biology & Medicine": "0.5333",
"Finance & Commerce": "0.6000",
"Astronomy": "0.5333",
"Geography": "0.3333",
"Law": "0.6000",
"Life Tip": "0.5000",
"Technology": "0.4667",
"Animation": "0.3333",
"Movie & TV Show": "0.5000",
"Documentary": "0.6000",
"News Report": "0.4667",
"Esports": "0.5000",
"Basketball": "0.4000",
"Football": "0.5333",
"Athletics": "0.3000",
"Other Sports": "0.4000",
"Stage Play": "0.7333",
"Magic Show": "0.4333",
"Variety Show": "0.2333",
"Acrobatics": "0.4000",
"Handicraft": "0.5667",
"Food": "0.2667",
"Fashion": "0.4667",
"Daily Life": "0.3333",
"Travel": "0.3000",
"Pet & Animal": "0.7000",
"Exercise": "0.5000",
"Multilingual": "0.5000"
},
"task_type": {
"Temporal Perception": "0.3333",
"Spatial Perception": "0.3333",
"Attribute Perception": "0.5185",
"Action Recognition": "0.4444",
"Object Recognition": "0.4815",
"OCR Problems": "0.2857",
"Counting Problem": "0.2708",
"Temporal Reasoning": "0.2418",
"Spatial Reasoning": "0.5455",
"Action Reasoning": "0.4444",
"Object Reasoning": "0.4708",
"Information Synopsis": "0.6564"
}
},
"overall": {
"overall": "0.5337",
"domain": {
"Knowledge": "0.5630",
"Film & Television": "0.5750",
"Sports Competition": "0.4756",
"Artistic Performance": "0.5556",
"Life Record": "0.5127",
"Multilingual": "0.4556"
},
"sub_category": {
"Humanity & History": "0.3889",
"Literature & Art": "0.5667",
"Biology & Medicine": "0.6444",
"Finance & Commerce": "0.6111",
"Astronomy": "0.6444",
"Geography": "0.4444",
"Law": "0.6556",
"Life Tip": "0.5778",
"Technology": "0.5333",
"Animation": "0.3556",
"Movie & TV Show": "0.6333",
"Documentary": "0.6222",
"News Report": "0.6889",
"Esports": "0.4889",
"Basketball": "0.3778",
"Football": "0.5333",
"Athletics": "0.4778",
"Other Sports": "0.5000",
"Stage Play": "0.7111",
"Magic Show": "0.5222",
"Variety Show": "0.4333",
"Acrobatics": "0.5556",
"Handicraft": "0.6333",
"Food": "0.4000",
"Fashion": "0.4556",
"Daily Life": "0.4556",
"Travel": "0.5000",
"Pet & Animal": "0.6111",
"Exercise": "0.5333",
"Multilingual": "0.4556"
},
"task_type": {
"Temporal Perception": "0.5455",
"Spatial Perception": "0.5556",
"Attribute Perception": "0.6712",
"Action Recognition": "0.5016",
"Object Recognition": "0.5876",
"OCR Problems": "0.5252",
"Counting Problem": "0.3284",
"Temporal Reasoning": "0.2881",
"Spatial Reasoning": "0.7679",
"Action Reasoning": "0.4947",
"Object Reasoning": "0.5242",
"Information Synopsis": "0.7214"
}
}
}
When testing without subtitles:
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-8B --verbose --nframe 16
The expected test results are:
{
"short": {
"overall": "0.6567",
"domain": {
"Knowledge": "0.6704",
"Film & Television": "0.7083",
"Sports Competition": "0.5933",
"Artistic Performance": "0.7000",
"Life Record": "0.6619",
"Multilingual": "0.4333"
},
"sub_category": {
"Humanity & History": "0.6000",
"Literature & Art": "0.6000",
"Biology & Medicine": "0.7667",
"Finance & Commerce": "0.7000",
"Astronomy": "0.6000",
"Geography": "0.7000",
"Law": "0.7000",
"Life Tip": "0.7000",
"Technology": "0.6667",
"Animation": "0.8000",
"Movie & TV Show": "0.6000",
"Documentary": "0.6333",
"News Report": "0.8000",
"Esports": "0.5333",
"Basketball": "0.3667",
"Football": "0.7000",
"Athletics": "0.7333",
"Other Sports": "0.6333",
"Stage Play": "0.8333",
"Magic Show": "0.6667",
"Variety Show": "0.6333",
"Acrobatics": "0.6667",
"Handicraft": "0.7000",
"Food": "0.6667",
"Fashion": "0.5333",
"Daily Life": "0.6667",
"Travel": "0.7667",
"Pet & Animal": "0.7667",
"Exercise": "0.5333",
"Multilingual": "0.4333"
},
"task_type": {
"Temporal Perception": "0.7222",
"Spatial Perception": "0.7667",
"Attribute Perception": "0.7623",
"Action Recognition": "0.5954",
"Object Recognition": "0.6845",
"OCR Problems": "0.7719",
"Counting Problem": "0.4080",
"Temporal Reasoning": "0.6154",
"Spatial Reasoning": "0.8148",
"Action Reasoning": "0.6596",
"Object Reasoning": "0.6250",
"Information Synopsis": "0.7683"
}
},
"medium": {
"overall": "0.5044",
"domain": {
"Knowledge": "0.5148",
"Film & Television": "0.5750",
"Sports Competition": "0.4533",
"Artistic Performance": "0.5917",
"Life Record": "0.4429",
"Multilingual": "0.4667"
},
"sub_category": {
"Humanity & History": "0.4333",
"Literature & Art": "0.6333",
"Biology & Medicine": "0.5667",
"Finance & Commerce": "0.6000",
"Astronomy": "0.4333",
"Geography": "0.3333",
"Law": "0.5667",
"Life Tip": "0.6333",
"Technology": "0.4333",
"Animation": "0.4000",
"Movie & TV Show": "0.6667",
"Documentary": "0.5667",
"News Report": "0.6667",
"Esports": "0.5667",
"Basketball": "0.2667",
"Football": "0.4667",
"Athletics": "0.4333",
"Other Sports": "0.5333",
"Stage Play": "0.8000",
"Magic Show": "0.4667",
"Variety Show": "0.5667",
"Acrobatics": "0.5333",
"Handicraft": "0.5667",
"Food": "0.4000",
"Fashion": "0.5000",
"Daily Life": "0.3333",
"Travel": "0.4333",
"Pet & Animal": "0.3667",
"Exercise": "0.5000",
"Multilingual": "0.4667"
},
"task_type": {
"Temporal Perception": "0.4516",
"Spatial Perception": "0.5714",
"Attribute Perception": "0.4932",
"Action Recognition": "0.3782",
"Object Recognition": "0.6212",
"OCR Problems": "0.4706",
"Counting Problem": "0.3053",
"Temporal Reasoning": "0.3836",
"Spatial Reasoning": "0.6111",
"Action Reasoning": "0.5172",
"Object Reasoning": "0.5970",
"Information Synopsis": "0.7051"
}
},
"long": {
"overall": "0.4589",
"domain": {
"Knowledge": "0.5037",
"Film & Television": "0.4500",
"Sports Competition": "0.4733",
"Artistic Performance": "0.4417",
"Life Record": "0.4048",
"Multilingual": "0.4667"
},
"sub_category": {
"Humanity & History": "0.5333",
"Literature & Art": "0.5000",
"Biology & Medicine": "0.6000",
"Finance & Commerce": "0.5000",
"Astronomy": "0.5000",
"Geography": "0.3667",
"Law": "0.5333",
"Life Tip": "0.5667",
"Technology": "0.4333",
"Animation": "0.2667",
"Movie & TV Show": "0.5667",
"Documentary": "0.4667",
"News Report": "0.5000",
"Esports": "0.5000",
"Basketball": "0.3667",
"Football": "0.5000",
"Athletics": "0.5000",
"Other Sports": "0.5000",
"Stage Play": "0.6333",
"Magic Show": "0.3333",
"Variety Show": "0.3000",
"Acrobatics": "0.5000",
"Handicraft": "0.4667",
"Food": "0.2667",
"Fashion": "0.4000",
"Daily Life": "0.3333",
"Travel": "0.3667",
"Pet & Animal": "0.6333",
"Exercise": "0.3667",
"Multilingual": "0.4667"
},
"task_type": {
"Temporal Perception": "0.1667",
"Spatial Perception": "0.0000",
"Attribute Perception": "0.6296",
"Action Recognition": "0.4127",
"Object Recognition": "0.5000",
"OCR Problems": "0.5000",
"Counting Problem": "0.3542",
"Temporal Reasoning": "0.3297",
"Spatial Reasoning": "0.6364",
"Action Reasoning": "0.4000",
"Object Reasoning": "0.4625",
"Information Synopsis": "0.6012"
}
},
"overall": {
"overall": "0.5400",
"domain": {
"Knowledge": "0.5630",
"Film & Television": "0.5778",
"Sports Competition": "0.5067",
"Artistic Performance": "0.5778",
"Life Record": "0.5032",
"Multilingual": "0.4556"
},
"sub_category": {
"Humanity & History": "0.5222",
"Literature & Art": "0.5778",
"Biology & Medicine": "0.6444",
"Finance & Commerce": "0.6000",
"Astronomy": "0.5111",
"Geography": "0.4667",
"Law": "0.6000",
"Life Tip": "0.6333",
"Technology": "0.5111",
"Animation": "0.4889",
"Movie & TV Show": "0.6111",
"Documentary": "0.5556",
"News Report": "0.6556",
"Esports": "0.5333",
"Basketball": "0.3333",
"Football": "0.5556",
"Athletics": "0.5556",
"Other Sports": "0.5556",
"Stage Play": "0.7556",
"Magic Show": "0.4889",
"Variety Show": "0.5000",
"Acrobatics": "0.5667",
"Handicraft": "0.5778",
"Food": "0.4444",
"Fashion": "0.4778",
"Daily Life": "0.4444",
"Travel": "0.5222",
"Pet & Animal": "0.5889",
"Exercise": "0.4667",
"Multilingual": "0.4556"
},
"task_type": {
"Temporal Perception": "0.5091",
"Spatial Perception": "0.6481",
"Attribute Perception": "0.6577",
"Action Recognition": "0.4760",
"Object Recognition": "0.6328",
"OCR Problems": "0.5971",
"Counting Problem": "0.3619",
"Temporal Reasoning": "0.3729",
"Spatial Reasoning": "0.7143",
"Action Reasoning": "0.4667",
"Object Reasoning": "0.5308",
"Information Synopsis": "0.6687"
}
}
}
When testing with subtitles:
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-8B --verbose --nframe 16 --use-subtitle
The expected test results are:
{
"short": {
"overall": "0.6900",
"domain": {
"Knowledge": "0.7148",
"Film & Television": "0.7500",
"Sports Competition": "0.5933",
"Artistic Performance": "0.7250",
"Life Record": "0.7000",
"Multilingual": "0.5000"
},
"sub_category": {
"Humanity & History": "0.5667",
"Literature & Art": "0.6333",
"Biology & Medicine": "0.8333",
"Finance & Commerce": "0.8333",
"Astronomy": "0.6667",
"Geography": "0.7000",
"Law": "0.7000",
"Life Tip": "0.8000",
"Technology": "0.7000",
"Animation": "0.7667",
"Movie & TV Show": "0.6667",
"Documentary": "0.6667",
"News Report": "0.9000",
"Esports": "0.5333",
"Basketball": "0.4000",
"Football": "0.6333",
"Athletics": "0.7667",
"Other Sports": "0.6333",
"Stage Play": "0.8000",
"Magic Show": "0.6667",
"Variety Show": "0.7667",
"Acrobatics": "0.6667",
"Handicraft": "0.6667",
"Food": "0.7000",
"Fashion": "0.5667",
"Daily Life": "0.7000",
"Travel": "0.8333",
"Pet & Animal": "0.8333",
"Exercise": "0.6000",
"Multilingual": "0.5000"
},
"task_type": {
"Temporal Perception": "0.6667",
"Spatial Perception": "0.7667",
"Attribute Perception": "0.7951",
"Action Recognition": "0.6412",
"Object Recognition": "0.6964",
"OCR Problems": "0.7895",
"Counting Problem": "0.4240",
"Temporal Reasoning": "0.6923",
"Spatial Reasoning": "0.8519",
"Action Reasoning": "0.7021",
"Object Reasoning": "0.6875",
"Information Synopsis": "0.8537"
}
},
"medium": {
"overall": "0.5256",
"domain": {
"Knowledge": "0.5593",
"Film & Television": "0.6167",
"Sports Competition": "0.4400",
"Artistic Performance": "0.6167",
"Life Record": "0.4429",
"Multilingual": "0.5000"
},
"sub_category": {
"Humanity & History": "0.4667",
"Literature & Art": "0.6000",
"Biology & Medicine": "0.5667",
"Finance & Commerce": "0.6667",
"Astronomy": "0.5667",
"Geography": "0.4667",
"Law": "0.5667",
"Life Tip": "0.6667",
"Technology": "0.4667",
"Animation": "0.3667",
"Movie & TV Show": "0.6667",
"Documentary": "0.6667",
"News Report": "0.7667",
"Esports": "0.5667",
"Basketball": "0.2667",
"Football": "0.4333",
"Athletics": "0.4333",
"Other Sports": "0.5000",
"Stage Play": "0.8333",
"Magic Show": "0.5333",
"Variety Show": "0.5667",
"Acrobatics": "0.5333",
"Handicraft": "0.5667",
"Food": "0.3667",
"Fashion": "0.4000",
"Daily Life": "0.4333",
"Travel": "0.4333",
"Pet & Animal": "0.4000",
"Exercise": "0.5000",
"Multilingual": "0.5000"
},
"task_type": {
"Temporal Perception": "0.4516",
"Spatial Perception": "0.5238",
"Attribute Perception": "0.5068",
"Action Recognition": "0.4034",
"Object Recognition": "0.6515",
"OCR Problems": "0.4118",
"Counting Problem": "0.3053",
"Temporal Reasoning": "0.3973",
"Spatial Reasoning": "0.7778",
"Action Reasoning": "0.5517",
"Object Reasoning": "0.6194",
"Information Synopsis": "0.7949"
}
},
"long": {
"overall": "0.4922",
"domain": {
"Knowledge": "0.5667",
"Film & Television": "0.4917",
"Sports Competition": "0.4800",
"Artistic Performance": "0.4583",
"Life Record": "0.4381",
"Multilingual": "0.4000"
},
"sub_category": {
"Humanity & History": "0.5667",
"Literature & Art": "0.5667",
"Biology & Medicine": "0.7333",
"Finance & Commerce": "0.5333",
"Astronomy": "0.5667",
"Geography": "0.4000",
"Law": "0.6667",
"Life Tip": "0.6000",
"Technology": "0.4667",
"Animation": "0.3333",
"Movie & TV Show": "0.5000",
"Documentary": "0.6000",
"News Report": "0.5333",
"Esports": "0.4333",
"Basketball": "0.4000",
"Football": "0.5333",
"Athletics": "0.4667",
"Other Sports": "0.5667",
"Stage Play": "0.7333",
"Magic Show": "0.3333",
"Variety Show": "0.3000",
"Acrobatics": "0.4667",
"Handicraft": "0.5667",
"Food": "0.3333",
"Fashion": "0.4333",
"Daily Life": "0.2667",
"Travel": "0.3667",
"Pet & Animal": "0.7333",
"Exercise": "0.3667",
"Multilingual": "0.4000"
},
"task_type": {
"Temporal Perception": "0.1667",
"Spatial Perception": "0.0000",
"Attribute Perception": "0.7037",
"Action Recognition": "0.4286",
"Object Recognition": "0.5000",
"OCR Problems": "0.5714",
"Counting Problem": "0.2917",
"Temporal Reasoning": "0.3077",
"Spatial Reasoning": "0.7273",
"Action Reasoning": "0.4278",
"Object Reasoning": "0.4917",
"Information Synopsis": "0.7117"
}
},
"overall": {
"overall": "0.5693",
"domain": {
"Knowledge": "0.6136",
"Film & Television": "0.6194",
"Sports Competition": "0.5044",
"Artistic Performance": "0.6000",
"Life Record": "0.5270",
"Multilingual": "0.4667"
},
"sub_category": {
"Humanity & History": "0.5333",
"Literature & Art": "0.6000",
"Biology & Medicine": "0.7111",
"Finance & Commerce": "0.6778",
"Astronomy": "0.6000",
"Geography": "0.5222",
"Law": "0.6444",
"Life Tip": "0.6889",
"Technology": "0.5444",
"Animation": "0.4889",
"Movie & TV Show": "0.6111",
"Documentary": "0.6444",
"News Report": "0.7333",
"Esports": "0.5111",
"Basketball": "0.3556",
"Football": "0.5333",
"Athletics": "0.5556",
"Other Sports": "0.5667",
"Stage Play": "0.7889",
"Magic Show": "0.5111",
"Variety Show": "0.5444",
"Acrobatics": "0.5556",
"Handicraft": "0.6000",
"Food": "0.4667",
"Fashion": "0.4667",
"Daily Life": "0.4667",
"Travel": "0.5444",
"Pet & Animal": "0.6556",
"Exercise": "0.4889",
"Multilingual": "0.4667"
},
"task_type": {
"Temporal Perception": "0.4909",
"Spatial Perception": "0.6296",
"Attribute Perception": "0.6892",
"Action Recognition": "0.5080",
"Object Recognition": "0.6497",
"OCR Problems": "0.5827",
"Counting Problem": "0.3582",
"Temporal Reasoning": "0.3729",
"Spatial Reasoning": "0.8036",
"Action Reasoning": "0.4982",
"Object Reasoning": "0.5639",
"Information Synopsis": "0.7678"
}
}
}
When testing without subtitles:
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-26B --verbose --nframe 16
The expected test results are:
{
"short": {
"overall": "0.6667",
"domain": {
"Knowledge": "0.6741",
"Film & Television": "0.7333",
"Sports Competition": "0.6133",
"Artistic Performance": "0.6750",
"Life Record": "0.6762",
"Multilingual": "0.5000"
},
"sub_category": {
"Humanity & History": "0.4000",
"Literature & Art": "0.5667",
"Biology & Medicine": "0.8667",
"Finance & Commerce": "0.7000",
"Astronomy": "0.6667",
"Geography": "0.6333",
"Law": "0.8000",
"Life Tip": "0.8000",
"Technology": "0.6333",
"Animation": "0.8000",
"Movie & TV Show": "0.7000",
"Documentary": "0.5667",
"News Report": "0.8667",
"Esports": "0.5333",
"Basketball": "0.4667",
"Football": "0.6333",
"Athletics": "0.7667",
"Other Sports": "0.6667",
"Stage Play": "0.8667",
"Magic Show": "0.5333",
"Variety Show": "0.6333",
"Acrobatics": "0.6667",
"Handicraft": "0.7000",
"Food": "0.7667",
"Fashion": "0.6667",
"Daily Life": "0.6667",
"Travel": "0.7667",
"Pet & Animal": "0.7333",
"Exercise": "0.4333",
"Multilingual": "0.5000"
},
"task_type": {
"Temporal Perception": "0.8333",
"Spatial Perception": "0.7333",
"Attribute Perception": "0.7541",
"Action Recognition": "0.6489",
"Object Recognition": "0.6548",
"OCR Problems": "0.7719",
"Counting Problem": "0.4080",
"Temporal Reasoning": "0.6154",
"Spatial Reasoning": "0.7778",
"Action Reasoning": "0.7234",
"Object Reasoning": "0.6500",
"Information Synopsis": "0.8049"
}
},
"medium": {
"overall": "0.5200",
"domain": {
"Knowledge": "0.5481",
"Film & Television": "0.5833",
"Sports Competition": "0.4267",
"Artistic Performance": "0.6167",
"Life Record": "0.4524",
"Multilingual": "0.5667"
},
"sub_category": {
"Humanity & History": "0.4000",
"Literature & Art": "0.6000",
"Biology & Medicine": "0.6667",
"Finance & Commerce": "0.5667",
"Astronomy": "0.5333",
"Geography": "0.4667",
"Law": "0.6667",
"Life Tip": "0.5333",
"Technology": "0.5000",
"Animation": "0.3667",
"Movie & TV Show": "0.6000",
"Documentary": "0.7000",
"News Report": "0.6667",
"Esports": "0.4667",
"Basketball": "0.3000",
"Football": "0.5000",
"Athletics": "0.3667",
"Other Sports": "0.5000",
"Stage Play": "0.6667",
"Magic Show": "0.6333",
"Variety Show": "0.6000",
"Acrobatics": "0.5667",
"Handicraft": "0.6667",
"Food": "0.3000",
"Fashion": "0.4000",
"Daily Life": "0.4000",
"Travel": "0.5333",
"Pet & Animal": "0.4667",
"Exercise": "0.4000",
"Multilingual": "0.5667"
},
"task_type": {
"Temporal Perception": "0.4839",
"Spatial Perception": "0.5238",
"Attribute Perception": "0.5890",
"Action Recognition": "0.4454",
"Object Recognition": "0.6364",
"OCR Problems": "0.4412",
"Counting Problem": "0.3474",
"Temporal Reasoning": "0.3836",
"Spatial Reasoning": "0.7222",
"Action Reasoning": "0.4655",
"Object Reasoning": "0.5448",
"Information Synopsis": "0.7436"
}
},
"long": {
"overall": "0.4578",
"domain": {
"Knowledge": "0.4815",
"Film & Television": "0.4583",
"Sports Competition": "0.4200",
"Artistic Performance": "0.4167",
"Life Record": "0.4857",
"Multilingual": "0.4000"
},
"sub_category": {
"Humanity & History": "0.5000",
"Literature & Art": "0.5667",
"Biology & Medicine": "0.5333",
"Finance & Commerce": "0.6000",
"Astronomy": "0.4667",
"Geography": "0.3000",
"Law": "0.5000",
"Life Tip": "0.4667",
"Technology": "0.4000",
"Animation": "0.3667",
"Movie & TV Show": "0.4667",
"Documentary": "0.5000",
"News Report": "0.5000",
"Esports": "0.4667",
"Basketball": "0.4000",
"Football": "0.4667",
"Athletics": "0.4000",
"Other Sports": "0.3667",
"Stage Play": "0.5667",
"Magic Show": "0.4667",
"Variety Show": "0.1333",
"Acrobatics": "0.5000",
"Handicraft": "0.6333",
"Food": "0.4333",
"Fashion": "0.3667",
"Daily Life": "0.5333",
"Travel": "0.3667",
"Pet & Animal": "0.6667",
"Exercise": "0.4000",
"Multilingual": "0.4000"
},
"task_type": {
"Temporal Perception": "0.0000",
"Spatial Perception": "0.3333",
"Attribute Perception": "0.5926",
"Action Recognition": "0.3968",
"Object Recognition": "0.5741",
"OCR Problems": "0.5000",
"Counting Problem": "0.2917",
"Temporal Reasoning": "0.2967",
"Spatial Reasoning": "0.6364",
"Action Reasoning": "0.4111",
"Object Reasoning": "0.4583",
"Information Synopsis": "0.6135"
}
},
"overall": {
"overall": "0.5481",
"domain": {
"Knowledge": "0.5679",
"Film & Television": "0.5917",
"Sports Competition": "0.4867",
"Artistic Performance": "0.5694",
"Life Record": "0.5381",
"Multilingual": "0.4889"
},
"sub_category": {
"Humanity & History": "0.4333",
"Literature & Art": "0.5778",
"Biology & Medicine": "0.6889",
"Finance & Commerce": "0.6222",
"Astronomy": "0.5556",
"Geography": "0.4667",
"Law": "0.6556",
"Life Tip": "0.6000",
"Technology": "0.5111",
"Animation": "0.5111",
"Movie & TV Show": "0.5889",
"Documentary": "0.5889",
"News Report": "0.6778",
"Esports": "0.4889",
"Basketball": "0.3889",
"Football": "0.5333",
"Athletics": "0.5111",
"Other Sports": "0.5111",
"Stage Play": "0.7000",
"Magic Show": "0.5444",
"Variety Show": "0.4556",
"Acrobatics": "0.5778",
"Handicraft": "0.6667",
"Food": "0.5000",
"Fashion": "0.4778",
"Daily Life": "0.5333",
"Travel": "0.5556",
"Pet & Animal": "0.6222",
"Exercise": "0.4111",
"Multilingual": "0.4889"
},
"task_type": {
"Temporal Perception": "0.5455",
"Spatial Perception": "0.6296",
"Attribute Perception": "0.6802",
"Action Recognition": "0.5208",
"Object Recognition": "0.6356",
"OCR Problems": "0.5827",
"Counting Problem": "0.3657",
"Temporal Reasoning": "0.3559",
"Spatial Reasoning": "0.7321",
"Action Reasoning": "0.4737",
"Object Reasoning": "0.5176",
"Information Synopsis": "0.6935"
}
}
}
When testing with subtitles:
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-26B --verbose --nframe 16 --use-subtitle
The expected test results are:
{
"short": {
"overall": "0.6844",
"domain": {
"Knowledge": "0.6889",
"Film & Television": "0.7250",
"Sports Competition": "0.6200",
"Artistic Performance": "0.7167",
"Life Record": "0.7000",
"Multilingual": "0.5667"
},
"sub_category": {
"Humanity & History": "0.3667",
"Literature & Art": "0.6000",
"Biology & Medicine": "0.9000",
"Finance & Commerce": "0.7333",
"Astronomy": "0.7000",
"Geography": "0.7333",
"Law": "0.8333",
"Life Tip": "0.7000",
"Technology": "0.6333",
"Animation": "0.7333",
"Movie & TV Show": "0.7333",
"Documentary": "0.5667",
"News Report": "0.8667",
"Esports": "0.6667",
"Basketball": "0.4333",
"Football": "0.6667",
"Athletics": "0.7333",
"Other Sports": "0.6000",
"Stage Play": "0.8333",
"Magic Show": "0.6000",
"Variety Show": "0.7667",
"Acrobatics": "0.6667",
"Handicraft": "0.6667",
"Food": "0.8333",
"Fashion": "0.6667",
"Daily Life": "0.7667",
"Travel": "0.7667",
"Pet & Animal": "0.7333",
"Exercise": "0.4667",
"Multilingual": "0.5667"
},
"task_type": {
"Temporal Perception": "0.7778",
"Spatial Perception": "0.7000",
"Attribute Perception": "0.7869",
"Action Recognition": "0.6336",
"Object Recognition": "0.6905",
"OCR Problems": "0.8070",
"Counting Problem": "0.4080",
"Temporal Reasoning": "0.7692",
"Spatial Reasoning": "0.8519",
"Action Reasoning": "0.7021",
"Object Reasoning": "0.7125",
"Information Synopsis": "0.8049"
}
},
"medium": {
"overall": "0.5456",
"domain": {
"Knowledge": "0.5852",
"Film & Television": "0.6167",
"Sports Competition": "0.4400",
"Artistic Performance": "0.6333",
"Life Record": "0.4714",
"Multilingual": "0.6000"
},
"sub_category": {
"Humanity & History": "0.5333",
"Literature & Art": "0.5667",
"Biology & Medicine": "0.6333",
"Finance & Commerce": "0.6667",
"Astronomy": "0.6667",
"Geography": "0.5000",
"Law": "0.6333",
"Life Tip": "0.5667",
"Technology": "0.5000",
"Animation": "0.3333",
"Movie & TV Show": "0.6333",
"Documentary": "0.7000",
"News Report": "0.8000",
"Esports": "0.4333",
"Basketball": "0.2667",
"Football": "0.6000",
"Athletics": "0.3667",
"Other Sports": "0.5333",
"Stage Play": "0.7667",
"Magic Show": "0.6000",
"Variety Show": "0.6000",
"Acrobatics": "0.5667",
"Handicraft": "0.6333",
"Food": "0.3000",
"Fashion": "0.4333",
"Daily Life": "0.3667",
"Travel": "0.6000",
"Pet & Animal": "0.4667",
"Exercise": "0.5000",
"Multilingual": "0.6000"
},
"task_type": {
"Temporal Perception": "0.4839",
"Spatial Perception": "0.4762",
"Attribute Perception": "0.5890",
"Action Recognition": "0.4622",
"Object Recognition": "0.6591",
"OCR Problems": "0.4706",
"Counting Problem": "0.3474",
"Temporal Reasoning": "0.4247",
"Spatial Reasoning": "0.8333",
"Action Reasoning": "0.4310",
"Object Reasoning": "0.6194",
"Information Synopsis": "0.7949"
}
},
"long": {
"overall": "0.4833",
"domain": {
"Knowledge": "0.5296",
"Film & Television": "0.5083",
"Sports Competition": "0.4333",
"Artistic Performance": "0.4583",
"Life Record": "0.4667",
"Multilingual": "0.4333"
},
"sub_category": {
"Humanity & History": "0.4667",
"Literature & Art": "0.5000",
"Biology & Medicine": "0.7000",
"Finance & Commerce": "0.6667",
"Astronomy": "0.6000",
"Geography": "0.3333",
"Law": "0.5667",
"Life Tip": "0.5000",
"Technology": "0.4333",
"Animation": "0.4000",
"Movie & TV Show": "0.4667",
"Documentary": "0.6667",
"News Report": "0.5000",
"Esports": "0.4667",
"Basketball": "0.3667",
"Football": "0.5667",
"Athletics": "0.3333",
"Other Sports": "0.4333",
"Stage Play": "0.7667",
"Magic Show": "0.4000",
"Variety Show": "0.2000",
"Acrobatics": "0.4667",
"Handicraft": "0.6333",
"Food": "0.3333",
"Fashion": "0.4333",
"Daily Life": "0.4667",
"Travel": "0.3000",
"Pet & Animal": "0.7000",
"Exercise": "0.4000",
"Multilingual": "0.4333"
},
"task_type": {
"Temporal Perception": "0.0000",
"Spatial Perception": "0.3333",
"Attribute Perception": "0.5556",
"Action Recognition": "0.4444",
"Object Recognition": "0.4815",
"OCR Problems": "0.6429",
"Counting Problem": "0.3333",
"Temporal Reasoning": "0.2967",
"Spatial Reasoning": "0.7273",
"Action Reasoning": "0.4611",
"Object Reasoning": "0.4667",
"Information Synopsis": "0.6748"
}
},
"overall": {
"overall": "0.5711",
"domain": {
"Knowledge": "0.6012",
"Film & Television": "0.6167",
"Sports Competition": "0.4978",
"Artistic Performance": "0.6028",
"Life Record": "0.5460",
"Multilingual": "0.5333"
},
"sub_category": {
"Humanity & History": "0.4556",
"Literature & Art": "0.5556",
"Biology & Medicine": "0.7444",
"Finance & Commerce": "0.6889",
"Astronomy": "0.6556",
"Geography": "0.5222",
"Law": "0.6778",
"Life Tip": "0.5889",
"Technology": "0.5222",
"Animation": "0.4889",
"Movie & TV Show": "0.6111",
"Documentary": "0.6444",
"News Report": "0.7222",
"Esports": "0.5222",
"Basketball": "0.3556",
"Football": "0.6111",
"Athletics": "0.4778",
"Other Sports": "0.5222",
"Stage Play": "0.7889",
"Magic Show": "0.5333",
"Variety Show": "0.5222",
"Acrobatics": "0.5667",
"Handicraft": "0.6444",
"Food": "0.4889",
"Fashion": "0.5111",
"Daily Life": "0.5333",
"Travel": "0.5556",
"Pet & Animal": "0.6333",
"Exercise": "0.4556",
"Multilingual": "0.5333"
},
"task_type": {
"Temporal Perception": "0.5273",
"Spatial Perception": "0.5926",
"Attribute Perception": "0.6937",
"Action Recognition": "0.5304",
"Object Recognition": "0.6469",
"OCR Problems": "0.6259",
"Counting Problem": "0.3731",
"Temporal Reasoning": "0.3842",
"Spatial Reasoning": "0.8214",
"Action Reasoning": "0.4947",
"Object Reasoning": "0.5551",
"Information Synopsis": "0.7368"
}
}
}
When testing without subtitles:
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-40B --verbose --nframe 16
The expected test results are:
{
"short": {
"overall": "0.7200",
"domain": {
"Knowledge": "0.7222",
"Film & Television": "0.7417",
"Sports Competition": "0.6667",
"Artistic Performance": "0.7583",
"Life Record": "0.7476",
"Multilingual": "0.5333"
},
"sub_category": {
"Humanity & History": "0.4333",
"Literature & Art": "0.6667",
"Biology & Medicine": "0.9667",
"Finance & Commerce": "0.8000",
"Astronomy": "0.8000",
"Geography": "0.6333",
"Law": "0.7333",
"Life Tip": "0.7333",
"Technology": "0.7333",
"Animation": "0.8000",
"Movie & TV Show": "0.7333",
"Documentary": "0.5667",
"News Report": "0.8667",
"Esports": "0.6667",
"Basketball": "0.4333",
"Football": "0.7667",
"Athletics": "0.8000",
"Other Sports": "0.6667",
"Stage Play": "0.9000",
"Magic Show": "0.6667",
"Variety Show": "0.7667",
"Acrobatics": "0.7000",
"Handicraft": "0.8667",
"Food": "0.7333",
"Fashion": "0.7333",
"Daily Life": "0.7333",
"Travel": "0.7667",
"Pet & Animal": "0.8000",
"Exercise": "0.6000",
"Multilingual": "0.5333"
},
"task_type": {
"Temporal Perception": "0.8889",
"Spatial Perception": "0.7333",
"Attribute Perception": "0.8033",
"Action Recognition": "0.6718",
"Object Recognition": "0.7262",
"OCR Problems": "0.8596",
"Counting Problem": "0.4400",
"Temporal Reasoning": "0.8462",
"Spatial Reasoning": "0.8889",
"Action Reasoning": "0.7660",
"Object Reasoning": "0.7250",
"Information Synopsis": "0.8415"
}
},
"medium": {
"overall": "0.5911",
"domain": {
"Knowledge": "0.6074",
"Film & Television": "0.6417",
"Sports Competition": "0.5067",
"Artistic Performance": "0.6583",
"Life Record": "0.5429",
"Multilingual": "0.7333"
},
"sub_category": {
"Humanity & History": "0.5333",
"Literature & Art": "0.6333",
"Biology & Medicine": "0.6000",
"Finance & Commerce": "0.6000",
"Astronomy": "0.5667",
"Geography": "0.5333",
"Law": "0.8000",
"Life Tip": "0.5667",
"Technology": "0.6333",
"Animation": "0.4000",
"Movie & TV Show": "0.7000",
"Documentary": "0.8000",
"News Report": "0.6667",
"Esports": "0.6333",
"Basketball": "0.1667",
"Football": "0.5333",
"Athletics": "0.6000",
"Other Sports": "0.6000",
"Stage Play": "0.7667",
"Magic Show": "0.6333",
"Variety Show": "0.5667",
"Acrobatics": "0.6667",
"Handicraft": "0.7000",
"Food": "0.3667",
"Fashion": "0.4333",
"Daily Life": "0.5333",
"Travel": "0.6333",
"Pet & Animal": "0.4000",
"Exercise": "0.7333",
"Multilingual": "0.7333"
},
"task_type": {
"Temporal Perception": "0.5484",
"Spatial Perception": "0.6190",
"Attribute Perception": "0.6712",
"Action Recognition": "0.5126",
"Object Recognition": "0.6667",
"OCR Problems": "0.5000",
"Counting Problem": "0.3579",
"Temporal Reasoning": "0.5068",
"Spatial Reasoning": "0.7778",
"Action Reasoning": "0.5345",
"Object Reasoning": "0.6716",
"Information Synopsis": "0.8205"
}
},
"long": {
"overall": "0.5256",
"domain": {
"Knowledge": "0.5926",
"Film & Television": "0.4583",
"Sports Competition": "0.5267",
"Artistic Performance": "0.5417",
"Life Record": "0.4762",
"Multilingual": "0.4667"
},
"sub_category": {
"Humanity & History": "0.5333",
"Literature & Art": "0.6000",
"Biology & Medicine": "0.7000",
"Finance & Commerce": "0.7667",
"Astronomy": "0.5667",
"Geography": "0.4333",
"Law": "0.5000",
"Life Tip": "0.6333",
"Technology": "0.6000",
"Animation": "0.3000",
"Movie & TV Show": "0.5333",
"Documentary": "0.5333",
"News Report": "0.4667",
"Esports": "0.6667",
"Basketball": "0.3667",
"Football": "0.6000",
"Athletics": "0.4000",
"Other Sports": "0.6000",
"Stage Play": "0.7000",
"Magic Show": "0.5667",
"Variety Show": "0.3667",
"Acrobatics": "0.5333",
"Handicraft": "0.5667",
"Food": "0.3667",
"Fashion": "0.4000",
"Daily Life": "0.4333",
"Travel": "0.3667",
"Pet & Animal": "0.6333",
"Exercise": "0.5667",
"Multilingual": "0.4667"
},
"task_type": {
"Temporal Perception": "0.3333",
"Spatial Perception": "0.3333",
"Attribute Perception": "0.6667",
"Action Recognition": "0.5397",
"Object Recognition": "0.5185",
"OCR Problems": "0.4286",
"Counting Problem": "0.2917",
"Temporal Reasoning": "0.3297",
"Spatial Reasoning": "0.6364",
"Action Reasoning": "0.5000",
"Object Reasoning": "0.5292",
"Information Synopsis": "0.7117"
}
},
"overall": {
"overall": "0.6122",
"domain": {
"Knowledge": "0.6407",
"Film & Television": "0.6139",
"Sports Competition": "0.5667",
"Artistic Performance": "0.6528",
"Life Record": "0.5889",
"Multilingual": "0.5778"
},
"sub_category": {
"Humanity & History": "0.5000",
"Literature & Art": "0.6333",
"Biology & Medicine": "0.7556",
"Finance & Commerce": "0.7222",
"Astronomy": "0.6444",
"Geography": "0.5333",
"Law": "0.6778",
"Life Tip": "0.6444",
"Technology": "0.6556",
"Animation": "0.5000",
"Movie & TV Show": "0.6556",
"Documentary": "0.6333",
"News Report": "0.6667",
"Esports": "0.6556",
"Basketball": "0.3222",
"Football": "0.6333",
"Athletics": "0.6000",
"Other Sports": "0.6222",
"Stage Play": "0.7889",
"Magic Show": "0.6222",
"Variety Show": "0.5667",
"Acrobatics": "0.6333",
"Handicraft": "0.7111",
"Food": "0.4889",
"Fashion": "0.5222",
"Daily Life": "0.5667",
"Travel": "0.5889",
"Pet & Animal": "0.6111",
"Exercise": "0.6333",
"Multilingual": "0.5778"
},
"task_type": {
"Temporal Perception": "0.6364",
"Spatial Perception": "0.6667",
"Attribute Perception": "0.7432",
"Action Recognition": "0.5847",
"Object Recognition": "0.6723",
"OCR Problems": "0.6403",
"Counting Problem": "0.3843",
"Temporal Reasoning": "0.4407",
"Spatial Reasoning": "0.8036",
"Action Reasoning": "0.5509",
"Object Reasoning": "0.6057",
"Information Synopsis": "0.7709"
}
}
}
When testing with subtitles:
torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-40B --verbose --nframe 16 --use-subtitle
The expected test results are:
{
"short": {
"overall": "0.7278",
"domain": {
"Knowledge": "0.7370",
"Film & Television": "0.7583",
"Sports Competition": "0.6800",
"Artistic Performance": "0.7750",
"Life Record": "0.7286",
"Multilingual": "0.5667"
},
"sub_category": {
"Humanity & History": "0.4333",
"Literature & Art": "0.6333",
"Biology & Medicine": "0.9667",
"Finance & Commerce": "0.8667",
"Astronomy": "0.8333",
"Geography": "0.7000",
"Law": "0.7667",
"Life Tip": "0.7000",
"Technology": "0.7333",
"Animation": "0.7667",
"Movie & TV Show": "0.7000",
"Documentary": "0.6667",
"News Report": "0.9000",
"Esports": "0.6667",
"Basketball": "0.3667",
"Football": "0.8000",
"Athletics": "0.8333",
"Other Sports": "0.7333",
"Stage Play": "0.8667",
"Magic Show": "0.7333",
"Variety Show": "0.8000",
"Acrobatics": "0.7000",
"Handicraft": "0.7667",
"Food": "0.8000",
"Fashion": "0.6667",
"Daily Life": "0.7333",
"Travel": "0.7667",
"Pet & Animal": "0.8000",
"Exercise": "0.5667",
"Multilingual": "0.5667"
},
"task_type": {
"Temporal Perception": "0.8889",
"Spatial Perception": "0.7333",
"Attribute Perception": "0.8115",
"Action Recognition": "0.6870",
"Object Recognition": "0.7202",
"OCR Problems": "0.8596",
"Counting Problem": "0.4640",
"Temporal Reasoning": "0.6923",
"Spatial Reasoning": "0.8889",
"Action Reasoning": "0.7234",
"Object Reasoning": "0.7625",
"Information Synopsis": "0.8780"
}
},
"medium": {
"overall": "0.6133",
"domain": {
"Knowledge": "0.6630",
"Film & Television": "0.6583",
"Sports Competition": "0.5133",
"Artistic Performance": "0.6917",
"Life Record": "0.5333",
"Multilingual": "0.7333"
},
"sub_category": {
"Humanity & History": "0.6000",
"Literature & Art": "0.7000",
"Biology & Medicine": "0.5667",
"Finance & Commerce": "0.7333",
"Astronomy": "0.7000",
"Geography": "0.5667",
"Law": "0.8333",
"Life Tip": "0.6667",
"Technology": "0.6000",
"Animation": "0.4333",
"Movie & TV Show": "0.7667",
"Documentary": "0.7333",
"News Report": "0.7000",
"Esports": "0.5667",
"Basketball": "0.2667",
"Football": "0.5667",
"Athletics": "0.5667",
"Other Sports": "0.6000",
"Stage Play": "0.8000",
"Magic Show": "0.6333",
"Variety Show": "0.6667",
"Acrobatics": "0.6667",
"Handicraft": "0.7000",
"Food": "0.3333",
"Fashion": "0.4000",
"Daily Life": "0.5333",
"Travel": "0.6333",
"Pet & Animal": "0.4667",
"Exercise": "0.6667",
"Multilingual": "0.7333"
},
"task_type": {
"Temporal Perception": "0.5484",
"Spatial Perception": "0.5238",
"Attribute Perception": "0.6438",
"Action Recognition": "0.5798",
"Object Recognition": "0.7121",
"OCR Problems": "0.4706",
"Counting Problem": "0.3684",
"Temporal Reasoning": "0.5479",
"Spatial Reasoning": "0.8333",
"Action Reasoning": "0.6034",
"Object Reasoning": "0.6791",
"Information Synopsis": "0.8462"
}
},
"long": {
"overall": "0.5300",
"domain": {
"Knowledge": "0.5889",
"Film & Television": "0.5000",
"Sports Competition": "0.5000",
"Artistic Performance": "0.6000",
"Life Record": "0.4571",
"Multilingual": "0.5000"
},
"sub_category": {
"Humanity & History": "0.5333",
"Literature & Art": "0.6000",
"Biology & Medicine": "0.6333",
"Finance & Commerce": "0.6333",
"Astronomy": "0.6667",
"Geography": "0.4000",
"Law": "0.7000",
"Life Tip": "0.6000",
"Technology": "0.5333",
"Animation": "0.3667",
"Movie & TV Show": "0.5000",
"Documentary": "0.5333",
"News Report": "0.6000",
"Esports": "0.6000",
"Basketball": "0.3333",
"Football": "0.6333",
"Athletics": "0.4000",
"Other Sports": "0.5333",
"Stage Play": "0.8667",
"Magic Show": "0.5667",
"Variety Show": "0.4333",
"Acrobatics": "0.5333",
"Handicraft": "0.5667",
"Food": "0.3333",
"Fashion": "0.3667",
"Daily Life": "0.4333",
"Travel": "0.4000",
"Pet & Animal": "0.6667",
"Exercise": "0.4333",
"Multilingual": "0.5000"
},
"task_type": {
"Temporal Perception": "0.1667",
"Spatial Perception": "0.3333",
"Attribute Perception": "0.6296",
"Action Recognition": "0.5714",
"Object Recognition": "0.5185",
"OCR Problems": "0.5714",
"Counting Problem": "0.2708",
"Temporal Reasoning": "0.3187",
"Spatial Reasoning": "0.6364",
"Action Reasoning": "0.4889",
"Object Reasoning": "0.5417",
"Information Synopsis": "0.7301"
}
},
"overall": {
"overall": "0.6237",
"domain": {
"Knowledge": "0.6630",
"Film & Television": "0.6389",
"Sports Competition": "0.5644",
"Artistic Performance": "0.6889",
"Life Record": "0.5730",
"Multilingual": "0.6000"
},
"sub_category": {
"Humanity & History": "0.5222",
"Literature & Art": "0.6444",
"Biology & Medicine": "0.7222",
"Finance & Commerce": "0.7444",
"Astronomy": "0.7333",
"Geography": "0.5556",
"Law": "0.7667",
"Life Tip": "0.6556",
"Technology": "0.6222",
"Animation": "0.5222",
"Movie & TV Show": "0.6556",
"Documentary": "0.6444",
"News Report": "0.7333",
"Esports": "0.6111",
"Basketball": "0.3222",
"Football": "0.6667",
"Athletics": "0.6000",
"Other Sports": "0.6222",
"Stage Play": "0.8444",
"Magic Show": "0.6444",
"Variety Show": "0.6333",
"Acrobatics": "0.6333",
"Handicraft": "0.6778",
"Food": "0.4889",
"Fashion": "0.4778",
"Daily Life": "0.5667",
"Travel": "0.6000",
"Pet & Animal": "0.6444",
"Exercise": "0.5556",
"Multilingual": "0.6000"
},
"task_type": {
"Temporal Perception": "0.6182",
"Spatial Perception": "0.6296",
"Attribute Perception": "0.7342",
"Action Recognition": "0.6230",
"Object Recognition": "0.6864",
"OCR Problems": "0.6403",
"Counting Problem": "0.3955",
"Temporal Reasoning": "0.4407",
"Spatial Reasoning": "0.8214",
"Action Reasoning": "0.5509",
"Object Reasoning": "0.6211",
"Information Synopsis": "0.7957"
}
}
}
When testing without subtitles:
torchrun --nproc-per-node=1 run.py --data Video-MME --model InternVL2-76B --verbose --nframe 16
The expected test results are:
{
"short": {
"overall": "0.7222",
"domain": {
"Knowledge": "0.7593",
"Film & Television": "0.7167",
"Sports Competition": "0.6800",
"Artistic Performance": "0.7500",
"Life Record": "0.7143",
"Multilingual": "0.5667"
},
"sub_category": {
"Humanity & History": "0.5333",
"Literature & Art": "0.6667",
"Biology & Medicine": "0.9333",
"Finance & Commerce": "0.8333",
"Astronomy": "0.7667",
"Geography": "0.7333",
"Law": "0.8000",
"Life Tip": "0.7667",
"Technology": "0.8000",
"Animation": "0.8000",
"Movie & TV Show": "0.6333",
"Documentary": "0.5667",
"News Report": "0.8667",
"Esports": "0.6667",
"Basketball": "0.6000",
"Football": "0.7667",
"Athletics": "0.7333",
"Other Sports": "0.6333",
"Stage Play": "0.8667",
"Magic Show": "0.6667",
"Variety Show": "0.7333",
"Acrobatics": "0.7333",
"Handicraft": "0.8000",
"Food": "0.7333",
"Fashion": "0.6000",
"Daily Life": "0.7333",
"Travel": "0.8667",
"Pet & Animal": "0.7667",
"Exercise": "0.5000",
"Multilingual": "0.5667"
},
"task_type": {
"Temporal Perception": "0.9444",
"Spatial Perception": "0.8333",
"Attribute Perception": "0.7869",
"Action Recognition": "0.6870",
"Object Recognition": "0.6786",
"OCR Problems": "0.8596",
"Counting Problem": "0.4400",
"Temporal Reasoning": "0.6923",
"Spatial Reasoning": "0.8519",
"Action Reasoning": "0.8085",
"Object Reasoning": "0.8000",
"Information Synopsis": "0.8537"
}
},
"medium": {
"overall": "0.5800",
"domain": {
"Knowledge": "0.5741",
"Film & Television": "0.6833",
"Sports Competition": "0.5200",
"Artistic Performance": "0.6833",
"Life Record": "0.5095",
"Multilingual": "0.6000"
},
"sub_category": {
"Humanity & History": "0.5000",
"Literature & Art": "0.6000",
"Biology & Medicine": "0.5667",
"Finance & Commerce": "0.6333",
"Astronomy": "0.6000",
"Geography": "0.5000",
"Law": "0.6333",
"Life Tip": "0.6000",
"Technology": "0.5333",
"Animation": "0.6000",
"Movie & TV Show": "0.7667",
"Documentary": "0.7667",
"News Report": "0.6000",
"Esports": "0.5000",
"Basketball": "0.4000",
"Football": "0.6000",
"Athletics": "0.4667",
"Other Sports": "0.6333",
"Stage Play": "0.8000",
"Magic Show": "0.6333",
"Variety Show": "0.6000",
"Acrobatics": "0.7000",
"Handicraft": "0.7333",
"Food": "0.3000",
"Fashion": "0.4000",
"Daily Life": "0.3667",
"Travel": "0.5667",
"Pet & Animal": "0.6333",
"Exercise": "0.5667",
"Multilingual": "0.6000"
},
"task_type": {
"Temporal Perception": "0.5806",
"Spatial Perception": "0.5238",
"Attribute Perception": "0.6027",
"Action Recognition": "0.5546",
"Object Recognition": "0.6212",
"OCR Problems": "0.5000",
"Counting Problem": "0.4000",
"Temporal Reasoning": "0.3836",
"Spatial Reasoning": "0.7222",
"Action Reasoning": "0.6207",
"Object Reasoning": "0.6642",
"Information Synopsis": "0.8077"
}
},
"long": {
"overall": "0.5333",
"domain": {
"Knowledge": "0.5926",
"Film & Television": "0.4667",
"Sports Competition": "0.5200",
"Artistic Performance": "0.5750",
"Life Record": "0.4810",
"Multilingual": "0.5333"
},
"sub_category": {
"Humanity & History": "0.5333",
"Literature & Art": "0.6000",
"Biology & Medicine": "0.5667",
"Finance & Commerce": "0.6667",
"Astronomy": "0.7333",
"Geography": "0.5000",
"Law": "0.5333",
"Life Tip": "0.7000",
"Technology": "0.5000",
"Animation": "0.4000",
"Movie & TV Show": "0.4000",
"Documentary": "0.4667",
"News Report": "0.6000",
"Esports": "0.4333",
"Basketball": "0.5333",
"Football": "0.5667",
"Athletics": "0.5000",
"Other Sports": "0.5667",
"Stage Play": "0.7333",
"Magic Show": "0.5667",
"Variety Show": "0.3333",
"Acrobatics": "0.6667",
"Handicraft": "0.5667",
"Food": "0.3667",
"Fashion": "0.5000",
"Daily Life": "0.4667",
"Travel": "0.3667",
"Pet & Animal": "0.7000",
"Exercise": "0.4000",
"Multilingual": "0.5333"
},
"task_type": {
"Temporal Perception": "0.5000",
"Spatial Perception": "0.3333",
"Attribute Perception": "0.5185",
"Action Recognition": "0.5556",
"Object Recognition": "0.5741",
"OCR Problems": "0.3571",
"Counting Problem": "0.3750",
"Temporal Reasoning": "0.4835",
"Spatial Reasoning": "0.6364",
"Action Reasoning": "0.4778",
"Object Reasoning": "0.5250",
"Information Synopsis": "0.6748"
}
},
"overall": {
"overall": "0.6119",
"domain": {
"Knowledge": "0.6420",
"Film & Television": "0.6222",
"Sports Competition": "0.5733",
"Artistic Performance": "0.6694",
"Life Record": "0.5683",
"Multilingual": "0.5667"
},
"sub_category": {
"Humanity & History": "0.5222",
"Literature & Art": "0.6222",
"Biology & Medicine": "0.6889",
"Finance & Commerce": "0.7111",
"Astronomy": "0.7000",
"Geography": "0.5778",
"Law": "0.6556",
"Life Tip": "0.6889",
"Technology": "0.6111",
"Animation": "0.6000",
"Movie & TV Show": "0.6000",
"Documentary": "0.6000",
"News Report": "0.6889",
"Esports": "0.5333",
"Basketball": "0.5111",
"Football": "0.6444",
"Athletics": "0.5667",
"Other Sports": "0.6111",
"Stage Play": "0.8000",
"Magic Show": "0.6222",
"Variety Show": "0.5556",
"Acrobatics": "0.7000",
"Handicraft": "0.7000",
"Food": "0.4667",
"Fashion": "0.5000",
"Daily Life": "0.5222",
"Travel": "0.6000",
"Pet & Animal": "0.7000",
"Exercise": "0.4889",
"Multilingual": "0.5667"
},
"task_type": {
"Temporal Perception": "0.6909",
"Spatial Perception": "0.6852",
"Attribute Perception": "0.6937",
"Action Recognition": "0.6102",
"Object Recognition": "0.6412",
"OCR Problems": "0.6331",
"Counting Problem": "0.4142",
"Temporal Reasoning": "0.4576",
"Spatial Reasoning": "0.7679",
"Action Reasoning": "0.5614",
"Object Reasoning": "0.6145",
"Information Synopsis": "0.7523"
}
}
}
When testing with subtitles:
torchrun --nproc-per-node=1 run.py --data Video-MME --model InternVL2-76B --verbose --nframe 16 --use-subtitle
The expected test results are:
{
"short": {
"overall": "0.7422",
"domain": {
"Knowledge": "0.7667",
"Film & Television": "0.7583",
"Sports Competition": "0.7067",
"Artistic Performance": "0.7833",
"Life Record": "0.7286",
"Multilingual": "0.5667"
},
"sub_category": {
"Humanity & History": "0.5000",
"Literature & Art": "0.6667",
"Biology & Medicine": "0.9667",
"Finance & Commerce": "0.8667",
"Astronomy": "0.8000",
"Geography": "0.7667",
"Law": "0.8000",
"Life Tip": "0.7667",
"Technology": "0.7667",
"Animation": "0.7667",
"Movie & TV Show": "0.7000",
"Documentary": "0.6667",
"News Report": "0.9000",
"Esports": "0.7000",
"Basketball": "0.5000",
"Football": "0.7667",
"Athletics": "0.8333",
"Other Sports": "0.7333",
"Stage Play": "0.8333",
"Magic Show": "0.7667",
"Variety Show": "0.8000",
"Acrobatics": "0.7333",
"Handicraft": "0.8000",
"Food": "0.8000",
"Fashion": "0.6333",
"Daily Life": "0.7333",
"Travel": "0.8667",
"Pet & Animal": "0.7333",
"Exercise": "0.5333",
"Multilingual": "0.5667"
},
"task_type": {
"Temporal Perception": "0.8889",
"Spatial Perception": "0.8000",
"Attribute Perception": "0.8115",
"Action Recognition": "0.7023",
"Object Recognition": "0.6964",
"OCR Problems": "0.9123",
"Counting Problem": "0.4720",
"Temporal Reasoning": "0.7692",
"Spatial Reasoning": "0.8519",
"Action Reasoning": "0.8511",
"Object Reasoning": "0.7875",
"Information Synopsis": "0.8902"
}
},
"medium": {
"overall": "0.5900",
"domain": {
"Knowledge": "0.6111",
"Film & Television": "0.7083",
"Sports Competition": "0.4800",
"Artistic Performance": "0.7083",
"Life Record": "0.5048",
"Multilingual": "0.6000"
},
"sub_category": {
"Humanity & History": "0.6000",
"Literature & Art": "0.6333",
"Biology & Medicine": "0.5667",
"Finance & Commerce": "0.6333",
"Astronomy": "0.6333",
"Geography": "0.6000",
"Law": "0.6667",
"Life Tip": "0.6333",
"Technology": "0.5333",
"Animation": "0.5333",
"Movie & TV Show": "0.8000",
"Documentary": "0.7667",
"News Report": "0.7333",
"Esports": "0.5000",
"Basketball": "0.3000",
"Football": "0.5667",
"Athletics": "0.4667",
"Other Sports": "0.5667",
"Stage Play": "0.8333",
"Magic Show": "0.6667",
"Variety Show": "0.6000",
"Acrobatics": "0.7333",
"Handicraft": "0.7333",
"Food": "0.3333",
"Fashion": "0.3333",
"Daily Life": "0.4333",
"Travel": "0.5333",
"Pet & Animal": "0.6333",
"Exercise": "0.5333",
"Multilingual": "0.6000"
},
"task_type": {
"Temporal Perception": "0.5161",
"Spatial Perception": "0.5238",
"Attribute Perception": "0.6027",
"Action Recognition": "0.5546",
"Object Recognition": "0.6439",
"OCR Problems": "0.5147",
"Counting Problem": "0.3579",
"Temporal Reasoning": "0.3973",
"Spatial Reasoning": "0.8889",
"Action Reasoning": "0.6207",
"Object Reasoning": "0.6791",
"Information Synopsis": "0.8718"
}
},
"long": {
"overall": "0.5522",
"domain": {
"Knowledge": "0.6222",
"Film & Television": "0.5167",
"Sports Competition": "0.5267",
"Artistic Performance": "0.5750",
"Life Record": "0.4905",
"Multilingual": "0.5333"
},
"sub_category": {
"Humanity & History": "0.6333",
"Literature & Art": "0.7000",
"Biology & Medicine": "0.6000",
"Finance & Commerce": "0.7667",
"Astronomy": "0.6000",
"Geography": "0.5333",
"Law": "0.6667",
"Life Tip": "0.6333",
"Technology": "0.4667",
"Animation": "0.4667",
"Movie & TV Show": "0.4333",
"Documentary": "0.5333",
"News Report": "0.6333",
"Esports": "0.5333",
"Basketball": "0.4333",
"Football": "0.6333",
"Athletics": "0.5000",
"Other Sports": "0.5333",
"Stage Play": "0.7333",
"Magic Show": "0.5667",
"Variety Show": "0.3667",
"Acrobatics": "0.6333",
"Handicraft": "0.5667",
"Food": "0.3667",
"Fashion": "0.4667",
"Daily Life": "0.4667",
"Travel": "0.4333",
"Pet & Animal": "0.7000",
"Exercise": "0.4333",
"Multilingual": "0.5333"
},
"task_type": {
"Temporal Perception": "0.5000",
"Spatial Perception": "0.6667",
"Attribute Perception": "0.6667",
"Action Recognition": "0.5238",
"Object Recognition": "0.5000",
"OCR Problems": "0.5714",
"Counting Problem": "0.2917",
"Temporal Reasoning": "0.5165",
"Spatial Reasoning": "0.6364",
"Action Reasoning": "0.4944",
"Object Reasoning": "0.5458",
"Information Synopsis": "0.7239"
}
},
"overall": {
"overall": "0.6281",
"domain": {
"Knowledge": "0.6667",
"Film & Television": "0.6611",
"Sports Competition": "0.5711",
"Artistic Performance": "0.6889",
"Life Record": "0.5746",
"Multilingual": "0.5667"
},
"sub_category": {
"Humanity & History": "0.5778",
"Literature & Art": "0.6667",
"Biology & Medicine": "0.7111",
"Finance & Commerce": "0.7556",
"Astronomy": "0.6778",
"Geography": "0.6333",
"Law": "0.7111",
"Life Tip": "0.6778",
"Technology": "0.5889",
"Animation": "0.5889",
"Movie & TV Show": "0.6444",
"Documentary": "0.6556",
"News Report": "0.7556",
"Esports": "0.5778",
"Basketball": "0.4111",
"Football": "0.6556",
"Athletics": "0.6000",
"Other Sports": "0.6111",
"Stage Play": "0.8000",
"Magic Show": "0.6667",
"Variety Show": "0.5889",
"Acrobatics": "0.7000",
"Handicraft": "0.7000",
"Food": "0.5000",
"Fashion": "0.4778",
"Daily Life": "0.5444",
"Travel": "0.6111",
"Pet & Animal": "0.6889",
"Exercise": "0.5000",
"Multilingual": "0.5667"
},
"task_type": {
"Temporal Perception": "0.6364",
"Spatial Perception": "0.6852",
"Attribute Perception": "0.7252",
"Action Recognition": "0.6102",
"Object Recognition": "0.6469",
"OCR Problems": "0.6835",
"Counting Problem": "0.3993",
"Temporal Reasoning": "0.4859",
"Spatial Reasoning": "0.8214",
"Action Reasoning": "0.5789",
"Object Reasoning": "0.6278",
"Information Synopsis": "0.8019"
}
}
}
MMBench-Video#
MMBench-Video is a benchmark designed to evaluate the proficiency of MLLMs in understanding video content. It addresses the limitations of traditional VideoQA benchmarks by incorporating long-form videos sourced from YouTube, which better reflect real-world scenarios. The benchmark uses free-form questions that require temporal reasoning, which are human-annotated based on a comprehensive capability taxonomy.
When testing with 8 frames:
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-1B --verbose --nframe 8
The expected test results are:
{
"coarse_all": {
"CP": "1.11",
"FP-S": "1.00",
"FP-C": "0.84",
"HL": "0.27",
"LR": "0.71",
"AR": "1.01",
"RR": "1.17",
"CSR": "0.77",
"TR": "0.71",
"Perception": "0.97",
"Reasoning": "0.88",
"Overall": "0.95"
},
"coarse_valid": {
"CP": "1.11",
"FP-S": "1.00",
"FP-C": "0.84",
"HL": "0.27",
"LR": "0.71",
"AR": "1.01",
"RR": "1.17",
"CSR": "0.77",
"TR": "0.71",
"Perception": "0.97",
"Reasoning": "0.88",
"Overall": "0.95"
},
"fine_all": {
"Video Topic": "1.05",
"Video Emotion": "1.27",
"Video Scene": "0.84",
"Video Style": "1.38",
"OCR": "0.87",
"Object Recognition": "1.07",
"Attribute Recognition": "1.41",
"Event Recognition": "0.93",
"Human Motion": "0.84",
"Counting": "0.99",
"Spatial Relationship": "1.16",
"Human-object Interaction": "0.80",
"Human Interaction": "0.70",
"Hallucination": "0.27",
"Structuralized Image-Text Understanding": "0.97",
"Mathematical Calculation": "0.31",
"Physical Property": "0.78",
"Function Reasoning": "0.95",
"Identity Reasoning": "1.30",
"Natural Relation": "1.04",
"Physical Relation": "0.92",
"Social Relation": "1.48",
"Common Sense Reasoning": "0.77",
"Counterfactual Reasoning": "0.80",
"Causal Reasoning": "0.67",
"Future Prediction": "0.77"
},
"fine_valid": {
"Video Topic": "1.05",
"Video Emotion": "1.27",
"Video Scene": "0.84",
"Video Style": "1.38",
"OCR": "0.87",
"Object Recognition": "1.07",
"Attribute Recognition": "1.41",
"Event Recognition": "0.93",
"Human Motion": "0.84",
"Counting": "0.99",
"Spatial Relationship": "1.16",
"Human-object Interaction": "0.80",
"Human Interaction": "0.70",
"Hallucination": "0.27",
"Structuralized Image-Text Understanding": "0.97",
"Mathematical Calculation": "0.31",
"Physical Property": "0.78",
"Function Reasoning": "0.95",
"Identity Reasoning": "1.30",
"Natural Relation": "1.04",
"Physical Relation": "0.92",
"Social Relation": "1.48",
"Common Sense Reasoning": "0.77",
"Counterfactual Reasoning": "0.80",
"Causal Reasoning": "0.67",
"Future Prediction": "0.77"
}
}
When testing with 16 frames:
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-1B --verbose --nframe 16
The expected test results are:
{
"coarse_all": {
"CP": "1.21",
"FP-S": "1.03",
"FP-C": "0.85",
"HL": "0.29",
"LR": "0.73",
"AR": "1.00",
"RR": "1.26",
"CSR": "0.70",
"TR": "0.74",
"Perception": "1.00",
"Reasoning": "0.90",
"Overall": "0.98"
},
"coarse_valid": {
"CP": "1.21",
"FP-S": "1.03",
"FP-C": "0.85",
"HL": "0.29",
"LR": "0.73",
"AR": "1.00",
"RR": "1.26",
"CSR": "0.70",
"TR": "0.74",
"Perception": "1.00",
"Reasoning": "0.90",
"Overall": "0.98"
},
"fine_all": {
"Video Topic": "1.15",
"Video Emotion": "1.37",
"Video Scene": "0.96",
"Video Style": "1.43",
"OCR": "0.96",
"Object Recognition": "1.08",
"Attribute Recognition": "1.47",
"Event Recognition": "0.86",
"Human Motion": "0.77",
"Counting": "0.94",
"Spatial Relationship": "1.09",
"Human-object Interaction": "0.85",
"Human Interaction": "0.64",
"Hallucination": "0.29",
"Structuralized Image-Text Understanding": "0.96",
"Mathematical Calculation": "0.38",
"Physical Property": "0.76",
"Function Reasoning": "0.89",
"Identity Reasoning": "1.36",
"Natural Relation": "1.00",
"Physical Relation": "1.10",
"Social Relation": "1.54",
"Common Sense Reasoning": "0.70",
"Counterfactual Reasoning": "0.88",
"Causal Reasoning": "0.72",
"Future Prediction": "0.74"
},
"fine_valid": {
"Video Topic": "1.15",
"Video Emotion": "1.37",
"Video Scene": "0.96",
"Video Style": "1.43",
"OCR": "0.96",
"Object Recognition": "1.08",
"Attribute Recognition": "1.47",
"Event Recognition": "0.86",
"Human Motion": "0.77",
"Counting": "0.94",
"Spatial Relationship": "1.09",
"Human-object Interaction": "0.85",
"Human Interaction": "0.64",
"Hallucination": "0.29",
"Structuralized Image-Text Understanding": "0.96",
"Mathematical Calculation": "0.38",
"Physical Property": "0.76",
"Function Reasoning": "0.89",
"Identity Reasoning": "1.36",
"Natural Relation": "1.00",
"Physical Relation": "1.10",
"Social Relation": "1.54",
"Common Sense Reasoning": "0.70",
"Counterfactual Reasoning": "0.88",
"Causal Reasoning": "0.72",
"Future Prediction": "0.74"
}
}
When testing with 8 frames:
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-2B --verbose --nframe 8
The expected test results are:
{
"coarse_all": {
"CP": "1.16",
"FP-S": "1.05",
"FP-C": "0.81",
"HL": "0.26",
"LR": "0.50",
"AR": "1.12",
"RR": "1.11",
"CSR": "0.81",
"TR": "0.83",
"Perception": "1.00",
"Reasoning": "0.91",
"Overall": "0.97"
},
"coarse_valid": {
"CP": "1.16",
"FP-S": "1.05",
"FP-C": "0.81",
"HL": "0.26",
"LR": "0.50",
"AR": "1.12",
"RR": "1.11",
"CSR": "0.81",
"TR": "0.83",
"Perception": "1.00",
"Reasoning": "0.91",
"Overall": "0.97"
},
"fine_all": {
"Video Topic": "1.12",
"Video Emotion": "1.29",
"Video Scene": "0.99",
"Video Style": "1.24",
"OCR": "0.94",
"Object Recognition": "1.04",
"Attribute Recognition": "1.46",
"Event Recognition": "1.02",
"Human Motion": "0.66",
"Counting": "1.16",
"Spatial Relationship": "0.93",
"Human-object Interaction": "0.77",
"Human Interaction": "0.77",
"Hallucination": "0.26",
"Structuralized Image-Text Understanding": "0.69",
"Mathematical Calculation": "0.22",
"Physical Property": "0.94",
"Function Reasoning": "1.09",
"Identity Reasoning": "1.32",
"Natural Relation": "0.93",
"Physical Relation": "0.98",
"Social Relation": "1.33",
"Common Sense Reasoning": "0.81",
"Counterfactual Reasoning": "1.00",
"Causal Reasoning": "0.76",
"Future Prediction": "0.87"
},
"fine_valid": {
"Video Topic": "1.12",
"Video Emotion": "1.29",
"Video Scene": "0.99",
"Video Style": "1.24",
"OCR": "0.94",
"Object Recognition": "1.04",
"Attribute Recognition": "1.46",
"Event Recognition": "1.02",
"Human Motion": "0.66",
"Counting": "1.16",
"Spatial Relationship": "0.93",
"Human-object Interaction": "0.77",
"Human Interaction": "0.77",
"Hallucination": "0.26",
"Structuralized Image-Text Understanding": "0.69",
"Mathematical Calculation": "0.22",
"Physical Property": "0.94",
"Function Reasoning": "1.09",
"Identity Reasoning": "1.32",
"Natural Relation": "0.93",
"Physical Relation": "0.98",
"Social Relation": "1.33",
"Common Sense Reasoning": "0.81",
"Counterfactual Reasoning": "1.00",
"Causal Reasoning": "0.76",
"Future Prediction": "0.87"
}
}
When testing with 16 frames:
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-2B --verbose --nframe 16
The expected test results are:
{
"coarse_all": {
"CP": "1.22",
"FP-S": "1.13",
"FP-C": "0.80",
"HL": "0.34",
"LR": "0.64",
"AR": "1.01",
"RR": "1.23",
"CSR": "0.88",
"TR": "0.87",
"Perception": "1.06",
"Reasoning": "0.95",
"Overall": "1.03"
},
"coarse_valid": {
"CP": "1.22",
"FP-S": "1.13",
"FP-C": "0.80",
"HL": "0.34",
"LR": "0.64",
"AR": "1.01",
"RR": "1.23",
"CSR": "0.88",
"TR": "0.87",
"Perception": "1.06",
"Reasoning": "0.95",
"Overall": "1.03"
},
"fine_all": {
"Video Topic": "1.14",
"Video Emotion": "1.29",
"Video Scene": "1.17",
"Video Style": "1.21",
"OCR": "1.02",
"Object Recognition": "1.13",
"Attribute Recognition": "1.59",
"Event Recognition": "0.99",
"Human Motion": "0.72",
"Counting": "1.24",
"Spatial Relationship": "1.02",
"Human-object Interaction": "0.67",
"Human Interaction": "0.85",
"Hallucination": "0.34",
"Structuralized Image-Text Understanding": "0.79",
"Mathematical Calculation": "0.40",
"Physical Property": "0.85",
"Function Reasoning": "1.07",
"Identity Reasoning": "1.11",
"Natural Relation": "1.15",
"Physical Relation": "1.00",
"Social Relation": "1.48",
"Common Sense Reasoning": "0.88",
"Counterfactual Reasoning": "1.10",
"Causal Reasoning": "0.82",
"Future Prediction": "0.81"
},
"fine_valid": {
"Video Topic": "1.14",
"Video Emotion": "1.29",
"Video Scene": "1.17",
"Video Style": "1.21",
"OCR": "1.02",
"Object Recognition": "1.13",
"Attribute Recognition": "1.59",
"Event Recognition": "0.99",
"Human Motion": "0.72",
"Counting": "1.24",
"Spatial Relationship": "1.02",
"Human-object Interaction": "0.67",
"Human Interaction": "0.85",
"Hallucination": "0.34",
"Structuralized Image-Text Understanding": "0.79",
"Mathematical Calculation": "0.40",
"Physical Property": "0.85",
"Function Reasoning": "1.07",
"Identity Reasoning": "1.11",
"Natural Relation": "1.15",
"Physical Relation": "1.00",
"Social Relation": "1.48",
"Common Sense Reasoning": "0.88",
"Counterfactual Reasoning": "1.10",
"Causal Reasoning": "0.82",
"Future Prediction": "0.81"
}
}
When testing with 8 frames:
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-4B --verbose --nframe 8
The expected test results are:
{
"coarse_all": {
"CP": "1.34",
"FP-S": "1.16",
"FP-C": "0.97",
"HL": "0.13",
"LR": "0.58",
"AR": "1.16",
"RR": "1.26",
"CSR": "1.02",
"TR": "0.99",
"Perception": "1.13",
"Reasoning": "1.03",
"Overall": "1.10"
},
"coarse_valid": {
"CP": "1.34",
"FP-S": "1.16",
"FP-C": "0.97",
"HL": "0.13",
"LR": "0.58",
"AR": "1.16",
"RR": "1.26",
"CSR": "1.02",
"TR": "0.99",
"Perception": "1.13",
"Reasoning": "1.03",
"Overall": "1.10"
},
"fine_all": {
"Video Topic": "1.30",
"Video Emotion": "1.43",
"Video Scene": "1.18",
"Video Style": "1.62",
"OCR": "0.98",
"Object Recognition": "1.24",
"Attribute Recognition": "1.53",
"Event Recognition": "1.11",
"Human Motion": "0.95",
"Counting": "1.31",
"Spatial Relationship": "1.07",
"Human-object Interaction": "0.95",
"Human Interaction": "0.95",
"Hallucination": "0.13",
"Structuralized Image-Text Understanding": "0.75",
"Mathematical Calculation": "0.33",
"Physical Property": "1.11",
"Function Reasoning": "1.07",
"Identity Reasoning": "1.30",
"Natural Relation": "0.96",
"Physical Relation": "1.25",
"Social Relation": "1.41",
"Common Sense Reasoning": "1.02",
"Counterfactual Reasoning": "0.97",
"Causal Reasoning": "0.98",
"Future Prediction": "1.02"
},
"fine_valid": {
"Video Topic": "1.30",
"Video Emotion": "1.43",
"Video Scene": "1.18",
"Video Style": "1.62",
"OCR": "0.98",
"Object Recognition": "1.24",
"Attribute Recognition": "1.53",
"Event Recognition": "1.11",
"Human Motion": "0.95",
"Counting": "1.31",
"Spatial Relationship": "1.07",
"Human-object Interaction": "0.95",
"Human Interaction": "0.95",
"Hallucination": "0.13",
"Structuralized Image-Text Understanding": "0.75",
"Mathematical Calculation": "0.33",
"Physical Property": "1.11",
"Function Reasoning": "1.07",
"Identity Reasoning": "1.30",
"Natural Relation": "0.96",
"Physical Relation": "1.25",
"Social Relation": "1.41",
"Common Sense Reasoning": "1.02",
"Counterfactual Reasoning": "0.97",
"Causal Reasoning": "0.98",
"Future Prediction": "1.02"
}
}
When testing with 16 frames:
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-4B --verbose --nframe 16
The expected test results are:
{
"coarse_all": {
"CP": "1.38",
"FP-S": "1.27",
"FP-C": "1.03",
"HL": "0.15",
"LR": "0.73",
"AR": "1.24",
"RR": "1.29",
"CSR": "1.17",
"TR": "0.99",
"Perception": "1.22",
"Reasoning": "1.09",
"Overall": "1.18"
},
"coarse_valid": {
"CP": "1.38",
"FP-S": "1.27",
"FP-C": "1.03",
"HL": "0.15",
"LR": "0.73",
"AR": "1.24",
"RR": "1.29",
"CSR": "1.17",
"TR": "0.99",
"Perception": "1.22",
"Reasoning": "1.09",
"Overall": "1.18"
},
"fine_all": {
"Video Topic": "1.31",
"Video Emotion": "1.47",
"Video Scene": "1.22",
"Video Style": "1.74",
"OCR": "1.19",
"Object Recognition": "1.29",
"Attribute Recognition": "1.62",
"Event Recognition": "1.13",
"Human Motion": "1.02",
"Counting": "1.25",
"Spatial Relationship": "1.16",
"Human-object Interaction": "0.99",
"Human Interaction": "1.00",
"Hallucination": "0.15",
"Structuralized Image-Text Understanding": "0.87",
"Mathematical Calculation": "0.51",
"Physical Property": "1.17",
"Function Reasoning": "1.05",
"Identity Reasoning": "1.49",
"Natural Relation": "1.00",
"Physical Relation": "1.25",
"Social Relation": "1.46",
"Common Sense Reasoning": "1.17",
"Counterfactual Reasoning": "1.05",
"Causal Reasoning": "0.96",
"Future Prediction": "1.04"
},
"fine_valid": {
"Video Topic": "1.31",
"Video Emotion": "1.47",
"Video Scene": "1.22",
"Video Style": "1.74",
"OCR": "1.19",
"Object Recognition": "1.29",
"Attribute Recognition": "1.62",
"Event Recognition": "1.13",
"Human Motion": "1.02",
"Counting": "1.25",
"Spatial Relationship": "1.16",
"Human-object Interaction": "0.99",
"Human Interaction": "1.00",
"Hallucination": "0.15",
"Structuralized Image-Text Understanding": "0.87",
"Mathematical Calculation": "0.51",
"Physical Property": "1.17",
"Function Reasoning": "1.05",
"Identity Reasoning": "1.49",
"Natural Relation": "1.00",
"Physical Relation": "1.25",
"Social Relation": "1.46",
"Common Sense Reasoning": "1.17",
"Counterfactual Reasoning": "1.05",
"Causal Reasoning": "0.96",
"Future Prediction": "1.04"
}
}
When testing with 8 frames:
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-8B --verbose --nframe 8
The expected test results are:
{
"coarse_all": {
"CP": "1.36",
"FP-S": "1.26",
"FP-C": "1.07",
"HL": "0.32",
"LR": "0.83",
"AR": "1.19",
"RR": "1.33",
"CSR": "1.14",
"TR": "1.02",
"Perception": "1.22",
"Reasoning": "1.12",
"Overall": "1.19"
},
"coarse_valid": {
"CP": "1.36",
"FP-S": "1.26",
"FP-C": "1.07",
"HL": "0.32",
"LR": "0.83",
"AR": "1.19",
"RR": "1.33",
"CSR": "1.14",
"TR": "1.02",
"Perception": "1.22",
"Reasoning": "1.12",
"Overall": "1.19"
},
"fine_all": {
"Video Topic": "1.23",
"Video Emotion": "1.49",
"Video Scene": "1.22",
"Video Style": "1.67",
"OCR": "1.14",
"Object Recognition": "1.35",
"Attribute Recognition": "1.66",
"Event Recognition": "1.18",
"Human Motion": "0.90",
"Counting": "1.31",
"Spatial Relationship": "1.24",
"Human-object Interaction": "1.05",
"Human Interaction": "1.02",
"Hallucination": "0.32",
"Structuralized Image-Text Understanding": "1.03",
"Mathematical Calculation": "0.53",
"Physical Property": "1.24",
"Function Reasoning": "1.05",
"Identity Reasoning": "1.26",
"Natural Relation": "1.00",
"Physical Relation": "1.27",
"Social Relation": "1.56",
"Common Sense Reasoning": "1.14",
"Counterfactual Reasoning": "0.95",
"Causal Reasoning": "1.07",
"Future Prediction": "0.98"
},
"fine_valid": {
"Video Topic": "1.23",
"Video Emotion": "1.49",
"Video Scene": "1.22",
"Video Style": "1.67",
"OCR": "1.14",
"Object Recognition": "1.35",
"Attribute Recognition": "1.66",
"Event Recognition": "1.18",
"Human Motion": "0.90",
"Counting": "1.31",
"Spatial Relationship": "1.24",
"Human-object Interaction": "1.05",
"Human Interaction": "1.02",
"Hallucination": "0.32",
"Structuralized Image-Text Understanding": "1.03",
"Mathematical Calculation": "0.53",
"Physical Property": "1.24",
"Function Reasoning": "1.05",
"Identity Reasoning": "1.26",
"Natural Relation": "1.00",
"Physical Relation": "1.27",
"Social Relation": "1.56",
"Common Sense Reasoning": "1.14",
"Counterfactual Reasoning": "0.95",
"Causal Reasoning": "1.07",
"Future Prediction": "0.98"
}
}
When testing with 16 frames:
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-8B --verbose --nframe 16
The expected test results are:
{
"coarse_all": {
"CP": "1.45",
"FP-S": "1.40",
"FP-C": "1.13",
"HL": "0.18",
"LR": "0.90",
"AR": "1.32",
"RR": "1.45",
"CSR": "1.19",
"TR": "1.04",
"Perception": "1.32",
"Reasoning": "1.18",
"Overall": "1.28"
},
"coarse_valid": {
"CP": "1.45",
"FP-S": "1.40",
"FP-C": "1.13",
"HL": "0.18",
"LR": "0.90",
"AR": "1.32",
"RR": "1.45",
"CSR": "1.19",
"TR": "1.04",
"Perception": "1.32",
"Reasoning": "1.18",
"Overall": "1.28"
},
"fine_all": {
"Video Topic": "1.38",
"Video Emotion": "1.57",
"Video Scene": "1.27",
"Video Style": "1.69",
"OCR": "1.32",
"Object Recognition": "1.40",
"Attribute Recognition": "1.80",
"Event Recognition": "1.18",
"Human Motion": "1.15",
"Counting": "1.44",
"Spatial Relationship": "1.22",
"Human-object Interaction": "1.15",
"Human Interaction": "1.03",
"Hallucination": "0.18",
"Structuralized Image-Text Understanding": "1.13",
"Mathematical Calculation": "0.56",
"Physical Property": "1.20",
"Function Reasoning": "1.05",
"Identity Reasoning": "1.72",
"Natural Relation": "0.93",
"Physical Relation": "1.45",
"Social Relation": "1.70",
"Common Sense Reasoning": "1.19",
"Counterfactual Reasoning": "1.07",
"Causal Reasoning": "1.04",
"Future Prediction": "1.06"
},
"fine_valid": {
"Video Topic": "1.38",
"Video Emotion": "1.57",
"Video Scene": "1.27",
"Video Style": "1.69",
"OCR": "1.32",
"Object Recognition": "1.40",
"Attribute Recognition": "1.80",
"Event Recognition": "1.18",
"Human Motion": "1.15",
"Counting": "1.44",
"Spatial Relationship": "1.22",
"Human-object Interaction": "1.15",
"Human Interaction": "1.03",
"Hallucination": "0.18",
"Structuralized Image-Text Understanding": "1.13",
"Mathematical Calculation": "0.56",
"Physical Property": "1.20",
"Function Reasoning": "1.05",
"Identity Reasoning": "1.72",
"Natural Relation": "0.93",
"Physical Relation": "1.45",
"Social Relation": "1.70",
"Common Sense Reasoning": "1.19",
"Counterfactual Reasoning": "1.07",
"Causal Reasoning": "1.04",
"Future Prediction": "1.06"
}
}
When testing with 8 frames:
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-26B --verbose --nframe 8
The expected test results are:
{
"coarse_all": {
"CP": "1.47",
"FP-S": "1.32",
"FP-C": "1.07",
"HL": "0.35",
"LR": "1.04",
"AR": "1.42",
"RR": "1.43",
"CSR": "1.16",
"TR": "1.04",
"Perception": "1.28",
"Reasoning": "1.22",
"Overall": "1.27"
},
"coarse_valid": {
"CP": "1.47",
"FP-S": "1.32",
"FP-C": "1.07",
"HL": "0.35",
"LR": "1.04",
"AR": "1.42",
"RR": "1.43",
"CSR": "1.16",
"TR": "1.04",
"Perception": "1.28",
"Reasoning": "1.22",
"Overall": "1.27"
},
"fine_all": {
"Video Topic": "1.35",
"Video Emotion": "1.47",
"Video Scene": "1.51",
"Video Style": "1.69",
"OCR": "1.21",
"Object Recognition": "1.37",
"Attribute Recognition": "1.82",
"Event Recognition": "1.16",
"Human Motion": "0.97",
"Counting": "1.43",
"Spatial Relationship": "1.20",
"Human-object Interaction": "1.05",
"Human Interaction": "1.02",
"Hallucination": "0.35",
"Structuralized Image-Text Understanding": "1.22",
"Mathematical Calculation": "0.76",
"Physical Property": "1.43",
"Function Reasoning": "1.29",
"Identity Reasoning": "1.55",
"Natural Relation": "1.33",
"Physical Relation": "1.12",
"Social Relation": "1.78",
"Common Sense Reasoning": "1.16",
"Counterfactual Reasoning": "1.05",
"Causal Reasoning": "1.05",
"Future Prediction": "1.06"
},
"fine_valid": {
"Video Topic": "1.35",
"Video Emotion": "1.47",
"Video Scene": "1.51",
"Video Style": "1.69",
"OCR": "1.21",
"Object Recognition": "1.37",
"Attribute Recognition": "1.82",
"Event Recognition": "1.16",
"Human Motion": "0.97",
"Counting": "1.43",
"Spatial Relationship": "1.20",
"Human-object Interaction": "1.05",
"Human Interaction": "1.02",
"Hallucination": "0.35",
"Structuralized Image-Text Understanding": "1.22",
"Mathematical Calculation": "0.76",
"Physical Property": "1.43",
"Function Reasoning": "1.29",
"Identity Reasoning": "1.55",
"Natural Relation": "1.33",
"Physical Relation": "1.12",
"Social Relation": "1.78",
"Common Sense Reasoning": "1.16",
"Counterfactual Reasoning": "1.05",
"Causal Reasoning": "1.06",
"Future Prediction": "1.06"
}
}
When testing with 16 frames:
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-26B --verbose --nframe 16
The expected test results are:
{
"coarse_all": {
"CP": "1.56",
"FP-S": "1.48",
"FP-C": "1.23",
"HL": "0.52",
"LR": "1.06",
"AR": "1.61",
"RR": "1.45",
"CSR": "1.38",
"TR": "1.23",
"Perception": "1.42",
"Reasoning": "1.35",
"Overall": "1.41"
},
"coarse_valid": {
"CP": "1.56",
"FP-S": "1.48",
"FP-C": "1.23",
"HL": "0.52",
"LR": "1.06",
"AR": "1.61",
"RR": "1.47",
"CSR": "1.38",
"TR": "1.23",
"Perception": "1.42",
"Reasoning": "1.35",
"Overall": "1.41"
},
"fine_all": {
"Video Topic": "1.52",
"Video Emotion": "1.48",
"Video Scene": "1.59",
"Video Style": "1.76",
"OCR": "1.37",
"Object Recognition": "1.55",
"Attribute Recognition": "1.91",
"Event Recognition": "1.30",
"Human Motion": "1.15",
"Counting": "1.46",
"Spatial Relationship": "1.18",
"Human-object Interaction": "1.35",
"Human Interaction": "1.08",
"Hallucination": "0.52",
"Structuralized Image-Text Understanding": "1.25",
"Mathematical Calculation": "0.78",
"Physical Property": "1.46",
"Function Reasoning": "1.42",
"Identity Reasoning": "1.96",
"Natural Relation": "1.44",
"Physical Relation": "1.06",
"Social Relation": "1.83",
"Common Sense Reasoning": "1.38",
"Counterfactual Reasoning": "1.25",
"Causal Reasoning": "1.23",
"Future Prediction": "1.17"
},
"fine_valid": {
"Video Topic": "1.52",
"Video Emotion": "1.48",
"Video Scene": "1.59",
"Video Style": "1.76",
"OCR": "1.38",
"Object Recognition": "1.56",
"Attribute Recognition": "1.91",
"Event Recognition": "1.30",
"Human Motion": "1.15",
"Counting": "1.46",
"Spatial Relationship": "1.18",
"Human-object Interaction": "1.35",
"Human Interaction": "1.08",
"Hallucination": "0.52",
"Structuralized Image-Text Understanding": "1.25",
"Mathematical Calculation": "0.78",
"Physical Property": "1.46",
"Function Reasoning": "1.42",
"Identity Reasoning": "1.96",
"Natural Relation": "1.50",
"Physical Relation": "1.06",
"Social Relation": "1.83",
"Common Sense Reasoning": "1.38",
"Counterfactual Reasoning": "1.25",
"Causal Reasoning": "1.24",
"Future Prediction": "1.17"
}
}
When testing with 8 frames:
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-40B --verbose --nframe 8
The expected test results are:
{
"coarse_all": {
"CP": "1.53",
"FP-S": "1.39",
"FP-C": "1.12",
"HL": "0.32",
"LR": "0.88",
"AR": "1.45",
"RR": "1.52",
"CSR": "1.15",
"TR": "1.13",
"Perception": "1.34",
"Reasoning": "1.25",
"Overall": "1.32"
},
"coarse_valid": {
"CP": "1.53",
"FP-S": "1.39",
"FP-C": "1.12",
"HL": "0.32",
"LR": "0.88",
"AR": "1.45",
"RR": "1.52",
"CSR": "1.15",
"TR": "1.13",
"Perception": "1.34",
"Reasoning": "1.25",
"Overall": "1.32"
},
"fine_all": {
"Video Topic": "1.57",
"Video Emotion": "1.65",
"Video Scene": "1.24",
"Video Style": "1.81",
"OCR": "1.29",
"Object Recognition": "1.40",
"Attribute Recognition": "1.80",
"Event Recognition": "1.21",
"Human Motion": "1.36",
"Counting": "1.45",
"Spatial Relationship": "1.22",
"Human-object Interaction": "1.14",
"Human Interaction": "1.02",
"Hallucination": "0.32",
"Structuralized Image-Text Understanding": "1.04",
"Mathematical Calculation": "0.62",
"Physical Property": "1.30",
"Function Reasoning": "1.33",
"Identity Reasoning": "1.74",
"Natural Relation": "1.30",
"Physical Relation": "1.35",
"Social Relation": "1.78",
"Common Sense Reasoning": "1.15",
"Counterfactual Reasoning": "1.18",
"Causal Reasoning": "1.14",
"Future Prediction": "1.13"
},
"fine_valid": {
"Video Topic": "1.57",
"Video Emotion": "1.65",
"Video Scene": "1.24",
"Video Style": "1.81",
"OCR": "1.29",
"Object Recognition": "1.40",
"Attribute Recognition": "1.80",
"Event Recognition": "1.21",
"Human Motion": "1.36",
"Counting": "1.45",
"Spatial Relationship": "1.22",
"Human-object Interaction": "1.14",
"Human Interaction": "1.02",
"Hallucination": "0.32",
"Structuralized Image-Text Understanding": "1.04",
"Mathematical Calculation": "0.62",
"Physical Property": "1.30",
"Function Reasoning": "1.33",
"Identity Reasoning": "1.74",
"Natural Relation": "1.30",
"Physical Relation": "1.35",
"Social Relation": "1.78",
"Common Sense Reasoning": "1.15",
"Counterfactual Reasoning": "1.18",
"Causal Reasoning": "1.14",
"Future Prediction": "1.13"
}
}
When testing with 16 frames:
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-40B --verbose --nframe 16
The expected test results are:
{
"coarse_all": {
"CP": "1.58",
"FP-S": "1.56",
"FP-C": "1.28",
"HL": "0.39",
"LR": "1.10",
"AR": "1.61",
"RR": "1.53",
"CSR": "1.25",
"TR": "1.20",
"Perception": "1.48",
"Reasoning": "1.35",
"Overall": "1.45"
},
"coarse_valid": {
"CP": "1.58",
"FP-S": "1.56",
"FP-C": "1.28",
"HL": "0.39",
"LR": "1.10",
"AR": "1.61",
"RR": "1.53",
"CSR": "1.25",
"TR": "1.20",
"Perception": "1.48",
"Reasoning": "1.35",
"Overall": "1.45"
},
"fine_all": {
"Video Topic": "1.57",
"Video Emotion": "1.67",
"Video Scene": "1.39",
"Video Style": "1.83",
"OCR": "1.47",
"Object Recognition": "1.64",
"Attribute Recognition": "2.03",
"Event Recognition": "1.32",
"Human Motion": "1.26",
"Counting": "1.49",
"Spatial Relationship": "1.31",
"Human-object Interaction": "1.30",
"Human Interaction": "1.26",
"Hallucination": "0.39",
"Structuralized Image-Text Understanding": "1.26",
"Mathematical Calculation": "0.84",
"Physical Property": "1.43",
"Function Reasoning": "1.49",
"Identity Reasoning": "1.92",
"Natural Relation": "1.56",
"Physical Relation": "1.27",
"Social Relation": "1.76",
"Common Sense Reasoning": "1.25",
"Counterfactual Reasoning": "1.27",
"Causal Reasoning": "1.19",
"Future Prediction": "1.15"
},
"fine_valid": {
"Video Topic": "1.57",
"Video Emotion": "1.67",
"Video Scene": "1.39",
"Video Style": "1.83",
"OCR": "1.47",
"Object Recognition": "1.64",
"Attribute Recognition": "2.03",
"Event Recognition": "1.32",
"Human Motion": "1.26",
"Counting": "1.49",
"Spatial Relationship": "1.31",
"Human-object Interaction": "1.30",
"Human Interaction": "1.26",
"Hallucination": "0.39",
"Structuralized Image-Text Understanding": "1.26",
"Mathematical Calculation": "0.84",
"Physical Property": "1.43",
"Function Reasoning": "1.49",
"Identity Reasoning": "1.92",
"Natural Relation": "1.56",
"Physical Relation": "1.27",
"Social Relation": "1.76",
"Common Sense Reasoning": "1.25",
"Counterfactual Reasoning": "1.27",
"Causal Reasoning": "1.19",
"Future Prediction": "1.15"
}
}
When testing with 8 frames:
torchrun --nproc-per-node=1 run.py --data MMBench-Video --model InternVL2-76B --verbose --nframe 8
The expected test results are:
{
"coarse_all": {
"CP": "1.59",
"FP-S": "1.41",
"FP-C": "1.25",
"HL": "0.42",
"LR": "0.98",
"AR": "1.60",
"RR": "1.41",
"CSR": "1.44",
"TR": "1.27",
"Perception": "1.38",
"Reasoning": "1.35",
"Overall": "1.37"
},
"coarse_valid": {
"CP": "1.59",
"FP-S": "1.41",
"FP-C": "1.25",
"HL": "0.42",
"LR": "0.98",
"AR": "1.60",
"RR": "1.41",
"CSR": "1.44",
"TR": "1.27",
"Perception": "1.38",
"Reasoning": "1.35",
"Overall": "1.37"
},
"fine_all": {
"Video Topic": "1.51",
"Video Emotion": "1.66",
"Video Scene": "1.46",
"Video Style": "1.90",
"OCR": "1.32",
"Object Recognition": "1.45",
"Attribute Recognition": "1.78",
"Event Recognition": "1.30",
"Human Motion": "1.07",
"Counting": "1.49",
"Spatial Relationship": "1.36",
"Human-object Interaction": "1.27",
"Human Interaction": "1.21",
"Hallucination": "0.42",
"Structuralized Image-Text Understanding": "1.21",
"Mathematical Calculation": "0.64",
"Physical Property": "1.57",
"Function Reasoning": "1.51",
"Identity Reasoning": "1.72",
"Natural Relation": "1.33",
"Physical Relation": "1.33",
"Social Relation": "1.52",
"Common Sense Reasoning": "1.44",
"Counterfactual Reasoning": "1.27",
"Causal Reasoning": "1.33",
"Future Prediction": "1.17"
},
"fine_valid": {
"Video Topic": "1.51",
"Video Emotion": "1.66",
"Video Scene": "1.46",
"Video Style": "1.90",
"OCR": "1.32",
"Object Recognition": "1.45",
"Attribute Recognition": "1.78",
"Event Recognition": "1.30",
"Human Motion": "1.07",
"Counting": "1.49",
"Spatial Relationship": "1.36",
"Human-object Interaction": "1.27",
"Human Interaction": "1.21",
"Hallucination": "0.42",
"Structuralized Image-Text Understanding": "1.21",
"Mathematical Calculation": "0.64",
"Physical Property": "1.57",
"Function Reasoning": "1.51",
"Identity Reasoning": "1.72",
"Natural Relation": "1.33",
"Physical Relation": "1.33",
"Social Relation": "1.52",
"Common Sense Reasoning": "1.44",
"Counterfactual Reasoning": "1.27",
"Causal Reasoning": "1.33",
"Future Prediction": "1.17"
}
}
When testing with 16 frames:
torchrun --nproc-per-node=1 run.py --data MMBench-Video --model InternVL2-76B --verbose --nframe 16
The expected test results are:
{
"coarse_all": {
"CP": "1.69",
"FP-S": "1.60",
"FP-C": "1.34",
"HL": "0.44",
"LR": "1.19",
"AR": "1.77",
"RR": "1.48",
"CSR": "1.51",
"TR": "1.36",
"Perception": "1.54",
"Reasoning": "1.46",
"Overall": "1.52"
},
"coarse_valid": {
"CP": "1.69",
"FP-S": "1.60",
"FP-C": "1.34",
"HL": "0.44",
"LR": "1.19",
"AR": "1.77",
"RR": "1.48",
"CSR": "1.51",
"TR": "1.36",
"Perception": "1.54",
"Reasoning": "1.46",
"Overall": "1.52"
},
"fine_all": {
"Video Topic": "1.64",
"Video Emotion": "1.73",
"Video Scene": "1.60",
"Video Style": "1.93",
"OCR": "1.48",
"Object Recognition": "1.65",
"Attribute Recognition": "2.06",
"Event Recognition": "1.42",
"Human Motion": "1.39",
"Counting": "1.69",
"Spatial Relationship": "1.36",
"Human-object Interaction": "1.44",
"Human Interaction": "1.20",
"Hallucination": "0.44",
"Structuralized Image-Text Understanding": "1.40",
"Mathematical Calculation": "0.89",
"Physical Property": "1.65",
"Function Reasoning": "1.49",
"Identity Reasoning": "2.17",
"Natural Relation": "1.30",
"Physical Relation": "1.47",
"Social Relation": "1.59",
"Common Sense Reasoning": "1.51",
"Counterfactual Reasoning": "1.43",
"Causal Reasoning": "1.36",
"Future Prediction": "1.34"
},
"fine_valid": {
"Video Topic": "1.64",
"Video Emotion": "1.73",
"Video Scene": "1.60",
"Video Style": "1.93",
"OCR": "1.48",
"Object Recognition": "1.65",
"Attribute Recognition": "2.06",
"Event Recognition": "1.42",
"Human Motion": "1.39",
"Counting": "1.69",
"Spatial Relationship": "1.36",
"Human-object Interaction": "1.44",
"Human Interaction": "1.20",
"Hallucination": "0.44",
"Structuralized Image-Text Understanding": "1.40",
"Mathematical Calculation": "0.89",
"Physical Property": "1.65",
"Function Reasoning": "1.49",
"Identity Reasoning": "2.17",
"Natural Relation": "1.30",
"Physical Relation": "1.47",
"Social Relation": "1.59",
"Common Sense Reasoning": "1.51",
"Counterfactual Reasoning": "1.43",
"Causal Reasoning": "1.36",
"Future Prediction": "1.34"
}
}
MathVision#
The MathVision (MATH-V) dataset is a comprehensive benchmark designed to evaluate the mathematical reasoning capabilities of multimodal large models. This dataset includes 3,040 high-quality mathematical problems, each paired with visual contexts sourced from real math competitions. It spans 16 distinct mathematical disciplines, including algebra, geometry, topology, and graph theory, and is graded across five levels of difficulty. This setup provides a diverse set of challenges that assess both the visual perception and reasoning abilities of models.
torchrun --nproc-per-node=8 run.py --model InternVL2-1B --data MathVision_MINI
The expected test results are:
-- ------------------------ --- --- -- -------- --------
0 Overall 304 100 37 32.8947 12.1711
1 algebra 19 5 1 26.3158 5.26316
2 analytic geometry 19 5 3 26.3158 15.7895
3 arithmetic 19 4 2 21.0526 10.5263
4 combinatorial geometry 19 7 2 36.8421 10.5263
5 combinatorics 19 1 3 5.26316 15.7895
6 counting 19 1 2 5.26316 10.5263
7 descriptive geometry 19 10 4 52.6316 21.0526
8 graph theory 19 7 2 36.8421 10.5263
9 logic 19 6 3 31.5789 15.7895
10 metric geometry - angle 19 10 4 52.6316 21.0526
11 metric geometry - area 19 8 1 42.1053 5.26316
12 metric geometry - length 19 8 3 42.1053 15.7895
13 solid geometry 19 6 0 31.5789 0
14 statistics 19 6 2 31.5789 10.5263
15 topology 19 8 2 42.1053 10.5263
16 transformation geometry 19 8 3 42.1053 15.7895
-- ------------------------ --- --- -- -------- --------
torchrun --nproc-per-node=8 run.py --model InternVL2-2B --data MathVision_MINI
The expected test results are:
-- ------------------------ --- --- -- -------- --------
0 Overall 304 100 48 32.8947 15.7895
1 algebra 19 6 1 31.5789 5.26316
2 analytic geometry 19 7 2 36.8421 10.5263
3 arithmetic 19 4 1 21.0526 5.26316
4 combinatorial geometry 19 5 5 26.3158 26.3158
5 combinatorics 19 1 1 5.26316 5.26316
6 counting 19 0 2 0 10.5263
7 descriptive geometry 19 8 4 42.1053 21.0526
8 graph theory 19 3 4 15.7895 21.0526
9 logic 19 9 5 47.3684 26.3158
10 metric geometry - angle 19 11 4 57.8947 21.0526
11 metric geometry - area 19 8 3 42.1053 15.7895
12 metric geometry - length 19 10 4 52.6316 21.0526
13 solid geometry 19 6 1 31.5789 5.26316
14 statistics 19 7 5 36.8421 26.3158
15 topology 19 5 1 26.3158 5.26316
16 transformation geometry 19 10 5 52.6316 26.3158
-- ------------------------ --- --- -- -------- --------
torchrun --nproc-per-node=8 run.py --model InternVL2-4B --data MathVision_MINI
The expected test results are:
-- ------------------------ --- -- -- -------- --------
0 Overall 304 89 54 29.2763 17.7632
1 algebra 19 4 4 21.0526 21.0526
2 analytic geometry 19 7 4 36.8421 21.0526
3 arithmetic 19 1 4 5.26316 21.0526
4 combinatorial geometry 19 6 2 31.5789 10.5263
5 combinatorics 19 1 2 5.26316 10.5263
6 counting 19 0 5 0 26.3158
7 descriptive geometry 19 8 5 42.1053 26.3158
8 graph theory 19 6 2 31.5789 10.5263
9 logic 19 8 2 42.1053 10.5263
10 metric geometry - angle 19 10 6 52.6316 31.5789
11 metric geometry - area 19 7 5 36.8421 26.3158
12 metric geometry - length 19 11 2 57.8947 10.5263
13 solid geometry 19 7 2 36.8421 10.5263
14 statistics 19 4 5 21.0526 26.3158
15 topology 19 6 1 31.5789 5.26316
16 transformation geometry 19 3 3 15.7895 15.7895
-- ------------------------ --- -- -- -------- --------
torchrun --nproc-per-node=8 run.py --model InternVL2-8B --data MathVision_MINI
The expected test results are:
-- ------------------------ --- --- -- -------- -------
0 Overall 304 104 62 34.2105 20.3947
1 algebra 19 4 4 21.0526 21.0526
2 analytic geometry 19 4 3 21.0526 15.7895
3 arithmetic 19 2 4 10.5263 21.0526
4 combinatorial geometry 19 9 6 47.3684 31.5789
5 combinatorics 19 1 3 5.26316 15.7895
6 counting 19 2 4 10.5263 21.0526
7 descriptive geometry 19 11 4 57.8947 21.0526
8 graph theory 19 6 2 31.5789 10.5263
9 logic 19 10 2 52.6316 10.5263
10 metric geometry - angle 19 7 4 36.8421 21.0526
11 metric geometry - area 19 7 7 36.8421 36.8421
12 metric geometry - length 19 7 2 36.8421 10.5263
13 solid geometry 19 8 4 42.1053 21.0526
14 statistics 19 6 4 31.5789 21.0526
15 topology 19 11 5 57.8947 26.3158
16 transformation geometry 19 9 4 47.3684 21.0526
-- ------------------------ --- --- -- -------- -------
torchrun --nproc-per-node=8 run.py --model InternVL2-26B --data MathVision_MINI
The expected test results are:
-- ------------------------ --- --- -- -------- --------
0 Overall 304 105 71 34.5395 23.3553
1 algebra 19 6 3 31.5789 15.7895
2 analytic geometry 19 6 7 31.5789 36.8421
3 arithmetic 19 4 4 21.0526 21.0526
4 combinatorial geometry 19 4 3 21.0526 15.7895
5 combinatorics 19 4 6 21.0526 31.5789
6 counting 19 1 3 5.26316 15.7895
7 descriptive geometry 19 7 4 36.8421 21.0526
8 graph theory 19 5 5 26.3158 26.3158
9 logic 19 11 7 57.8947 36.8421
10 metric geometry - angle 19 9 3 47.3684 15.7895
11 metric geometry - area 19 9 7 47.3684 36.8421
12 metric geometry - length 19 10 3 52.6316 15.7895
13 solid geometry 19 6 1 31.5789 5.26316
14 statistics 19 8 7 42.1053 36.8421
15 topology 19 10 5 52.6316 26.3158
16 transformation geometry 19 5 3 26.3158 15.7895
-- ------------------------ --- --- -- -------- --------
torchrun --nproc-per-node=8 run.py --model InternVL2-40B --data MathVision_MINI
The expected test results are:
-- ------------------------ --- --- -- -------- -------
0 Overall 304 100 65 32.8947 21.3816
1 algebra 19 6 4 31.5789 21.0526
2 analytic geometry 19 7 5 36.8421 26.3158
3 arithmetic 19 4 8 21.0526 42.1053
4 combinatorial geometry 19 3 6 15.7895 31.5789
5 combinatorics 19 0 4 0 21.0526
6 counting 19 1 2 5.26316 10.5263
7 descriptive geometry 19 8 2 42.1053 10.5263
8 graph theory 19 6 3 31.5789 15.7895
9 logic 19 8 4 42.1053 21.0526
10 metric geometry - angle 19 10 5 52.6316 26.3158
11 metric geometry - area 19 8 2 42.1053 10.5263
12 metric geometry - length 19 10 3 52.6316 15.7895
13 solid geometry 19 6 3 31.5789 15.7895
14 statistics 19 10 6 52.6316 31.5789
15 topology 19 7 4 36.8421 21.0526
16 transformation geometry 19 6 4 31.5789 21.0526
-- ------------------------ --- --- -- -------- -------
torchrun --nproc-per-node=1 run.py --model InternVL2-76B --data MathVision_MINI
The expected test results are:
-- ------------------------ --- --- -- -------- -------
0 Overall 304 102 72 33.5526 23.6842
1 algebra 19 1 3 5.26316 15.7895
2 analytic geometry 19 6 8 31.5789 42.1053
3 arithmetic 19 5 7 26.3158 36.8421
4 combinatorial geometry 19 7 2 36.8421 10.5263
5 combinatorics 19 1 4 5.26316 21.0526
6 counting 19 0 3 0 15.7895
7 descriptive geometry 19 9 2 47.3684 10.5263
8 graph theory 19 6 3 31.5789 15.7895
9 logic 19 8 5 42.1053 26.3158
10 metric geometry - angle 19 11 5 57.8947 26.3158
11 metric geometry - area 19 9 5 47.3684 26.3158
12 metric geometry - length 19 10 5 52.6316 26.3158
13 solid geometry 19 6 5 31.5789 26.3158
14 statistics 19 6 8 31.5789 42.1053
15 topology 19 7 4 36.8421 21.0526
16 transformation geometry 19 10 3 52.6316 15.7895
-- ------------------------ --- --- -- -------- -------
BLINK#
The BLINK dataset is a new benchmark designed to challenge MLLMs by focusing on core visual perception tasks that are not typically covered by other benchmarks. It reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompts. These tasks include relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning, which humans can generally solve quickly but are significantly challenging for current multimodal LLMs.
torchrun --nproc-per-node=8 run.py --model InternVL2-1B --data BLINK
The expected test results are:
2024-08-02 13:47:04,164 - RUN - INFO - The evaluation of model InternVL2-1B x dataset BLINK has finished!
2024-08-02 13:47:04,164 - RUN - INFO - Evaluation Results:
2024-08-02 13:47:04,166 - RUN - INFO -
------------------------- -------------------
split none
Overall 0.3855865334034719
Art_Style 0.4700854700854701
Counting 0.325
Forensic_Detection 0.25
Functional_Correspondence 0.26153846153846155
IQ_Test 0.2866666666666667
Jigsaw 0.5266666666666666
Multi-view_Reasoning 0.44360902255639095
Object_Localization 0.4918032786885246
Relative_Depth 0.49193548387096775
Relative_Reflectance 0.3283582089552239
Semantic_Correspondence 0.2446043165467626
Spatial_Relation 0.5664335664335665
Visual_Correspondence 0.27325581395348836
Visual_Similarity 0.4740740740740741
------------------------- -------------------
torchrun --nproc-per-node=8 run.py --model InternVL2-2B --data BLINK
The expected test results are:
2024-08-02 13:46:22,686 - RUN - INFO - The evaluation of model InternVL2-2B x dataset BLINK has finished!
2024-08-02 13:46:22,686 - RUN - INFO - Evaluation Results:
2024-08-02 13:46:22,689 - RUN - INFO -
------------------------- -------------------
split none
Overall 0.43766438716465017
Art_Style 0.5299145299145299
Counting 0.4666666666666667
Forensic_Detection 0.2803030303030303
Functional_Correspondence 0.23076923076923078
IQ_Test 0.2866666666666667
Jigsaw 0.47333333333333333
Multi-view_Reasoning 0.556390977443609
Object_Localization 0.36885245901639346
Relative_Depth 0.6048387096774194
Relative_Reflectance 0.39552238805970147
Semantic_Correspondence 0.3669064748201439
Spatial_Relation 0.7622377622377622
Visual_Correspondence 0.3313953488372093
Visual_Similarity 0.5111111111111111
------------------------- -------------------
torchrun --nproc-per-node=8 run.py --model InternVL2-4B --data BLINK
The expected test results are:
2024-08-02 13:34:06,982 - RUN - INFO - The evaluation of model InternVL2-4B x dataset BLINK has finished!
2024-08-02 13:34:06,982 - RUN - INFO - Evaluation Results:
2024-08-02 13:34:06,984 - RUN - INFO -
------------------------- -------------------
split none
Overall 0.46081009994739613
Art_Style 0.5897435897435898
Counting 0.55
Forensic_Detection 0.32575757575757575
Functional_Correspondence 0.25384615384615383
IQ_Test 0.23333333333333334
Jigsaw 0.48
Multi-view_Reasoning 0.556390977443609
Object_Localization 0.5245901639344263
Relative_Depth 0.6370967741935484
Relative_Reflectance 0.3283582089552239
Semantic_Correspondence 0.2805755395683453
Spatial_Relation 0.8111888111888111
Visual_Correspondence 0.36046511627906974
Visual_Similarity 0.5925925925925926
------------------------- -------------------
torchrun --nproc-per-node=8 run.py --model InternVL2-8B --data BLINK
The expected test results are:
2024-08-02 13:28:10,915 - RUN - INFO - The evaluation of model InternVL2-8B x dataset BLINK has finished!
2024-08-02 13:28:10,915 - RUN - INFO - Evaluation Results:
2024-08-02 13:28:10,917 - RUN - INFO -
------------------------- -------------------
split none
Overall 0.5086796422935297
Art_Style 0.7094017094017094
Counting 0.75
Forensic_Detection 0.3484848484848485
Functional_Correspondence 0.17692307692307693
IQ_Test 0.30666666666666664
Jigsaw 0.5466666666666666
Multi-view_Reasoning 0.48872180451127817
Object_Localization 0.5573770491803278
Relative_Depth 0.7419354838709677
Relative_Reflectance 0.39552238805970147
Semantic_Correspondence 0.26618705035971224
Spatial_Relation 0.7972027972027972
Visual_Correspondence 0.36046511627906974
Visual_Similarity 0.7851851851851852
------------------------- -------------------
torchrun --nproc-per-node=8 run.py --model InternVL2-26B --data BLINK
The expected test results are:
2024-08-02 13:00:51,453 - RUN - INFO - The evaluation of model InternVL2-26B x dataset BLINK has finished!
2024-08-02 13:00:51,453 - RUN - INFO - Evaluation Results:
2024-08-02 13:00:51,455 - RUN - INFO -
------------------------- -------------------
split none
Overall 0.5623356128353498
Art_Style 0.7606837606837606
Counting 0.675
Forensic_Detection 0.45454545454545453
Functional_Correspondence 0.3
IQ_Test 0.30666666666666664
Jigsaw 0.7466666666666667
Multi-view_Reasoning 0.41353383458646614
Object_Localization 0.5737704918032787
Relative_Depth 0.782258064516129
Relative_Reflectance 0.3582089552238806
Semantic_Correspondence 0.4172661870503597
Spatial_Relation 0.8461538461538461
Visual_Correspondence 0.47674418604651164
Visual_Similarity 0.8222222222222222
------------------------- -------------------
torchrun --nproc-per-node=8 run.py --model InternVL2-40B --data BLINK
The expected test results are:
2024-08-02 14:03:54,291 - RUN - INFO - The evaluation of model InternVL2-40B x dataset BLINK has finished!
2024-08-02 14:03:54,291 - RUN - INFO - Evaluation Results:
2024-08-02 14:03:54,292 - RUN - INFO -
------------------------- -------------------
split none
Overall 0.5718043135192005
Art_Style 0.6923076923076923
Counting 0.7166666666666667
Forensic_Detection 0.44696969696969696
Functional_Correspondence 0.25384615384615383
IQ_Test 0.22666666666666666
Jigsaw 0.8
Multi-view_Reasoning 0.5639097744360902
Object_Localization 0.5819672131147541
Relative_Depth 0.7903225806451613
Relative_Reflectance 0.3880597014925373
Semantic_Correspondence 0.41007194244604317
Spatial_Relation 0.8461538461538461
Visual_Correspondence 0.4941860465116279
Visual_Similarity 0.8518518518518519
------------------------- -------------------
torchrun --nproc-per-node=1 run.py --model InternVL2-76B --data BLINK
The expected test results are:
2024-08-02 16:08:58,199 - RUN - INFO - The evaluation of model InternVL2-76B x dataset BLINK has finished!
2024-08-02 16:08:58,199 - RUN - INFO - Evaluation Results:
2024-08-02 16:08:58,200 - RUN - INFO -
------------------------- -------------------
split none
Overall 0.5681220410310363
Art_Style 0.6581196581196581
Counting 0.7
Forensic_Detection 0.42424242424242425
Functional_Correspondence 0.3
IQ_Test 0.2733333333333333
Jigsaw 0.74
Multi-view_Reasoning 0.5639097744360902
Object_Localization 0.5245901639344263
Relative_Depth 0.782258064516129
Relative_Reflectance 0.30597014925373134
Semantic_Correspondence 0.4028776978417266
Spatial_Relation 0.8391608391608392
Visual_Correspondence 0.6802325581395349
Visual_Similarity 0.7555555555555555
------------------------- -------------------
MTVQA#
MTVQA (Multilingual Text-Centric Visual Question Answering) introduces high-quality human expert annotations across nine diverse languages to address multilingual TEC-VQA challenges, enhancing AI models’ performance in text-centric visual environments.
torchrun --nproc-per-node=8 run.py --model InternVL2-1B --data MTVQA_TEST
The expected test results are:
{
"AR": 1.991465149359886,
"Average": 12.570079669519032,
"DE": 21.85114503816794,
"FR": 20.54176072234763,
"IT": 22.39819004524887,
"JA": 6.159420289855073,
"KR": 8.422939068100359,
"RU": 3.571428571428571,
"TH": 2.1645021645021645,
"VI": 11.199095022624435
}
torchrun --nproc-per-node=8 run.py --model InternVL2-2B --data MTVQA_TEST
The expected test results are:
{
"AR": 1.422475106685633,
"Average": 10.88816760106226,
"DE": 15.744274809160306,
"FR": 19.751693002257337,
"IT": 21.380090497737555,
"JA": 7.367149758454106,
"KR": 5.913978494623656,
"RU": 3.0423280423280423,
"TH": 0.8658008658008658,
"VI": 9.049773755656108
}
torchrun --nproc-per-node=8 run.py --model InternVL2-4B --data MTVQA_TEST
The expected test results are:
{
"AR": 1.849217638691323,
"Average": 15.34375922100915,
"DE": 24.904580152671755,
"FR": 30.81264108352145,
"IT": 26.923076923076923,
"JA": 8.091787439613526,
"KR": 8.064516129032258,
"RU": 3.7037037037037033,
"TH": 3.463203463203463,
"VI": 12.104072398190045
}
torchrun --nproc-per-node=8 run.py --model InternVL2-8B --data MTVQA_TEST
The expected test results are:
{
"AR": 2.418207681365576,
"Average": 18.102685157863675,
"DE": 28.435114503816795,
"FR": 33.972911963882616,
"IT": 30.20361990950226,
"JA": 8.57487922705314,
"KR": 10.931899641577061,
"RU": 5.158730158730158,
"TH": 6.926406926406926,
"VI": 17.760180995475114
}
torchrun --nproc-per-node=8 run.py --model InternVL2-26B --data MTVQA_TEST
The expected test results are:
{
"AR": 3.982930298719772,
"Average": 17.71909117733845,
"DE": 28.053435114503817,
"FR": 26.52370203160271,
"IT": 30.316742081447963,
"JA": 9.903381642512077,
"KR": 11.29032258064516,
"RU": 6.613756613756613,
"TH": 8.225108225108226,
"VI": 18.32579185520362
}
torchrun --nproc-per-node=8 run.py --model InternVL2-40B --data MTVQA_TEST
The expected test results are:
{
"AR": 4.551920341394026,
"Average": 20.61079964591325,
"DE": 30.62977099236641,
"FR": 36.455981941309254,
"IT": 34.61538461538461,
"JA": 10.748792270531402,
"KR": 13.261648745519713,
"RU": 6.481481481481481,
"TH": 5.627705627705628,
"VI": 21.49321266968326
}
torchrun --nproc-per-node=1 run.py --model InternVL2-76B --data MTVQA_TEST
The expected test results are:
{
"AR": 9.53058321479374,
"Average": 22.794334611979934,
"DE": 31.297709923664126,
"FR": 35.66591422121896,
"IT": 35.18099547511312,
"JA": 11.11111111111111,
"KR": 14.336917562724013,
"RU": 11.904761904761903,
"TH": 9.956709956709958,
"VI": 26.923076923076923
}
Citation#
If you find this project useful in your research, please consider citing:
@article{chen2024far,
title={How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites},
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
journal={Science China Information Sciences},
volume={67},
number={12},
pages={220101},
year={2024},
publisher={Springer}
}
@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={24185--24198},
year={2024}
}