Evaluation of InternVL2 Series#

To evaluate the performance of the InternVL2 series across various tasks, follow the instructions for each specific dataset. Ensure that the appropriate number of GPUs is allocated as specified.

1⃣️ We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.

2⃣️ Please note that evaluating the same model using different testing toolkits like InternVL and VLMEvalKit can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results.

3⃣️️ Note, the dataset description is generated by GPT-4 and may contain errors.

Model Preparation#

model name

type

param

download

size

InternVL2-1B

MLLM

0.9B

🤗 HF link

1.8 GB

InternVL2-2B

MLLM

2.2B

🤗 HF link

4.2 GB

InternVL2-4B

MLLM

4.2B

🤗 HF link

7.8 GB

InternVL2-8B

MLLM

8.1B

🤗 HF link

16 GB

InternVL2-26B

MLLM

25.5B

🤗 HF link

48 GB

InternVL2-40B

MLLM

40.1B

🤗 HF link

75 GB

InternVL2-Llama3-76B

MLLM

76.3B

🤗 HF link

143 GB

Before evaluation, download the trained model we provide.

cd pretrained/
# pip install -U huggingface_hub
# Download OpenGVLab/InternVL2-1B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-1B --local-dir InternVL2-1B
# Download OpenGVLab/InternVL2-2B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-2B --local-dir InternVL2-2B
# Download OpenGVLab/InternVL2-4B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-4B --local-dir InternVL2-4B
# Download OpenGVLab/InternVL2-8B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-8B --local-dir InternVL2-8B
# Download OpenGVLab/InternVL2-26B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-26B --local-dir InternVL2-26B
# Download OpenGVLab/InternVL2-40B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-40B --local-dir InternVL2-40B
# Download OpenGVLab/InternVL2-Llama3-76B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-Llama3-76B --local-dir InternVL2-Llama3-76B

The directory structure is:

pretrained
├── InternVL2-1B
├── InternVL2-2B
├── InternVL2-4B
├── InternVL2-8B
├── InternVL2-26B
├── InternVL2-40B
└── InternVL2-Llama3-76B

Evaluation using InternVL Codebase#

Data Preparation#

Please prepare the evaluation data according to the guidance provided here.

MME#

MME is a comprehensive benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on both perception and cognition abilities across 14 different subtasks, ensuring robust and diverse testing of these models.

Please use the following command to perform the test with 1 GPU:

GPUS=1 sh evaluate.sh pretrained/InternVL2-1B mme --dynamic

The expected test results are:

=========== Perception ===========
total score: 1346.1990796318528

         existence  score: 175.0
         count  score: 113.33333333333334
         position  score: 135.0
         color  score: 138.33333333333331
         posters  score: 116.32653061224491
         celebrity  score: 144.70588235294116
         scene  score: 143.25
         landmark  score: 128.5
         artwork  score: 141.75
         OCR  score: 110.0


=========== Cognition ===========
total score: 448.2142857142857

         commonsense_reasoning  score: 95.71428571428571
         numerical_calculation  score: 57.5
         text_translation  score: 177.5
         code_reasoning  score: 117.5

OKVQA#

OKVQA (Outside Knowledge Visual Question Answering) is a dataset designed for visual question answering tasks that require external knowledge beyond what is visible in the image, featuring over 14,000 questions to evaluate the reasoning abilities of AI models.

Please use the following command to perform the test with 8 GPU:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-okvqa-val --dynamic

The expected test results are:

okvqa_val 0.48513674197383483

TextVQA#

TextVQA is a dataset designed to evaluate visual question answering models by requiring them to read and reason about text present within images, containing 45,336 questions over 28,408 images from the OpenImages dataset.

The TextVQA dataset provides official OCR results, specifically Rosetta OCR tokens. During testing with InstructBLIP and LLaVA 1.5, the OCR results are input to the LLM as a prompt. If you want to input Rosetta OCR tokens, use the following command:

We do not use Rosetta OCR tokens, run this command:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-textvqa-val --dynamic

The expected test results are:

textvqa_val 0.7052400000000033

VizWiz#

The VizWiz VQA dataset is a visual question answering dataset created to help answer visual questions posed by blind individuals. It contains over 31,000 visual questions, where users took a picture using a mobile phone and recorded a spoken question about it. Each question comes with 10 crowdsourced answers. This dataset addresses tasks such as predicting the answer to a visual question and determining whether a visual question can be answered.

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-vizwiz-val --dynamic

The expected test results are:

vizwiz_val 0.5306783977772626

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-vizwiz-test --dynamic

For the test set, submit the results to the evaluation server.

ChartQA#

The ChartQA dataset is a comprehensive benchmark for question answering about charts that involves both visual and logical reasoning. It includes a mix of 9.6K human-written questions and 23.1K machine-generated questions derived from chart summaries. This dataset is designed to evaluate models that can understand and analyze charts by answering complex questions that often require multiple logical and arithmetic operations, as well as referencing visual features of the charts.

The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. The final score for model evaluation is calculated as the average of the scores on these two test sets:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-chartqa-test --dynamic --max-num 12

The expected test results are:

['chartqa_test_human', {'relaxed_accuracy': 0.5392}]
['chartqa_test_augmented', {'relaxed_accuracy': 0.9184}]

result = (53.92 + 91.84) / 2 = 72.88

DocVQA#

The DocVQA dataset consists of 50,000 questions on 12,000+ document images. It is designed for visual question answering tasks where questions are answered using text within the document images. The dataset includes OCR transcriptions and ground truth answers, supporting evaluation of models that interpret and extract information from documents.

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-docvqa-val --dynamic --max-num 18

The expected test results are:

Overall ANLS: 0.7999

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-docvqa-test --dynamic --max-num 18

For the test set, submit the results to the evaluation server.

The expected test results are:

Overall ANLS: 0.8170

AI2D#

The AI2D dataset contains over 5,000 grade school science diagrams with extensive annotations and 15,000 multiple-choice questions for research on diagram understanding and question answering.

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-ai2d-test --dynamic

The expected test results are:

ai2diagram_test {'accuracy': 0.6408678756476683}

InfographicVQA#

The InfographicVQA dataset is a collection of infographics accompanied by natural language questions and answers. This dataset includes a diverse range of infographics sourced from thousands of different websites, ensuring a variety of layouts and designs. It comprises 30,035 questions across 5,485 images, split into training, validation, and test sets.

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-infovqa-val --dynamic --max-num 24

The expected test results are:

Overall ANLS: 0.5018

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-infovqa-test --dynamic --max-num 24

For the test set, submit the results to the evaluation server.

The expected test results are:

Overall ANLS: 0.5090

GQA#

The GQA dataset is a large-scale visual question answering dataset designed for real-world visual reasoning and compositional question answering. It contains over 22 million questions grounded in real images, each accompanied by detailed scene graphs that describe objects, their attributes, and relationships within the scene. The dataset includes images from the Visual Genome dataset, with questions that require various reasoning skills such as spatial understanding and multi-step inference.

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B vqa-gqa-testdev --dynamic

The expected test results are:

Accuracy: 59.77%

POPE#

The POPE (Polling-based Object Probing Evaluation) dataset is designed to evaluate object hallucination in MLLMs. The dataset consists of 3,000 questions related to the captions of 500 images. By treating the MLLMs’ answers to these questions as a binary classification task, the dataset allows researchers to measure accuracy, precision, recall, and F1 scores to determine the extent of hallucination in the models.

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B pope --dynamic

The expected test results are:

Category: random, # samples: 2910
TP      FP      TN      FN
1239    51      1359    261
Accuracy: 0.8927835051546392
Precision: 0.9604651162790697
Recall: 0.826
F1 score: 0.8881720430107527
Yes ratio: 0.44329896907216493
0.888, 0.893, 0.960, 0.826, 0.443
====================================
Category: popular, # samples: 3000
TP      FP      TN      FN
1239    93      1407    261
Accuracy: 0.882
Precision: 0.9301801801801802
Recall: 0.826
F1 score: 0.875
Yes ratio: 0.444
0.875, 0.882, 0.930, 0.826, 0.444
====================================
Category: adversarial, # samples: 3000
TP      FP      TN      FN
1239    151     1349    261
Accuracy: 0.8626666666666667
Precision: 0.8913669064748202
Recall: 0.826
F1 score: 0.8574394463667819
Yes ratio: 0.4633333333333333
0.857, 0.863, 0.891, 0.826, 0.463
====================================

result = (88.8 + 87.5 + 85.7) / 3 = 87.3

Tiny LVLM#

The Tiny LVLM-eHub is a streamlined evaluation benchmark designed to assess the multimodal capabilities of MLLMs, including models like Bard. It focuses on six categories of multimodal abilities: visual perception, visual knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence.

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B tiny_lvlm --dynamic

The expected test results are:

Visual_Knowledge_Acquisition: 0.6857142857142857
Object_Hallucination: 0.91
Visual_Commonsense: 0.556
Visual_Perception: 0.4875
Visual_Reasoning: 0.6145454545454545
Overall: 3.2537597402597402

MMMU#

The MMMU dataset is a comprehensive benchmark designed to evaluate multimodal models on college-level tasks that require domain-specific knowledge and reasoning. It includes 11,500 questions sourced from college exams, quizzes, and textbooks, spanning six disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions cover 30 subjects and feature 30 types of images, such as charts, diagrams, maps, tables, and more.

For the validation set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmmu-val --dynamic

The expected test results are:

{'Overall-Art and Design': {'num': 120, 'acc': 0.383}, 'Art': {'num': 30, 'acc': 0.4}, 'Art_Theory': {'num': 30, 'acc': 0.4}, 'Design': {'num': 30, 'acc': 0.567}, 'Music': {'num': 30, 'acc': 0.167}, 'Overall-Business': {'num': 150, 'acc': 0.333}, 'Accounting': {'num': 30, 'acc': 0.333}, 'Economics': {'num': 30, 'acc': 0.433}, 'Finance': {'num': 30, 'acc': 0.067}, 'Manage': {'num': 30, 'acc': 0.367}, 'Marketing': {'num': 30, 'acc': 0.467}, 'Overall-Science': {'num': 150, 'acc': 0.3}, 'Biology': {'num': 30, 'acc': 0.267}, 'Chemistry': {'num': 30, 'acc': 0.233}, 'Geography': {'num': 30, 'acc': 0.367}, 'Math': {'num': 30, 'acc': 0.167}, 'Physics': {'num': 30, 'acc': 0.467}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.313}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.433}, 'Clinical_Medicine': {'num': 30, 'acc': 0.233}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.4}, 'Pharmacy': {'num': 30, 'acc': 0.3}, 'Public_Health': {'num': 30, 'acc': 0.2}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.483}, 'History': {'num': 30, 'acc': 0.4}, 'Literature': {'num': 30, 'acc': 0.667}, 'Sociology': {'num': 30, 'acc': 0.467}, 'Psychology': {'num': 30, 'acc': 0.4}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.348}, 'Agriculture': {'num': 30, 'acc': 0.233}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.367}, 'Computer_Science': {'num': 30, 'acc': 0.4}, 'Electronics': {'num': 30, 'acc': 0.4}, 'Energy_and_Power': {'num': 30, 'acc': 0.333}, 'Materials': {'num': 30, 'acc': 0.4}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.3}, 
'Overall': {'num': 900, 'acc': 0.354}}

For the test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmmu-test --dynamic

For the test set, submit the results to the evaluation server.

MMVet (GPT-4-0613)#

⚠️ Warning: Here, we use GPT-4-0613 as the judge model, while in VLMEvalKit, GPT-4-Turbo is used as the judge model. Using different versions of GPT-4 can result in significant score variations. Therefore, testing the same model with the two codebases can lead to notable score differences.

The MM-Vet dataset is a comprehensive benchmark designed to evaluate the integrated capabilities of MLLMs. It encompasses six core vision-language (VL) capabilities: recognition, knowledge, optical character recognition (OCR), spatial awareness, language generation, and math. The dataset includes 200 images and 218 questions, each requiring one or more of these capabilities to answer. The evaluation uses an open-ended LLM-based approach, allowing assessment across various answer styles and question types.

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmvet --dynamic

Then, submit the results to the evaluation server. The expected test results are:

runs: [37.8]

MMBench#

The MMBench dataset is a comprehensive multi-modality benchmark designed to evaluate the fine-grained abilities of vision-language models. It contains around 3,000 multiple-choice questions covering 20 ability dimensions, structured into a hierarchical taxonomy. These dimensions include perception and reasoning abilities, further broken down into specific skills like coarse and fine-grained perception, attribute reasoning, and logic reasoning.

For the English dev / test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmbench-dev-en --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmbench-test-en --dynamic

Then, submit the results to the evaluation server. The expected test results are:

mmbench-dev-en: -
mmbench-test-en: 65.4

For the Chinese dev / test set, run:

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmbench-dev-cn --dynamic
GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmbench-test-cn --dynamic

Then, submit the results to the evaluation server. The expected test results are:

mmbench-dev-cn: -
mmbench-test-cn: 60.7

CCBench#

CCBench, a multi-modal benchmark in the domain of Chinese Culture, is designed to evaluate the performance of MLLMs on tasks specifically related to Chinese cultural content.

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B ccbench-dev --dynamic

Then, submit the results to the evaluation server. The expected test results are:

ccbench-dev: 75.7

SEED#

CCBench is a multimodal benchmark specifically designed to evaluate models on tasks related to Chinese culture. It is part of the larger MMBench suite of benchmarks, developed by the OpenCompass Community, and aims to provide fine-grained evaluations across various capabilities of vision-language models. CCBench includes 510 questions in a multiple-choice format, focusing on cultural knowledge and understanding.

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B seed --dynamic

The expected test results are:

Acc@1: 0.6074485825458588
length: 17990
Accuracy for each data type:
Data type Scene Understanding: 73.05%
Data type Instance Identity: 71.16%
Data type Instance Location: 69.23%
Data type Instance Attributes: 58.49%
Data type Instances Counting: 52.55%
Data type Spatial Relation: 43.53%
Data type Instance Interaction: 71.13%
Data type Visual Reasoning: 72.51%
Data type Text Understanding: 68.60%
Data type Action Recognition: 53.55%
Data type Action Prediction: 39.92%
Data type Procedure Understanding: 28.74%
Total accuracy: 60.76%
Image accuracy: 65.62%
Video accuracy: 42.35%

MMVP#

The MMVP dataset is designed to benchmark the performance of multimodal large language models (MLLMs) in visual question answering tasks. This dataset focuses on identifying “CLIP-blind pairs,” which are images that appear similar to the CLIP model despite having clear visual differences. The MMVP dataset includes 300 images derived from ImageNet-1k and LAION-Aesthetics, each paired with straightforward questions to evaluate the models’ visual capabilities. It highlights the challenges these systems face, often leading to incorrect responses and hallucinated explanations.

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mmvp --dynamic

The expected test results are:

Evaluating MMVP ...
Results saved to results/MMVP_240708020850.jsonl
The accuracy is 0.2

RefCOCO Series#

RefCOCO, RefCOCO+, and RefCOCOg are datasets used for tasks involving referring expression comprehension, segmentation, and generation. These datasets are built upon the MSCOCO dataset, and they are essential for evaluating models in natural language processing and computer vision.

GPUS=8 sh evalulate.sh pretrained/InternVL2-1B refcoco --dynamic

The expected test results are:

Model

avg.

RefCOCO
(val)

RefCOCO
(testA)

RefCOCO
(testB)

RefCOCO+
(val)

RefCOCO+
(testA)

RefCOCO+
(testB)

RefCOCO‑g
(val)

RefCOCO‑g
(test)

InternVL2‑1B

79.9

83.6

88.7

79.8

76.0

83.6

67.7

80.2

79.9

InternVL2‑2B

77.7

82.3

88.2

75.9

73.5

82.8

63.3

77.6

78.3

InternVL2‑4B

84.4

88.5

91.2

83.9

81.2

87.2

73.8

84.6

84.6

InternVL2‑8B

82.9

87.1

91.1

80.7

79.8

87.9

71.4

82.7

82.7

InternVL2‑26B

88.5

91.2

93.3

87.4

86.8

91.0

81.2

88.5

88.6

InternVL2‑40B

90.3

93.0

94.7

89.2

88.5

92.8

83.6

90.3

90.6

InternVL2-
Llama3‑76B

90.0

92.2

94.8

88.4

88.8

93.1

82.8

89.5

90.3

MVBench#

MVBench is a comprehensive multimodal video understanding benchmark developed to evaluate the temporal comprehension capabilities of MLLMs. It includes 20 challenging video tasks that require temporal understanding and cannot be effectively solved using a single frame. The benchmark uses a novel static-to-dynamic method, transforming static tasks into dynamic ones to systematically generate video tasks that demand a wide range of temporal skills, from perception to cognition.

We evaluate our models on MVBench by extracting 16 frames from each video, and each frame was resized to a 448x448 image.

GPUS=8 sh evaluate.sh pretrained/InternVL2-1B mvbench --dynamic --max-num 1

The expected test results are:

57.9

Evaluation using VLMEvalKit Codebase#

Data Preparation#

VLMEvalKit will automatically download the data for evaluation, so you do not need to prepare it manually.

MathVista#

The MathVista dataset is a comprehensive benchmark for evaluating mathematical reasoning within visual contexts. It consists of three newly created datasets—IQTest, FunctionQA, and PaperQA—designed to address logical reasoning on puzzle test figures, algebraic reasoning over functional plots, and scientific reasoning with academic paper figures, respectively.

torchrun --nproc-per-node=8 run.py --data MathVista_MINI --model InternVL2-1B --verbose

The expected test results are:

"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","489","377","48.9","37.7"
"scientific reasoning","122","85","45","69.67213114754098","36.885245901639344"
"textbook question answering","158","92","63","58.22784810126582","39.87341772151899"
"numeric commonsense","144","39","24","27.083333333333332","16.666666666666664"
"arithmetic reasoning","353","102","103","28.89518413597734","29.178470254957507"
"visual question answering","179","92","53","51.39664804469274","29.608938547486037"
"geometry reasoning","239","147","95","61.50627615062761","39.74895397489539"
"algebraic reasoning","281","170","112","60.4982206405694","39.8576512455516"
"geometry problem solving","208","138","85","66.34615384615384","40.86538461538461"
"math word problem","186","26","52","13.978494623655912","27.956989247311824"
"logical reasoning","37","11","5","29.72972972972973","13.513513513513514"
"figure question answering","269","141","124","52.41635687732342","46.09665427509294"
"statistical reasoning","301","144","148","47.840531561461795","49.16943521594684"

HallusionBench#

HallusionBench is a comprehensive benchmark designed to evaluate image-context reasoning in MLLMs, focusing on identifying issues related to language hallucination and visual illusion. The dataset consists of 346 images paired with 1,129 questions crafted by human experts. These questions are divided into two categories: Visual Dependent (VD) and Visual Supplement (VS), allowing the benchmark to assess the nuanced understanding and interpretation of visual data by MLLMs.

torchrun --nproc-per-node=8 run.py --data HallusionBench --model InternVL2-1B --verbose

The expected test results are:

"split","aAcc","fAcc","qAcc"
"Overall","54.363827549947416","23.98843930635838","21.978021978021978"
"VS","58.333333333333336","15.517241379310345","28.651685393258425"
"VD","51.945854483925544","28.26086956521739","17.689530685920577"
"VS_map","56.25","9.090909090909092","12.5"
"VD_illusion","48.61111111111111","25.806451612903224","8.333333333333332"
"VD_figure","58.75","36.58536585365854","23.076923076923077"
"VS_ocr","44.44444444444444","23.076923076923077","3.7037037037037033"
"VD_video","51.76470588235295","14.583333333333334","11.594202898550725"
"VD_ocr","78.65168539325843","58.139534883720934","55.81395348837209"
"VS_chart","66.15384615384615","17.5","47.368421052631575"
"VD_math","29.629629629629626","5.555555555555555","3.7037037037037033"
"VS_table","57.14285714285714","10.714285714285714","23.25581395348837"

result = (54.363827549947416 + 23.98843930635838 + 21.978021978021978) / 3 = 33.4

MMStar#

The MMStar dataset is an advanced multimodal benchmark designed to evaluate the capabilities of MLLMs. It comprises 1,500 carefully selected samples that are balanced and purified to ensure they exhibit visual dependency and minimal data leakage. The dataset evaluates models across six core capabilities and 18 detailed axes, focusing on complex multimodal tasks that require advanced reasoning and understanding of visual content.

torchrun --nproc-per-node=8 run.py --data MMStar --model InternVL2-1B --verbose

The expected test results are:

"split","Overall","coarse perception","fine-grained perception","instance reasoning","logical reasoning","math","science & technology"
"none","0.452","0.588","0.368","0.548","0.352","0.46","0.396"

OCRBench#

OCRBench is a comprehensive evaluation benchmark designed to assess the OCR capabilities of MLLMs. It includes five components: Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). The benchmark encompasses data from 29 datasets, making it one of the most thorough OCR evaluation tools available. OCRBench aims to reveal both the strengths and weaknesses of MLLMs, particularly in handling multilingual text, handwritten text, non-semantic text, and mathematical expressions. The benchmark includes 1,000 question-answer pairs, all manually verified for precision.

torchrun --nproc-per-node=8 run.py --data OCRBench --model InternVL2-1B --verbose

The expected test results are:

{
    "Text Recognition": 243,
    "Scene Text-centric VQA": 165,
    "Doc-oriented VQA": 125,
    "Key Information Extraction": 149,
    "Handwritten Mathematical Expression Recognition": 72,
    "Final Score": 754,
    "Final Score Norm": 75.4
}

MMMU#

The MMMU dataset is a comprehensive benchmark designed to evaluate multimodal models on college-level tasks that require domain-specific knowledge and reasoning. It includes 11,500 questions sourced from college exams, quizzes, and textbooks, spanning six disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions cover 30 subjects and feature 30 types of images, such as charts, diagrams, maps, tables, and more.

torchrun --nproc-per-node=8 run.py --data MMMU_DEV_VAL --model InternVL2-1B --verbose

The expected test results are:

"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"dev","0.34","0.2","0.0","0.2","0.2","0.4","0.4","0.0","0.4","0.0","0.2","0.4","0.4","0.2","0.0","0.6","0.6","0.4","0.2","0.6","0.6","0.6","0.2","0.2","0.0","0.4","0.4","0.8","0.6","0.2","0.8","0.35","0.44","0.28","0.55","0.36","0.17142857142857143"
"validation","0.3688888888888889","0.2","0.2","0.23333333333333334","0.4666666666666667","0.43333333333333335","0.4666666666666667","0.3333333333333333","0.4","0.3333333333333333","0.3333333333333333","0.5333333333333333","0.4666666666666667","0.36666666666666664","0.4666666666666667","0.4","0.23333333333333334","0.4","0.43333333333333335","0.7666666666666667","0.43333333333333335","0.43333333333333335","0.4","0.16666666666666666","0.26666666666666666","0.26666666666666666","0.2","0.36666666666666664","0.26666666666666666","0.3","0.5","0.425","0.3333333333333333","0.35333333333333333","0.49166666666666664","0.3333333333333333","0.32857142857142857"

RealWorldQA#

The RealWorldQA dataset is a benchmark designed to evaluate the real-world spatial understanding capabilities of multimodal AI models. It consists of over 700 images, each accompanied by a question and a verifiable answer, focusing on various real-world scenarios, including those captured from vehicles. This dataset aims to test how well AI models comprehend physical environments and spatial relations, enhancing their ability to interpret and analyze real-world scenes.

torchrun --nproc-per-node=8 run.py --data RealWorldQA --model InternVL2-1B --verbose

The expected test results are:

"split","Overall"
"none","0.5032679738562091"

MMVet (GPT-4-Turbo)#

The MM-Vet dataset is a comprehensive benchmark designed to evaluate the integrated capabilities of MLLMs. It encompasses six core vision-language (VL) capabilities: recognition, knowledge, optical character recognition (OCR), spatial awareness, language generation, and math. The dataset includes 200 images and 218 questions, each requiring one or more of these capabilities to answer. The evaluation uses an open-ended LLM-based approach, allowing assessment across various answer styles and question types.

torchrun --nproc-per-node=8 run.py --data MMVet --model InternVL2-1B --verbose

The expected test results are:

"Category","tot","acc"
"rec","187","37.27272727272725"
"ocr","108","37.96296296296297"
"know","84","14.76190476190476"
"gen","80","14.624999999999996"
"spat","75","33.733333333333334"
"math","26","22.692307692307693"
"Overall","218","33.25688073394493"

Note that because the version of GPT-4 used for scoring differs from the official server, the scores tested by VLMEvalKit will be slightly different.

LLaVA-Bench (GPT-4-Turbo)#

The LLaVA-Bench-in-the-Wild dataset is designed to evaluate the capabilities of MLLMs in handling more complex and diverse visual tasks. It includes a set of 24 images with 60 associated questions, covering a range of indoor and outdoor scenes, memes, paintings, and sketches. Each image is paired with detailed, manually curated descriptions and questions that test the model’s generalizability to novel domains.

torchrun --nproc-per-node=8 run.py --data LLaVABench --model InternVL2-1B --verbose

The expected test results are:

"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","51.6","39.5","76.5"
"detail","58.9","37.3","63.3"
"conv","43.0","40.0","92.9"
"complex","54.9","40.4","73.6"

VideoMME#

The Video-MME dataset is a comprehensive benchmark designed to evaluate the capabilities of MLLMs in video analysis. It is the first benchmark specifically tailored for this purpose, focusing on a high-quality assessment of models’ performance in processing sequential visual data.

When testing without subtitles:

torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-1B --verbose --nframe 16

The expected test results are:

{
    "short": {
        "overall": "0.5289",
        "domain": {
            "Knowledge": "0.5481",
            "Film & Television": "0.6167",
            "Sports Competition": "0.4667",
            "Artistic Performance": "0.5333",
            "Life Record": "0.5143",
            "Multilingual": "0.4000"
        },
        "sub_category": {
            "Humanity & History": "0.3333",
            "Literature & Art": "0.4000",
            "Biology & Medicine": "0.7000",
            "Finance & Commerce": "0.6333",
            "Astronomy": "0.5667",
            "Geography": "0.5333",
            "Law": "0.6000",
            "Life Tip": "0.5333",
            "Technology": "0.6333",
            "Animation": "0.6000",
            "Movie & TV Show": "0.7333",
            "Documentary": "0.5333",
            "News Report": "0.6000",
            "Esports": "0.3667",
            "Basketball": "0.3667",
            "Football": "0.5333",
            "Athletics": "0.5333",
            "Other Sports": "0.5333",
            "Stage Play": "0.7333",
            "Magic Show": "0.3333",
            "Variety Show": "0.6333",
            "Acrobatics": "0.4333",
            "Handicraft": "0.4667",
            "Food": "0.5000",
            "Fashion": "0.6333",
            "Daily Life": "0.4000",
            "Travel": "0.6333",
            "Pet & Animal": "0.6667",
            "Exercise": "0.3000",
            "Multilingual": "0.4000"
        },
        "task_type": {
            "Temporal Perception": "0.6667",
            "Spatial Perception": "0.6000",
            "Attribute Perception": "0.6721",
            "Action Recognition": "0.4427",
            "Object Recognition": "0.4821",
            "OCR Problems": "0.6316",
            "Counting Problem": "0.3040",
            "Temporal Reasoning": "0.6154",
            "Spatial Reasoning": "0.6667",
            "Action Reasoning": "0.6170",
            "Object Reasoning": "0.4750",
            "Information Synopsis": "0.7073"
        }
    },
    "medium": {
        "overall": "0.4144",
        "domain": {
            "Knowledge": "0.3630",
            "Film & Television": "0.5250",
            "Sports Competition": "0.3933",
            "Artistic Performance": "0.4750",
            "Life Record": "0.3952",
            "Multilingual": "0.4333"
        },
        "sub_category": {
            "Humanity & History": "0.2000",
            "Literature & Art": "0.4000",
            "Biology & Medicine": "0.5000",
            "Finance & Commerce": "0.4333",
            "Astronomy": "0.4333",
            "Geography": "0.2333",
            "Law": "0.4000",
            "Life Tip": "0.4333",
            "Technology": "0.2333",
            "Animation": "0.3333",
            "Movie & TV Show": "0.5333",
            "Documentary": "0.6000",
            "News Report": "0.6333",
            "Esports": "0.5000",
            "Basketball": "0.1333",
            "Football": "0.4333",
            "Athletics": "0.3333",
            "Other Sports": "0.5667",
            "Stage Play": "0.5667",
            "Magic Show": "0.3333",
            "Variety Show": "0.5000",
            "Acrobatics": "0.5000",
            "Handicraft": "0.4667",
            "Food": "0.3000",
            "Fashion": "0.3667",
            "Daily Life": "0.3333",
            "Travel": "0.4333",
            "Pet & Animal": "0.4000",
            "Exercise": "0.4667",
            "Multilingual": "0.4333"
        },
        "task_type": {
            "Temporal Perception": "0.3871",
            "Spatial Perception": "0.6190",
            "Attribute Perception": "0.4110",
            "Action Recognition": "0.3613",
            "Object Recognition": "0.5000",
            "OCR Problems": "0.4706",
            "Counting Problem": "0.2526",
            "Temporal Reasoning": "0.2740",
            "Spatial Reasoning": "0.6667",
            "Action Reasoning": "0.3276",
            "Object Reasoning": "0.4179",
            "Information Synopsis": "0.5897"
        }
    },
    "long": {
        "overall": "0.3333",
        "domain": {
            "Knowledge": "0.3259",
            "Film & Television": "0.3250",
            "Sports Competition": "0.3000",
            "Artistic Performance": "0.3167",
            "Life Record": "0.3762",
            "Multilingual": "0.3667"
        },
        "sub_category": {
            "Humanity & History": "0.3333",
            "Literature & Art": "0.3667",
            "Biology & Medicine": "0.3333",
            "Finance & Commerce": "0.4667",
            "Astronomy": "0.2000",
            "Geography": "0.3000",
            "Law": "0.2667",
            "Life Tip": "0.3000",
            "Technology": "0.3667",
            "Animation": "0.2000",
            "Movie & TV Show": "0.4667",
            "Documentary": "0.4000",
            "News Report": "0.2333",
            "Esports": "0.4000",
            "Basketball": "0.3333",
            "Football": "0.2333",
            "Athletics": "0.1333",
            "Other Sports": "0.4000",
            "Stage Play": "0.4000",
            "Magic Show": "0.2667",
            "Variety Show": "0.1333",
            "Acrobatics": "0.4667",
            "Handicraft": "0.5333",
            "Food": "0.4333",
            "Fashion": "0.3333",
            "Daily Life": "0.3667",
            "Travel": "0.2000",
            "Pet & Animal": "0.4333",
            "Exercise": "0.3333",
            "Multilingual": "0.3667"
        },
        "task_type": {
            "Temporal Perception": "0.3333",
            "Spatial Perception": "0.0000",
            "Attribute Perception": "0.5185",
            "Action Recognition": "0.3016",
            "Object Recognition": "0.2963",
            "OCR Problems": "0.5000",
            "Counting Problem": "0.1250",
            "Temporal Reasoning": "0.2857",
            "Spatial Reasoning": "0.6364",
            "Action Reasoning": "0.2556",
            "Object Reasoning": "0.3042",
            "Information Synopsis": "0.5153"
        }
    },
    "overall": {
        "overall": "0.4256",
        "domain": {
            "Knowledge": "0.4123",
            "Film & Television": "0.4889",
            "Sports Competition": "0.3867",
            "Artistic Performance": "0.4417",
            "Life Record": "0.4286",
            "Multilingual": "0.4000"
        },
        "sub_category": {
            "Humanity & History": "0.2889",
            "Literature & Art": "0.3889",
            "Biology & Medicine": "0.5111",
            "Finance & Commerce": "0.5111",
            "Astronomy": "0.4000",
            "Geography": "0.3556",
            "Law": "0.4222",
            "Life Tip": "0.4222",
            "Technology": "0.4111",
            "Animation": "0.3778",
            "Movie & TV Show": "0.5778",
            "Documentary": "0.5111",
            "News Report": "0.4889",
            "Esports": "0.4222",
            "Basketball": "0.2778",
            "Football": "0.4000",
            "Athletics": "0.3333",
            "Other Sports": "0.5000",
            "Stage Play": "0.5667",
            "Magic Show": "0.3111",
            "Variety Show": "0.4222",
            "Acrobatics": "0.4667",
            "Handicraft": "0.4889",
            "Food": "0.4111",
            "Fashion": "0.4444",
            "Daily Life": "0.3667",
            "Travel": "0.4222",
            "Pet & Animal": "0.5000",
            "Exercise": "0.3667",
            "Multilingual": "0.4000"
        },
        "task_type": {
            "Temporal Perception": "0.4727",
            "Spatial Perception": "0.5741",
            "Attribute Perception": "0.5676",
            "Action Recognition": "0.3834",
            "Object Recognition": "0.4605",
            "OCR Problems": "0.5396",
            "Counting Problem": "0.2537",
            "Temporal Reasoning": "0.3051",
            "Spatial Reasoning": "0.6607",
            "Action Reasoning": "0.3298",
            "Object Reasoning": "0.3678",
            "Information Synopsis": "0.5820"
        }
    }
}

When testing with subtitles:

torchrun --nproc-per-node=8 run.py --data Video-MME --model InternVL2-1B --verbose --nframe 16 --use-subtitle

The expected test results are:

{
    "short": {
        "overall": "0.5433",
        "domain": {
            "Knowledge": "0.5630",
            "Film & Television": "0.6000",
            "Sports Competition": "0.4933",
            "Artistic Performance": "0.5167",
            "Life Record": "0.5571",
            "Multilingual": "0.4000"
        },
        "sub_category": {
            "Humanity & History": "0.3333",
            "Literature & Art": "0.4000",
            "Biology & Medicine": "0.7667",
            "Finance & Commerce": "0.6000",
            "Astronomy": "0.6000",
            "Geography": "0.5000",
            "Law": "0.6667",
            "Life Tip": "0.6000",
            "Technology": "0.6000",
            "Animation": "0.5667",
            "Movie & TV Show": "0.7333",
            "Documentary": "0.5000",
            "News Report": "0.6000",
            "Esports": "0.4333",
            "Basketball": "0.4000",
            "Football": "0.5000",
            "Athletics": "0.5000",
            "Other Sports": "0.6333",
            "Stage Play": "0.7667",
            "Magic Show": "0.3333",
            "Variety Show": "0.5333",
            "Acrobatics": "0.4333",
            "Handicraft": "0.5000",
            "Food": "0.6000",
            "Fashion": "0.6333",
            "Daily Life": "0.4333",
            "Travel": "0.7333",
            "Pet & Animal": "0.6667",
            "Exercise": "0.3333",
            "Multilingual": "0.4000"
        },
        "task_type": {
            "Temporal Perception": "0.5556",
            "Spatial Perception": "0.5667",
            "Attribute Perception": "0.6557",
            "Action Recognition": "0.4656",
            "Object Recognition": "0.5238",
            "OCR Problems": "0.6667",
            "Counting Problem": "0.3120",
            "Temporal Reasoning": "0.4615",
            "Spatial Reasoning": "0.6296",
            "Action Reasoning": "0.5957",
            "Object Reasoning": "0.5375",
            "Information Synopsis": "0.7561"
        }
    },
    "medium": {
        "overall": "0.4289",
        "domain": {
            "Knowledge": "0.4111",
            "Film & Television": "0.5250",
            "Sports Competition": "0.4000",
            "Artistic Performance": "0.4917",
            "Life Record": "0.3714",
            "Multilingual": "0.5000"
        },
        "sub_category": {
            "Humanity & History": "0.3667",
            "Literature & Art": "0.4333",
            "Biology & Medicine": "0.5667",
            "Finance & Commerce": "0.5000",
            "Astronomy": "0.5333",
            "Geography": "0.3333",
            "Law": "0.3333",
            "Life Tip": "0.4000",
            "Technology": "0.2333",
            "Animation": "0.2667",
            "Movie & TV Show": "0.5000",
            "Documentary": "0.6333",
            "News Report": "0.7000",
            "Esports": "0.5000",
            "Basketball": "0.1667",
            "Football": "0.4333",
            "Athletics": "0.3667",
            "Other Sports": "0.5333",
            "Stage Play": "0.6333",
            "Magic Show": "0.4333",
            "Variety Show": "0.4333",
            "Acrobatics": "0.4667",
            "Handicraft": "0.5000",
            "Food": "0.3333",
            "Fashion": "0.3333",
            "Daily Life": "0.3000",
            "Travel": "0.4000",
            "Pet & Animal": "0.3000",
            "Exercise": "0.4333",
            "Multilingual": "0.5000"
        },
        "task_type": {
            "Temporal Perception": "0.4194",
            "Spatial Perception": "0.6667",
            "Attribute Perception": "0.4658",
            "Action Recognition": "0.3613",
            "Object Recognition": "0.4924",
            "OCR Problems": "0.4265",
            "Counting Problem": "0.2632",
            "Temporal Reasoning": "0.2877",
            "Spatial Reasoning": "0.7222",
            "Action Reasoning": "0.3276",
            "Object Reasoning": "0.4403",
            "Information Synopsis": "0.6538"
        }
    },
    "long": {
        "overall": "0.3689",
        "domain": {
            "Knowledge": "0.3852",
            "Film & Television": "0.3833",
            "Sports Competition": "0.3267",
            "Artistic Performance": "0.3417",
            "Life Record": "0.3905",
            "Multilingual": "0.3333"
        },
        "sub_category": {
            "Humanity & History": "0.2333",
            "Literature & Art": "0.4333",
            "Biology & Medicine": "0.4333",
            "Finance & Commerce": "0.6000",
            "Astronomy": "0.2667",
            "Geography": "0.2667",
            "Law": "0.5000",
            "Life Tip": "0.4333",
            "Technology": "0.3000",
            "Animation": "0.2667",
            "Movie & TV Show": "0.4667",
            "Documentary": "0.5000",
            "News Report": "0.3000",
            "Esports": "0.3667",
            "Basketball": "0.2667",
            "Football": "0.3667",
            "Athletics": "0.2000",
            "Other Sports": "0.4333",
            "Stage Play": "0.4333",
            "Magic Show": "0.2333",
            "Variety Show": "0.2333",
            "Acrobatics": "0.4667",
            "Handicraft": "0.4667",
            "Food": "0.4333",
            "Fashion": "0.3667",
            "Daily Life": "0.4000",
            "Travel": "0.1667",
            "Pet & Animal": "0.5333",
            "Exercise": "0.3667",
            "Multilingual": "0.3333"
        },
        "task_type": {
            "Temporal Perception": "0.3333",
            "Spatial Perception": "0.0000",
            "Attribute Perception": "0.5185",
            "Action Recognition": "0.3016",
            "Object Recognition": "0.3148",
            "OCR Problems": "0.2857",
            "Counting Problem": "0.1875",
            "Temporal Reasoning": "0.2637",
            "Spatial Reasoning": "0.5455",
            "Action Reasoning": "0.3278",
            "Object Reasoning": "0.3667",
            "Information Synopsis": "0.5521"
        }
    },
    "overall": {
        "overall": "0.4470",
        "domain": {
            "Knowledge": "0.4531",
            "Film & Television": "0.5028",
            "Sports Competition": "0.4067",
            "Artistic Performance": "0.4500",
            "Life Record": "0.4397",
            "Multilingual": "0.4111"
        },
        "sub_category": {
            "Humanity & History": "0.3111",
            "Literature & Art": "0.4222",
            "Biology & Medicine": "0.5889",
            "Finance & Commerce": "0.5667",
            "Astronomy": "0.4667",
            "Geography": "0.3667",
            "Law": "0.5000",
            "Life Tip": "0.4778",
            "Technology": "0.3778",
            "Animation": "0.3667",
            "Movie & TV Show": "0.5667",
            "Documentary": "0.5444",
            "News Report": "0.5333",
            "Esports": "0.4333",
            "Basketball": "0.2778",
            "Football": "0.4333",
            "Athletics": "0.3556",
            "Other Sports": "0.5333",
            "Stage Play": "0.6111",
            "Magic Show": "0.3333",
            "Variety Show": "0.4000",
            "Acrobatics": "0.4556",
            "Handicraft": "0.4889",
            "Food": "0.4556",
            "Fashion": "0.4444",
            "Daily Life": "0.3778",
            "Travel": "0.4333",
            "Pet & Animal": "0.5000",
            "Exercise": "0.3778",
            "Multilingual": "0.4111"
        },
        "task_type": {
            "Temporal Perception": "0.4545",
            "Spatial Perception": "0.5741",
            "Attribute Perception": "0.5766",
            "Action Recognition": "0.3930",
            "Object Recognition": "0.4802",
            "OCR Problems": "0.5108",
            "Counting Problem": "0.2724",
            "Temporal Reasoning": "0.2881",
            "Spatial Reasoning": "0.6429",
            "Action Reasoning": "0.3719",
            "Object Reasoning": "0.4185",
            "Information Synopsis": "0.6285"
        }
    }
}

MMBench-Video#

MMBench-Video is a benchmark designed to evaluate the proficiency of MLLMs in understanding video content. It addresses the limitations of traditional VideoQA benchmarks by incorporating long-form videos sourced from YouTube, which better reflect real-world scenarios. The benchmark uses free-form questions that require temporal reasoning, which are human-annotated based on a comprehensive capability taxonomy.

When testing with 8 frames:

torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-1B --verbose --nframe 8

The expected test results are:

{
    "coarse_all": {
        "CP": "1.11",
        "FP-S": "1.00",
        "FP-C": "0.84",
        "HL": "0.27",
        "LR": "0.71",
        "AR": "1.01",
        "RR": "1.17",
        "CSR": "0.77",
        "TR": "0.71",
        "Perception": "0.97",
        "Reasoning": "0.88",
        "Overall": "0.95"
    },
    "coarse_valid": {
        "CP": "1.11",
        "FP-S": "1.00",
        "FP-C": "0.84",
        "HL": "0.27",
        "LR": "0.71",
        "AR": "1.01",
        "RR": "1.17",
        "CSR": "0.77",
        "TR": "0.71",
        "Perception": "0.97",
        "Reasoning": "0.88",
        "Overall": "0.95"
    },
    "fine_all": {
        "Video Topic": "1.05",
        "Video Emotion": "1.27",
        "Video Scene": "0.84",
        "Video Style": "1.38",
        "OCR": "0.87",
        "Object Recognition": "1.07",
        "Attribute Recognition": "1.41",
        "Event Recognition": "0.93",
        "Human Motion": "0.84",
        "Counting": "0.99",
        "Spatial Relationship": "1.16",
        "Human-object Interaction": "0.80",
        "Human Interaction": "0.70",
        "Hallucination": "0.27",
        "Structuralized Image-Text Understanding": "0.97",
        "Mathematical Calculation": "0.31",
        "Physical Property": "0.78",
        "Function Reasoning": "0.95",
        "Identity Reasoning": "1.30",
        "Natural Relation": "1.04",
        "Physical Relation": "0.92",
        "Social Relation": "1.48",
        "Common Sense Reasoning": "0.77",
        "Counterfactual Reasoning": "0.80",
        "Causal Reasoning": "0.67",
        "Future Prediction": "0.77"
    },
    "fine_valid": {
        "Video Topic": "1.05",
        "Video Emotion": "1.27",
        "Video Scene": "0.84",
        "Video Style": "1.38",
        "OCR": "0.87",
        "Object Recognition": "1.07",
        "Attribute Recognition": "1.41",
        "Event Recognition": "0.93",
        "Human Motion": "0.84",
        "Counting": "0.99",
        "Spatial Relationship": "1.16",
        "Human-object Interaction": "0.80",
        "Human Interaction": "0.70",
        "Hallucination": "0.27",
        "Structuralized Image-Text Understanding": "0.97",
        "Mathematical Calculation": "0.31",
        "Physical Property": "0.78",
        "Function Reasoning": "0.95",
        "Identity Reasoning": "1.30",
        "Natural Relation": "1.04",
        "Physical Relation": "0.92",
        "Social Relation": "1.48",
        "Common Sense Reasoning": "0.77",
        "Counterfactual Reasoning": "0.80",
        "Causal Reasoning": "0.67",
        "Future Prediction": "0.77"
    }
}

When testing with 16 frames:

torchrun --nproc-per-node=8 run.py --data MMBench-Video --model InternVL2-1B --verbose --nframe 16

The expected test results are:

{
    "coarse_all": {
        "CP": "1.21",
        "FP-S": "1.03",
        "FP-C": "0.85",
        "HL": "0.29",
        "LR": "0.73",
        "AR": "1.00",
        "RR": "1.26",
        "CSR": "0.70",
        "TR": "0.74",
        "Perception": "1.00",
        "Reasoning": "0.90",
        "Overall": "0.98"
    },
    "coarse_valid": {
        "CP": "1.21",
        "FP-S": "1.03",
        "FP-C": "0.85",
        "HL": "0.29",
        "LR": "0.73",
        "AR": "1.00",
        "RR": "1.26",
        "CSR": "0.70",
        "TR": "0.74",
        "Perception": "1.00",
        "Reasoning": "0.90",
        "Overall": "0.98"
    },
    "fine_all": {
        "Video Topic": "1.15",
        "Video Emotion": "1.37",
        "Video Scene": "0.96",
        "Video Style": "1.43",
        "OCR": "0.96",
        "Object Recognition": "1.08",
        "Attribute Recognition": "1.47",
        "Event Recognition": "0.86",
        "Human Motion": "0.77",
        "Counting": "0.94",
        "Spatial Relationship": "1.09",
        "Human-object Interaction": "0.85",
        "Human Interaction": "0.64",
        "Hallucination": "0.29",
        "Structuralized Image-Text Understanding": "0.96",
        "Mathematical Calculation": "0.38",
        "Physical Property": "0.76",
        "Function Reasoning": "0.89",
        "Identity Reasoning": "1.36",
        "Natural Relation": "1.00",
        "Physical Relation": "1.10",
        "Social Relation": "1.54",
        "Common Sense Reasoning": "0.70",
        "Counterfactual Reasoning": "0.88",
        "Causal Reasoning": "0.72",
        "Future Prediction": "0.74"
    },
    "fine_valid": {
        "Video Topic": "1.15",
        "Video Emotion": "1.37",
        "Video Scene": "0.96",
        "Video Style": "1.43",
        "OCR": "0.96",
        "Object Recognition": "1.08",
        "Attribute Recognition": "1.47",
        "Event Recognition": "0.86",
        "Human Motion": "0.77",
        "Counting": "0.94",
        "Spatial Relationship": "1.09",
        "Human-object Interaction": "0.85",
        "Human Interaction": "0.64",
        "Hallucination": "0.29",
        "Structuralized Image-Text Understanding": "0.96",
        "Mathematical Calculation": "0.38",
        "Physical Property": "0.76",
        "Function Reasoning": "0.89",
        "Identity Reasoning": "1.36",
        "Natural Relation": "1.00",
        "Physical Relation": "1.10",
        "Social Relation": "1.54",
        "Common Sense Reasoning": "0.70",
        "Counterfactual Reasoning": "0.88",
        "Causal Reasoning": "0.72",
        "Future Prediction": "0.74"
    }
}

MathVision#

The MathVision (MATH-V) dataset is a comprehensive benchmark designed to evaluate the mathematical reasoning capabilities of multimodal large models. This dataset includes 3,040 high-quality mathematical problems, each paired with visual contexts sourced from real math competitions. It spans 16 distinct mathematical disciplines, including algebra, geometry, topology, and graph theory, and is graded across five levels of difficulty. This setup provides a diverse set of challenges that assess both the visual perception and reasoning abilities of models.

torchrun --nproc-per-node=8 run.py --model InternVL2-1B --data MathVision_MINI

The expected test results are:

--  ------------------------  ---  ---  --  --------  --------
 0  Overall                   304  100  37  32.8947   12.1711
 1  algebra                    19    5   1  26.3158    5.26316
 2  analytic geometry          19    5   3  26.3158   15.7895
 3  arithmetic                 19    4   2  21.0526   10.5263
 4  combinatorial geometry     19    7   2  36.8421   10.5263
 5  combinatorics              19    1   3   5.26316  15.7895
 6  counting                   19    1   2   5.26316  10.5263
 7  descriptive geometry       19   10   4  52.6316   21.0526
 8  graph theory               19    7   2  36.8421   10.5263
 9  logic                      19    6   3  31.5789   15.7895
10  metric geometry - angle    19   10   4  52.6316   21.0526
11  metric geometry - area     19    8   1  42.1053    5.26316
12  metric geometry - length   19    8   3  42.1053   15.7895
13  solid geometry             19    6   0  31.5789    0
14  statistics                 19    6   2  31.5789   10.5263
15  topology                   19    8   2  42.1053   10.5263
16  transformation geometry    19    8   3  42.1053   15.7895
--  ------------------------  ---  ---  --  --------  --------

MTVQA#

MTVQA (Multilingual Text-Centric Visual Question Answering) introduces high-quality human expert annotations across nine diverse languages to address multilingual TEC-VQA challenges, enhancing AI models’ performance in text-centric visual environments.

torchrun --nproc-per-node=8 run.py --model InternVL2-1B --data MTVQA_TEST

The expected test results are:

{
    "AR": 1.991465149359886,
    "Average": 12.570079669519032,
    "DE": 21.85114503816794,
    "FR": 20.54176072234763,
    "IT": 22.39819004524887,
    "JA": 6.159420289855073,
    "KR": 8.422939068100359,
    "RU": 3.571428571428571,
    "TH": 2.1645021645021645,
    "VI": 11.199095022624435
}

Citation#

If you find this project useful in your research, please consider citing:

@article{chen2024far,
  title={How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={Science China Information Sciences},
  volume={67},
  number={12},
  pages={220101},
  year={2024},
  publisher={Springer}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}