InternViT-6B for Semantic Segmentation

InternViT-6B for Semantic Segmentation#

This folder contains the implementation of the InternViT-6B for semantic segmentation, which is developed on top of MMSegmentation v0.30.0, corresponding to Section 4.2.2 of our InternVL 1.0 paper.

In this part, we validate the visual perception capabilities of InternViT-6B, the most core component of InternVL 1.0. To investigate the pixel-level perceptual capacity of InternViT-6B, we conduct extensive experiments of semantic segmentation on the ADE20K dataset.

Data Preparation#

To set up your dataset for segmentation, it is recommended to symlink the dataset root to segmentation/data. If your folder structure is different, you may need to adjust the corresponding paths in the config files.

segmentation
├── data
│   ├── ade
│   │   ├── ADEChallengeData2016
│   │   │   ├── annotations
│   │   │   │   ├── training
│   │   │   │   ├── validation
│   │   │   ├── images
│   │   │   │   ├── training
│   │   │   │   ├── validation

The training and validation set of ADE20K could be download from this link.

If you want to use other datasets, please refer to the guidelines in MMSegmentation.

Model Preparation#

model name

type

param

download

size

intern_vit_6b_224px.pth

pytorch

6B

🤗 HF link

12 GB

Download the above model weight and place it in the pretrained/ folder:

mkdir pretrained && cd pretrained
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/intern_vit_6b_224px.pth

The directory structure should be:

pretrained
└── intern_vit_6b_224px.pth

Training#

Please note, this open-source code does not include DeepSpeed in MMSegmentation, so it currently only supports training for linear probing and head tuning, and does not support full-parameter training.

If you want to train a super-large segmentation model, please refer to this codebase.

To train a linear classifier for InternViT-6B with 8 GPU on 1 node (total batch size 16), run:

sh dist_train.sh configs/intern_vit_6b/linear_probing/linear_intern_vit_6b_504_80k_ade20k_bs16_lr4e-5_frozen.py 8
# or manage jobs with slurm
GPUS=8 sh slurm_train.sh <partition> <job-name> configs/intern_vit_6b/linear_probing/linear_intern_vit_6b_504_80k_ade20k_bs16_lr4e-5_frozen.py

Note, it is normal for the following information to appear during training and it can be safely ignored:

INFO:mmseg:_IncompatibleKeys(missing_keys=[], unexpected_keys=[‘clip_projector.norm1_q.weight’, ‘clip_projector.norm1_q.bias’, ‘clip_projector.norm1_k.weight’, ‘clip_projector.norm1_k.bias’, ‘clip_projector.norm1_v.weight’, ‘clip_projector.norm1_v.bias’, ‘clip_projector.cross_attn.q_bias’, ‘clip_projector.cross_attn.k_bias’, ‘clip_projector.cross_attn.v_bias’, ‘clip_projector.cross_attn.q.weight’, ‘clip_projector.cross_attn.k.weight’, ‘clip_projector.cross_attn.v.weight’, ‘clip_projector.cross_attn.proj.weight’, ‘clip_projector.cross_attn.proj.bias’])

Evaluation#

type

backbone

head

mIoU

config

download

few-shot (1/16)

InternViT-6B

Linear

46.5

config

ckpt | log

few-shot (1/8)

InternViT-6B

Linear

50.0

config

ckpt | log

few-shot (1/4)

InternViT-6B

Linear

53.3

config

ckpt | log

few-shot (1/2)

InternViT-6B

Linear

55.8

config

ckpt | log

few-shot (1/1)

InternViT-6B

Linear

57.2

config

ckpt | log

linear probing

InternViT-6B (frozen)

Linear

47.2

config

ckpt | log

head tuning

InternViT-6B (frozen)

UperNet

54.9

config

ckpt | log

full tuning

InternViT-6B

UperNet

58.9

config

ckpt | log

You can download checkpoints from here or from the table above. Then place them to segmentation/checkpoints/.

For example, to evaluate the InternViT-6B with a single GPU:

python test.py configs/intern_vit_6b/linear_probing/linear_intern_vit_6b_504_80k_ade20k_bs16_lr4e-5_frozen.py checkpoints/linear_intern_vit_6b_504_80k_ade20k_bs16_lr4e-5_frozen.pth --eval mIoU

For example, to evaluate the InternViT-6B with a single node with 8 GPUs:

sh dist_test.sh configs/intern_vit_6b/linear_probing/linear_intern_vit_6b_504_80k_ade20k_bs16_lr4e-5_frozen.py checkpoints/linear_intern_vit_6b_504_80k_ade20k_bs16_lr4e-5_frozen.pth 8 --eval mIoU

Citation#

If you find this project useful in your research, please consider citing:

@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}