TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Instance Segmentation	COCO minival	Mask R-CNN (ViL Base, multi-scale, 3x lr)	mask AP	45.7	# 45
Instance Segmentation	COCO minival	Mask R-CNN (ViL Base, multi-scale, 3x lr)	AP75	49.9	# 5
Instance Segmentation	COCO minival	Mask R-CNN (ViL Base, 1x lr)	mask AP	45.1	# 46
Instance Segmentation	COCO minival	Mask R-CNN (ViL Base, 1x lr)	AP50	67.2	# 9
Instance Segmentation	COCO minival	Mask R-CNN (ViL Base, 1x lr)	AP75	49.3	# 6
Object Detection	COCO minival	RetinaNet (ViL-Base, multi-scale, 3x)	box AP	44.7	# 114
Object Detection	COCO minival	RetinaNet (ViL-Base, multi-scale, 3x)	AP75	47.6	# 43
Object Detection	COCO minival	RetinaNet (ViL-Base, multi-scale, 3x)	APS	29.9	# 19
Object Detection	COCO minival	RetinaNet (ViL-Base, multi-scale, 3x)	APM	48	# 30
Object Detection	COCO minival	RetinaNet (ViL-Base, multi-scale, 3x)	APL	58.1	# 44
Object Detection	COCO minival	RetinaNet (ViL-Base)	box AP	44.3	# 121
Object Detection	COCO minival	RetinaNet (ViL-Base)	AP50	65.5	# 35
Object Detection	COCO minival	RetinaNet (ViL-Base)	AP75	47.1	# 46
Object Detection	COCO minival	RetinaNet (ViL-Base)	APS	28.9	# 21
Object Detection	COCO minival	RetinaNet (ViL-Base)	APM	47.9	# 31
Object Detection	COCO minival	RetinaNet (ViL-Base)	APL	58.3	# 43
Image Classification	<h2>oi</h2>	ViL-Medium-W	Top 1 Accuracy	82.9%	# 446
Image Classification	<h2>oi</h2>	ViL-Medium-W	Number of params	39.8M	# 673
Image Classification	<h2>oi</h2>	ViL-Small	Top 1 Accuracy	82%	# 531
Image Classification	<h2>oi</h2>	ViL-Small	Number of params	24.6M	# 584
Image Classification	<h2>oi</h2>	ViL-Small	GFLOPs	4.86	# 225
Image Classification	<h2>oi</h2>	ViL-Tiny-RPB	Top 1 Accuracy	76.7%	# 829
Image Classification	<h2>oi</h2>	ViL-Tiny-RPB	Number of params	6.7M	# 451
Image Classification	<h2>oi</h2>	ViL-Tiny-RPB	GFLOPs	1.3	# 118
Image Classification	<h2>oi</h2>	ViL-Base-D	Top 1 Accuracy	83.2%	# 414
Image Classification	<h2>oi</h2>	ViL-Base-D	Number of params	55.7M	# 741
Image Classification	<h2>oi</h2>	ViL-Base-D	GFLOPs	13.4	# 322
Image Classification	<h2>oi</h2>	ViL-Base-W	Top 1 Accuracy	81.9%	# 544
Image Classification	<h2>oi</h2>	ViL-Base-W	Number of params	79M	# 801
Image Classification	<h2>oi</h2>	ViL-Base-W	GFLOPs	6.74	# 241
Image Classification	<h2>oi</h2>	ViL-Medium-D	Top 1 Accuracy	83.3%	# 404
Image Classification	<h2>oi</h2>	ViL-Medium-D	Number of params	39.7M	# 672
Image Classification	<h2>oi</h2>	ViL-Medium-D	GFLOPs	8.7	# 282

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/2103-15358/instance-segmentation-on-coco-minival)](https://paperswithcode.com/sota/instance-segmentation-on-coco-minival?p=2103-15358)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/2103-15358/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=2103-15358)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/2103-15358/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=2103-15358)`

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

ICCV 2021 · Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, Jianfeng Gao ·

This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer, which significantly enhances the ViT of \cite{dosovitskiy2020image} for encoding high-resolution images using two techniques. The first is the multi-scale model structure, which provides image encodings at multiple scales with manageable computational cost. The second is the attention mechanism of vision Longformer, which is a variant of Longformer \cite{beltagy2020longformer}, originally developed for natural language processing, and achieves a linear complexity w.r.t. the number of input tokens. A comprehensive empirical study shows that the new ViT significantly outperforms several strong baselines, including the existing ViT models and their ResNet counterparts, and the Pyramid Vision Transformer from a concurrent work \cite{wang2021pyramid}, on a range of vision tasks, including image classification, object detection, and segmentation. The models and source code are released at \url{https://github.com/microsoft/vision-longformer}.

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract

Code

Add Remove Mark official

microsoft/vision-longformer official

238

microsoft/esvit

404

microsoft/VisionLongformerForObject…

Tasks

Add Remove

Image Classification

Instance Segmentation

object-detection

Object Detection

Datasets

MS COCO

Results from the Paper

Edit

Ranked #45 on Instance Segmentation on COCO minival

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Instance Segmentation	COCO minival	Mask R-CNN (ViL Base, multi-scale, 3x lr)	mask AP	45.7	# 45	Compare
Instance Segmentation	COCO minival	Mask R-CNN (ViL Base, multi-scale, 3x lr)	AP75	49.9	# 5	Compare
Instance Segmentation	COCO minival	Mask R-CNN (ViL Base, 1x lr)	mask AP	45.1	# 46	Compare
			AP50	67.2	# 9	Compare
			AP75	49.3	# 6	Compare
Object Detection	COCO minival	RetinaNet (ViL-Base, multi-scale, 3x)	box AP	44.7	# 114	Compare
			AP75	47.6	# 43	Compare
			APS	29.9	# 19	Compare
			APM	48	# 30	Compare
			APL	58.1	# 44	Compare
Object Detection	COCO minival	RetinaNet (ViL-Base)	box AP	44.3	# 121	Compare
			AP50	65.5	# 35	Compare
			AP75	47.1	# 46	Compare
			APS	28.9	# 21	Compare
			APM	47.9	# 31	Compare
			APL	58.3	# 43	Compare
Image Classification	<h2>oi</h2>	ViL-Medium-W	Top 1 Accuracy	82.9%	# 446	Compare
Image Classification	<h2>oi</h2>	ViL-Medium-W	Number of params	39.8M	# 673	Compare
Image Classification	<h2>oi</h2>	ViL-Small	Top 1 Accuracy	82%	# 531	Compare
			Number of params	24.6M	# 584	Compare
			GFLOPs	4.86	# 225	Compare
Image Classification	<h2>oi</h2>	ViL-Tiny-RPB	Top 1 Accuracy	76.7%	# 829	Compare
			Number of params	6.7M	# 451	Compare
			GFLOPs	1.3	# 118	Compare
Image Classification	<h2>oi</h2>	ViL-Base-D	Top 1 Accuracy	83.2%	# 414	Compare
			Number of params	55.7M	# 741	Compare
			GFLOPs	13.4	# 322	Compare
Image Classification	<h2>oi</h2>	ViL-Base-W	Top 1 Accuracy	81.9%	# 544	Compare
			Number of params	79M	# 801	Compare
			GFLOPs	6.74	# 241	Compare
Image Classification	<h2>oi</h2>	ViL-Medium-D	Top 1 Accuracy	83.3%	# 404	Compare
			Number of params	39.7M	# 672	Compare
			GFLOPs	8.7	# 282	Compare

Methods

Add Remove

1x1 Convolution • Absolute Position Encodings • Adam • AdamW • Attention Dropout • Average Pooling • Batch Normalization • Bottleneck Residual Block • BPE • Convolution • Dense Connections • Dilated Sliding Window Attention • Dropout • GELU • Global and Sliding Window Attention • Global Average Pooling • Kaiming Initialization • Label Smoothing • Layer Normalization • Linear Layer • Linear Warmup With Linear Decay • Longformer • Max Pooling • Multi-Head Attention • Position-Wise Feed-Forward Layer • ReLU • Residual Block • Residual Connection • ResNet • Scaled Dot-Product Attention • Sliding Window Attention • Softmax • Transformer • Vision Transformer • Weight Decay • WordPiece

Edit Social Preview

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove