Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer, which significantly enhances the ViT of \cite{dosovitskiy2020image} for encoding high-resolution images using two techniques. The first is the multi-scale model structure, which provides image encodings at multiple scales with manageable computational cost. The second is the attention mechanism of vision Longformer, which is a variant of Longformer \cite{beltagy2020longformer}, originally developed for natural language processing, and achieves a linear complexity w.r.t. the number of input tokens. A comprehensive empirical study shows that the new ViT significantly outperforms several strong baselines, including the existing ViT models and their ResNet counterparts, and the Pyramid Vision Transformer from a concurrent work \cite{wang2021pyramid}, on a range of vision tasks, including image classification, object detection, and segmentation. The models and source code are released at \url{https://github.com/microsoft/vision-longformer}.

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Instance Segmentation COCO minival Mask R-CNN (ViL Base, multi-scale, 3x lr) mask AP 45.7 # 45
AP75 49.9 # 5
Instance Segmentation COCO minival Mask R-CNN (ViL Base, 1x lr) mask AP 45.1 # 46
AP50 67.2 # 9
AP75 49.3 # 6
Object Detection COCO minival RetinaNet (ViL-Base, multi-scale, 3x) box AP 44.7 # 114
AP75 47.6 # 43
APS 29.9 # 19
APM 48 # 30
APL 58.1 # 44
Object Detection COCO minival RetinaNet (ViL-Base) box AP 44.3 # 121
AP50 65.5 # 35
AP75 47.1 # 46
APS 28.9 # 21
APM 47.9 # 31
APL 58.3 # 43
Image Classification <h2>oi</h2> ViL-Medium-W Top 1 Accuracy 82.9% # 446
Number of params 39.8M # 673
Image Classification <h2>oi</h2> ViL-Small Top 1 Accuracy 82% # 531
Number of params 24.6M # 584
GFLOPs 4.86 # 225
Image Classification <h2>oi</h2> ViL-Tiny-RPB Top 1 Accuracy 76.7% # 829
Number of params 6.7M # 451
GFLOPs 1.3 # 118
Image Classification <h2>oi</h2> ViL-Base-D Top 1 Accuracy 83.2% # 414
Number of params 55.7M # 741
GFLOPs 13.4 # 322
Image Classification <h2>oi</h2> ViL-Base-W Top 1 Accuracy 81.9% # 544
Number of params 79M # 801
GFLOPs 6.74 # 241
Image Classification <h2>oi</h2> ViL-Medium-D Top 1 Accuracy 83.3% # 404
Number of params 39.7M # 672
GFLOPs 8.7 # 282

Methods