PVT v2: Improved Baselines with Pyramid Vision Transformer

25 Jun 2021  ·  Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao ·

Transformer recently has presented encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs, including (1) linear complexity attention layer, (2) overlapping patch embedding, and (3) convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linear and achieves significant improvements on fundamental vision tasks such as classification, detection, and segmentation. Notably, the proposed PVT v2 achieves comparable or better performances than recent works such as Swin Transformer. We hope this work will facilitate state-of-the-art Transformer researches in computer vision. Code is available at https://github.com/whai362/PVT.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Object Detection COCO minival Sparse R-CNN (PVTv2-B2) box AP 50.1 # 76
AP50 69.5 # 23
AP75 54.9 # 18
Object Detection COCO-O PVTv2-B5 (Mask R-CNN) Average mAP 28.2 # 23
Effective Robustness 6.85 # 17
Image Classification ImageNet PVTv2-B3 Top 1 Accuracy 83.2% # 412
Number of params 45.2M # 710
GFLOPs 6.9 # 248
Image Classification ImageNet PVTv2-B1 Top 1 Accuracy 78.7% # 745
Number of params 13.1M # 508
GFLOPs 2.1 # 151
Image Classification ImageNet PVTv2-B0 Top 1 Accuracy 70.5% # 947
Number of params 3.4M # 375
GFLOPs 0.6 # 65
Image Classification ImageNet PVTv2-B2 Top 1 Accuracy 82% # 529
Number of params 25.4M # 597
GFLOPs 4 # 191
Image Classification ImageNet PVTv2-B4 Top 1 Accuracy 83.8% # 357
Number of params 82M # 810
GFLOPs 11.8 # 313

Methods