TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	ConViT-B+	Top 1 Accuracy	82.5%	# 481
Image Classification	ImageNet	ConViT-B+	Number of params	152M	# 884
Image Classification	ImageNet	ConViT-B+	Hardware Burden	None	# 1
Image Classification	ImageNet	ConViT-B+	Operations per network pass	None	# 1
Image Classification	ImageNet	ConViT-B+	GFLOPs	30	# 390
Image Classification	ImageNet	ConViT-Ti+	Top 1 Accuracy	76.7%	# 831
Image Classification	ImageNet	ConViT-Ti+	Number of params	10M	# 477
Image Classification	ImageNet	ConViT-Ti+	GFLOPs	2	# 146
Image Classification	ImageNet	ConViT-Ti	Top 1 Accuracy	73.1%	# 916
Image Classification	ImageNet	ConViT-Ti	Number of params	6M	# 440
Image Classification	ImageNet	ConViT-Ti	GFLOPs	1	# 103
Image Classification	ImageNet	ConViT-B	Top 1 Accuracy	82.4%	# 490
Image Classification	ImageNet	ConViT-B	Number of params	86M	# 816
Image Classification	ImageNet	ConViT-B	GFLOPs	17	# 352
Image Classification	ImageNet	ConViT-S	Top 1 Accuracy	81.3%	# 594
Image Classification	ImageNet	ConViT-S	Number of params	27M	# 617
Image Classification	ImageNet	ConViT-S	GFLOPs	5.4	# 236
Image Classification	ImageNet	ConViT-S+	Top 1 Accuracy	82.2%	# 509
Image Classification	ImageNet	ConViT-S+	Number of params	48M	# 715
Image Classification	ImageNet	ConViT-S+	GFLOPs	10	# 299

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/convit-improving-vision-transformers-with/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=convit-improving-vision-transformers-with)`

ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

19 Mar 2021 · Stéphane d'Ascoli, Hugo Touvron, Matthew Leavitt, Ari Morcos, Giulio Biroli, Levent Sagun ·

Convolutional architectures have proven extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision Transformers (ViTs) rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification. However, they require costly pre-training on large external datasets or distillation from pre-trained convolutional networks. In this paper, we ask the following question: is it possible to combine the strengths of these two architectures while avoiding their respective limitations? To this end, we introduce gated positional self-attention (GPSA), a form of positional self-attention which can be equipped with a ``soft" convolutional inductive bias. We initialise the GPSA layers to mimic the locality of convolutional layers, then give each attention head the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information. The resulting convolutional-like ViT architecture, ConViT, outperforms the DeiT on ImageNet, while offering a much improved sample efficiency. We further investigate the role of locality in learning by first quantifying how it is encouraged in vanilla self-attention layers, then analysing how it is escaped in GPSA layers. We conclude by presenting various ablations to better understand the success of the ConViT. Our code and models are released publicly at https://github.com/facebookresearch/convit.

PDF Abstract

Code

Add Remove Mark official

facebookresearch/convit official

456

rwightman/pytorch-image-models

30,320

facebookresearch/vissl

↳ Quickstart in

Colab

3,234

mindspore-ecosystem/mindcv

224

SforAiDl/vformer

162

See all 9 implementations

Tasks

Add Remove

Image Classification

Inductive Bias

Datasets

ImageNet

Results from the Paper

Edit

Ranked #481 on Image Classification on ImageNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	ConViT-B+	Top 1 Accuracy	82.5%	# 481	Compare
			Number of params	152M	# 884	Compare
			Hardware Burden	None	# 1	Compare
			Operations per network pass	None	# 1	Compare
			GFLOPs	30	# 390	Compare
Image Classification	ImageNet	ConViT-Ti+	Top 1 Accuracy	76.7%	# 831	Compare
			Number of params	10M	# 477	Compare
			GFLOPs	2	# 146	Compare
Image Classification	ImageNet	ConViT-Ti	Top 1 Accuracy	73.1%	# 916	Compare
			Number of params	6M	# 440	Compare
			GFLOPs	1	# 103	Compare
Image Classification	ImageNet	ConViT-B	Top 1 Accuracy	82.4%	# 490	Compare
			Number of params	86M	# 816	Compare
			GFLOPs	17	# 352	Compare
Image Classification	ImageNet	ConViT-S	Top 1 Accuracy	81.3%	# 594	Compare
			Number of params	27M	# 617	Compare
			GFLOPs	5.4	# 236	Compare
Image Classification	ImageNet	ConViT-S+	Top 1 Accuracy	82.2%	# 509	Compare
			Number of params	48M	# 715	Compare
			GFLOPs	10	# 299	Compare

Methods

Add Remove

Attention Dropout • ConViT • DeiT • Dense Connections • Dropout • Feedforward Network • GPSA • Layer Normalization • Linear Layer • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax

Edit Social Preview

ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove