ConvFormer: Closing the Gap Between CNN and Vision Transformers

Zimian Wei, Hengyue Pan, Xin Niu, Dongsheng Li

Vision transformers have shown excellent performance in computer vision tasks. However, the computation cost of their (local) self-attention mechanism is expensive. Comparatively, CNN is more efficient with built-in inductive bias. Recent works show that CNN is promising to compete with vision transformers by learning their architecture design and training protocols. Nevertheless, existing methods either ignore multi-level features or lack dynamic prosperity, leading to sub-optimal performance. In this paper, we propose a novel attention mechanism named MCA, which captures different patterns of input images by multiple kernel sizes and enables input-adaptive weights with a gating mechanism. Based on MCA, we present a neural network named ConvFormer. ConvFormer adopts the general architecture of vision transformers, while replacing the (local) self-attention mechanism with our proposed MCA. Extensive experimental results demonstrated that ConvFormer outperforms similar size vision transformers(ViTs) and convolutional neural networks (CNNs) in various tasks. For example, ConvFormer-S, ConvFormer-L achieve state-of-the-art performance of 82.8%, 83.6% top-1 accuracy on ImageNet dataset. Moreover, ConvFormer-S outperforms Swin-T by 1.5 mIoU on ADE20K, and 0.9 bounding box AP on COCO with a smaller model size. Code and models will be available.

Knowledge Graph



Sign up or login to leave a comment