The domain of computer vision has seen significant progress in the last decade, and this progress can be mainly attributed to the emergence of convolutional neural networks (CNNs). CNN’s perfect ability to process 2D data, thanks to its hierarchical feature extraction mechanism, is a key success factor.
Modern CNNs have come a long way since their introduction. Updated training mechanisms, data enhancements, improved network design paradigms, and more. The literature is full of successful examples of these proposals that make CNNs more powerful and efficient.
On the other hand, the open-source aspect of the computer vision domain has contributed to significant improvements. Thanks to the pre-trained large-scale visual model, the learning feature becomes more efficient; thus, starting from scratch is not the case for the majority of vision models.
Today, the performance of the vision model is mainly determined by three factors: the selected neural network architecture, the training method, and the training data. Improvements in any of these trios lead to a significant boost in overall performance.
Of these three, innovation in network architecture has played an important role in progress. CNNs eliminate the need for manual feature engineering by allowing the use of generic feature learning methods. Not so long ago, we had a breakthrough in transformer architecture in the natural language processing domain, and they moved to the vision domain. Transformers is quite successful thanks to its strong scaling capabilities in data and model sizes. Finally, in recent years, the ConvNeXt architecture was introduced. It modernizes traditional convolutional networks and shows us that pure convolutional models can also scale.
Though, we have a small problem here. All of this “progress” is measured through a single computer vision task, monitoring image recognition performance on ImageNet. It is still the most common method for exploring the design space for neural network architectures.
On the other hand, we have researchers looking at different ways to teach neural networks how to process images. Instead of using labeled images, they use a self-supervised approach where the network has to figure out what’s in the image itself. Masked autoencoders are one of the most popular ways to achieve this. It is based on the masked language modeling technique, which is widely used in natural language processing.
It is possible to mix and match different techniques when training neural networks, but it is difficult. One can combine ConvNeXt with masked autoencoders. However, since masked autoencoders are designed to work best with transformers for sequential data processing, it may be too computationally expensive to use them with convolutional networks. Also, the design may not be compatible with convolutional networks due to the sliding window mechanism. And previous research has shown that it can be tough to get good results when using self-supervised learning methods like masked autoencoders with convolutional networks. Therefore, it is important to remember that different architectures may have different feature learning behavior that may affect the quality of the final result.
This is where ConvNeXt V2 comes into play. It is a co-designed architecture that uses a masked autoencoder in the ConvNeXt framework to achieve results similar to those obtained using transformers. It is a step towards making the mask-based self-supervised learning method effective for the ConvNeXt model.
Designing a masked autoencoder for ConvNeXt was the first challenge, and they solved it in a smart way. They treat the masked input as a set of sparse patches and use sparse convolutions to process only the visible part. Moreover, the part of the transformer decoder in the masked autoencoder is replaced by a single ConvNeXt block, which makes the whole structure fully convolutional, which in turn increases the efficiency of pre-training.
Finally, a global response normalization layer is added to the framework to improve inter-channel feature competition. However, this change is effective if the model is trained with masked autoencoders. Therefore, reusing a fixed architectural design from supervised learning may be suboptimal.
ConvNeXt V2 improves performance when used together with masked autoencoders. It is designed specifically for self-monitored learning tasks. Using fully convolutional masked autoencoder pre-training can significantly improve the performance of pure convolutional networks.
Review paper and Github. All Credit For This Research Goes To The Researcher On This Project. Also, don’t forget to join Our Reddit page, Discord channeland Email Newsletterwhere we share the latest AI research news, cool AI projects, and more.
Ekrem Çetinkaya received his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis on image denoising using deep convolutional networks. He is currently pursuing a Ph.D. degree at the University of Klagenfurt, Austria, and worked as a researcher in the ATHENA project. His research interests include deep learning, computer vision, and multimedia networks.