An Overview of Vision Transformers and Deep Learning Methods for Classifying Remote Sensing Images

Main Article Content

Keerthishree P V, Suhas G K, S G Gollagi, Yathisha L

Abstract

The diversified, multifarious, and high-dimensional nature of remote-sensing photos makes remote-sensing image scene categorization (RSISC) an important and challenging challenge for understanding changes on Earth's surface. RSISC's main goal is to give acquired images semantic labels so that they can be arranged according to semantic content. Deep learning frameworks, especially in image analysis, have seen a sharp increase in interest and development in recent years. Even though deep learning approaches are more computationally costly than conventional machine learning techniques, they have demonstrated great potential in this field. This study provides a thorough evaluation of several deep learning (DL) methods, including Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) like ResNet, VGG16, InceptionV3, and DenseNet. We use the NWPU-RESISC45 and RSI-CB256 remote sensing datasets, which are both publically accessible, to assess how well these models perform. The findings show that although conventional CNN designs perform competitively, Vision Transformers (ViTs) are better at identifying intricate spatial correlations in the data for the categorization of remote sensing images. Because vision transformers use self-attention mechanisms to efficiently capture complicated spatial linkages and long-range dependencies, they perform exceptionally well in remote sensing picture classification. Furthermore, multi-scale feature extraction is made possible by their patch-based processing, which improves accuracy, particularly in high-resolution images.

Article Details

Section
Articles