Transformer based UNet for Semantic Segmentation on Aerial Imagery
Main Article Content
Abstract
Semantic segmentation is an integral component of computer vision, providing detailed scene analysis by classifying each pixel in an image. It is particularly valuable in remote sensing applications, such as land cover mapping, urban change detection, and environmental protection. However, semantic segmentation often faces challenges in capturing both local and global context effectively. Traditional machine learning models encounter limitations with suboptimal feature extraction, handling noisy data, and adapting to varying data distributions. To address these challenges, deep learning models offer improved adaptability and feature learning capabilities. In particular, Transformer architectures have shown promise in modelling global information, leading to enhanced performance in various vision-related tasks, including semantic segmentation. In this work, we propose a novel approach that integrates a Transformer-based decoder into the U-Net architecture for real-time urban scene segmentation. The model combines a CNN-based encoder, utilizing ResNet-101 for feature extraction, with a Transformer-based decoder to capture both local and global contexts. This hybrid architecture allows for better complex urban element segmentation, making the model better at defining fine details and also large-scale structures. For performance evaluation, the proposed model is tested against UAVid, which results to an 89% accuracy and an 80% of MIoU; thus, confirming that the proposed model is effective in achieving a good outcome in the urban scene segmentation process.