Hybrid Intelligence in Image Segmentation: Weaving Context and Detail with High Fidelity using Attention-Enhanced ViT-CNNs
Main Article Content
Abstract
Image segmentation is a significant problem in computer vision that aims at segmenting an image into appropriate segments based on meanings in the picture. In this work, we investigate on two advanced artificial intelligence algorithms that are deep learning architectures and some new attention mechanisms to improve effectiveness of image segmentation tasks. That is why we introduce a new architecture, called TransCNN-Seg, which utilizes the global attention mechanisms of the Transformer network and combines them with local feature extraction of the CNNs. This integration utilizes the advantages of the Transformer model which has a connection with expressive modeling of long-distance relationships and the CNN for its capacity to effectively capture detailed location features. The segmentation discussed in this paper uses a multi-stage segmentation approach and is based on Vision Transformer which is adopted with spatial-channel attention modules and an enhanced decoder design defined by the use of attention-gated skip-connections. This architecture shows desirable properties in answering complicated issues in segmentation, especially in comparably difficult fields such as segmentation of tumors in various medical imaging applications as well as segmentation of road scenes for self-driving cars. On standard performance measurement criteria, which is mean Intersection over Union (mIoU), TransCNN-Seg is able to obtain an mIoU of at least 83.7%, which is 2.8% better than previous state of the art methods, when the proposed model is tested using the Cityscapes datasets and a standardized medical imaging dataset. There are 13% and 5% improvements over pure CNN counterpart in the Boundary F1-score of the regions within and outside occlusion on average over all tested methods and 10% improvement of segmentation difference across all tested methods in Concave F1-score.