Transformer Models in Digital Image Processing: A Systematic Review of Architectures and Applications

Manar Abdulkareem Al-Abaji; Meaad Salih; Maher Khalaf Hussein

doi:10.33387/protk.v13i2.11858

Authors

Manar Abdulkareem Al-Abaji University of Mosul
Meaad Salih University of Mosul
Maher Khalaf Hussein University of Mosul

DOI:

https://doi.org/10.33387/protk.v13i2.11858

Keywords:

Vision Transformer (ViT), Self-Attention, Image Segmentation, Object Detection, Swin Transformer

Abstract

The past few years have seen the explosive and profound revolution in the field of digital image processing, where Transformer-based architectures have dominated a wide range of tasks and replaced the long-standing convolutional counterparts, because the self-attention mechanism in Transformer models, originating from natural language processing, is able to capture long-range spatial relationships in images much more effectively than the inherently limited receptive fields of Convolutional Neural Networks (CNNs). In this paper, we conduct a comprehensive systematic review of Transformer architectures for digital image processing from 2020 to 2026, and we cover the key foundational models, such as Vision Transformer (ViT), Swin Transformer, DeiT and BEiT, and their numerous variants. We follow the development path of these models from simple image classification to complex tasks including object detection, semantic and instance segmentation, image restoration, medical imaging, and generative image synthesis, and we identify four major trends in architectural designs, i.e., purely Transformer-based vision models, CNN-Transformer hybrid architectures, hierarchical windowed attention networks, and diffusion-Transformer fusion models. We also provide a structured comparative analysis of 42 influential methods on 18 benchmark datasets, including their performance trajectories, computational and memory trade-offs, and emerging best practices in model designs. Finally, we also elaborate on the open challenges, such as the quadratic computational cost of standard attention, requirement for large-scale pre-training data, and domain generalization limitations, and summarize the future directions, e.g., more efficient attention, tighter integration of multi-modal information, and light-weight Transformer designs for edge and resource-constrained devices, therefore, this review is a rigorous and timely reference for researchers and practitioners who are interested in improving visual intelligence with Transformer-based methods.

Downloads

Download data is not yet available.

References

[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Las Vegas, NV, USA, 2016, pp. 770–778, doi: 10.1109/CVPR.2016.90.

[2] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Honolulu, HI, USA, 2017, pp. 2261–2269, doi: 10.1109/CVPR.2017.243.

[3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017, doi: 10.48550/arXiv.1706.03762.

[4] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2021, doi: 10.48550/ARXIV.2010.11929.

[5] Z. Liu et al., “Swin Transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 10012–10022, doi: 10.1109/ICCV48922.2021.00986.

[6] Y. Li et al., “BEiT V2: Masked image modeling with vector-quantized visual tokenizers,” in Advances in Neural Information Processing Systems (NeurIPS), 2022, doi: 10.48550/arXiv.2208.06366.

[7] J. N. Saeed and M. K. Hussein, “A Multi-ViTs-based approach for automatic rice leaf disease classification,” Iraqi Journal of Science, vol. 66, no. 9, pp. 3938–3950, 2025, doi: 10.24996/ijs.2025.66.9.33.

[8] M. Tan and Q. V. Le, “EfficientNetV2: Smaller models and faster training,” in Proc. Int. Conf. Mach. Learn. (ICML), 2021, pp. 10096–10106, doi: 10.48550/arXiv.2104.00298.

[9] H. Touvron, M. Cord, and H. Jégou, “DeiT III: Revenge of the ViT,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2022, pp. 516–533, doi :10.48550/arXiv.2204.07118 .

[10] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: Convolutional block attention module,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2020, pp. 3–19, doi: 10.1007/978-3-030-01234-2_1.

[11] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL, 2020, pp. 4171–4186, doi: 10.18653/v1/N19-1423.

[12] N. Carion et al., “End-to-end object detection with transformers,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2020, pp. 213–229, doi: 10.1007/978-3-030-58452-8_13.

[13] L. Alkahla, J. Saeed, and M. Hussein, “Empowering ovarian cancer subtype classification with parallel Swin transformers and WSI imaging,” The International Arab Journal of Information Technology, vol. 21, no. 6, 2024, doi: 10.34028/iajit/21/6/5.

[14] J. Chen et al., “TransUNet: Transformers make strong encoders for medical image segmentation,” arXiv, 2021, doi: 10.48550/arXiv.2102.04306.

[15] M. K. Hussein and A. Alqassab, “Deep learning in medical systems: A comprehensive review of applications and ethical challenges,” Neutrosophic Optimization and Intelligent Systems, vol. 7, 2025, doi: 10.61356/j.nois.2025.7606.

[16] A. Zamir et al., “Restormer: Efficient transformer for high-resolution image restoration,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 5728–5739, doi: 10.1109/CVPR52688.2022.00564.

[17] X. Zhu et al., “Deformable DETR: Deformable transformers for end-to-end object detection,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2021, doi: 10.48550/arXiv.2010.04159.

[18] H. Touvron et al., “Training data-efficient image transformers & distillation through attention,” in Proc. Int. Conf. Mach. Learn. (ICML), 2021, pp. 10347–10357, doi: 10.48550/arXiv.2012.12877.

[19] H. Bao, L. Dong, S. Piao, and F. Wei, “BEiT: BERT pre-training of image transformers,” in Proc. ICLR, 2022, doi: 10.48550/arXiv.2106.08254.

[20] X. Chu et al., “Conditional positional encodings for vision transformers,” in Proc. ICLR, 2023, doi: 10.48550/arXiv.2102.10882.

[21] M. K. Hussein, L. T. Alkahla, and A. Alqassab, “Hyperspectral image classification using hybrid swarm feature selection and ensemble classifier,” Ingénierie des Systèmes d’Information, vol. 29, no. 6, pp. 2367–2375, 2024, doi: 10.18280/isi.290624.

[22] Z. Liu et al., “Swin Transformer V2: Scaling up capacity and resolution,” in Proc. IEEE/CVF CVPR, 2022, pp. 12009–12019, doi: 10.1109/CVPR52688.2022.01170.

[23] X. Dong et al., “CSWin Transformer: A general vision transformer backbone with cross-shaped windows,” in Proc. IEEE/CVF CVPR, 2022, pp. 12124–12134, doi: 10.1109/CVPR52688.2022.01181.

[24] J. Yang et al., “Focal self-attention for local-global interactions in vision transformers,” in Advances in Neural Information Processing Systems (NeurIPS), 2021, pp. 30585–30597, doi: 10.48550/arXiv.2107.12239.

[25] H. Wu et al., “CvT: Introducing convolutions to vision transformers,” in Proc. IEEE/CVF ICCV, 2021, pp. 22–31, doi: 10.1109/ICCV48922.2021.00009.

[26] B. Graham et al., “LeViT: A vision transformer in ConvNet’s clothing for faster inference,” in Proc. IEEE/CVF ICCV, 2021, pp. 12259–12269, doi: 10.1109/ICCV48922.2021.01204.

[27] Z. Dai et al., “CoAtNet: Marrying convolution and attention for all data sizes,” in Advances in Neural Information Processing Systems (NeurIPS), 2021, doi: 10.48550/arXiv.2106.04803.

[28] M. K. Hussein, L. T. Alkahla, and A. Alqassab, “Increasing the accuracy of melanoma classification by exploiting firefly algorithm and fine-tuned CNNs,” in AIP Conference Proceedings, vol. 3264, no. 1, art. no. 040011, 2025, doi: 10.1063/5.0259165.

[29] J. Chen et al., “CMT: Convolutional neural networks meet vision transformers,” in Proc. IEEE/CVF CVPR, 2022, pp. 12175–12185, doi: 10.1109/CVPR52688.2022.01186.

[30] M. K. Hussein, A. Alqassab, and L. T. Alkahla, “Enhancing feature selection in network intrusion detection systems using a novel hybrid binary swarm algorithm,” Baghdad Science Journal, vol. 22, no. 7, pp. 2429–2437, 2025, doi: 10.21123/2411-7986.5007.

[31] K. Choromanski et al., “Rethinking attention with Performers,” in Proc. ICLR, 2021, doi: 10.48550/arXiv.2009.14794.

[32] Y. Xiong et al., “Nyströmformer: A Nyström-based self-attention mechanism,” in Proc. AAAI, 2021, pp. 14138–14148, doi: 10.1609/aaai.v35i16.17575.

[33] H. Wang et al., “Axial-DeepLab: Stand-alone axial-attention for panoptic segmentation,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2020, pp. 108–126, doi: 10.1007/978-3-030-58548-8_7.

[34] K. He et al., “Masked autoencoders are scalable vision learners,” in Proc. IEEE/CVF CVPR, 2022, pp. 16000–16009, doi: 10.1109/CVPR52688.2022.01553.

[35] M. Caron et al., “Emerging properties in self-supervised vision transformers,” in Proc. IEEE/CVF ICCV, 2021, pp. 9650–9660, doi: 10.1109/ICCV48922.2021.00951.

[36] J. Zhou et al., “iBOT: Image BERT pre-training with online tokenizer,” in Proc. ICLR, 2022, doi: 10.48550/arXiv.2111.07832.

[37] Z. Xie et al., “SimMIM: A simple framework for masked image modeling,” in Proc. IEEE/CVF CVPR, 2022, pp. 9653–9663, doi: 10.1109/CVPR52688.2022.00943.

[38] W. Peebles and S. Xie, “Scalable diffusion models with transformers,” in Proc. IEEE/CVF ICCV, 2023, pp. 4195–4205, doi: 10.1109/ICCV51070.2023.00387.

[39] F. Bao et al., “All are worth words: A ViT backbone for diffusion models,” in Proc. IEEE/CVF CVPR, 2023, pp. 22669–22679, doi: 10.1109/CVPR52729.2023.02172.

[40] P. Esser et al., “Scaling rectified flow transformers for high-resolution image synthesis,” in Proc. ICML, 2024, doi: 10.48550/arXiv.2403.03206.

[41] X. Zhu et al., “Deformable DETR: Deformable transformers for end-to-end object detection,” in Proc. ICLR, 2021, doi: 10.48550/arXiv.2010.04159.

[42] H. Zhang et al., “DINO: DETR with improved deNoising anchor boxes for end-to-end object detection,” in Proc. ICLR, 2023, doi: 10.48550/arXiv.2203.03605.

[43] R. Strudel et al., “Segmenter: Transformer for semantic segmentation,” in Proc. IEEE/CVF ICCV, 2021, pp. 7262–7272, doi: 10.1109/ICCV48922.2021.00717.

[44] B. Cheng et al., “Masked-attention mask transformer for universal image segmentation,” in Proc. IEEE/CVF CVPR, 2022, pp. 1290–1299, doi: 10.1109/CVPR52688.2022.00135.

[45] A. Kirillov et al., “Segment anything,” in Proc. IEEE/CVF ICCV, 2023, pp. 4015–4026, doi: 10.1109/ICCV51070.2023.00371.

[46] H. Chen et al., “Pre-trained image processing transformer,” in Proc. IEEE/CVF CVPR, 2021, pp. 12299–12310, doi: 10.1109/CVPR46437.2021.01212.

[47] J. Liang et al., “SwinIR: Image restoration using Swin transformer,” in Proc. IEEE/CVF ICCVW, 2021, pp. 1833–1844, doi: 10.1109/ICCVW54120.2021.00210.

[48] S. W. Zamir et al., “Restormer: Efficient transformer for high-resolution image restoration,” in Proc. IEEE/CVF CVPR, 2022, pp. 5728–5739, doi: 10.1109/CVPR52688.2022.00564.

[49] L. Chen et al., “Simple baselines for image restoration,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2022, pp. 17–33, doi: 10.1007/978-3-031-20071-7_2.

[50] J. Chen et al., “TransUNet: Transformers make strong encoders for medical image segmentation,” arXiv, 2021, doi: 10.48550/arXiv.2102.04306.

[51] A. Hatamizadeh et al., “Swin UNETR: Swin transformers for semantic segmentation of brain tumors in MRI images,” in MICCAI BrainLes Workshop, 2022, pp. 272–284, doi: 10.1007/978-3-031-08999-2_22.

[52] R. J. Chen et al., “Scaling vision transformers to gigapixel images via hierarchical self-supervised learning,” in Proc. IEEE/CVF CVPR, 2022, pp. 16144–16155, doi: 10.1109/CVPR52688.2022.01567.

[53] Y. Jiang, S. Chang, and Z. Wang, “TransGAN: Two pure transformers can make one strong GAN, and that can scale up,” in Advances in Neural Information Processing Systems (NeurIPS), 2021, doi: 10.48550/arXiv.2102.07074

	All	Since 2021
Kutipan	840	708
indeks-h	14	12
indeks-i10	23	16

Transformer Models in Digital Image Processing: A Systematic Review of Architectures and Applications

Authors

DOI:

Keywords:

Abstract

Downloads

References

Additional Files

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Similar Articles

Menu

Policies

Editor In Chief

Editorial Board

Scholar Citations

Template Protek

Tools manager

Supported By

Visitors View Stat Protek

Information

Language

Current Issue

Developed By

Keywords