Research Article | | Peer-Reviewed

A Comparative Study of Text-to-Image Synthesis Techniques Using Generative Adversarial Networks

Received: 8 April 2025     Accepted: 19 April 2025     Published: 22 May 2025
Views:       Downloads:
Abstract

Text-to-image synthesis using Generative Adversarial Networks (GANs) has become a pivotal area of research, offering significant potential in automated content generation and multimodal understanding. This study provides a comparative evaluation of six prominent GAN-based models—namely, the foundational work by Reed et al., StackGAN, AttnGAN, MirrorGAN, MimicGAN, and In-domain GAN Inversion—applied to a standardized dataset under consistent conditions. The analysis focused on four key performance dimensions: visual quality, semantic alignment between text and image, training stability, and robustness to noise in textual input. The results reveal a clear progression in model capability over time. While early models laid essential groundwork, they were limited in resolution and semantic coherence. Subsequent models introduced architectural innovations such as multi-stage generation, attention mechanisms, and semantic feedback loops, which significantly enhanced image fidelity and alignment with textual descriptions. Notably, AttnGAN and MirrorGAN achieved strong alignment performance due to their integration of attention and redescription modules, respectively. MimicGAN demonstrated superior robustness to noisy or ambiguous inputs, addressing a critical gap in earlier approaches. In contrast, In-domain GAN Inversion, though not a traditional text-to-image method, offered high image quality and valuable insights for latent-space manipulation. Overall, the comparative findings emphasize the trade-offs between model complexity and performance gains. Advances in attention, robustness, and semantic feedback have led to more reliable and realistic image synthesis. This study contributes a structured overview of current approaches and identifies pathways for future research aimed at balancing accuracy, interpretability, and generalizability in text-to-image systems.

Published in American Journal of Neural Networks and Applications (Volume 11, Issue 1)
DOI 10.11648/j.ajae.20231001.13
Page(s) 24-30
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2025. Published by Science Publishing Group

Keywords

Generative Adversarial Networks (GANs), Text to Image Synthesis, StackGAN, AttnGAN, MirrorGAN, MimicGAN, In-domain GAN Inversion

References
[1] K. Ganguly, (2017), “Learning Generative Adversarial Networks: Next-generation deep learning simplified”. Packt Publishing.
[2] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein, (2017), “Unrolled generative adversarial networks”, in proceedings international conference on learning representations, pp. 1–25.
[3] REEDSCOT and others, (2016), “Generative Adversarial Text to Image Synthesis”, University of Michigan, Ann Arbor, MI, USA, volume 48.
[4] Han Zhang and others, (2017), “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks”. IEEE Xplore, P 5907-5915.
[5] Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X. –(2018), "AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks". Lehigh University, IEEE Xplore, P 1316-1324.
[6] Qiao, T., Zhang, J., Xu, D., Lu, H. – (2019), "MirrorGAN: Learning Text-to-image Generation by Redescription", Zhejiang University, China, IEEE Xplore, P 1505-1514.
[7] Zhao, J., Zhang, Y., He, X., Xing, E. P., (2020), "MimicGAN: Robust Projection onto Image Manifolds with Corruption Mimicking", International Journal of Computer Vision 128(184), Springer,
[8] Zhu, Z., Huang, W., Zhan, D., Dong, D., Yan, J., Liu, W., (2020), "In-domain GAN Inversion for Real Image Editing", The Chinese University of Hong Kong.
[9] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., (2014), “Generative adversarial nets. In: NeurIPS”.
[10] Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y., (2018), “Spectral normalization for generative adversarial networks. In: ICLR”.
[11] Zhang, H., Goodfellow, I., Metaxas, D., Odena, A., (2019), “Self-attention generative adversarial networks. In: ICML”.
[12] Bau, D., Strobelt, H., Peebles, W., Wulff, J., Zhou, B., Zhu, J. Y., Torralba, A., (2019), “Semantic photo manipulation with a generative image prior. SIGGRAPH”.
[13] Perarnau, G., Van De Weijer, J., Raducanu, B., Alvarez, J. M., (2016), “Invertible conditional gans for image editing. In: NeurIPS Workshop”.
[14] Brock, A., Donahue, J., Simonyan, K., (2019), “Large scale GAN training for high fidelity natural image synthesis. In: ICLR”.
[15] Karras, T., Laine, S., Aila, T., (2019),” A style-based generator architecture for generative adversarial networks. In: CVPR”.
[16] Shen, Y., Gu, J., Tang, X., Zhou, B. (2020) - Interpreting the latent space of gans for semantic face editing. In: CVPR.
[17] A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski. Plug & play generative networks (2017) – Conditional.
[18] iterative generation of images in latent space. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[19] P. Salehi and A. Chalechale, (2020) - ‘Pix2Pix-based Stain-to-Stain Translation: A Solution for Robust Stain Normalization in Histopathology Images Analysis’, arXiv Paper. arXiv2002.00647.
[20] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) - ‘High-resolution image synthesis and semantic manipulation with conditional gans’, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807.
[21] A. Brock, J. Donahue, and K. Simonyan (2019) - ‘Large scale gan training for high fidelity natural image synthesis’, Int. Conf. Learn. Represent.
[22] L. Tran, X. Yin, and X. Liu (2017) - ‘Disentangled representation learning gan for pose-invariant face recognition’, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1415–1424.
Cite This Article
  • APA Style

    Kraitem, Z. (2025). A Comparative Study of Text-to-Image Synthesis Techniques Using Generative Adversarial Networks. American Journal of Neural Networks and Applications, 11(1), 24-30. https://doi.org/10.11648/j.ajae.20231001.13

    Copy | Download

    ACS Style

    Kraitem, Z. A Comparative Study of Text-to-Image Synthesis Techniques Using Generative Adversarial Networks. Am. J. Neural Netw. Appl. 2025, 11(1), 24-30. doi: 10.11648/j.ajae.20231001.13

    Copy | Download

    AMA Style

    Kraitem Z. A Comparative Study of Text-to-Image Synthesis Techniques Using Generative Adversarial Networks. Am J Neural Netw Appl. 2025;11(1):24-30. doi: 10.11648/j.ajae.20231001.13

    Copy | Download

  • @article{10.11648/j.ajae.20231001.13,
      author = {Zaid Kraitem},
      title = {A Comparative Study of Text-to-Image Synthesis Techniques Using Generative Adversarial Networks
    },
      journal = {American Journal of Neural Networks and Applications},
      volume = {11},
      number = {1},
      pages = {24-30},
      doi = {10.11648/j.ajae.20231001.13},
      url = {https://doi.org/10.11648/j.ajae.20231001.13},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajae.20231001.13},
      abstract = {Text-to-image synthesis using Generative Adversarial Networks (GANs) has become a pivotal area of research, offering significant potential in automated content generation and multimodal understanding. This study provides a comparative evaluation of six prominent GAN-based models—namely, the foundational work by Reed et al., StackGAN, AttnGAN, MirrorGAN, MimicGAN, and In-domain GAN Inversion—applied to a standardized dataset under consistent conditions. The analysis focused on four key performance dimensions: visual quality, semantic alignment between text and image, training stability, and robustness to noise in textual input. The results reveal a clear progression in model capability over time. While early models laid essential groundwork, they were limited in resolution and semantic coherence. Subsequent models introduced architectural innovations such as multi-stage generation, attention mechanisms, and semantic feedback loops, which significantly enhanced image fidelity and alignment with textual descriptions. Notably, AttnGAN and MirrorGAN achieved strong alignment performance due to their integration of attention and redescription modules, respectively. MimicGAN demonstrated superior robustness to noisy or ambiguous inputs, addressing a critical gap in earlier approaches. In contrast, In-domain GAN Inversion, though not a traditional text-to-image method, offered high image quality and valuable insights for latent-space manipulation. Overall, the comparative findings emphasize the trade-offs between model complexity and performance gains. Advances in attention, robustness, and semantic feedback have led to more reliable and realistic image synthesis. This study contributes a structured overview of current approaches and identifies pathways for future research aimed at balancing accuracy, interpretability, and generalizability in text-to-image systems.
    },
     year = {2025}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - A Comparative Study of Text-to-Image Synthesis Techniques Using Generative Adversarial Networks
    
    AU  - Zaid Kraitem
    Y1  - 2025/05/22
    PY  - 2025
    N1  - https://doi.org/10.11648/j.ajae.20231001.13
    DO  - 10.11648/j.ajae.20231001.13
    T2  - American Journal of Neural Networks and Applications
    JF  - American Journal of Neural Networks and Applications
    JO  - American Journal of Neural Networks and Applications
    SP  - 24
    EP  - 30
    PB  - Science Publishing Group
    SN  - 2469-7419
    UR  - https://doi.org/10.11648/j.ajae.20231001.13
    AB  - Text-to-image synthesis using Generative Adversarial Networks (GANs) has become a pivotal area of research, offering significant potential in automated content generation and multimodal understanding. This study provides a comparative evaluation of six prominent GAN-based models—namely, the foundational work by Reed et al., StackGAN, AttnGAN, MirrorGAN, MimicGAN, and In-domain GAN Inversion—applied to a standardized dataset under consistent conditions. The analysis focused on four key performance dimensions: visual quality, semantic alignment between text and image, training stability, and robustness to noise in textual input. The results reveal a clear progression in model capability over time. While early models laid essential groundwork, they were limited in resolution and semantic coherence. Subsequent models introduced architectural innovations such as multi-stage generation, attention mechanisms, and semantic feedback loops, which significantly enhanced image fidelity and alignment with textual descriptions. Notably, AttnGAN and MirrorGAN achieved strong alignment performance due to their integration of attention and redescription modules, respectively. MimicGAN demonstrated superior robustness to noisy or ambiguous inputs, addressing a critical gap in earlier approaches. In contrast, In-domain GAN Inversion, though not a traditional text-to-image method, offered high image quality and valuable insights for latent-space manipulation. Overall, the comparative findings emphasize the trade-offs between model complexity and performance gains. Advances in attention, robustness, and semantic feedback have led to more reliable and realistic image synthesis. This study contributes a structured overview of current approaches and identifies pathways for future research aimed at balancing accuracy, interpretability, and generalizability in text-to-image systems.
    
    VL  - 11
    IS  - 1
    ER  - 

    Copy | Download

Author Information
  • Sections