A Comparative Study of Text-to-Image Synthesis Techniques Using Generative Adversarial Networks

Zaid Kraitem

doi:doi:10.11648/j.ajae.20231001.13

Research Article |

| Peer-Reviewed

A Comparative Study of Text-to-Image Synthesis Techniques Using Generative Adversarial Networks

Zaid Kraitem^*

Published in American Journal of Neural Networks and Applications (Volume 11, Issue 1)

Received: 8 April 2025 Accepted: 19 April 2025 Published: 22 May 2025

Views: Downloads:

Download PDF

Share This Article

Twitter
Linked In
Facebook

Abstract

Text-to-image synthesis using Generative Adversarial Networks (GANs) has become a pivotal area of research, offering significant potential in automated content generation and multimodal understanding. This study provides a comparative evaluation of six prominent GAN-based models—namely, the foundational work by Reed et al., StackGAN, AttnGAN, MirrorGAN, MimicGAN, and In-domain GAN Inversion—applied to a standardized dataset under consistent conditions. The analysis focused on four key performance dimensions: visual quality, semantic alignment between text and image, training stability, and robustness to noise in textual input. The results reveal a clear progression in model capability over time. While early models laid essential groundwork, they were limited in resolution and semantic coherence. Subsequent models introduced architectural innovations such as multi-stage generation, attention mechanisms, and semantic feedback loops, which significantly enhanced image fidelity and alignment with textual descriptions. Notably, AttnGAN and MirrorGAN achieved strong alignment performance due to their integration of attention and redescription modules, respectively. MimicGAN demonstrated superior robustness to noisy or ambiguous inputs, addressing a critical gap in earlier approaches. In contrast, In-domain GAN Inversion, though not a traditional text-to-image method, offered high image quality and valuable insights for latent-space manipulation. Overall, the comparative findings emphasize the trade-offs between model complexity and performance gains. Advances in attention, robustness, and semantic feedback have led to more reliable and realistic image synthesis. This study contributes a structured overview of current approaches and identifies pathways for future research aimed at balancing accuracy, interpretability, and generalizability in text-to-image systems.

Published in	American Journal of Neural Networks and Applications (Volume 11, Issue 1)
DOI	10.11648/j.ajae.20231001.13
Page(s)	24-30
Creative Commons	This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.
Copyright	Copyright © The Author(s), 2025. Published by Science Publishing Group

Keywords

Generative Adversarial Networks (GANs), Text to Image Synthesis, StackGAN, AttnGAN, MirrorGAN, MimicGAN, In-domain GAN Inversion

References

[1]	K. Ganguly, (2017), “Learning Generative Adversarial Networks: Next-generation deep learning simplified”. Packt Publishing.
[2]	L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein, (2017), “Unrolled generative adversarial networks”, in proceedings international conference on learning representations, pp. 1–25.
[3]	REEDSCOT and others, (2016), “Generative Adversarial Text to Image Synthesis”, University of Michigan, Ann Arbor, MI, USA, volume 48.
[4]	Han Zhang and others, (2017), “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks”. IEEE Xplore, P 5907-5915.
[5]	Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X. –(2018), "AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks". Lehigh University, IEEE Xplore, P 1316-1324.
[6]	Qiao, T., Zhang, J., Xu, D., Lu, H. – (2019), "MirrorGAN: Learning Text-to-image Generation by Redescription", Zhejiang University, China, IEEE Xplore, P 1505-1514.
[7]	Zhao, J., Zhang, Y., He, X., Xing, E. P., (2020), "MimicGAN: Robust Projection onto Image Manifolds with Corruption Mimicking", International Journal of Computer Vision 128(184), Springer, https://doi.org/10.1007/s11263-020-01310-5
[8]	Zhu, Z., Huang, W., Zhan, D., Dong, D., Yan, J., Liu, W., (2020), "In-domain GAN Inversion for Real Image Editing", The Chinese University of Hong Kong.
[9]	Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., (2014), “Generative adversarial nets. In: NeurIPS”.
[10]	Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y., (2018), “Spectral normalization for generative adversarial networks. In: ICLR”.
[11]	Zhang, H., Goodfellow, I., Metaxas, D., Odena, A., (2019), “Self-attention generative adversarial networks. In: ICML”.
[12]	Bau, D., Strobelt, H., Peebles, W., Wulff, J., Zhou, B., Zhu, J. Y., Torralba, A., (2019), “Semantic photo manipulation with a generative image prior. SIGGRAPH”.
[13]	Perarnau, G., Van De Weijer, J., Raducanu, B., Alvarez, J. M., (2016), “Invertible conditional gans for image editing. In: NeurIPS Workshop”.
[14]	Brock, A., Donahue, J., Simonyan, K., (2019), “Large scale GAN training for high fidelity natural image synthesis. In: ICLR”.
[15]	Karras, T., Laine, S., Aila, T., (2019),” A style-based generator architecture for generative adversarial networks. In: CVPR”.
[16]	Shen, Y., Gu, J., Tang, X., Zhou, B. (2020) - Interpreting the latent space of gans for semantic face editing. In: CVPR.
[17]	A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski. Plug & play generative networks (2017) – Conditional.
[18]	iterative generation of images in latent space. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[19]	P. Salehi and A. Chalechale, (2020) - ‘Pix2Pix-based Stain-to-Stain Translation: A Solution for Robust Stain Normalization in Histopathology Images Analysis’, arXiv Paper. arXiv2002.00647.
[20]	T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) - ‘High-resolution image synthesis and semantic manipulation with conditional gans’, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807.
[21]	A. Brock, J. Donahue, and K. Simonyan (2019) - ‘Large scale gan training for high fidelity natural image synthesis’, Int. Conf. Learn. Represent.
[22]	L. Tran, X. Yin, and X. Liu (2017) - ‘Disentangled representation learning gan for pose-invariant face recognition’, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1415–1424.

Cite This Article

Plain Text BibTeX RIS

APA Style

Kraitem, Z. (2025). A Comparative Study of Text-to-Image Synthesis Techniques Using Generative Adversarial Networks. American Journal of Neural Networks and Applications, 11(1), 24-30. https://doi.org/10.11648/j.ajae.20231001.13

Copy | Download

ACS Style

Kraitem, Z. A Comparative Study of Text-to-Image Synthesis Techniques Using Generative Adversarial Networks. Am. J. Neural Netw. Appl. 2025, 11(1), 24-30. doi: 10.11648/j.ajae.20231001.13

Copy | Download

AMA Style

Kraitem Z. A Comparative Study of Text-to-Image Synthesis Techniques Using Generative Adversarial Networks. Am J Neural Netw Appl. 2025;11(1):24-30. doi: 10.11648/j.ajae.20231001.13

Copy | Download

@article{10.11648/j.ajae.20231001.13,
  author = {Zaid Kraitem},
  title = {A Comparative Study of Text-to-Image Synthesis Techniques Using Generative Adversarial Networks
},
  journal = {American Journal of Neural Networks and Applications},
  volume = {11},
  number = {1},
  pages = {24-30},
  doi = {10.11648/j.ajae.20231001.13},
  url = {https://doi.org/10.11648/j.ajae.20231001.13},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajae.20231001.13},
  abstract = {Text-to-image synthesis using Generative Adversarial Networks (GANs) has become a pivotal area of research, offering significant potential in automated content generation and multimodal understanding. This study provides a comparative evaluation of six prominent GAN-based models—namely, the foundational work by Reed et al., StackGAN, AttnGAN, MirrorGAN, MimicGAN, and In-domain GAN Inversion—applied to a standardized dataset under consistent conditions. The analysis focused on four key performance dimensions: visual quality, semantic alignment between text and image, training stability, and robustness to noise in textual input. The results reveal a clear progression in model capability over time. While early models laid essential groundwork, they were limited in resolution and semantic coherence. Subsequent models introduced architectural innovations such as multi-stage generation, attention mechanisms, and semantic feedback loops, which significantly enhanced image fidelity and alignment with textual descriptions. Notably, AttnGAN and MirrorGAN achieved strong alignment performance due to their integration of attention and redescription modules, respectively. MimicGAN demonstrated superior robustness to noisy or ambiguous inputs, addressing a critical gap in earlier approaches. In contrast, In-domain GAN Inversion, though not a traditional text-to-image method, offered high image quality and valuable insights for latent-space manipulation. Overall, the comparative findings emphasize the trade-offs between model complexity and performance gains. Advances in attention, robustness, and semantic feedback have led to more reliable and realistic image synthesis. This study contributes a structured overview of current approaches and identifies pathways for future research aimed at balancing accuracy, interpretability, and generalizability in text-to-image systems.
},
 year = {2025}
}

Copy | Download

TY  - JOUR
T1  - A Comparative Study of Text-to-Image Synthesis Techniques Using Generative Adversarial Networks

AU  - Zaid Kraitem
Y1  - 2025/05/22
PY  - 2025
N1  - https://doi.org/10.11648/j.ajae.20231001.13
DO  - 10.11648/j.ajae.20231001.13
T2  - American Journal of Neural Networks and Applications
JF  - American Journal of Neural Networks and Applications
JO  - American Journal of Neural Networks and Applications
SP  - 24
EP  - 30
PB  - Science Publishing Group
SN  - 2469-7419
UR  - https://doi.org/10.11648/j.ajae.20231001.13
AB  - Text-to-image synthesis using Generative Adversarial Networks (GANs) has become a pivotal area of research, offering significant potential in automated content generation and multimodal understanding. This study provides a comparative evaluation of six prominent GAN-based models—namely, the foundational work by Reed et al., StackGAN, AttnGAN, MirrorGAN, MimicGAN, and In-domain GAN Inversion—applied to a standardized dataset under consistent conditions. The analysis focused on four key performance dimensions: visual quality, semantic alignment between text and image, training stability, and robustness to noise in textual input. The results reveal a clear progression in model capability over time. While early models laid essential groundwork, they were limited in resolution and semantic coherence. Subsequent models introduced architectural innovations such as multi-stage generation, attention mechanisms, and semantic feedback loops, which significantly enhanced image fidelity and alignment with textual descriptions. Notably, AttnGAN and MirrorGAN achieved strong alignment performance due to their integration of attention and redescription modules, respectively. MimicGAN demonstrated superior robustness to noisy or ambiguous inputs, addressing a critical gap in earlier approaches. In contrast, In-domain GAN Inversion, though not a traditional text-to-image method, offered high image quality and valuable insights for latent-space manipulation. Overall, the comparative findings emphasize the trade-offs between model complexity and performance gains. Advances in attention, robustness, and semantic feedback have led to more reliable and realistic image synthesis. This study contributes a structured overview of current approaches and identifies pathways for future research aimed at balancing accuracy, interpretability, and generalizability in text-to-image systems.

VL  - 11
IS  - 1
ER  -

Copy | Download

Author Information

Zaid Kraitem

Department of Information Engineering, Al-Wataniya Private University, Hama, Syria

Contact Email

http://orcid.org/0000-0002-7498-0932

Download PDF

Sections

Plain Text BibTeX RIS

APA Style

Kraitem, Z. (2025). A Comparative Study of Text-to-Image Synthesis Techniques Using Generative Adversarial Networks. American Journal of Neural Networks and Applications, 11(1), 24-30. https://doi.org/10.11648/j.ajae.20231001.13

Copy | Download

ACS Style

Kraitem, Z. A Comparative Study of Text-to-Image Synthesis Techniques Using Generative Adversarial Networks. Am. J. Neural Netw. Appl. 2025, 11(1), 24-30. doi: 10.11648/j.ajae.20231001.13

Copy | Download

AMA Style

Kraitem Z. A Comparative Study of Text-to-Image Synthesis Techniques Using Generative Adversarial Networks. Am J Neural Netw Appl. 2025;11(1):24-30. doi: 10.11648/j.ajae.20231001.13

Copy | Download

@article{10.11648/j.ajae.20231001.13,
  author = {Zaid Kraitem},
  title = {A Comparative Study of Text-to-Image Synthesis Techniques Using Generative Adversarial Networks
},
  journal = {American Journal of Neural Networks and Applications},
  volume = {11},
  number = {1},
  pages = {24-30},
  doi = {10.11648/j.ajae.20231001.13},
  url = {https://doi.org/10.11648/j.ajae.20231001.13},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajae.20231001.13},
  abstract = {Text-to-image synthesis using Generative Adversarial Networks (GANs) has become a pivotal area of research, offering significant potential in automated content generation and multimodal understanding. This study provides a comparative evaluation of six prominent GAN-based models—namely, the foundational work by Reed et al., StackGAN, AttnGAN, MirrorGAN, MimicGAN, and In-domain GAN Inversion—applied to a standardized dataset under consistent conditions. The analysis focused on four key performance dimensions: visual quality, semantic alignment between text and image, training stability, and robustness to noise in textual input. The results reveal a clear progression in model capability over time. While early models laid essential groundwork, they were limited in resolution and semantic coherence. Subsequent models introduced architectural innovations such as multi-stage generation, attention mechanisms, and semantic feedback loops, which significantly enhanced image fidelity and alignment with textual descriptions. Notably, AttnGAN and MirrorGAN achieved strong alignment performance due to their integration of attention and redescription modules, respectively. MimicGAN demonstrated superior robustness to noisy or ambiguous inputs, addressing a critical gap in earlier approaches. In contrast, In-domain GAN Inversion, though not a traditional text-to-image method, offered high image quality and valuable insights for latent-space manipulation. Overall, the comparative findings emphasize the trade-offs between model complexity and performance gains. Advances in attention, robustness, and semantic feedback have led to more reliable and realistic image synthesis. This study contributes a structured overview of current approaches and identifies pathways for future research aimed at balancing accuracy, interpretability, and generalizability in text-to-image systems.
},
 year = {2025}
}

Copy | Download

TY  - JOUR
T1  - A Comparative Study of Text-to-Image Synthesis Techniques Using Generative Adversarial Networks

AU  - Zaid Kraitem
Y1  - 2025/05/22
PY  - 2025
N1  - https://doi.org/10.11648/j.ajae.20231001.13
DO  - 10.11648/j.ajae.20231001.13
T2  - American Journal of Neural Networks and Applications
JF  - American Journal of Neural Networks and Applications
JO  - American Journal of Neural Networks and Applications
SP  - 24
EP  - 30
PB  - Science Publishing Group
SN  - 2469-7419
UR  - https://doi.org/10.11648/j.ajae.20231001.13
AB  - Text-to-image synthesis using Generative Adversarial Networks (GANs) has become a pivotal area of research, offering significant potential in automated content generation and multimodal understanding. This study provides a comparative evaluation of six prominent GAN-based models—namely, the foundational work by Reed et al., StackGAN, AttnGAN, MirrorGAN, MimicGAN, and In-domain GAN Inversion—applied to a standardized dataset under consistent conditions. The analysis focused on four key performance dimensions: visual quality, semantic alignment between text and image, training stability, and robustness to noise in textual input. The results reveal a clear progression in model capability over time. While early models laid essential groundwork, they were limited in resolution and semantic coherence. Subsequent models introduced architectural innovations such as multi-stage generation, attention mechanisms, and semantic feedback loops, which significantly enhanced image fidelity and alignment with textual descriptions. Notably, AttnGAN and MirrorGAN achieved strong alignment performance due to their integration of attention and redescription modules, respectively. MimicGAN demonstrated superior robustness to noisy or ambiguous inputs, addressing a critical gap in earlier approaches. In contrast, In-domain GAN Inversion, though not a traditional text-to-image method, offered high image quality and valuable insights for latent-space manipulation. Overall, the comparative findings emphasize the trade-offs between model complexity and performance gains. Advances in attention, robustness, and semantic feedback have led to more reliable and realistic image synthesis. This study contributes a structured overview of current approaches and identifies pathways for future research aimed at balancing accuracy, interpretability, and generalizability in text-to-image systems.

VL  - 11
IS  - 1
ER  -

Copy | Download