Text-to-image synthesis using Generative Adversarial Networks (GANs) has become a pivotal area of research, offering significant potential in automated content generation and multimodal understanding. This study provides a comparative evaluation of six prominent GAN-based models—namely, the foundational work by Reed et al., StackGAN, AttnGAN, MirrorGAN, MimicGAN, and In-domain GAN Inversion—applied to a standardized dataset under consistent conditions. The analysis focused on four key performance dimensions: visual quality, semantic alignment between text and image, training stability, and robustness to noise in textual input. The results reveal a clear progression in model capability over time. While early models laid essential groundwork, they were limited in resolution and semantic coherence. Subsequent models introduced architectural innovations such as multi-stage generation, attention mechanisms, and semantic feedback loops, which significantly enhanced image fidelity and alignment with textual descriptions. Notably, AttnGAN and MirrorGAN achieved strong alignment performance due to their integration of attention and redescription modules, respectively. MimicGAN demonstrated superior robustness to noisy or ambiguous inputs, addressing a critical gap in earlier approaches. In contrast, In-domain GAN Inversion, though not a traditional text-to-image method, offered high image quality and valuable insights for latent-space manipulation. Overall, the comparative findings emphasize the trade-offs between model complexity and performance gains. Advances in attention, robustness, and semantic feedback have led to more reliable and realistic image synthesis. This study contributes a structured overview of current approaches and identifies pathways for future research aimed at balancing accuracy, interpretability, and generalizability in text-to-image systems.
Published in | American Journal of Neural Networks and Applications (Volume 11, Issue 1) |
DOI | 10.11648/j.ajae.20231001.13 |
Page(s) | 24-30 |
Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
Copyright |
Copyright © The Author(s), 2025. Published by Science Publishing Group |
Generative Adversarial Networks (GANs), Text to Image Synthesis, StackGAN, AttnGAN, MirrorGAN, MimicGAN, In-domain GAN Inversion
[1] | K. Ganguly, (2017), “Learning Generative Adversarial Networks: Next-generation deep learning simplified”. Packt Publishing. |
[2] | L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein, (2017), “Unrolled generative adversarial networks”, in proceedings international conference on learning representations, pp. 1–25. |
[3] | REEDSCOT and others, (2016), “Generative Adversarial Text to Image Synthesis”, University of Michigan, Ann Arbor, MI, USA, volume 48. |
[4] | Han Zhang and others, (2017), “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks”. IEEE Xplore, P 5907-5915. |
[5] | Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X. –(2018), "AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks". Lehigh University, IEEE Xplore, P 1316-1324. |
[6] | Qiao, T., Zhang, J., Xu, D., Lu, H. – (2019), "MirrorGAN: Learning Text-to-image Generation by Redescription", Zhejiang University, China, IEEE Xplore, P 1505-1514. |
[7] | Zhao, J., Zhang, Y., He, X., Xing, E. P., (2020), "MimicGAN: Robust Projection onto Image Manifolds with Corruption Mimicking", International Journal of Computer Vision 128(184), Springer, |
[8] | Zhu, Z., Huang, W., Zhan, D., Dong, D., Yan, J., Liu, W., (2020), "In-domain GAN Inversion for Real Image Editing", The Chinese University of Hong Kong. |
[9] | Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., (2014), “Generative adversarial nets. In: NeurIPS”. |
[10] | Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y., (2018), “Spectral normalization for generative adversarial networks. In: ICLR”. |
[11] | Zhang, H., Goodfellow, I., Metaxas, D., Odena, A., (2019), “Self-attention generative adversarial networks. In: ICML”. |
[12] | Bau, D., Strobelt, H., Peebles, W., Wulff, J., Zhou, B., Zhu, J. Y., Torralba, A., (2019), “Semantic photo manipulation with a generative image prior. SIGGRAPH”. |
[13] | Perarnau, G., Van De Weijer, J., Raducanu, B., Alvarez, J. M., (2016), “Invertible conditional gans for image editing. In: NeurIPS Workshop”. |
[14] | Brock, A., Donahue, J., Simonyan, K., (2019), “Large scale GAN training for high fidelity natural image synthesis. In: ICLR”. |
[15] | Karras, T., Laine, S., Aila, T., (2019),” A style-based generator architecture for generative adversarial networks. In: CVPR”. |
[16] | Shen, Y., Gu, J., Tang, X., Zhou, B. (2020) - Interpreting the latent space of gans for semantic face editing. In: CVPR. |
[17] | A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski. Plug & play generative networks (2017) – Conditional. |
[18] | iterative generation of images in latent space. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). |
[19] | P. Salehi and A. Chalechale, (2020) - ‘Pix2Pix-based Stain-to-Stain Translation: A Solution for Robust Stain Normalization in Histopathology Images Analysis’, arXiv Paper. arXiv2002.00647. |
[20] | T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) - ‘High-resolution image synthesis and semantic manipulation with conditional gans’, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807. |
[21] | A. Brock, J. Donahue, and K. Simonyan (2019) - ‘Large scale gan training for high fidelity natural image synthesis’, Int. Conf. Learn. Represent. |
[22] | L. Tran, X. Yin, and X. Liu (2017) - ‘Disentangled representation learning gan for pose-invariant face recognition’, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1415–1424. |
APA Style
Kraitem, Z. (2025). A Comparative Study of Text-to-Image Synthesis Techniques Using Generative Adversarial Networks. American Journal of Neural Networks and Applications, 11(1), 24-30. https://doi.org/10.11648/j.ajae.20231001.13
ACS Style
Kraitem, Z. A Comparative Study of Text-to-Image Synthesis Techniques Using Generative Adversarial Networks. Am. J. Neural Netw. Appl. 2025, 11(1), 24-30. doi: 10.11648/j.ajae.20231001.13
@article{10.11648/j.ajae.20231001.13, author = {Zaid Kraitem}, title = {A Comparative Study of Text-to-Image Synthesis Techniques Using Generative Adversarial Networks }, journal = {American Journal of Neural Networks and Applications}, volume = {11}, number = {1}, pages = {24-30}, doi = {10.11648/j.ajae.20231001.13}, url = {https://doi.org/10.11648/j.ajae.20231001.13}, eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajae.20231001.13}, abstract = {Text-to-image synthesis using Generative Adversarial Networks (GANs) has become a pivotal area of research, offering significant potential in automated content generation and multimodal understanding. This study provides a comparative evaluation of six prominent GAN-based models—namely, the foundational work by Reed et al., StackGAN, AttnGAN, MirrorGAN, MimicGAN, and In-domain GAN Inversion—applied to a standardized dataset under consistent conditions. The analysis focused on four key performance dimensions: visual quality, semantic alignment between text and image, training stability, and robustness to noise in textual input. The results reveal a clear progression in model capability over time. While early models laid essential groundwork, they were limited in resolution and semantic coherence. Subsequent models introduced architectural innovations such as multi-stage generation, attention mechanisms, and semantic feedback loops, which significantly enhanced image fidelity and alignment with textual descriptions. Notably, AttnGAN and MirrorGAN achieved strong alignment performance due to their integration of attention and redescription modules, respectively. MimicGAN demonstrated superior robustness to noisy or ambiguous inputs, addressing a critical gap in earlier approaches. In contrast, In-domain GAN Inversion, though not a traditional text-to-image method, offered high image quality and valuable insights for latent-space manipulation. Overall, the comparative findings emphasize the trade-offs between model complexity and performance gains. Advances in attention, robustness, and semantic feedback have led to more reliable and realistic image synthesis. This study contributes a structured overview of current approaches and identifies pathways for future research aimed at balancing accuracy, interpretability, and generalizability in text-to-image systems. }, year = {2025} }
TY - JOUR T1 - A Comparative Study of Text-to-Image Synthesis Techniques Using Generative Adversarial Networks AU - Zaid Kraitem Y1 - 2025/05/22 PY - 2025 N1 - https://doi.org/10.11648/j.ajae.20231001.13 DO - 10.11648/j.ajae.20231001.13 T2 - American Journal of Neural Networks and Applications JF - American Journal of Neural Networks and Applications JO - American Journal of Neural Networks and Applications SP - 24 EP - 30 PB - Science Publishing Group SN - 2469-7419 UR - https://doi.org/10.11648/j.ajae.20231001.13 AB - Text-to-image synthesis using Generative Adversarial Networks (GANs) has become a pivotal area of research, offering significant potential in automated content generation and multimodal understanding. This study provides a comparative evaluation of six prominent GAN-based models—namely, the foundational work by Reed et al., StackGAN, AttnGAN, MirrorGAN, MimicGAN, and In-domain GAN Inversion—applied to a standardized dataset under consistent conditions. The analysis focused on four key performance dimensions: visual quality, semantic alignment between text and image, training stability, and robustness to noise in textual input. The results reveal a clear progression in model capability over time. While early models laid essential groundwork, they were limited in resolution and semantic coherence. Subsequent models introduced architectural innovations such as multi-stage generation, attention mechanisms, and semantic feedback loops, which significantly enhanced image fidelity and alignment with textual descriptions. Notably, AttnGAN and MirrorGAN achieved strong alignment performance due to their integration of attention and redescription modules, respectively. MimicGAN demonstrated superior robustness to noisy or ambiguous inputs, addressing a critical gap in earlier approaches. In contrast, In-domain GAN Inversion, though not a traditional text-to-image method, offered high image quality and valuable insights for latent-space manipulation. Overall, the comparative findings emphasize the trade-offs between model complexity and performance gains. Advances in attention, robustness, and semantic feedback have led to more reliable and realistic image synthesis. This study contributes a structured overview of current approaches and identifies pathways for future research aimed at balancing accuracy, interpretability, and generalizability in text-to-image systems. VL - 11 IS - 1 ER -