Presenting an Optimal Method for Constructing an English-Persian Comparable Corpus

Seyede Roya Mohammadi; Noushin Riahi

doi:doi:10.11648/j.ijiis.20160503.12

| Peer-Reviewed

Presenting an Optimal Method for Constructing an English-Persian Comparable Corpus

Seyede Roya Mohammadi, Noushin Riahi

Published in International Journal of Intelligent Information Systems (Volume 5, Issue 3)

Received: 23 March 2016 Accepted: 7 June 2016 Published: 18 June 2016

Views: Downloads:

Download PDF

Share This Article

Twitter
Linked In
Facebook

Abstract

Multilingual corpora are the main sources in language information retrieval fields. The quality of many researches such as machine translation strongly depends on the quality of these corpora. One of these corpora's is comparable corpus. Considering their quality, these corpora contain broad range of information but constructing them has its special problems which lead to a few numbers of pairs in comparable corpus unlike its large dataset. In this paper we present a new method for increasing the quality and quantity of comparable corpus. We built a Persian-English comparable corpus from two independent news collections: BBC news in English and Hamshahri news in Persian.

Published in	International Journal of Intelligent Information Systems (Volume 5, Issue 3)
DOI	10.11648/j.ijiis.20160503.12
Page(s)	42-47
Creative Commons	This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.
Copyright	Copyright © The Author(s), 2024. Published by Science Publishing Group

Keywords

Comparable Corpus, Corpus Quality, Hamshahri Corpus, Query, RATF Factor

References

[1]	A. Blets, E. kow, “Extracting Parallel Fragments from Comparable Corpora for Date-to-Text Generation”, Proceeding INLG’10 Procedeeing of the 6^th International Natural Language Generation Conference, 2007, pp. 167-171.
[2]	P. Fung, “Finding terminology translations from nonparallel corpora”, Proceedings of the Fifth Workshop on Very Large Corpora, pages 192–202, 1997.
[3]	R. Rapp, Automatic identification of word translations fromunrelated english and german corpora. In Proceedings of the 37th annual meeting of the association for Computational Linguistics on Computational Linguistics, pages 519–526, Morristown.
[4]	D. Herv´e, E. Gaussier, and F. Sadat, An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In Proceedings of the 19th International Conference on Computational Linguistics, COLING, pages 1–7, Taipei, Taiwan.
[5]	R. Xavier, Y. Sasaki, M. Tonoike, S. Sato, and T. Utsuro, Compiling French-Japanese terminologies from the web. In proceedings of the 11st EACL, 2006, pages 225–232, Trento, Italy.
[6]	E. Morin, D. B´eatrice, T. Koichi and K. Kyo, Bilingual terminology mining - using brain, not brawn comparable corpora. In Proceedings of the 45th ACL, 2007, pages 664– 671, Prague, Czech Republic.
[7]	J. Xu, W. Croft, “Query expansion using local and global document analysis”, Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, 18–22 August 1996, pages 4–11.
[8]	R. Xiao and X. Hu, Corpus-Based Studies of Translational Chinese in English-Chinese Translation, Springer Heidelberg New York Dordrecht London, 2015, ISSN 2197-8689, ISSN 2197-8697 (electronic), New Frontiers in Translation Studies, ISBN 978-3-642-41362-9, ISBN 978-3-642-41363-6 (eBook), DOI 10.1007/978-3-642-41363-6.
[9]	K. Benjamin Tsou, Augmented Comparative Corpora and Monitoring Corpus in Chinese: LIVAC and Sketch Search Engine Compared, Proceedings of the Eighth Workshop on Building and Using Comparable Corpora, pages 1–2, Beijing, China, July 30, 2015.
[10]	P. Fung and P. Cheung, “Mining very Non-parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and EM”, In EMNLP 2004, pages 57-63.
[11]	T. Tao, C. X. Zhai, “Mining Comparable Bilingual Text Corpora for Cross-Language Information Integration,” in SIGKDD, 2005, pp. 691-696.
[12]	T. Talvensaari, J. Laurikkala, K. Jarvelin, M. Juhola, H. Keskustalo, “Creating and Exploiting a Comparable Corpus in Cross-Language Information Retrieval”, ACM Trans. Inf. Syst., Vol. 25, No. 1, 2007, pp. 4.
[13]	T. Talvensaari, “Effects of Aligned Corpus Quality and Size in Corpus-Based CLIR,” Advances in Information Retrieval, 2008, pp. 114-125.
[14]	L. Shao and H. T. Ng, “Mining New Word Translations from Comparable Corpora”, In: COLING 2004.
[15]	M. Tonoike, T. Utsuro, and S. Sato, “Compositional Translation Estimation of Technical Terms using a Domain/Topic-Specific Corpus collected from the Web”, Journal of Natural Language Processing, Vol. 14, No. 2, pp. 33-68, April 2007.
[16]	D. Shezaf and A. Rappoport,. Bilingual Lexicon Generation Using Non-Aligned Signatures. In Proc. of the 48th Annual Meeting of the Association for Computational Linguistics (ACL 2010), Uppsala, Sweden, 2010, pp. 98–07.
[17]	X. Saralegi, I. San Vicente and A. Gurrutxaga, =Automatic Extraction of Bilingual Terms from Comparable Corpora in a Popular Science Domain. In Proc. of the 1st Workshop on Building and Using Comparable Corpora (BUCC) at LREC 2008.
[18]	B. Li, E. Gaussier, “Improving Corpus Comparability for Bilingual Lexicon Extraction from Comparable Corpora,” in Proceeding of the 23rd International Conference on Computational Linguistics, Beijing, China: Coling Organizing Committee, 2010, pp. 644-652.
[19]	NJ, USA. Association for Computational Linguistics. Ghayoomi, Momtazi, Bijankhan, A study of corpus development for Persian, International Journal of Asian Language Processing 20(1), 2010.
[20]	H. Hashemi, A. Shakery, H. Faili, Creating Persian English Comparable Corpus, CLEF, 2010.

Cite This Article

Plain Text BibTeX RIS

APA Style

Seyede Roya Mohammadi, Noushin Riahi. (2016). Presenting an Optimal Method for Constructing an English-Persian Comparable Corpus. International Journal of Intelligent Information Systems, 5(3), 42-47. https://doi.org/10.11648/j.ijiis.20160503.12

Copy | Download

ACS Style

Seyede Roya Mohammadi; Noushin Riahi. Presenting an Optimal Method for Constructing an English-Persian Comparable Corpus. Int. J. Intell. Inf. Syst. 2016, 5(3), 42-47. doi: 10.11648/j.ijiis.20160503.12

Copy | Download

AMA Style

Seyede Roya Mohammadi, Noushin Riahi. Presenting an Optimal Method for Constructing an English-Persian Comparable Corpus. Int J Intell Inf Syst. 2016;5(3):42-47. doi: 10.11648/j.ijiis.20160503.12

Copy | Download

@article{10.11648/j.ijiis.20160503.12,
  author = {Seyede Roya Mohammadi and Noushin Riahi},
  title = {Presenting an Optimal Method for Constructing an English-Persian Comparable Corpus},
  journal = {International Journal of Intelligent Information Systems},
  volume = {5},
  number = {3},
  pages = {42-47},
  doi = {10.11648/j.ijiis.20160503.12},
  url = {https://doi.org/10.11648/j.ijiis.20160503.12},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijiis.20160503.12},
  abstract = {Multilingual corpora are the main sources in language information retrieval fields. The quality of many researches such as machine translation strongly depends on the quality of these corpora. One of these corpora's is comparable corpus. Considering their quality, these corpora contain broad range of information but constructing them has its special problems which lead to a few numbers of pairs in comparable corpus unlike its large dataset. In this paper we present a new method for increasing the quality and quantity of comparable corpus. We built a Persian-English comparable corpus from two independent news collections: BBC news in English and Hamshahri news in Persian.},
 year = {2016}
}

Copy | Download

TY  - JOUR
T1  - Presenting an Optimal Method for Constructing an English-Persian Comparable Corpus
AU  - Seyede Roya Mohammadi
AU  - Noushin Riahi
Y1  - 2016/06/18
PY  - 2016
N1  - https://doi.org/10.11648/j.ijiis.20160503.12
DO  - 10.11648/j.ijiis.20160503.12
T2  - International Journal of Intelligent Information Systems
JF  - International Journal of Intelligent Information Systems
JO  - International Journal of Intelligent Information Systems
SP  - 42
EP  - 47
PB  - Science Publishing Group
SN  - 2328-7683
UR  - https://doi.org/10.11648/j.ijiis.20160503.12
AB  - Multilingual corpora are the main sources in language information retrieval fields. The quality of many researches such as machine translation strongly depends on the quality of these corpora. One of these corpora's is comparable corpus. Considering their quality, these corpora contain broad range of information but constructing them has its special problems which lead to a few numbers of pairs in comparable corpus unlike its large dataset. In this paper we present a new method for increasing the quality and quantity of comparable corpus. We built a Persian-English comparable corpus from two independent news collections: BBC news in English and Hamshahri news in Persian.
VL  - 5
IS  - 3
ER  -

Copy | Download

Author Information

Seyede Roya Mohammadi

Computer Engineering Department, Alzahra University, Tehran, Iran
Noushin Riahi

Computer Engineering Department, Alzahra University, Tehran, Iran

Download PDF

Sections

Plain Text BibTeX RIS

APA Style

Seyede Roya Mohammadi, Noushin Riahi. (2016). Presenting an Optimal Method for Constructing an English-Persian Comparable Corpus. International Journal of Intelligent Information Systems, 5(3), 42-47. https://doi.org/10.11648/j.ijiis.20160503.12

Copy | Download

ACS Style

Seyede Roya Mohammadi; Noushin Riahi. Presenting an Optimal Method for Constructing an English-Persian Comparable Corpus. Int. J. Intell. Inf. Syst. 2016, 5(3), 42-47. doi: 10.11648/j.ijiis.20160503.12

Copy | Download

AMA Style

Seyede Roya Mohammadi, Noushin Riahi. Presenting an Optimal Method for Constructing an English-Persian Comparable Corpus. Int J Intell Inf Syst. 2016;5(3):42-47. doi: 10.11648/j.ijiis.20160503.12

Copy | Download

@article{10.11648/j.ijiis.20160503.12,
  author = {Seyede Roya Mohammadi and Noushin Riahi},
  title = {Presenting an Optimal Method for Constructing an English-Persian Comparable Corpus},
  journal = {International Journal of Intelligent Information Systems},
  volume = {5},
  number = {3},
  pages = {42-47},
  doi = {10.11648/j.ijiis.20160503.12},
  url = {https://doi.org/10.11648/j.ijiis.20160503.12},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijiis.20160503.12},
  abstract = {Multilingual corpora are the main sources in language information retrieval fields. The quality of many researches such as machine translation strongly depends on the quality of these corpora. One of these corpora's is comparable corpus. Considering their quality, these corpora contain broad range of information but constructing them has its special problems which lead to a few numbers of pairs in comparable corpus unlike its large dataset. In this paper we present a new method for increasing the quality and quantity of comparable corpus. We built a Persian-English comparable corpus from two independent news collections: BBC news in English and Hamshahri news in Persian.},
 year = {2016}
}

Copy | Download

TY  - JOUR
T1  - Presenting an Optimal Method for Constructing an English-Persian Comparable Corpus
AU  - Seyede Roya Mohammadi
AU  - Noushin Riahi
Y1  - 2016/06/18
PY  - 2016
N1  - https://doi.org/10.11648/j.ijiis.20160503.12
DO  - 10.11648/j.ijiis.20160503.12
T2  - International Journal of Intelligent Information Systems
JF  - International Journal of Intelligent Information Systems
JO  - International Journal of Intelligent Information Systems
SP  - 42
EP  - 47
PB  - Science Publishing Group
SN  - 2328-7683
UR  - https://doi.org/10.11648/j.ijiis.20160503.12
AB  - Multilingual corpora are the main sources in language information retrieval fields. The quality of many researches such as machine translation strongly depends on the quality of these corpora. One of these corpora's is comparable corpus. Considering their quality, these corpora contain broad range of information but constructing them has its special problems which lead to a few numbers of pairs in comparable corpus unlike its large dataset. In this paper we present a new method for increasing the quality and quantity of comparable corpus. We built a Persian-English comparable corpus from two independent news collections: BBC news in English and Hamshahri news in Persian.
VL  - 5
IS  - 3
ER  -

Copy | Download