Software Development for Identifying Persian Text Similarity
International Journal of Intelligent Information Systems
Volume 3, Issue 6-1, December 2014, Pages: 61-66
Received: Oct. 21, 2014; Accepted: Oct. 23, 2014; Published: Oct. 29, 2014
Views 3230      Downloads 136
Authors
Elham Mahdipour, Computer Engineering Department, Khavaran Institute of Higher Education, Mashhad, Iran
Rahele Shojaeian Razavi, Computer Engineering Department, Khavaran Institute of Higher Education, Mashhad, Iran
Zahra Gheibi, Computer Engineering Department, Khavaran Institute of Higher Education, Mashhad, Iran
Article Tools
Follow on us
Abstract
The vast span of nouns, words and verbs in Persian language and the availability of information in all fields in the form of paper, book and internet arises the need of a system to compare texts and evaluate their similarities. In this paper a system has been presented for comparing the text and determining the degree of Persian (Farsi) text similarities. This system uses TF-IDF method to give weight to sentences. Moreover, the roots of the nouns have been found and identical score has been given to synonyms and word families. The results gained from implementation indicate that the proposed system has a desired efficiency in comparing short texts.
Keywords
Text Similarity, TF-IDF, Semantic Similarity, Stemming
To cite this article
Elham Mahdipour, Rahele Shojaeian Razavi, Zahra Gheibi, Software Development for Identifying Persian Text Similarity, International Journal of Intelligent Information Systems. Special Issue: Research and Practices in Information Systems and Technologies in Developing Countries. Vol. 3, No. 6-1, 2014, pp. 61-66. doi: 10.11648/j.ijiis.s.2014030601.21
References
[1]
WenyinL, Hao TY, ChenW, FengM “A web-based platform for user interactive question answering”. World Wide Web: Internet Web Inform Syst (2009) 12(2):107–124, 2009.
[2]
Park EK, Ra DY, Jang MG, "Techniques for improving web retrieval effectiveness". Inform Process Manag 41:1207–1223, 2005.
[3]
Atkinson-Abutridy J, Mellish C, Aitken S, "Combining information extraction with genetic algorithms for text mining", IEEE Intelligent Systems, pp: 22-30, 2004, Available on: http://homepages.abdn.ac.uk/c.mellish/pages/papers/atkinsonieee.pdf.
[4]
K Metzler D, Dumais S, Meek C, "Similarity measures for short segments of text". In: Proceedings of the 29th European conference on information retrieval (ECIR 2007). Lecture notes in computer science,vol 4425, Springer, Berlin , pp 16–27, 2007.
[5]
Hassel, M., Resource Lean and Portable "Automatic Text Summarization", Stockholm, Sweden. p. 144, 2007.
[6]
Turney, P. "Mining the web for synonyms: PMI-IR versus LSA on TOEFL". In Proceedings of the Twelfth European Conference on Machine Learning, 2001, Available on: http://www.extractor.com/turney-ecml2001.pdf.
[7]
Landauer T. K., Foltz P., and Laham D, "Introduction to latent semantic analysis". Discourse Processes 25, 1998.
[8]
K. Aas and L. Eikvil, “Text Categorisation: A Survey”, 1999, Available on: http://citeseer.nj.nec.com/aas99text.html.
[9]
Wu Z., Palmer M., "Verb semantics and lexical selection". ACL' 94 Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, pp: 133-138, 1994. Available on: http://dl.acm.org/citation.cfm?id=981751.
[10]
Voorhees E., "Using WordNet to disambiguate word senses for text retrieval", SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on research and development information retrieval, pp: 171-180, 1993, Available on: http://dl.acm.org/citation.cfm?id=160715.
[11]
R. Krovetz, "Viewing morphology as an inference process", Proc. 16th ACM SIGIR Conference, Pittsburgh, June 27-July 1, pp. 191-202, 1993.
[12]
Hessami Fard Reza, Ghasem sany Gholamreza, "Design of a stemming algorithm for Persian", 11th Annual Conference of Computer Society of Iran, Tehran, 2006. (Persian) Available on: http://www.civilica.com/Paper-ACCSI11-ACCSI11_066.html
[13]
Qazvinian,Vahed.,SharifHassnabadi,Leila., Halavati, Ramin.,"Summarizing Text With a Genetic Algorithm-Based Sentence Extraction", Int. J. Knowledge Management Studies, Vol. 2, No. 4, pp:426-444, 2008, Available on: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.130.2201&rep=rep1&type=pdf.
[14]
Rada Mihalcea, Courtney Corley, Carlo Strapparava, "Corpus-based and Knowledge-based measures of text semantic similarity", AAAI '06 Proceeding of the 21st national conference on Artificial intelligence, Vol. 1, pp: 775-780, 2006.
[15]
Antonio Toral, Oscar Ferrandez, Eneko Agirre, Rafael Munoz, "A study on linking Wikipedia categories to Wordnet synsets using text similarity", International Conference RANLP 2009, Borovets, Bolgaria, pp: 449-454, 2009.
[16]
Xiaojun Quan, Gang Liu, Zhi Lu, Xingliang Ni, Liu Wenyin, "Short text similarity based on probabilistic topics", Knowl Inf Syst, 25, pp:473-491, DOI:10.1007/s10115-009-0250-y, 2010.
ADDRESS
Science Publishing Group
1 Rockefeller Plaza,
10th and 11th Floors,
New York, NY 10020
U.S.A.
Tel: (001)347-983-5186