| Peer-Reviewed

Extracting Textual Information from Google Using Wrapper Class

Received: 22 April 2017    Accepted: 11 May 2017    Published: 5 July 2017
Views:       Downloads:
Abstract

In general, the web text documents are often structured, un-structured, or semi-structured format that is promptly growing everyday with massive amounts of data. The users provided with many tools for searching relevant information. Some of the searches include, Keyword searching, topic and subject browsing can help users to find relevant information quickly. In addition, Index search mechanisms allow the user to retrieve a set of relevant documents. Occasionally these search mechanisms are not sufficient. With the rapid development of Internet, amount of data available on the web regularly increased, which makes it difficult for humans to distinguish relevant information. A wrapper class is proposed to extract the relevant text information and focus on finding useful facts of knowledge from unstructured web documents using Google. Techniques from information retrieval (IR), information extraction (IE), and pattern recognition are explored.

Published in Advances in Networks (Volume 5, Issue 1)
DOI 10.11648/j.net.20170501.11
Page(s) 1-13
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2024. Published by Science Publishing Group

Keywords

Information Extraction, Retrieval, Semantic Web, Web Search Engine

References
[1] P. Srinivasan, J. Mitchell, O. Bodenreider, G. Pant, and F. Menczer. Web crawling agents for retrieving biomedical nformation. In NETTAB: Agents in Bioinformatics, Bologna, Italy, 2002.
[2] Krishan Kant Lavania, Sapna Jain, Madhur Kumar Gupta, and Nicy Sharma,”Google: A Case Study (Web Searching and Crawling)”, International Journal of Computer Theory and Engineering, Vol. 5, No. 2, April 2013.
[3] Patrick Mair and Scott Chamberlain,” Web Technologies Task View”, The R Journal Vol. 6/1, June 2014, ISSN 2073-4859, pp.178-181.
[4] Brin S., Page L.: “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Proceedings of the 7th World Wide Web Conference, pp.107 -117, 1998.
[5] Fiala D. “A System for Citations Retrieval on the Web”, MSc. thesis, University of West Bohemia in Pilsen, 2003.
[6] Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze, “An introduction to Information Retrieval”, online Edition 2009 Cambridge University Press Cambridge, England pp.443-459.
[7] D. Bollegala, Y. Matsuo and M. Ishizuka, “Automatic Discovery of Personal Name Aliases from the Web,” IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 6, (2011) June.
[8] S. Sekine and J. Artiles, “Weps 2 Evaluation Campaign: Overview of the Web People Search Attribute Extraction Task,” Proc. Second Web People Search Evaluation Workshop (WePS ’09) at 18th Int’l World Wide Web Conf., (2009).
[9] Y. Matsuo, J. Mori, M. Hamasaki, K. Ishida, T. Nishimura, H. Takeda, K. Hasida and M. Ishizuka, “Polyphonet: An Advanced Social Network Extraction System,” Proc. WWW ’06, (2006).
[10] Rama Subbu Lakshmi B, Jayabhaduri R, “Automatic Discovery of Association Orders between Name and Aliases from the Web using Anchor Texts-based Co-occurrences”, International Journal of Computer Applications (0975 – 8887) Volume 41– No.19, March 2012.
[11] Singh, B. and Singh, H. K. 2010. Web Data Mining Research: A Survey. Computational Intelligence and Computing Research (ICCIC).IEEE International Conference, pp. 1-10.
[12] Mr. A. Muthusamy and Dr. A. Subramani “Lexical Pattern Extraction from Data Set Make Use of Personal Name Aliases”, in International Journal of Advancements in Computing Technology ISSN: 2005-8039(print), 2233-9337(online) Vol 7, No.3 May 2015,pp. 102-108.
[13] Basic Search Handout URL: WWW.digitallearn.org.
[14] Web Search Engine URL: www.wikipedia.org.
[15] Web Search Engine market Share URL: https:// en.wikipedia.org/wiki/Web_search_engine#Market share.
[16] Google URL: http://www.google.com.
[17] Freshness Showdown URL: http://www.searchengineshowdown.com/stats/freshness.shtml.
[18] Search Engines Showdown: Size Comparison Methodology: URL:
[19] http://www.searchengineshowdown.com/stats/methodology.shtml.
[20] RFC 1950 URL: http://www.faqs.org/rfcs/rfc1950.html.
[21] Web Scrapping URL: https://en.wikipedia.org/wiki/Web_scraping.
[22] Baeza-Yates R., Castillo C. Crawling the infinite Web: five levels are enough. Proceedings of the third Workshop on Web Graphs (WAW), Rome, Italy, Lecture Notes in Computer Science, Springer, vol. 3243, pp. 156-167, 2004.
[23] Baeza-Yates R., Castillo C., Marín M., Rodríguez A. Crawling a country: better strategies than breadth-first for web page ordering. Proceedings of the 14th international conference on World Wide Web (WWW 2005), Chiba, Japan, pp. 864-872, 2005.
[24] Chakrabarti S. Mining the Web: Analysis of Hypertext and Semi Structured Data. Morgan Kaufmann Publishers, San Francisco, California, USA, 2002.
[25] Chakrabarti D., Faloutsos C. Graph mining: Laws, generators, and algorithms. ACM Computing Surveys, vol. 38, no. 1, 2006.
[26] Najork M., Wiener J. L. Breadth-first crawling yields high-quality pages. Proceedings of the 10th international conference on the World Wide Web (WWW10), Hong Kong, pp. 114-118, 2001.
[27] Cho J., Garcia-Molina H., Page L. Efficient Crawling Through URL Ordering. Proceedings of the 7th international conference on the World Wide Web (WWW7), Brisbane, Australia, pp. 161-172, 1998.
[28] Abiteboul S., Preda M., Cobena G. Adaptive on-line page importance computation. Proceedings of the 12th international conference on World Wide Web (WWW’03), Budapest, Hungary, pp. 280-290, 2003.
[29] Ghemawat S., Gobioff H., Leung S.-T. The Google file system. Proceedings of the 19th ACM symposium on Operating systems principles, Bolton Landing, NY, USA, pp. 29-43, 2003.
[30] Ntoulas A., Cho J., Olston C. What's new on the web?: the evolution of the web from a search engine perspective. Proceedings of the 13th international conference on the World Wide Web (WWW '04), New York, NY, USA, pp. 1-12, 2004.
[31] Mr. A. Muthusamy and Dr. A. Subramani “Automatic Discovery of Lexical Patterns using Pattern Extraction Algorithm to Identify Personal Name Aliases with Entities”, in International Journal of Software Engineering and Its Applications ISSN: 1738-9984 Vol 9, No.12(2015),pp. 165-176.
[32] Mr. A. Muthusamy and Dr. A. Subramani “A Survey of Automatic Extraction of Personal Name Alias from the Web”, in International Journal of Signal Processing, Image Processing and Pattern Recognition ISSN: 2005-4254 Vol. 7, No. 6 (2014), pp. 75-84.
[33] Mr. A. Muthusamy, Dr. A. Subramani, “Framework for pattern generation from discriminating datasets”, International Journal of Collaborative Intelligence Vol.1, No.2 (2015), pp. 115-123.
Cite This Article
  • APA Style

    A. Muthusamy, A. Subramani. (2017). Extracting Textual Information from Google Using Wrapper Class. Advances in Networks, 5(1), 1-13. https://doi.org/10.11648/j.net.20170501.11

    Copy | Download

    ACS Style

    A. Muthusamy; A. Subramani. Extracting Textual Information from Google Using Wrapper Class. Adv. Netw. 2017, 5(1), 1-13. doi: 10.11648/j.net.20170501.11

    Copy | Download

    AMA Style

    A. Muthusamy, A. Subramani. Extracting Textual Information from Google Using Wrapper Class. Adv Netw. 2017;5(1):1-13. doi: 10.11648/j.net.20170501.11

    Copy | Download

  • @article{10.11648/j.net.20170501.11,
      author = {A. Muthusamy and A. Subramani},
      title = {Extracting Textual Information from Google Using Wrapper Class},
      journal = {Advances in Networks},
      volume = {5},
      number = {1},
      pages = {1-13},
      doi = {10.11648/j.net.20170501.11},
      url = {https://doi.org/10.11648/j.net.20170501.11},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.net.20170501.11},
      abstract = {In general, the web text documents are often structured, un-structured, or semi-structured format that is promptly growing everyday with massive amounts of data. The users provided with many tools for searching relevant information. Some of the searches include, Keyword searching, topic and subject browsing can help users to find relevant information quickly. In addition, Index search mechanisms allow the user to retrieve a set of relevant documents. Occasionally these search mechanisms are not sufficient. With the rapid development of Internet, amount of data available on the web regularly increased, which makes it difficult for humans to distinguish relevant information. A wrapper class is proposed to extract the relevant text information and focus on finding useful facts of knowledge from unstructured web documents using Google. Techniques from information retrieval (IR), information extraction (IE), and pattern recognition are explored.},
     year = {2017}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Extracting Textual Information from Google Using Wrapper Class
    AU  - A. Muthusamy
    AU  - A. Subramani
    Y1  - 2017/07/05
    PY  - 2017
    N1  - https://doi.org/10.11648/j.net.20170501.11
    DO  - 10.11648/j.net.20170501.11
    T2  - Advances in Networks
    JF  - Advances in Networks
    JO  - Advances in Networks
    SP  - 1
    EP  - 13
    PB  - Science Publishing Group
    SN  - 2326-9782
    UR  - https://doi.org/10.11648/j.net.20170501.11
    AB  - In general, the web text documents are often structured, un-structured, or semi-structured format that is promptly growing everyday with massive amounts of data. The users provided with many tools for searching relevant information. Some of the searches include, Keyword searching, topic and subject browsing can help users to find relevant information quickly. In addition, Index search mechanisms allow the user to retrieve a set of relevant documents. Occasionally these search mechanisms are not sufficient. With the rapid development of Internet, amount of data available on the web regularly increased, which makes it difficult for humans to distinguish relevant information. A wrapper class is proposed to extract the relevant text information and focus on finding useful facts of knowledge from unstructured web documents using Google. Techniques from information retrieval (IR), information extraction (IE), and pattern recognition are explored.
    VL  - 5
    IS  - 1
    ER  - 

    Copy | Download

Author Information
  • Department of Computer Technology, N. G. P Arts and Science College, Coimbatore, India

  • Department of Computer Science, Govt. Arts College, Dharmapuri, India

  • Sections