A Comprehensive Comparison between Web Content Mining Tools: Usages, Capabilities and Limitations

A Comprehensive Comparison between Web Content Mining Tools: Usages, Capabilities and Limitations Zahra Hojati 1, Rozita Jamili Oskouei 2* Department of Electrical, Computer & IT, Zanjan Branch, Islamic Azad University, Zanjan, Iran 1 Computer Department of Science and Information Technology, Institute of Advanced Studies in Basic Sciences (IASBS), Zanjan, Iran 2 zahojati@gmail.com 1, rozita2010r@gmail.com 2* (Corresponding Author) Abstract: Today, with significant increases in data volume and web development, the necessity of techniques and methods for efficient access to data and web mining is more than ever. Web mining, is the process of data discovery of unknown and useful knowledge of web content and hyperlinks which can be used as a factor to examine the behaviour of users during web surfing and access to different web sites. Depending on the type of the data which is explored, web mining methods are divided to three categories: web content mining, web structure mining, and web usage mining. This paper divides web content mining tools into two major categories of commercial and non-commercial tools, and tries to carefully examine these tools and their usage advantages and disadvantages. Beside the explaining capabilities of each tool for researchers including students and people who use these tools for special purpose but do not have enough time to test each tool, this paper tries to represent a useful way to identify and select appropriate tools to reach desired goals. The distinctive aspect of this research in compared to former researches, is that we try to practically examine the efficiency and function of each tool and represent simple and clear instructions to use these tools and show the degree to which these tools could met the expectations. Key words: Web mining, Web content mining, Web content mining commercial tools, web content mining non-commercial tools.

1 Introduction: Analysis and discover useful information on the World Wide Web has become a challenge for researchers in this field. [2] Search engines are important tools to obtain the required information on the Internet, but unfortunately the search engines have low precision, thus we a need a technique for web mining to extract information with precision. Web mining is a method of applying data mining techniques to discover and extract information automatically from documents and Web services. Searching a subject from the web to get completely accurate information is difficult. Web mining is a technique which is used to obtain accurate and exact information quickly and with the high degree of precision and reliability. In fact Web mining is a process of access to general data on the web, this process gets structured, unstructured and semi-structured data from the databases on the web, and based on data mining techniques discovers and extracts hidden information from the web. Based on the type of data which is explored, web mining is divided to the three forms of web content mining, web structure mining and practical web mining. Web content mining extracts web s useful information, such as text, image, video or audio files. Using graph theory, web structure mining analyses the nodes and connections between them, which represent web pages and links inside them and finally web practical mining determines the users access patterns when they use the web. However, web mining techniques, faces challenges which distinguishes it from data mining from a structured database. Some of these challenges are as following: Information existing on the web has very big volume and it is not easy to access them. Information and data on the web are of various types such as structured tables, text, multimedia data, image and so on. Most of information on the web, like HTML, is semi-structured data. Web sites have noisy and surplus information Information on the websites is dynamic and is constantly changing. Finally, in addition to have data, information and services, web environment like a virtual community has possibilities for interaction between people and organizations [4].

It seems that due to the data volume increase and requirements of users, more items could be added to above list. Therefore, it is necessary to introduce and produce tools to overcome these challenges. The structure of this paper is organized as follow: On the second chapter we will briefly discuss about the concept of web content and web content tools. On the third chapter a classification for available tools for web content mining will be represented. On the fourth chapter several commercial web content mining tools will be discussed and we will compare their strengths and drawbacks. The fifth chapter of our study will focus on non-commercial web content mining tools and finally, a conclusion of all discussed issues presented in this paper will be presented in Section six. 2. Basic Concepts In this section, we will discuss the general concepts used in the paper. 2.1. Web Mining Data mining is the extraction of information and knowledge from large volume of collected data. This obtained information or knowledge should have a set of attributes including: non duplicate, being understandable, interesting and... Data mining have steps such as: Pre-Processing (preliminary data pre-processing), Transformed Data, Data mining, Pattern Recognition and Knowledge Discovery and. [5] In fact web mining is the application of data mining techniques to extract information and knowledge from available data and content on the web which has 3 types: Web Content Mining: Searching the contents of the web pages, including: text, figures, tables and etc. Web Structure Mining: is searching hyperlinks, connections and links between Web pages. Web Usage Mining: is searching users internet usage pattern by analysis of Access Log Files. In this paper we will discuss Web Content Mining tools.

2.1.1. Web Content Mining Web content mining is could extract useful information from the content, documents and data on the web. In general, the web content data are in three forms of structured data (such as the data in the tables) unstructured data (such as free texts) or semi-structured data (such as HTML documents). According to studies, two methods have been proposed for mining web content: mining of unstructured data method and mining semi-structured and structured data method. As most of the information on web pages is unstructured, techniques such as knowledge discovery techniques or text mining and are required to search for user preferences for this type of data. Using Machine Learning Techniques we can process non-structured data. In short, all of HTML websites are classified as those which have unstructured content and websites design or coded with XML are structured and searching them is easier and faster. Some structured data are: data in lists, tables, trees and... There are four stages in Web content mining: - Collect - extract content from the web. - Analysis - extract usable data from formatted data (such as HTML, PDF) - Analyse - Mark, evaluating, classifying, filtering, sorting, and so on. - Production - convert analyse results to useful information (including reports, search index, etc.) 3. Web Content Mining Tools Web content mining tools can extract necessary and required information from the available data on the web; this is usually done by downloading data with different forms. It is obvious that because of different types of data on the web, this process could be very time consuming and tedious. To make extracting process of desired information easier, some tools are used. These tools could be classified as general division of commercial and non-commercial. In an overview of all available tools it seems that all web content mining tools are often able to fulfil same needs. Our studies show that from April 2014, nearly 238 different types of tools with similar functionality and in some cases different functions has been created [8]. However, most of

these tools have outstanding features in compare to other tools. Later on the third chapter in this paper we will evaluate and compare the features and advantages of the tools which used more. 4. Business Tools of Web content mining 4.1. Web content extractor Web content extractor, is software for data mining and data extraction. This tool can collect data from online stores, commercial sites, business sites, buy and sale sites, search engine results, and....this software allows the user to represent the mining data in the form of Excel (.csv), text (ASCII), and HTML as well as Access, and MySQL databases as an output. Being known for crawling and Web spiders, downloading data as a multi subject, to be able to collect data from password protected sites, very easy to use and fast and accurate learning [2] and so on, are some of the qualities of Web Content Extractor tool. For example, on offering price and the stock exchange websites, this tool can be very useful and important. You can see a part of worksheet of this tool in Figure 1. Figure1. Snapshot of Web content extractor 4.2 Easy Web Extractor This tool is a powerful tool with easy-to- use data mining, data extraction and web mining. One of the important features of this tool is extracting similar data. Web data that include extraction patterns can also be extracted. For example this tool allows users to create a project for websites

with similar structure (such as online stores, buying and selling sites, search engines and etc). This tool has a customized web crawler. You can also determine Crawling and multi subject download rules in this tool. Another important feature of this tool is access to information of password protected sites. Creating extracted data with various formats such as Excel (CSV), Text (TXT), HTML, XML file, Microsoft Access, SQL, MySQL is also a feature of this tool. A part of worksheet of this tool is presented in Figure 2. Figure2.Snapshot of Easy Web Extract Tools 4.3 Web Information Extractor This tool is a powerful tool for data mining and extracting web content. This tool extracts structured and non-structured data from web pages and reforms these data in a local file or stores them in a database and sends all to the web server. This tool has an important update feature when new content is created. For example this tool monitors the web page constantly and when the new content is added to the page, depending to the change, the assigned task to the tool is updated. The extracted databases can be saved with a (CSV), Text (TXT) formats. In this tool depending on type of data which is text, links, image, source code, or a list of table data, the stages of project and options selection are different but in general, the steps are the same. For example, to select the data image, first the image is selected and a red box is created around it. By selecting desired data, this data is displayed in current object. Then create item

button is selected and on newly opened window, the items name, type and attribute must be entered. Finally, by an operation provided by the user, the selected data in the initial stages extracted and stored in the result list. The shortages of this tool are in loading of the website and being time consuming. A part of worksheet of this tool is presented in Figure 3. Figure3. Snapshot of web information extractor 4.4. Web Data Extractor At first glance, you can see that this tool is an easy to use and relatively comprehensive tool. This tool is able to extract different data such as Meta tags, URL, email, phone, fax, etc. The default format for storage of extracted data is.csv. A part of worksheet of this tool is presented in Figure 4. Figure4. Snapshot of Web Data Extractor

One of important features of this tool is being able to change the settings according to user preferences. Also the user can select the type of search engine. In this tool for modifying we can use filters which are available in tools for extracting url, text, data and parsers. For example, in url filter there is a part which contains a list of accredited domain for the websites address. The advantage of this list is the possibility of subtracting or adding to / from it. About text and data filter it is possible that if the desired text and data exist, the web page will extracted. In the parser, prefix phrases is placed for telephone, fax and e-mail which those terms are usually used on the website for phone numbers, fax and e-mail addresses. This tool is different from other tools on how to start extraction operation. In this tool, first all the users desired settings are selected and set; then according to this configuration, the data are displayed. While in other tools, first the site is loaded and then the user with depending on the contents and the information, start to create his/her pattern and rule. Although in this tool, at the beginning of the work settings and available options are general, but for the user who is familiar with extraction with different software can be useful and the user is faced with various options directly at the beginning of the project; this feature can open mind and view of the user toward oncoming procedure for extracting. Figure5. Creating a Filter in Web Data Extractor 4.5 Mozenda Mozenda is a popular tool with high ranking. Mozenda is software that allows professional and non-professional users to extract data from web pages easily. Depending on the possibility of

selection on web content and emphasis on the use of cloud computing, in this tool extraction, storage and data management are done in the form of centralized and comprehensive. One of the features of this tool is automatic data extraction without leaving any sign, in fact this occurs because of unknown proxy using by Mozenda which can produce a rotating IP and prevent user identification. Mozenda tool like Automation anywhere tool enables the user to have data extraction timing. That s enough in the first step, the desired data represented to the tool and the number of repetitions (in hours or days) to be determined; in this case the achieved data would be more up to date. Those parts of this tool which are related the data mining includes a complete package of useful applications for marketer, by using the features of this tool we can do all the affairs related to forecasts, get information to create budget, analysis, analysis and pricing competitor and etc. Mozenda tool by using the text filter is able to filter a user's text in a smartly and extract certain parts of the text. As previously mentioned, this feature of Mozenda tool is similar to the characteristics of Web data Extractor tool [10]. There are two sections in Mozenda: Web Console Section: This application allows the user to run the agents, organize and publish the results of data extracted from the web environment. Agent Producer Section: This is an application for windows that can be used to construct project associated with the data extraction in windows. [4] Figure6. Parts of Mozenda[4]

4.6 Screen- Scraper Screen- Scraper makes it possible for a computer to receive character based data from a central processor repeatedly and get it recognizable for a graphical interface. New versions of Screen Scraper get data from html, so using a browser they can have access to information. This tool provides a graphical interface that allows you to specify the address of the data elements that have been extracted, it is also provides possibility to work with extracted data. An advantage of this software is downloading all website products within a spreadsheet. [2] Screen- Scraper is a tool that is able to collect data from web pages easily. So when the information from a particular site is needed, using Screen Scraper most of the websites file to access and compare can be downloaded at once. This software, by representing a graphical interface and advanced extract tools allows users to be able to use it easily. Some of the key features of Screen Scraper tool are as follows: To copy the text of web pages automatically, to open links automatically, to fill and submit forms on websites automatically, to download files such as PDF, Word, images and so on automatically from websites, integration or separation of integrated data by most programming languages (Java, PHP,.NET, ASP), which can be run on the server. 4.7 Automation Anywhere Working with this software is very easy and without any difficulty user easily could extract the data depending on the need. Because of intelligent and automated technology in this tool, complicated and time consuming operation, done quickly and does not require programming. One of the features of this tool is the ability to repeat an action during hours, minutes or seconds [12].One can also specify the rate of the required action. One of the important features of this tool is the existence of a section called Scheduler. In this section we can schedule required action so that the action can be executed at a specified time. For example, the desired action can be done once a day or certain days of the week or once a month. This tool in addition to recording the actions done in the windows is able to record on the web. This part which is called Web Recorder, records all activities on the web from opening a website to select a button or link. Using this quality of Automation anywhere tool, two types of data including multiple data and tables can be extracted.

The output of this tool is in Xml, txt, excel, mysql format. Generally the web recorder of this tool can be used for the following items: Entering website Fill out and submit the form Update database records Navigating through required search Using a web base erp Extract data from web Test an online program A part of worksheet of this tool is presented in Figure 7. Figure7. part of work space in automation anywhere tool 5. Non-Commercial Tools to Explore Web Content 5.1 Import.Io Although this tool is one of non-commercial tools on web mining field, but its graphical environment is much better than commercial tools. At Home page of this tool, there is a feature which enables the user to select the type of extracted data regarding to one of the connector extractor and crawler options. Although the performance of these three options are the same but

they have some differences. If the data and information has a pattern after extracting data of a page in the form of table, the extractor option identifies data mining algorithms of this relation and without any interference of the user, extracts the other related data with a pattern in the page and place them in the table. About the Crawler, this feature is exist that with designing a pattern to extract one of the web pages, we can also do data extraction in other web pages. In Connector option, each user can access structured data from the obtained search results. After the search, Import.io, stores data in a table in the structured form; desired data could be extracted of these tables and if necessary, later using stored data, the website would be explored again. This tool will extract structured data. The extracted data are stored on virtual servers. Each time that the data is placed on the platform of this tool an API is made for accessing the classified data, so online access to integrated data is easily possible. A part of worksheet of this tool is presented in Figure 8. Figure8. Import.Io 5.2 Webextractor360 This tool is a completely free tool which has a very simple environment. By entering the address of the desired site some parameters such as images, words, tables, internet addresses, email, phone, fax, etc will be displayed in the tool which could extract the various information from website. This tool includes options for extracting which in some cases the tool can be compared with

commercial tools such as web data extractor. For example, in webextractor360 we can ignore certain hyperlinks which are defined to search and in this tool, putting page addresses in result of the search is optional. Although web data extractor tool has more options and flexibility but we can compare two cited cases for webextractor360 with a filtering feature in web data extractor tool. A part of worksheet of this tool is presented in Figure 8. Figure8. Webextractor 360 5.3 Irobotsoft This tool is a portable tool so it does not need to installation and using it, we have an intelligent robot to perform all activities related to the website, such as filling out forms for membership, link selection and Database connection. To use this tool programming is not required and everyone without any programming knowledge is able to learn to use this tool, but with a little programming skill one can build a stronger irobot tool. Irobot runs on various Operating systems such as xp, 7, vista, Nt and needs IE browser to run. One of the features of this tool is a multiple automatically data extraction from various websites. The extracted data could be stored with csv and xml format. You can see part of worksheet of this tool in Figure 9.

Figure9. Irobotsoft 5.4. Scrapy Scrapy is a free tool for web data mining extraction which extracts structured data from the Web pages. This tool is suitable for various purposes such as automatically testing web pages, monitoring and data mining. Scrapy is written in phyton language and it is portably executable in Linux, Windows, Mac and BSD operating systems. Before running this tool on the operating system, the phyton should be installed and some changes in the operating system should be made which according to the type of operating system these changes are different. However, the use of this tool is a little more difficult than other tools. 5.5 Context Miner This tool mines the web content online and free. The output formats of the extracted data are XML and CSV. The difference of this tool with other tools is in Web mining of certain sites, in other word this tool is mining the web just from the sites like Youtube, Flickr and Twitter. 6. Conclusions and Discussion In this paper, first the concept of web mining and the need to use it because of expansion of the volume and variety of information on the web, the dynamic web pages, and so on was introduced. Users to access their desired information need techniques for extracting information. Web content mining is one of the methods which can be used to access the users desired information. For web content mining some tools are designed, these tools can be divided into two categories: commercial

and non-commercial. The main objective of this paper is to introduce and compare the tools of the two groups. At the end there is a table to compare tools of required operating system, being commercial or noncommercial, the type of extractable data and data output format. Table1.Comparing General Characteristics of Commercial and Non-Commercial Tools Tools name OS Commercial What to Extract Data export format Automation anywhere windows yes unstructured data or tabular data Xml, txt, excel, mysql Easy Web Extract Web info extractor windows yes text, url, image, html Microsoft Excel, Access, TXT, SQL, HTML, XML, MySQL, ODBC Windows yes structure or unstructured data Excel (CSV), Text (TXT) 2000/XP/2003 mozenda windows yes image, text,price, date, address, phone, and fax CSV, XML, TSV Screen scraper Web data extractor windows yes Image,text 95/98/NT/2000/ Me/XP yes Meta tags, urls, text,phone, email address,fax Microsoft Excel Web content extractor Import.io Web extractor360 Context miner irobotsoft scrapy windows yes text, image, multi media, Excel, text, HTML, MS Access DB, SQL Script File, MySQL Script File, XML file, HTTP submit form, ODBC Data source Windows,osx,li nux Windows 2000/XP/2003/ Vista no no text, numbers, locations, URLs, images Images, Phrases, HTML Headers, HTML Tables, URLs (Links), URLs (Keywords), Emails, Phone, Fax CSV, HTML, or XLS. ---- no CSV, XML Windows XP / Seven/ vista/nt windows, linux, mac, BSD text no structure or unstructured data CSV, XML no Structure data CSV, XML,JSON(JavaScript Object Notation)

References: [1] S.Balan, P. Ponmuthuramalingam, (2013), A Study of Various Technique of Web Content Mining Research Issues and Tools, International Journal of Innovation Research and Studies (IJIRS), Vol. 2 Issue 5, pp. 507-517. [2] M.Karpagam, R.Sasikala, (2013), Analysis of Web Content Mining Tools, International Journal of Advanced Research in Computer Science and Software Engineering, Vol. 3, Issue 12, pp. 124-130. [3] Aishwarya Rastogi, Smita Gupta, Srishti Agarwal, Nimisha Agarwal, (2012), Web Mining: A Comparative Study,International Journal Of Computational Engineering Research, pp. 325-331. [4] AbdelhakimHerrouz,Chabane Khentout Mahieddine Djoudi, (2013), Overview Of Web Content Mining Tools, The International Journal Of Engineering And Science (IJES), Vol. 2, Issue 6, pp. 1-6. [5] Jiawei HanandMichelineKamber, (2006), Data Mining: Concepts and Techniques, book, ISBN 13: 978-1- 55860-901-3. [6]Lieu, B., (2007), Web Data Mining Exploring Hyperlinks, Contents, and Usage Data, book published by Springer-Verlag, Berlin, Heidelberg 2007, ISBN: 978-3-642-19459-7 (Print) 978-3-642-19460-3 (Online). [7] Bharanipriya, V., Prasad, V.K., (2011), Web Content Mining Tools: A comparative Study, International Journal of Information Technology and Knowledge Management. Vol. 4, No 1, pp. 211-215. [8] Marcus P. Zillman, M.S., A.M.H.A., (2014), Web Data Extractors A White Paper Link Compilation, http://virtualprivatelibrary.blogspot.com/web%20data%20extractors.pdf [Accessed on: April 29, 2014] [9] Cooley, R., Srivastava, J., Mobasher, B., (1997), Web Mining: Information and Pattern Discovery On The World Wide Web,. Proc. Of The 9th IEEE International Conference on Tools With Arterial Intelligence (ICTAI'97). [10]http://www.garethjames.net/a-guide-to-web-scrapping-tools/ [Accessed on: April 29, 2014] [11] Singh, T. P., Seetha, D. A. & Pandey, K. K. (2012). HIT: Web Content Mining Tool. IJECCE, Vol. 3, No.6, pp. 1388-1394. [12] Automation Anywhere Manual. AA, http://www.automationanywhere.com [Accessed on: July 2014] [13] Easy Web Extract Review, http://scraping.pro/easy-web-extract-review/ [Accessed on: July 2014] [14] Kshitija Pol, Nita Patil, Shreya Patankar, Chhaya Das, (2008), A Survey on Web Content Mining and extraction of Structured and Semistructured data, First International Conference on Emerging Trends in Engineering and Technology. [15] Faustina Johnson, Santosh Kumar Gupta, (2012), Web Content Mining Techniques: A Survey, International Journal of Computer Applications (0975 888), Volume 47 No.11, pp. 44-50.