An Approach To Web Content Mining

Size: px

Start display at page:

Download "An Approach To Web Content Mining"

Augustine Logan
6 years ago
Views:

1 An Approach To Web Content Mining Nita Patil, Chhaya Das, Shreya Patanakar, Kshitija Pol Department of Computer Engg. Datta Meghe College of Engineering, Airoli, Navi Mumbai Abstract-With the research in information Retrieval and phenomenal growth of the web, today s websites have become a key communication and information medium for various organizations. It also offers an unprecedented opportunity and challenges to data mining. Various techniques are available to extract useful data from the web. It is very important for the users to utilize this information effectively which helps them to understand the structure of information on the web more deeply and precisely. This paper conducts a survey of how Web content mining plays an efficient tool in extracting structured and semi structured data and mining them into useful knowledge Key Words- Web Content mining, Semi Structured data, structured data. I. INTRODUCTION. The web is a medium for accessing a great variety of information stored in different parts of the world. Information is mostly in the form of unstructured data. As the data on the web grows at explosive rates, it has lead to several problems such as increased difficulty of finding relevant information, extracting potentially useful knowledge and learning about consumers or individual users. Efforts are being made to make such data available, usually in some structured form such as table, for querying and further manipulation. Web mining is an emerging research area focused on resolving these problems. This is web mining.some of the techniques of web mining are Web content mining, Web usage mining, Web structure mining. Web content mining extract information from web page content. Two groups of web content mining are those that directly mine the content of documents and those that improve on the content search of other tools like search engine. For Web content mining data can be image, audio, text and video. Any mining method focuses on information extraction and integration. Web content mining extracts information from different web sites for its access and knowledge discovery. It is challenging job because of following reasons 1. All types of data are available 2. Due to nested structure of HTML code, web information is semi structured. And it is needed that web 3.Information present on the web is constantly increasing & changing. It is important for much application to keep with the changes and monitors it for a particular type of information. So Web is dynamic. 4. Since web is wide it is possible to extract information of any kind. 5. It is also possible that same piece of information may appeared in many pages or sites. Following are the problems in web content mining. Data/information extraction: Our focus will be on extraction of structured data from Web pages, such as products and search results. Extracting such data allows one to provide services. Two main types of techniques, machine learning and automatic extraction are covered. Web information integration and schema matching: Although the Web contains a huge amount of data, each web site (or even page) represents similar information differently. How to identify or match semantically similar data is a very important problem with many practical applications. Some existing techniques and problems are examined. Opinion extraction from online sources: There are many online opinion sources, e.g., customer reviews of products, forums, blogs and chat rooms. Mining opinions (especially consumer opinions) is of great importance for marketing intelligence and product benchmarking. We will introduce a few tasks and techniques to mine such sources. Knowledge synthesis: Concept hierarchies or ontology are useful in many applications. However, generating them manually is very time consuming. A few existing methods that explores the information redundancy of the Web will be presented. The main application is to synthesize and organize the pieces of information on the Web to give the user a coherent picture of the topic domain. Segmenting Web pages and detecting noise: In many Web applications, one only wants the main content of the Web page without advertisements, navigation links, copyright notices. Automatically segmenting Web page to extract the main content of the pages is interesting problem. A number of interesting techniques have been proposed in the past few years. II. TYPES OF DATA ON WEB AND ITS EXTRACTION

2 Data available on web is classified as structured data, semi structured data and Unstructured data. A. Structured Data Extraction. It is useful to extract structured data from web pages structured data is easily extracted as compared to semi structured and unstructured data. Some of structured data are list, tree, and data in the form of table. Structured data can be extracted using Wrapper generation. Wrapper learning works as the user manually labels set of trained pages. The rules are applied to extract information from Web pages. B. Unstructured data Extraction. Unstructured data is in the form of text document. It is related to text mining, natural language processing and machine learning and web question answering. accurate retrieval in single webpage. Normally data in companies is represented by entire web site, not by individual webpage. Let us take example when we want to know the price of Washing Machine, it is useful to search web sites of Washing Machine retailers than world wide web. There area various drawback of using web directory. In most time web directories return a small portion of the web sites that are relevant to given topic. It may happen that web services require up to date information. Here simplest way is to use well established method. The web crawler take a data apply a step of post processing, analyses the resulted web pages to find relevant web pages. This analysis is possible by applying a web site classifier to all retrieved pages from a given websites. This approach does guarantees that the crawled web pages are according to their respective web pages. To achieve accurate and efficient web site crawling it requires two level graph abstractions. There are two different level of abstraction. It is internal crawler and external crawler. C. Semi structured data extraction. Semi structured data are not full and grammatical text. Semi structured data is Hierarchical structured. Semi structured data do not have a predefined structure. Data is semi structured is inherent structure which appears implicitly on page and varies from one page to another. There are several techniques to extract semi-structured data some of them are NLP techniques, wrapper generation, ontology, TINTIN. To extract such data we must know what to extract. Here common approach is to build a specific grammar which details the surrounding of each piece of data to extract. III. TECHNIQUES FOR EXTRACTING STRUCTURED DATA There are various techniques to extract Structured data using Web content mining.some of the techniques are as described below. A. Web Crawler One of the Web crawlers is Google. For extracting the data Web Crawler takes data from user, search it and get well selected pages. The crawler starts from given pages and linked those pages. Now the crawler uses web engines to perform breadth first search of whole world and explores only a small portion of web using breadth first search as directed by user. Now it will return new page which is not indexed till. Index is given to each individual page by web search engine. Web search provide abstract view on the page and list relevant websites for a different types of data. For some data that user wants to search by applying web crawler techniques get more a. External crawler External crawler orders yet unknown websites and invokes internal crawls on first page. External crawler is used to order external crawl frontier after that it will decide which will be next site for an internal crawler. The external crawler takes user specified websites and expand the graph using newly found websites. b. Internal crawler Internal crawler examines current web sites to identify its purpose and download few web pages. The internal crawler will process the web pages generated by external crawler. The web pages generated by internal crawler are more reliable due to better classification accuracy. Internal crawl started with home pages because the most publishers want to tell the user about website to provide specific information. Internal crawl uses breadth first search to traverse the web. Internal crawler will keep retrieving enough links for extending external. And in this way it will find relative web pages. If pages are not found by traversal link, then this process continues until a reasonable number of traversal links. B. Dynamic Web Content Mining. If we want to mine news from online news site, then dynamic web content mining is useful. It consists of four stages which is resource identification preprocessing, generalisation and analysis. a. Resource identification Here crawler navigated across a web site and extracts news report from it. It will first download and check whether that is outdating if not then that document is useful. Now it analyses the identification news report and eliminate irrelevant - 2 -

3 information. It continues analysis until the queue of URL is empty. This process is activated periodically. Now the document constitutes a snapshot of current event and is subsequently preprocessed and stored. b. Preprocessing Stage Prepossessing stage transform incoming news report.s into a structured representation.it consist of its source information, date and formal representation of its contents. Then the sentence are marked with part of speech tag, using pos tag, noun are identified and joined to form a unique item when it is in sequence. Then item are selected and inserted into a list of topics. c. Generalization stage Generalization considers two tasks: the construction of topic distribution and analysis of trends. Dynamic Crawler continuously downloads the latest new report. It applies simple NLP techniques which will extract meaningful topics. In discovery stage it uses straight statically measure and identifies topic contributing. Now it will develop a graphical interface for supporting the user to interpret discovered pattern. d. Analysis stage In this stage user select time frame of interest and establishes parameter. Then the user will use the pattern discovered by the system in the generalization stage. If the discovered pattern is not interesting of user he can repeat this process. C. Wrapper Generation. Web page data can be extracted using HTML wrapper. Here the data is DOM tree which are constructed by web browser such as Mozilla and internet explorer. For extraction of such data we use DOM tree not HTML wrapper. Data chunk can be extracted from a DOM tree which is called as instance. Instance is set of tree node in the input DOM tree, substring from the text content nodes or values of tree element attributed. If wrapper is manually generated then he cannot think about HTML table as a tree with some text values. Building block of wrapper is called pattern[3].each pattern extract one parent pattern. Each pattern has one or more filter that specify how to extract the relevant instances for this pattern. Here filter will return a set of output instance extracted from a given input instance. A set of instance extracted for a pattern is a combination of instances by its entire filter. Interactive wrapper generator creates filter from a visual interaction with a human wrapper designer [3].Then user will give his response which is equivalent to marking of nodes in DOM tree. In this way we find out the filter that identifies the entire designer pattern. It will select an input instance and mark out missing instance. When system currently matches all intended instance of current input, user will decide to continue with input instance or HTML document. This wrapper generation algorithm uses clustering and attribute classification. Cluster is similar to their tree structure. It will build the list of feature used for classification of filter. Now the list constructed from their attributes and values, construct the training database, for every customer build a decision tree based on attribute classifier. Then it will build a tree based attribute classifier. Each cluster is divided into blocks. Each cluster defines its extraction rule which is core Xpath expression and an attribute classifier. Instance of input DOM is found by Xpath expression, which matches particular tree shape of the cluster. Attribute classifier will sort out the instance. Here DOM attribute may repeat several block with different values. Index is given to the block. DOM attribute on the same block are treated as non-repeating. User can highlight the interested block. In this way internal wrapper generation is used to extract data. d. Page content mining For a given query q and a usual web search engine, it first obtain a set of pages retrieved and ranked by a web searching method. Then we classify these pages according to their importance comparing it with PCR (Page Content Rank). PCR classifies pages from set of R(q) of pages retrieved IV. TECHNIQUES FOR EXTRACTING SEMI STRUCTURED DATA. A. Using OEM and Schema Knowledge Mining In this method to get useful information data should be embedded in a group of relevant information and store it with Object Exchange Model (OEM). On that we apply Schema knowledge mining. It helps user to understand the information structure of the web more deeply and thoroughly. Each object contains an object identifier and a value in OEM. A value can be atomic or complex. Atomic value can be integer real, string program. While complex value is a collection of 0 or more OEM sub-object, each linked to the parent via a descriptive textual label. To implement it user must provide an initial http address to semi structured data extractor. After that extractor start to get the needed HTML file from corresponding remote web server, extract the useful data based on the specification file directing the extraction, and store it in OEM model. If we are getting some useful hyperlink, then these hyperlinks are inserted into queue to get HTML file and extract information. Now semi structured data can be use for schema knowledge discovery. If semi structured data has no fixed schema and same attributes have different number of values or no values in different but similarly structured web pages. In such situation it makes difficult extraction task. Assuming that web pages are stored in HTML format, it is possible to design specification file for every class of similarly structured web pages. Here label is added after specification file is designed - 3 -

4 for interesting attributes. Here file used to extract information on particular information on web site. Then information needed to extract and after that label is added. when hyperlink is extracted, we must tell the programmer which class the hyperlink pages belong to. After applying algorithm for extracting semi structured data from WWW and storing it in OEM, the algorithm gets an HTML file Document from a web site. It extracts needed information according to specification file. To implement schema discovery algorithm we use hash table. We get index on the object and can find out all its sub objects when trying to get an extraction. B. Using Top down Extraction Top down approach extracts complex object and decomposes into less complex object. It is possible that structure appear on a page that vary from one page to another. To extract such a data we need some description of what to extract. Top down approach extract complex object from data rich web source. Text on the page may have inherent structure. In such a situation to extract data from Top down approach we need to distinguish objects. From a set of data rich pages we have to extract object and their attributes which can be inserted into tables for querying. This allows retrieval of information which is not possible by other text searching techniques To extract information from set of data rich, some type of description of what to extract is needed. We make assumption as how to parse and recognize token for insertion on a table. Grammar based approach is too rigid for processing typical text which appear on Web. To handle such situation designer of grammar should have top anticipate which exception could occur in practice and adopt the grammar.once an object is properly structured it inserted for latter querying. Now the nested table can be flattened for querying as a standard relational table. C. Natural language processing Here information extraction system is also called as Natural language processing. Such information extraction is use to develop system that could take natural language free text and will extract a limited range of key part from them. In this process it uses structured information of semi structured data. It is possible to find out some kind of structure pattern from them to assist data extraction. D. Web information Collection, Collaging and Programming System (WICCAP) Tree language can be used to precede the tree and can be used to describe web semi structured data source. WICCAP is Web Data Mediator System. Here Language is required to represent Wrapper rules. System introduces tools which are rules based on the framework of tree language, visual interface for user to build and modify rules. WICCAP interprets the learned wrapper rules and build structured XML document from web data source. a. Web Data Extraction Language In Web Data Extraction language, the kernel task of Web Data Extraction is to transform Web data to structured data sensitive to user requirement. Its possible to store the information of the web in the table since most data is semi structured document that are easy to rendered by browser and read by any user. The one problem with semi-structured document is lack of explicit data schema and constraint on the data. Web data extraction system convert web data to relational data model and stored it in relational database. Relational database is highly efficient to access. Since semi structured data is in hierarchical structure in nature data model can store data to output storage corresponding to this model. Here we use WDEL based on a formal language theory, tree language. Tree language is tool to represent and process tree. Unit of tree language is symbol, which represent structure information of node. We call the set of symbol which should be without sub-symbol. We call it as a term over some alphabet. If the web data is in the form of directed graph, where node is web document or part of it. Web document are then hyperlink or other kind of link such as Links in XML document. The information is reorganized and appear as logic view. After web data extraction we circles route in this graph can be discarded. The graph can be treated as tree. web document can be treated as tree language. Tree language Grammer describes initial symbol. After that find out grammar and maps the relation between input and output schema. Then it stores extracted data using hierarchical structure data model Now construct tree automata according to tree language grammar. The next step is transform hierarchical data to relational table. More workload is required to transform data to relational tablet and to store hierarchical structure storage. WICCAP store extracted data in XML format. Here output data is based on tree language theory. Its logical view is not same as a physical structure of web document. So we have to add virtual node in the logic data model which is called as mapping node. A mapping node represent a physical path.in this way we can map a physical path in physical structure of web to logic path in their logic view. Here user has to use their domain knowledge of how to define a logic view mapped to actual website and exploit the extracted data. CONCLUSION. Web mining is a rapid growing research area. Web content mining is related but different from data mining and text mining. Web data are mainly semi-structured and/or unstructured. Web content mining requires creative applications of data mining and/or text mining techniques and also its own unique approaches. Due to the heterogeneity and the lack of structure of Web data, automated discovery of - 4 -

5 targeted or unexpected knowledge information still present many challenging research problems. We first described the various types of data available on web. Then we described various problems of web content mining and techniques to mine the Web pages including structured and semi structured data. REFERENCES [1] J.P.Callan. Passage-Level Evidence in Document Retrival., In Proceeding of the ACM SIGIR Conference on Infromation Retrival, pages , Dulbin, Ireland, [2] M.Kaszkiel and J.Zobel.,.Passage retrival Revisited., in Proceeding of the ACM SIGIR Conference on InformationRetivel, pages ,Philadephia,USA,1997. [3]Bing Liu, Kevin Chen_Chuan Chang, Editorial Issue on Web Content Mining, issue2, [4]A.Mendez-Torreblanca, M.Monte, A Trend Discovery for Dynamic Web Content Mining, IEEE, Inteligence System, Vol 14, pages.20-22, [5]Chen Enhog,Wang Sufi Semi Structure data extraction and Schema Knowledge Mining,EUROMICRO Conference, Proceeding 25 Volume 2 Issue, pages ,1999. [6]Zhao Li, Wee Keong Ng, WICCAP: From Semi- Structured Data to Structured Data, In Proceeding of the 11 Th IEEE International Conference and Workshop on the Engineering of Computer-Based System,

Life Science Journal 2017;14(2) Optimized Web Content Mining

Life Science Journal 2017;14(2) Optimized Web Content Mining Optimized Web Content Mining * K. Thirugnana Sambanthan,** Dr. S.S. Dhenakaran, Professor * Research Scholar, Dept. Computer Science, Alagappa University, Karaikudi, E-mail: shivaperuman@gmail.com ** Dept.