Wrapper Implementation for Information Extraction from House Music Web Sources

Size: px

Start display at page:

Download "Wrapper Implementation for Information Extraction from House Music Web Sources"

Ilene Poole
5 years ago
Views:

1 Wrapper Implementation for Information Extraction from House Music Web Sources Author: Matthew Rowe Supervisor: Prof Fabio Ciravegna Module Code: COM3010/COM th May 2005 This report is submitted in partial fulfilment of the requirement for the degree of Master of Software Engineering by Matthew Rowe. I

2 Signed Declaration All sentences or passages quoted in this dissertation from other people's work have been specifically acknowledged by clear cross-referencing to author, work and page(s). Any illustrations which are not the work of the author of this dissertation have been used with the explicit permission of the originator and are specifically acknowledged. I understand that failure to do this amounts to plagiarism and will be considered grounds for failure in this dissertation and the degree examination as a whole. Name: Matthew Rowe Signature: Date: 4 th May 2005 II

3 Abstract The genre of house music is an expanding area with new producers creating new records every week for many record labels, and eager listeners. Until the creation of the World Wide Web, purchasing records could be a mundane and laborious task. Consumers are now able to browse large lists of available records, and preview the record before it is purchased, either by listening to the record, looking at the sleeve, or simply recognising the artist, title or label. More and more people buy their records online, and this has lead to the creation of online record stores, all competing for custom. The aim of this project is to adapt relevant techniques to replace the process of browsing many different online records stores, and contain records from several different web sites in one place. III

4 Acknowledgements I would like to thank Professor Fabio Ciravegna for his continual support and guidance throughout the project. I would also like to thank Sam Chapman for his help with the primary stages of research, and taking time out of his busy schedule to answer any questions I had. And finally, I would like to thank Nicholas Starkey, my housemate at university, who has proof read this work when he was not obliged to. IV

5 Table of Contents 1. Introduction Overview Objectives Outline Wrapper Classes LR Wrapper Class HLRT Wrapper Class OCLR Wrapper Class HOCLR Wrapper Class N-LR Wrapper Class N-HLRT Wrapper Class Softmealy Wrapper Representation Wrapper Coverage Extraction Patterns and Rules Embedded Catalog Formalism and Extraction Rules Whisk Extraction Rules Information Extraction Patterns from Tabular Web Pages Wrapper Induction WIEN STALKER Soft Mealy Whisk Information Extraction Armadillo Amilcare Regular Expressions Requirements and Analysis Motivation Objectives System Structure Wrappers Wrapper Processing System Presentation System Analysis House Record Web Sites Wrapper Technology to be implemented Requirements Functional Requirements Non-functional Requirements Priority Matrix Evaluation Methodology Empirical Evaluation (primary) V

6 6.6.2 Test Case Evaluation (primary) Alpha Beta Testing (secondary) System Design Choice of technologies System Structure Skeleton Wrapper Wrapper Processing System Presentation Prospective Wrapping Sources Extrax Component Implementation Skeleton Wrapper Implementation Wrapper Processing System Presentation Evaluation Requirements Evaluation Test Case Evaluation Test Source 1: DCS Staff Page Test Source 2: Juno Empirical Evaluation Test Source 1: DCS Staff Page Test Source 2: Juno Alpha Beta Testing Alpha Version Beta Version Evaluation Analysis Conclusion Achievements Limitations Problems Encountered Possible Improvements Future Work References...60 Appendix...63 VI

7 Table of Figures Figure 2.2 Web page containing country codes... 4 Figure 2.3 Web page containing country codes... 5 Figure 2.5a Web page containing nested content... 7 Figure 2.7a Softmealy token classes... 9 Figure 2.7b Source code from computer science department at UTC... 9 Figure 2.7c First instance separator notation Figure 2.7d Separator class notation Figure 2.8a Table of wrapper class coverage Figure 2.8b Relative Expressiveness of LR, HLRT, OCLR and HOCLRT Figure 3.1a Example of an εс description Figure 3.1b Extraction rule for restaurant name Figure 3.1c Section HTML code from restaurant lists web page Figure 3.1d Extraction rule for restaurant name using a wildcard Figure 3.1e Extraction rule for restaurant address Figure 3.1f - Section HTML code from restaurant lists web page Figure 3.1g Extraction rule for telephone number using a disjunctive Figure 3.1h Iterative extraction rule for list of credit cards Figure 3.2a Target source containing housing information Figure 3.2b Extraction rule for number of bedrooms and prices Figure 3.2c Semantic class for Bdrm Figure 3.2d Extraction rule using semantic classes Nghbr and Bdrm Figure 3.3a Syntax used by the pattern language Figure 3.3b Syntax of the tag tokens Figure 3.3c Syntax of the text tokens Figure 3.3d Tag token using wildcards Figure 4.1 Screen shot of WIEN running Figure 4.4a Untagged HTML code Figure 4.4b Tagged HTML code Figure 4.4c Most general extraction rule Figure 5.3a Java source code to create a pattern matcher Figure 5.3b Java source code to capture information using a group Figure 5.3c Java source code to apply a pattern to a matcher Figure 6.1a Screen shot of 24 Figure 6.1b Screen shot of 25 Figure 6.1c Screen shot of 25 Figure 6.1d Screen shot of 26 Figure 6.3 The structure of the system Figure 6.4.1a Screen shot of 28 Figure 6.4.1b Screen shot of 29 Figure 6.4.1c Embedded Catalog Formalism of a house record web site Figure Tier architecture of the proposed system Figure 7.2.1a Screenshot of displaying information intended for extraction Figure 7.2.1b Class structure for the skeleton wrapper VII

8 Figure 7.2.2a Flow diagram of wrapper processing system Figure 7.2.2c Illustration of the HLRT wrapper running over a table of records Figure Structure of JSP files Figure 8.1 The Extrax Logo Figure Implementation of the HLRT wrapper algorithm Figure 8.1.3a Screen shot of Extrax displaying records from all sources Figure 8.1.3b Screen shot of the search tool in Extrax Figure 8.1.3c Screen shot of Extrax displaying a single record Figure 8.1.3d Screen shot of Extrax offering different audio formats Figure 9.2.1a Screen shot of the DCS staff page Figure 9.2.1b Section of HTML code from the DCS staff page Figure 9.2.1c DCS wrapper console output Figure 9.2.2b HTML source code of 52 Figure 9.2.2c Juno wrapper console output Figure 10.5 System structure involving a wrapper induction system VIII

9 1. Introduction 1.1 Overview The World Wide Web has grown exponentially both in size and in popularity since its creation. With the current estimate [1] of Internet users standing at over 888 million people world wide, and usage growth between 2000 and 2005 at 146.2%, the World Wide Web has become an important and useful tool in everyday life. Not only is the Internet a medium of communication, but it also provides goods and services to the consumer, which would previously have been unobtainable. Companies have set up extensive catalogues of their stock for users to browse using web browsers. The multi platform nature of web sites allows any user with access to the World Wide Web the chance to view a company s products. The ability to view companies stock on their web sites has brought about competition between many companies, causing them to compete for users by providing more extensive stock lists, or more details about the stock they sell. Several companies have now set up sites to act as an intermediate party between the consumer and many companies [2]. By extracting information from the stock lists from various companies web sites, the intermediate sites are able to create their own stock list based on a culmination of the extracted stock lists. The intermediate sites offer details about the product to the user and include a link to the site where the information was extracted, and the product can be purchased. In order to extract the information from web sites the intermediate site implements wrappers. Wrappers are designed to represent a pattern or structure of a document or web page. Specific content areas on a web page can be set for extraction by the wrapper. The intermediate web sites use wrapper technology to extract stock lists from various web sites. Several of the intermediate web sites are capable of creating wrappers automatically using wrapper induction (the practice of inducing wrappers based on annotated examples where the annotation trains the system on the information to be extracted). Many intermediate web sites create wrappers manually to improve their accuracy and guarantee the extraction of the correct content. This project focuses on those manually created wrappers and looks at the various wrapper technologies available, their different roles, and their application environments. Extensive research has taken place looking in to the field of wrapper technology and extraction patterns. The focus of this research has been centred on the extraction of information from semi-structured web pages (pages containing both free text and tables). Retailers and distributors of records have benefited from the World Wide Web by offering lists of their current records available to the consumer, where each record offers multimedia content such as an audio sample. However, a problem arises when compiling a list of the latest records available. The large number of record web sites offering lists of the latest releases forces users to look at many web sites consecutively, and cumulatively create a mental picture of the latest releases. This project will focus on applying wrapper technology to retrieve records from various record web sites. 1

10 1.2 Objectives The aim of this project is to implement a current wrapper technology and use this implementation to extract information from house record web sites on the World Wide Web. Using the extracted information from the various sources, it will then be gathered together and displayed to the user. 1.3 Outline Chapter two focuses on the various wrapper technologies that are available. Each technology is described by giving examples of the algorithms they require for processing, the notations used for each and the environments in which they are applied. Chapter three looks at the different varieties of extraction patterns and describes the different notations used, along with the different application environments. Chapter four describes wrapper induction and its role in generating wrappers automatically. This chapter also includes example wrapper induction systems and how they function. Chapter five introduces methods of information extraction, and several systems specialised in annotating information as it is extracted. Chapter six looks at the requirements for the project, and the analysis that will be applied to the project. This chapter attempts to capture suitable testing strategies and methodologies, and outlines the essential characteristics the project system should adhere to. The following chapter, chapter seven, focuses on the design of the system, detailing the proposed structure of the system, and outlining the designs for the various system components. Chapter eight explains the implementation of the system, what challenges arose, and how they were tackled. Chapter nine contains the evaluation of the project. It comments on the outcome of the system when compared to the requirements presented previously, and. It also recalls the testing methods from chapter six and applies them to the system for analysis. Chapter ten includes the conclusion of the project, detailing the work completed, possible future developments, and discussion of the final system. 2

11 2. Wrapper Classes For information to be extracted from a web page, a mechanism must be set up to take as inputs, the HTML source code of the web page and some method or pattern for extracting information from that code. One such method is the use of wrappers. They function by being given delimiters (parameters) relating to information that is to be extracted. When extracting information from a web page, these parameters will be HTML tags that surround the required information. There are many different wrapper classes available for use. This section will look at the different roles, and methods involved with each. 2.1 LR Wrapper Class This wrapper class [3] uses a left to right mechanism to extract information. It is supplied a list of tuples, where each tuple consists of a left and right delimiter. Within the target source code inside these delimiters will be the information to be extracted. The algorithm this class uses is as follows: Given wrapper(l1,r1,.,lk,rk) and page P While there is a next occurrence of l1 in P For each (lk,rk) ε {(l1,r1),.,(lk,rk)} Extract from P the value between lk and rk Return all extracted tuples By using this algorithm it is possible to set the tuples to extract the necessary information from a given page. For example a page consists of: ho[a11](a12)co[a21](a22)co[a31](a32)ct And then applying the following wrapper to the page: {[,],(,)} would produce: {<A11,A12>,<A21,A22>,<A31,A32>} 2.2 HLRT Wrapper Class This class of wrapper [3] uses the LR wrapper algorithm on a specific area of the page, which can be set by two more delimiters; the head and tail. By setting these parameters the following algorithm is produced. Given wrapper(h,t,l1,r1,.lk,rk) and page P Skip past the next occurrence of h in p While the next occurrence of l1 is before t in P For each (lk,rk) ε {(l1,r1),.,(lk,rk)} Extract from P the value between lk and rk Return all extracted tuples 3

12 In comparison to the LR wrapper, the algorithm for the HLRT wrapper does not hold much difference. Minor alterations to a page can render a LR wrapper useless [4]. The only significant alterations are the setting of the head and tail in the HLRT wrapper class. Although this seems like a basic alteration it has a powerful effect when working with an arbitrarily large page, allowing extraction within a specific area of the content, and therefore avoiding processing redundant information. Figure 2.2 Web page containing country codes To show the functioning of this wrapper class, consider the following page in HTML form: <HTML><TITLE>Some Country Code</TITLE><BODY> Some Country Codes Congo 242 Egypt 20 Belize 501 Spain 34 <HR>End</BODY></HTML> And then applying the following wrapper to the page: {,<HR>,,,,} would produce: {<Congo,242>,<Egypt,20>,<Belize,501>,<Spain,34>} This mechanism for information extraction is very useful for any tabular data, as a lot of information on the web is stored in tables. Tables are written in a structured and repetitive manner, allowing the creation of several left and right delimiter tuples to match one row. Once the information has been extracted from that row, then the algorithm would reapply the left and right delimiters to the next row and so forth until the tail of the wrapper has been reached. The head and tail parameters that are specified can also be used as a mechanism to avoid incorrect extraction of information from the head and tail of the page. Specific areas of a page can be focused on and information can be extracted from those areas rather than the whole page. 4

13 2.3 OCLR Wrapper Class Another class of wrappers is the OCLR wrapper class [3]. The theory behind this class of wrapper states that a page consists of tuples separated by irrelevant text, whereas HLRT wrapper class works by treating the page as containing irrelevant text before the head and after the tail. This class of wrappers works by specifying an opening and closing parameter for each tuple. Given wrapper(o,c,l1,r1,.,lk,rk) and page P While there is a next occurrence of o in P Skip to the next occurrence of o in P For each (lk,rk) ε {(l1,r1),.,(lk,rk)} Extract from P the value between lk and rk Skip past the next occurrence of c in P Return all extracted tuples Consider the following page P: Figure 2.3 Web page containing country codes <HTML><TITLE>Some Country Code</TITLE><BODY> 1 Congo 242 2 Egypt 20 3 Belize 501 4 Spain 34 <HR>End</BODY></HTML> And then applying the following wrapper to the page: {, ,,,,} would produce: {<Congo,242>,<Egypt,20>,<Belize,501>,<Spain,34>} By setting the opening of the tuple as and the closing as all the irrelevant numbers can be skipped. 5

14 2.4 HOCLR Wrapper Class This class of wrapper combines the functionality of the OCLR wrapper class and the HLRT wrapper class [3, 4]. Using the theory from the OCLR wrapper class of tuples being separated by irrelevant text, and combining with setting a specific area for extraction, with the head and tail delimiters, the HOCLR wrapper class is more precise and accurate in its extractions. Given wrapper(h,t,o,c,l1,r1,.,lk,rk) and page P Skip past the first occurrence of h in P While the next occurrence of o is before t in P Skip to the next occurrence of o in P For each (lk,rk) ε {(l1,r1),.,(lk,rk)} Extract from P the value between lk and rk Skip past the next occurrence of c in P Return all extracted tuples Consider the following page P from: ho[a11](a12)co[a21](a22)co[a31](a32)ct And then applying the following wrapper to the page: {h,t,o,c,[,],(,)} Would produce: {<A11,A12>,<A21,A22>,<A31,A32>} Although this mechanism for wrapping information appears to be more precise due to the combination of techniques between wrapper classes, it is only useful when the page to be wrapped contains irrelevant information. When this is the case this mechanism is extremely efficient and is capable of extracting information effectively. The comparison to note between this class of wrapper and the HLRT wrapper class is the accuracy of the extraction. In order for a piece of information to be extracted, the opening and closing tags are required to be present surrounding the tuple, the left and right delimiters must also be present, and the head and tail delimiters must be present. 6

15 2.5 N-LR Wrapper Class The previously seen wrapper classes [3] look at information that is stored in tables and therefore has a repetitive structure to it. This class of wrapper deals with nested information. This is different from tabular information, as nested information requires extraction of differing length tuples. A list is not refined to containing the same number of values to each attribute; instead this number varies from attribute to attribute. For example: Figure 2.5a Web page containing nested content As you can see from figure 2.5a certain attributes have differing numbers of values nested below them; The Semantic Web has no values below it, but Wrapper Induction Systems has ten. The N-LR wrapper class handles this problem by using the following algorithm: Given wrapper(l1,r1,.,lk,rk) and page P While there is a next occurrence of l1 in P Extract from P between l1 and r1 While there is a next occurrence of lk before l1 in P Extract from P between lk and rk Return all extracted tuples Using the following page P: h[a11i][a11ii](a12)[a21](a22)[a31](a32)t Applying the following wrapper to page P: {[,],(,)} Correctly extracts the information as follows: {<A11i>,<A11ii,{<A12>}>,<A21,{<A22>}>,<A31,{<A32>}>} 7

16 The advantage of having such a wrapper class is obvious when handling data that varies in structural content. Lists such as the one in figure 2.5 can be hard to traverse and extract information from without the use of such a wrapper class as N-LR. 2.6 N-HLRT Wrapper Class By combining the functionality of the N-LR wrapper class and the HLRT wrapper class, the N-HLRT wrapper class [3] attempts to avoid extracting irrelevant information from nested sources by using head and tail delimiters. The algorithm that it uses is similar to the N-LR wrapper class, but with the added use of a head and tail. As follows: Given wrapper(h,t,l1,r1,.,lk,rk) and page P Skip past first occurrence of h in P While there is a next occurrence of l1 before t in P Extract from P between l1 and r1 While there is a next occurrence of lk before l1 and t in P Extract from P between lk and rk Return all extracted tuples Using the following page P: h[a11i][a11ii](a12)[a21](a22)[a31](a32)t Applying the following wrapper to page P: {h,t,[,],(,)} Correctly extracts the information as follows: {<A11i>,<A11ii,{<A12>}>,<A21,{<A22>}>,<A31,{<A32>}>} 8

17 2.7 Softmealy Wrapper Representation This wrapper representation requires tuples to be extracted, and contained within the tuples are attributes [5]. This representation of a wrapper treats each distinct attribute permutation as an encoded path through a HTML coded web page. In order to fully understand this theory it is important to break down the wrapper representation in to each of its components. The Softmealy approach treats the HTML of a page to be one continuous string of content, separated into tokens. The use of tokens is very important to the wrapper, as this notation is what the base elements of the HTML are broken down into. A token is written as: t(v) where; t is the token class, and v is a string. The several token classes that a token can belong to are as shown in figure 2.7a. CAlph(ASU) = ASU all uppercase C1Alph(Professor) = Professor 1 uppercase, and then followed by at least one lowercase 0Aplh(and) = and 1 lowercase, then more than 0 characters Num(123) = 123 numeric string Html() = HTML tag Punc(,) =, Punctuation Symbol Control Characters where the length is the argument. ie. NL(1) = New Line, Tab(4) = Tab, Spc(3) = White Space Figure 2.7a Softmealy token classes Softmealy uses separators. A separator is an invisible borderline between two adjacent tokens. A separator s is described by its context tokens, those to the left; sl and those to the right; sr. In order to extract an attribute the surrounding tokens are recognised. However, the separators for the attribute may not be the same for all the tuples in the web page. Contextual rules are therefore used to define a set of individual separators. A contextual rule can be expressed as a sequence of tokens, and can include the use of wildcard tokens to denote any token in any class. <LI><A HREF= > mani Chandy</A>, Professor of Computer Science and Executive Officer of Computer Science <LI><A HREF= > Jim Avro</A>, Associate Professor of Computer Science <LI><A HREF= > David E. Breen</A>, Assistant Director of Computer Graphic Lab <LI>John Tanner, Visiting Associate of Computer Science <LI>Fred Thompson, Professor Emeritus of Applied Philosophy and Computer Science Figure 2.7b Source code from computer science department at UTC 9

18 In order to show how the representation works with a web page, take figure 2.7b. This page contains details of members of the computer science department from the university of Technology in California. Let us consider the separator between and Professor. For the first instance of this separator we can use the following: sl ::= Punc(,)Spc(1)Html() sr ::= C1Alph(Professor)Spc(1)0Alph(of) Figure 2.7c First instance separator notation Although this will work for the first instance of the separator, it will not work for all of those following it. Therefore, rather than using the definition of the one separator, the separator class S can be used to hold the various definitions: SL ::= Html(</A>)Punc(,)Spc(_)Html() Punc(_)NL(_)Spc(_)Html() Punc(,)Spc(_)Html() SR ::= C1Alph(_) Figure 2.7d Separator class notation In the notation from figure 2.7d, the symbol refers to the OR logical operator; the use of optional units. The next stage of this representation is to create a Finite State Transducer (FST), which takes a sequence of separators as input, rather than a HTML string. It then matches tokens from the separators with contextual rules and decides what the following transitions should be. One tuple is represented by one FST; a body string can be extracted from a HTML page and then fed in to a tuple transducer, which can then iteratively extract attributes. This form of wrapper expression is an extremely useful aspect of the Softmealy approach, it is very precise, and the use of options can be extremely useful when dealing with alternating formats throughout a page. The use of finite state transducers as a means of traversing through a page and extracting content is an approach which could easily be applied to tabular information from semi-structured sources due to the repetitive nature of its format helping with the iterative process. 10

19 2.8 Wrapper Coverage The wrappers that have been looked at so far can be assessed according to their coverage over a specific page. Kushmerick et al [3] used this methodology when deciding which wrapper class to implement for WIEN (wrapper induction environment). By using the same page for the assessment of all wrapper classes, comparisons between wrapper classes can be made. Wrapper Class Coverage (%) LR 53 HLRT 57 OCLR 53 HOCLRT 57 N-LR 13 N-HLRT 50 Figure 2.8a Table of wrapper class coverage Wrapper class coverage can be measured by the percentage of a page it covers. A more expressive wrapper is capable of a larger percentage of coverage. Wrapper classes can be combined to achieve a much larger area of coverage. If all the wrappers from figure 2.8a were combined, the union of all the wrappers would result in covering seventy percent of a page. If the union of the LR, HLRT, OCLR, and HOCLRT wrapper classes were used then coverage of sixty percent would be achieved. Figure 2.8b demonstrates the ability to achieve a larger level of coverage through the union of wrapper classes. The relative expressiveness of each wrapper is shown by the size of the oval indicating the amount of space it is able to cover. Each wrapper class is capable of covering a set area, but when combining the areas together, the amount of coverage increases. (HLRT) (LR) (HOCLRT) (OCLR) Figure 2.8b Relative Expressiveness of LR, HLRT, OCLR and HOCLRT 11

20 3. Extraction Patterns and Rules 3.1 Embedded Catalog Formalism and Extraction Rules Due to the hierarchical structure of many web pages of a semistructured nature it is possible to apply representation formalisms. One such formalism is the embedded catalog formalism. The εс (Embedded Catalog) description of a page [6] is a tree-like structure, where leaves represent data to be extracted. Each node in the tree is either a homogeneous list (same kind, eg. List of books), or a heterogeneous tuple (different kinds, eg. Book and author). LIST(LA-Weekly restauraunts) TUPLE(LA-Weekly restauraunts) name address phone review LIST(credit cards) credit card Figure 3.1a Example of an εс description As you can see from figure 3.1, the εс description of a restaurant taken from a web page can be displayed relatively easily. The page itself comprises of a list of restaurants, where each restaurant is a tuple. Based on the description of the tuple previously, it can contain different types, and in this case the tuple contains leaves and an embedded list. This formalism can then be combined with a query of the information to be extracted (eg. address) and extraction rules, which specify the correct way to go about retrieving the information. The key theory behind extraction rules [7] used within the context of an εс is to use landmarks to locate a specific item within the content of it s parent item. Eg. Find the address within a restaurant tuple. By looking at figure 3.1b you can see the extraction rule that will be applied to the HTML code of a given source (based on figure 3.1c). 12

21 Rule 1 = SkipTo( ) SkipTo() Figure 3.1b Extraction rule for restaurant name <TITLE>LA Weekly Restaurants</TITLE><CENTER> Name: Killer Shrimp Location: Any Cuisine: Any KILLER SHRIMP 523 Washington, Marina del Rey (310) <BLOCKQUOTE> Lovely seafood. MC, V, AE, DIS. Figure 3.1c Section HTML code from restaurant lists web page The extraction rule applied to the HTML code works by skipping all content until two consecutive tags are found, following which it then skips until a tag is found. The information following the second SkipTo() statement is the information to be extracted, in this case the information to be extracted is the name of the restaurant. Using this rule syntax it is also possible to implement wildcard elements such as _HTMLTag_ where it is unclear what the tag will be. In the instance of the previous rule it would be possible to create a similar rule, but implementing the wildcard as such: Rule 2 = SkipTo( _HTMLTag_) SkipTo() Figure 3.1d Extraction rule for restaurant name using a wildcard This rule would also extract the restaurant name, but would instead look for a tag followed consecutively by any HTML tag. Another example of a rule that could be created to extract the address of the restaurant from the code is as follows: Rule 3 = SkipTo( ) Figure 3.1e Extraction rule for restaurant address This implies the skipping of all content until a and tag appears consecutively. The information to be extracted appears after these tags. One of the main advantages of using rules for extraction is the ability to apply alternative rules using a disjunctive, this caters for optional rules to be used depending on the content. For example take figure 3.1f. <TITLE>LA Weekly Restaurants</TITLE><CENTER> Name: Killer Shrimp Location: LA Cuisine: Chinese HOP LI Pico Blvd., West L.A. No Phone Number <BLOCKQUOTE> Lovely Chinese food. V, AE. Figure 3.1f - Section HTML code from restaurant lists web page 13

22 If the required information to be extracted was the telephone number from both figure 3.1c and figure 3.1f there could be a problem with the application of only one extraction rule. In figure 3.1c the telephone number is indicated by the tag, however in figure 3.1f the telephone number is indicated with the tag. This can be overcome by using a rule as follows: Rule 4 = EITHER SkipTo() OR SkipTo() Figure 3.1g Extraction rule for telephone number using a disjunctive When processing this rule, the implication is to skip all content until a tag is found, if no tag is found, then skip all content until a tag is found. Hence covering both options. By using an εс formalism, the tuple of the restaurant details can contain a list of credit cards available for use at the restaurant. In order to be able to extract each credit card from the list there must be some iteration rule capable of working through the list. The process would entail some extraction rule to allow the system to get the list of credit cards, and then apply the following: Rule 5 = iteration rule SkipTo(,) Figure 3.1h Iterative extraction rule for list of credit cards The rule uses the, to reach the start of each credit card, and reapplies the iteration rule to the list to identify each element. When applied to figure 3.1c it would extract; MC, V, AE and DIS. One of the major advantages of using extraction rules with an embedded catalog formalism is the ability to navigate lists of items. Depending on the nature of the target source for extraction, this feature would be very useful. Another advantage is the ability to include disjunctive rules, where options are available for extraction. This would be extremely useful when slight variations in layout or syntax on a page frequently occur. 14

23 3.2 Whisk Extraction Rules This representation of a rule uses regular expression patterns that are able to identify extracts and areas of a given source, and delimiters within the expressions to find precise points in the source [7, 8]. Armed with these components, the rule can be applied to any source and return the required information. Capitol Hill- 1 br twnhme. D/W W/D. pkg incl $675. 3BR upper flr no gar. $995. (206) Figure 3.2a Target source containing housing information Pattern:: *(Digit) BR * $ (Number) Figure 3.2b Extraction rule for number of bedrooms and prices By looking at figure 3.2b and its application to the target source figure 3.2a it is possible to see how Whisk extraction patterns operate. The pattern reads as a regular expression initially using the wildcard * denoting any character. The token within the brackets; Digit, matches a single digit in the source. Digit is a class built in to the Whisk representation of the extraction pattern. The next part of the expression looks for the appearance of br in the source. The quotes relate to a literal. The expression then matches any character until it reaches another literal, this time of a $. Immediately proceeding will be a number token, this is also a built in class. The information that is to be extracted is held inside the brackets. In the case of figure 3.2b, this is a Digit and a Number, both of which are built in classes. This refers to the unique way in which Whisk is able to extract information using predefined semantic classes. A semantic class can be created containing variations of a term, enabling the expression to be able to check different terms from the source against those belonging to the semantic classes. For example: Bdrm = (brs br bds bd bedrooms bedroom bed) Figure 3.2c Semantic class for Bdrm Now by using the predefined class from figure#, with the extraction pattern: Pattern:: *(Nghbr) (Digit) Bdrm* $ (Number) Figure 3.2d Extraction rule using semantic classes Nghbr and Bdrm This terminology is very useful when dealing with differing terms. It is able to handle the variations and extract information using simple notations, similar to the way wrappers look for delimiters in their input source. Semantic classes are also useful when creating generic rules for multiple sources by using the semantic classes to represent variations in terms between sources. 15

24 3.3 Information Extraction Patterns from Tabular Web Pages Due to the uniform and regular design of a table in HTML, the creation of extraction patterns to match its structure can be extremely accurate. Each row of a table follows a set pattern, which is in most cases identical to the rows occurring before and after it. Using this theory Gao et al set out to build a language for expressing patterns by focusing its design on HTML documents with a tabular structure. The part of the page containing the information to be extracted was formatted into a sequence of rows where each row contains information regarding a single entity. The extraction pattern used to extract this information must therefore match each row. Each row contains a sequence of HTML tags and strings between those tags, where the strings are in most cases the information to be extracted. The pattern language [9] contains one basic unit; an abstract token. A pattern contains a sequence of these abstract tokens, where each abstract token is a generalisation of tokens in the page. This generalisation can fall into three main categories: A generalised token, an optional token or a disjunctive token. A generalised token can then either be a text token, which is a knowledge unit or piece of information. A generalised token can also be a tag token, which is an HTML tag forming the row structure. An optional token is a generalised token that may or may not appear in the row, and a disjunctive token is used to match more complex elements that cannot be expressed in a single token, and require a sequence of tokens. Pattern::= [AbstractToken,AbstractToken,.,AbstractToken] AbstractToken::= GeneralisedToken OR opt(generalisedtoken) OR alt(generalisedtoken,.,generalisedtoken) GeneralisedToken::= Head(Parameter,Parameter,Parameter) Figure 3.3a Syntax used by the pattern language The generalised token, as shown in figure 3.3a uses the term head to refer to the two types that it can be; tag or text. The three parameters within the brackets are unique to each type. Looking firstly at the tag token, the first parameter refers to the level or role the tag plays. This level can be one of five levels, and refers to how it affects the page structure. The second parameter refers to whether the tag is an opening or closing tag. And the third and final parameter refers to the string content of the tag. The format of the tag token can be seen in figure 3.3b. TagToken::= tag(level,open/close,tag_string) level::= page para line word other open/close::= o c tag_string::= character string Figure 3.3b Syntax of the tag tokens 16

25 The text token also uses the generalised token representation, but with different values for the head and parameters within the brackets. The head value is altered to text, and the first parameter denotes the length of the text. The second parameter is a sequence of character category symbols denoting a combination of uppercase letters, lowercase letters, a digit, punctuation or a character. And the third parameter contains the text string itself. The format of the text token can be seen in figure 3.3c. TextToken::= text(length,format,text_string) length::= number format::= CharCategory* CharCategory::= C l n p character text_string::= character string Figure 3.3c Syntax of the text tokens Another notion that is introduced in the representation is the use of wildcards. Within a generalised token the use of the word any denotes a wildcard. This can be very useful when certain parameters are not essential, or may differ between rows, although the general tag or format may be the same. For example figure 3.3d demonstrates the form of the generalised tag token. tag(line,any,any) Figure 3.3d Tag token using wildcards The tag token from figure 3.3d can be used on any HTML tags used to create a new line such as: <TR>, </TR>, This representation scheme is extremely useful when handling data that is stored in a tabular format. The use of patterns for every row containing data allows a set routine for information extraction, provided the HTML format is roughly the same throughout the rows. The inclusion of optional tags is helpful when dealing with row variations, and the use of disjunctive tags helps to overcome complicated pieces of HTML code that could cause problems. 17

26 4. Wrapper Induction Systems have been created to automatically generate wrappers and extraction patterns. This chapter will focus on these systems and detail which wrappers and patterns they are capable of generating. In order to perform wrapper induction it is essential for a machine learning technique to be implemented. The implementation must be trained to induce wrappers based on examples, and then assess the wrappers by their accuracy over extracted information. The assessment of wrappers contributes to retraining the system, and improving the accuracy of the produced wrappers. 4.1 WIEN WIEN is a wrapper induction environment application allowing users to view an example web page that information would be extracted from in a standard web browser. The user is then able to click on the web page and highlight the segments of data that are required for extraction. The system would then learn the highlighted data and generate a wrapper to extract the highlighted information. The system uses HLRT wrappers as mentioned previously. According to Kushmerick [3], this wrapper is the most effective on semi-structure web sources which contain tabular information to be extracted. WIEN is an implementation where the user plays the role of the oracle (labelling the page), rather than using an automated oracle process for the task. Once the system has learned the wrapper, it applies it to all the information within the head and tail delimiters and displays what would be extracted. If there are any faults, the user is able to go back and rectify the problems and the system re-trains the wrapper. Figure 4.1 Screen shot of WIEN running 18

27 4.2 STALKER STALKER [6] uses a set of training examples to generate extraction rules using an embedded catalog formalism [7]. It begins by creating a simple rule, which covers as many examples as possible; it then creates a new rule for the remaining examples. This is where the influence of rule learning theory is applied. STALKER uses a sequential covering algorithm [#]; it tries to find a perfect disjunct rule and continues to attempt to find this as long as some examples are still to be classified. It takes an initial set of candidates and selects and refines the best candidate until the perfect disjunct is found. This is found by looking for the disjunct that accepts the largest number of positive examples, and should there be a tie between two or more disjunct, then the disjunct chosen is the one with the fewest negative examples. STALKER can be used to extract information from web pages that are tabular and contain lists and tuples of information that are hierarchical [5], essentially any web pages that follow the embedded catalog formalism. Figure 4.2 Structure of Stalker [6] 4.3 Soft Mealy The methodology of Softmealy [9] is to quickly wrap a wide range of semistructured web pages containing tuples with missing attributes, multiple attribute values, variant attribute permutations and typing mistakes or errors, using the wrapper representation format mentioned previously in chapter 2. It is therefore able to handle diverse structure patterns. The generalization rule used is similar to WIEN as it induces contextual rules based on training examples of tuple extraction similar to WIEN s use of user-side training by highlighting the desired information to be learnt. The FST matches the context tokens of separators with the contextual rules to find the state transitions. However it is possible for the contextual rules not to cover one another and the FST will then become non-deterministic. To avoid this problem a rule priority policy is used; this policy prefers longer rules or those with more ground tokens. 19

28 4.4 Whisk Using training instances, Whisk [8] is able to automatically induce rules. The training instances must be hand-tagged as Whisk uses a supervised learning algorithm. The algorithm goes through several iterations to produce rules. In order to create the handtagged training instances, the algorithm begins with a reservoir of untagged instances and an empty set of tagged instances. During each iteration a set of untagged instances are selected from the reservoir and presented to the user for annotation. The user then adds a tag to every case frame to be extracted from the instance; this corresponds to the attribute of the tuple to extract as mentioned in the previous section corresponding to Whisk extraction patterns. The tags are designed to aid the creation of the rules and test the performance of certain rules. Therefore, we can say that an instance is covered by a rule if the rule can be successfully applied to the instance. If the extracted phrases exactly match a tag then the extraction is correct. Figure 4.4a and figure 4.4b show the effect of tagging on a piece of HTML code. Capitol Hill 1 br twnhme fplc D/W W/D. Undrgrnd pkg incl $ BR, upper flr of tum of city HOME. Incl gar. grt N. Hill loc $995. (206) (This as last ran on 08/03/97.) <hr> Figure 4.4a Untagged HTML Capitol Hill 1 br twnhme fplc D/W W/D. Undrgrnd pkg Incl $ BR, upper flr of tum of city HOME. Incl gar. Grt N. Hill loc $995. (206) (This as last ran on 08/03/97.) <hr> ]@S Rental {Neighbourhood Capitol Hill} {Bedrooms 1} {Price Rental {Neighbourhood Capitol Hill} {Bedrooms 3} {Price 995} Figure 4.4b Tagged HTML code The rule induction process begins by creating a seed instance and then finding a most general rule, which covers this instance. The general rule is then extended one term at a time to make it more specific; this makes use of a top-down induction process. However, this theory is ineffective in Whisk and its application of extraction rules, as the first rule must be able to extract some piece of information and the most general extraction rule Whisk could create would be as shown in figure 4.4c. Empty rule: *(*)*(*)*(*)* Figure 4.4c Most general extraction rule 20

29 The most general rule would not be able to extract any information as it contains too many wildcard operators. To overcome this Whisk must attempt to anchor an extraction by consider terms added just within the extraction boundaries, these would be the closest terms relating to those areas. These are then added to the rule until the seed instance is covered and a correct extraction can be made. Terms can then be added to a proposed rule by considering the addition of each term and testing the performance of each proposed extension based on the training set. Whisk uses the hill climbing methodology in its algorithm. This implies that if it makes a mistake or chooses a wrong term it does not back track, instead it will add terms until the rule performs reliably. 21

30 5. Information Extraction 5.1 Armadillo Developed by the web intelligence group at the computer science department of the University of Sheffield, Armadillo [10] is a system designed to semantically annotate information from large data sources. It works by extracting information from different sources and then integrating the retrieved knowledge into a repository. It identifies possible annotations based on an existing lexicon, and then identifies other annotations not provided by the lexicon based on the context in which the known annotations were identified. Any new annotations must be confirmed for them to be added; following this all the annotations are integrated and stored in a database. The system works by expanding and adding to the initial lexicon by finding similarities in the repository and learning to recognize them. It also makes use of redundancy to bootstrap other recognizers that when generalized will retrieve other pieces of information. By using information from previously wrapped sources to provide an automatic annotation of examples it is able to induce wrappers. Armadillo works like a glass box where users are able to check and modify the output, and the strategy. If a piece of information is missed by the system it can be added by the user and the modules of the system are then re-run. The system has an ability to retrieve information from multiple sources using similar paradigms. 5.2 Amilcare Amilcare [11] is an adaptive information extraction tool able to help with document annotation. By using machine learning it is able to induce rules capable of extracting information. To do this; training examples annotated with XML tags are used to generalise the rules. Amilcare can run in three different modes; training mode, test mode, and production mode Training mode The role of this mode is to induce rules. Amilcare performs this operation by taking an ontology (a scenario) and an annotated corpus (training example) as input. Rules are then induced and output, capable of reproducing annotations on a similar corpus to the primary corpus used as input Test mode In order to test the rules, an unseen tagged corpus is used. Amilcare removes the tagging that it contains and then re-annotates the corpus using the rules. It then compares the original corpus to the re-annotated corpus, and outputs the accuracy statistics measuring the precision and recall of the re-annotation. It is then possible to retrain the rules to improve the levels of precision and recall Production mode The final mode is used by providing documents as input to Amilcare for them to be annotated. Once the annotated documents are returned, the user of the system is able to check the annotation and corrections can be made and the rules retrained accordingly. 22

31 5.3 Regular Expressions Information extraction can be carried out from a source using the java.util.regex library package from the Java API [12]. Using this library package, regular expressions can be written and matched against text. In order to create a regular expression in Java, a string representation of the regular expression must first be created. This is then compiled into a pattern. A matcher is then created using the pattern, and takes the text, which the regular expression will be matched against, as a parameter. Pattern p = Pattern.compile("a*b"); Matcher m = p.matcher("aaaaab"); Figure 5.3a Java source code to create a pattern matcher The matcher can then be queried to indicate if a match has taken place using m.matches(), which returns true, if a match has occurred, or false if not. This is the most basic of methods that can be performed using the matcher. To use regular expressions for information extraction the use of groups is essential. This involves capturing a group using parentheses. Pattern p = Pattern.compile(".*name:\s([a-zA-Z]+).*"); Figure 5.3b Java source code to capture information using a group The pattern from figure 5.3b would look for name: and a white space in the text, and then extract the word directly following it. The pattern could be applied to the matcher in figure 5.3c. Matcher m = p.matcher("real name: John"); Figure 5.3c Java source code to apply a pattern to a matcher Once a successful match has been found, the name string can be extracted by using m.group(1). Using multiple groups is supported and the number used as a parameter by m.group() relates to the order which the group appears in the regular expression. The only problem when using regular expressions for information extraction is that the expression must match the whole text that is to be searched. When dealing with arbitrarily large web pages containing a large amount of HTML source, it can be difficult and time consuming to create a regular expression to match the entire content. 23

32 6. Requirements and Analysis 6.1 Motivation Looking for the latest house music records available is a time consuming and repetitive task, and can take a long time to form a good understanding of which records have recently been released. Should a user purchase their records from four different web sites, all of which stock slightly different labels, the process of creating a list of all the latest releases would take a long time. The user would be required to access the first web site and browse the available stock, listening to the various audio samples and deciding on the appropriate records to buy. However, a certain record may appear on several other web sites and at a cheaper price without the user knowing. Another possible scenario could occur when the user is browsing a particular web site and comes across a record, which does not contain an image of the sleeve. Without recognising the name, the user purchases the record only to discover they already have a copy of the record in their possession from a previous order. Figure 6.1a Screen shot of 24

33 Figure 6.1b Screen shot of Figure 6.1c Screen shot of 25

Figure 6.1d Screen shot of www.coolwax.co.uk Checking the four sites from figures 6.1a 6.1d [13 16] and their extensive lists of records could take a very long time.

34 Figure 6.1d Screen shot of Checking the four sites from figures 6.1a 6.1d [13 16] and their extensive lists of records could take a very long time. Therefore, by attempting to replicate this action of compiling a list of the current records available, the proposed system would be able to store details each record in one database and allow users to browse its content. By providing an image of the sleeve and an audio sample of the record, the site would provide media content to the end user. This would become a very useful tool to any house music record enthusiast or collector. 6.2 Objectives The main objective of this project is to implement an existing wrapper or extraction pattern technology to extract information from semi-structured information sources on the World Wide Web. The implementation would allow the constructing of wrappers by hand which would then be run on the system to extract information from its related source. The creation of a generic wrapper processing system would allow the writing of wrappers for other sources, or the modification of an existing wrapper when the site structure changes without effecting the processing system. Another objective is to present the extracted information in a useful way to the end user, so they are able to see the system working and make use of its functionality and purpose. Although this is not as important as the extracting of information, it is still an important aspect of the project to focus on. 26

35 6.3 System Structure Wrappers Wrapper Processing System Database WWW Presentation Figure 6.3 The structure of the system Wrappers A collection of wrappers will be used by the wrapper processing system to extract information from the World Wide Web. Each wrapper will be designed around the same rigid structure and will be processed in the same manner. It will contain the necessary information for processing, including delimiters and web site URL. The collection of wrappers can be added to at any time, and the wrappers contained within the collection can also be edited to alter the delimiters that it uses Wrapper Processing System The system will take as input the collection of wrappers available, and use the World Wide Web to process them. The system will be designed in such a way that any wrapper can be processed that has been designed around the correct structure that has been set out for all wrappers. The system will then run each wrapper in turn and extract the information from the corresponding web source. The information will then be passed to a database for storage Presentation The presentation element from figure 6.3 relates to the displaying of data from the database to the end user. The information must be displayed neatly and precisely and contain all the required elements. The presentation element will be discussed in greater detail in the chapter eight relating to implementation and testing. 27

6.4 System Analysis 6.4.1 House Record Web Sites There are many house records web sites based in the United Kingdom alone.

36 6.4 System Analysis House Record Web Sites There are many house records web sites based in the United Kingdom alone. They provide a service to the customer of offering up to date records, with the ability to preview each record before purchasing. This can be done though the displaying of record details relevant to the user, such as; artist, title, label, price, sleeve image and an audio sample. The latter two details of a record are the key elements that a user is interested in when browsing stocks of records. The ability to listen to a record before purchasing is a vital aspect of shopping for records on the World Wide Web. The format of house record web sites varies from site to site. Some sites offer an image of the record sleeve some do not. However, all major sources for house music records offer the ability to listen to a preview of a record before buying, because this influences the sale of records to the consumer, and informs the consumer of the professional level of the company. Figure 6.4.1a Screen shot of Figure 6.4.1a shows a typical house record web site. It contains a list of the currently available records in the genre of house available for purchasing. Each record is contained in its own row and has an image of the sleeve, the artist, title, label, price, and a link to an audio sample of the record. 28

Figure 6.4.1b Screen shot of www.htfr.com In many house music record web sites there is no sleeve image [17].

37 Figure 6.4.1b Screen shot of In many house music record web sites there is no sleeve image [17]. Instead the records available are displayed in simple tables containing the remaining relevant information. There is a major correlation between many house record web sites based around their format. They all share the same tabular format because it is easier for the consumer to browse stock and navigate, and also because all the stock is stored in databases enabling generating a row for every record. The general format of a house music record web site can be compared to the embedded catalog formalism as mentioned in chapter three. This formalism is extremely representative of style of presentation used throughout each site that has been researched. One page containing the latest house records can be thought of as a large list of house records. Each record is the equivalent of a tuple where the tuple contains; the artist, the title, the label, the price and an embedded list of audio samples. Figure 6.4.1c shows this formalism. 29

38 List(House Records) Tuple(House Record) artist title label price List(audio clips) audio clip Figure 6.4.1c Embedded Catalog Formalism of a house record web site All the sites researched for this project contain pages that can be accessed easily and contain information regarding the same genres and sub-genres. However, over the research period of this project several have altered their format, which would cause hard coded wrappers to fail due to the syntax alterations Wrapper Technology to be implemented The initial theory for implementation of a wrapper technology was to make use of the embedded catalog formalism and extraction rules. Each formalism would relate to a web page containing a list of records as show previously in figure 6.4.1c. However, following further analysis of the extraction rules and their implementation it was found that there structure was not best suited to all the web sites available. Therefore, the HLRT wrapper class, as mentioned in chapter two, was chosen for implementation. Following the analysis of the various house record web sites the HLRT wrapper class best suited the rigid tabular nature of the sites. And following the analysis of the HTML format of the many house record web sites, a HLRT wrapper could be easily applied using distinct tags to separate attributes of records and would lead to the creation of an efficient and precise wrapper. 30

39 6.5 Requirements Functional Requirements The following functional requirements demonstrate features that the components of the final system will be able to do: Skeleton Wrapper Provide a rigid and logical format for the wrapper. Provide details of how to store the extracted information. Format complies to the HLRT wrapper class. Provide methods for manipulating the extracting content. Wrapper Processing System Able to process a collection of wrappers. Will store all extracted content in a secure database. Process any wrapper, which complies with the skeleton wrapper design. Able to alter extracted content into one format for consistency. Process wrappers using an implementation of the HLRT wrapper algorithm. Extract audio file URLs from large segments of extracted content. Alter extracted content to SQL safe format. Presentation of the extracted information Display the extracted content to the user. Allow the user to search for a given record. Order records by their attributes. Offer media content associated with each record (sleeve image, audio sample). Allow individual records to be viewed. Offer links to the sites from which the records were extracted. Offer comparisons between records. Automatically update the database with the latest records. Search for a record by any associated attribute (artist, title, label) Non-functional Requirements The following non-functional requirements demonstrate restrictions imposed on the final system: Portability - The system will be portable, and can be run on multiple platforms providing there is a Java environment present. Reliability No problems or bugs will exist in the system causing it to be unusable. Efficiency Running the collection of wrappers will not change the performance of the system should any of the wrappers change due to alterations in site content. 31

40 Usability Creating new wrappers for the system will be a straightforward process. Scalability The system will be able to extract information from any source, no matter how large or small the target source is. Security The extracted information will be stored securely and will be inaccessible to hostile users Priority Matrix The priority matrix takes the previously mentioned functional requirements and compiles them into a matrix. Each requirement is given a priority rating relating to how important the requirement is to the project system. E = Essential N = Necessary D = Desirable O = Optional Requirement Skeleton Wrapper Provide a rigid and logical format for the wrapper. Provide details of how to store the extracted information. Format complies to the HLRT wrapper class. Provide any needed methods for manipulating the extracting content. Wrapper Processing System Able to process a collection of wrappers. Will store all extracted content in a secure database Process any wrapper, which complies with the skeleton wrapper design. Able to alter extracted content into one format for consistency. Process wrappers using an implementation of the HLRT wrapper algorithm. Extract audio file URLs from large segments of extracted content. Presentation of the extracted information Display the extracted content to the user. Allow the user to search for a given record. Order records by their attributes. Offer media content associated with each. Allow individual records to be viewed. Offer links to the sites from which the records were extracted. Offer comparisons between records. Automatically update the database with the latest records. Search for a record using any attribute. Priority E N E N N N E D E D E O D D O N O O O 32

41 6.6 Evaluation Methodology Evaluating the final system is an essential part of the project. There are two primary strategies of evaluation, and one strategy, which is classed as secondary. The primary strategy methods are more reliable and dependent, and focus on actual readings and testing solutions. The secondary strategy is more concerned with trial and error, and a less rigid format of evaluation Empirical Evaluation (primary) According to Machine Learning [18] and Text Processing [19, 20] literature it is possible to assess information extraction systems using statistical measures. The efficiency of a system can be calculated based on their precision and recall. Relevant Non-relevant Total Retrieved A B A+B Not retrieved C D C+D Total A+C B+D A+B+C+D The following formulae are used for the evaluation of Information retrieval systems and the application of the formulae are relevant to the information extraction nature of the project system. Recall deals with the amount of relevant information retrieved, and precision looks at the total amount of information retrieved. A Recall = = A+C Retrieved and Relevant All Relevant A Precision = = A+B Retrieved and Relevant All Retrieved This kind of evaluation refers to the use of experimentation and observation of results. It has been performed as a method of evaluation on the wrapper induction systems that have been mentioned so far. Therefore empirical evaluation will be used as a method for measuring efficiency of the completed final system. 33

42 6.6.2 Test Case Evaluation (primary) In order to fully test and evaluate the functionality of the wrapper processing system, it is essential to create and use test cases. A single test case refers to the use of a web site, which will have a wrapper written specifically to be processed on that site. Primarily the information to be extracted from the web site will be highlighted to show exactly what should be wrapped. The second stage will involve processing the wrapper using the wrapper processing system and then analysing the results by comparing the information intended for extraction and the actual information retrieved. It is also important to note that although the system will be extracting information from the source within a desired section (due to the head and tail notion of the wrapper), it will be evaluated according to the success of extraction of desired attributes Alpha Beta Testing (secondary) User evaluation of the system is a concern when assessing the presentation of the extracted information to an end user. By producing an alpha test version of the proposed presentation unit, the system can be assessed for faults and any comments can be made regarding possible improvements and modifications. The beta test version will then demonstrate a new rectified version of the system to the end user, and again encourage any criticisms to be made. 34

43 7. System Design 7.1 Choice of technologies The wrapper processing system and the wrappers themselves are to be written in Java [12]. The object-orientated nature of Java enables the use of specialist packages to access the World Wide Web in order to extract information. The choice of data medium for storage was a difficult choice and following research; MySql [21] was chosen. XML [22] and RDF [23] were considered, and RDF appeared to be useful due to its ability to create relations between elements that it stores. MySql was chosen because of the connectivity it has with Java, and the requirement for sorting of data extracted by the presentation unit. The presentation unit of the system is to be written in JSP [24] (Java Server Pages), although both ASP [25] (Active Server Pages) and PHP [26] were serious contenders in this decision. 7.2 System Structure Recall the system structure diagram from the requirements and analysis chapter (figure 6.3). Although this diagram shows exactly how the system will be structured, the system is intended to have a 3 tier design encapsulating a client tier, a business tier, and a data tier. Presentation Database Wrapper Processing System Wrappers client business data Figure Tier architecture of the proposed system. Figure 7.2 depicts how the system would be structured. This format is taken from e- commerce literature [27], it is suitable for the system. The presentation unit will simply contain JSP files able to connect to the database of extracted information and create dynamic content using the information retrieved. 35

44 7.2.1 Skeleton Wrapper Several wrappers will be created and added to the collection to be used by the processing system. In order for the wrappers to adhere to the same structure, a skeleton wrapper must be produced. Based on the HLRT algorithm there are several parameters that it must include; the head, tail, and tuples of left and right delimiters. The number of tuples depends on the amount of information to be extracted. Figure 7.2.1a Screenshot of displaying information intended for extraction The highlighted areas in figure 7.2.1a [28] depicts the typical attributes to be extracted. These components would form the tuples of the HLRT wrapper and the algorithm would have its parameters set according to the delimiters surrounding those pieces of information. The skeleton wrapper must therefore take three parameters; a head, a tail, and a list of delimiters where one delimiter contains the left and right parameters for a given tuple attribute. It will then take the form as shown in figure 7.2.1b. Wrapper sethead(string h) settail(string t) setdelimiters(delimiter[] d) gethead() gettail() getdelimiters() 1 * Delimiter setleft(string l) setright(string r) getleft() getright() Figure 7.2.1b Class structure for the skeleton wrapper 36

45 7.2.2 Wrapper Processing System The wrapper processing system has the job of processing the wrappers fed in to it and extracting the required information from the designated source, it must then output the information to the database for later use. The first mechanism that must be looked at is the HLRT wrapper class algorithm. The wrappers to be fed in do not contain the algorithm for processing; this will instead be located in the processing system. Extract HTML Input Wrapper Load Parameters Extract Content Output Content Figure 7.2.2a Flow diagram of wrapper processing system Figure 7.2.2a shows how the wrapper processing system would operate. The wrapper would be input into the system, and the parameters of the wrapper would be loaded into the system while at the same time the system will extract the html from the web site associated with the wrapper. Following this, the content is extracted from the html according to the parameters from the wrapper and the content is then output. 37

46 update() UpdateSystem WrapperBlackMarket updatedb() reset() ResetTables ContentGrabber WrapperCoolWax updatedb() WrapperJuno updatedb() gethtml(string URL) grablabels(string URL, Wrapper w) updatedb() WrapperTuneInn TupleGrabber getcontent(string h, Wrapper w) getlabel(string h, String l, String r) afterhead(string htm, String h) beforetail(string htm, String t) WrapperVinylAddiction updatedb() TailManip leftbeforetail(string l, String s) NameHandler HandleName(String n) RearrangeName(String n) SoundHandler GetSound(String s) FilterSound(String s) TidyTitle tidy(string title) Figure 7.2.2b Class diagram of the wrapper processing system 38

The class diagram from figure 7.2.2b illustrates the structure the classes will take [29]. The content grabber will contain methods for extracting the html source into a string given the URL.

47 The class diagram from figure 7.2.2b illustrates the structure the classes will take [29]. The content grabber will contain methods for extracting the html source into a string given the URL. The tuple grabber will also contain a method, which takes as parameters a wrapper and the html source string. It will then be able to run the wrapper over the string using the implemented HLRT wrapper class algorithm and extract the required information. The content grabber will be responsible for instigating this method, and will then invoke the method found in the store content class, which writes the information extracted to a database or some data storage medium. The data stored must be of suitable format to avoid SQL exceptions [21] so a sound handler and name handler will be used. The tuple grabber class contains the HLRT wrapper class algorithm. To recap it is as follows: Given wrapper(h,t,l1,r1,.lk,rk) and page P Skip past the next occurrence of h in p While the next occurrence of l1 is before t in P For each (lk,rk) ε {(l1,r1),.,(lk,rk)} Extract from P the value between lk and rk Return all extracted tuples To transfer this algorithm from its pseudo code format into an implementation requires the use of several packages for handling strings, and particularly matching string patterns. In the implementation, the head, and tail will be specific delimiters and resemble actual html content. The tuple delimiters will often be in the form of regular expressions to match patterns of the source content. Although in certain cases actual html content will be appropriate, however, this can be handled as a form of regular expression. When the wrappers are processed by the processing system they are put through the algorithm. To illustrate how this would operate; figure 7.2.2c [17] demonstrates the manner in which information would be retrieved from a page. The wrapper would be set to extract the artist, title, label, price and the mp3 by setting its parameters accordingly. The head would be set to where the table starts, and the tail would be set to where the table ends. Each record attribute is stored in the same cell on each row. As the arrows show, the method skips certain areas, which contain redundant information. Figure 7.2.2c Illustration of the HLRT wrapper running over a table of records 39

48 7.2.3 Presentation Presenting the extracted information using a web browser was chosen because of the nature of the extracted content, and the multi-platform versatility of web sites. The information will be presented to the user using Java Server Pages [24]. This technology requires a servlet container to host the web pages. After looking into the connectivity between using JSP and a MySql database, it has become apparent there is a method for connecting the two components together. The use of a driver [30] will enable the pages to securely access the database. Although the priority of this project is to wrap information correctly, the presentation of the extracted information also of concern. As indicated in the functional requirements from chapter six, there are many features of the system involving the presentation of the extracted information to the user. In order to display the extracted information to the user each JSP file will contain the code to access the MySql database using a connection driver. The file will then query the database and a Result Set object will be returned containing results of the query. The Result Set object will then be iterated through and a table will be created where rows are dynamically produced for each member of the result set. A search function can be implemented using HTML form elements and posting the data entered in the form input text field. Once the form has been submitted, the JSP creates an SQL query containing the details from the form, and a Result Set object is returned with the records matching the query. A similar procedure will be applied for ordering the records. By including drop down lists to allow users to select how they would like the records to be ordered, either by artist, title, or label. The ordering would work by dynamically creating an SQL query instructing the database to return the Result Set object ordered as specified by the user. index.jsp allrecords.jsp search.jsp coolwax.jsp vinyladdiction.jsp tuneinn.jsp juno.jsp blackmarket.jsp viewrecord.jsp Figure Structure of JSP files 40

49 7.3 Prospective Wrapping Sources Due to the extensive number of companies selling records on the World Wide Web there is an abundance of sources to extract content from. The final systems use relies on the comparison of web sources from companies based in the United Kingdom for practical reasons (currency, purchasing, etc). From the twenty web sites that have been looked out throughout the research for this project, five have been chosen for wrapping. Of these web sites; four contain images of the record sleeves along with audio samples, and the remaining one only contains audio samples. All of the sites to be wrapped contain the remaining necessary record information such as artist, title, label and price. [13] [14] [15] [16] [31] 41

50 Figure 8.1 The Extrax Logo 8. Extrax Extrax is the name given to the completed system capable of wrapping content from various web sites using manually created wrappers and then dynamically displaying the results to the user. It uses the three-tier architecture mentioned in the previous chapter. This chapter focuses on the implementation of the system. 8.1 Component Implementation Skeleton Wrapper Implementation The basis for the system relies on the creation of a well-structured and easy to implement wrapper design. Implementing the HLRT wrapper algorithm within the wrapper processing system forced the creation of a skeleton wrapper class according to a number of set features. The algorithm relies on the setting of a head and tail; to mark the start and end of the area of content over which the wrapper will be run, and a list of delimiters, which will be applied to this content area. Based on these criteria the skeleton wrapper was created according to its design laid out in chapter seven. The wrapper class contains a string representation of the head and a string representation of the tail. It also contains a list of delimiters. A delimiter is a separate class and can be thought of as tuple containing a string representation of the left parameter and a string representation of the right parameter. Once the algorithm has identified the section of content for the information to be extracted from, it loads the list of delimiters and iterates through them. Once it reaches the end of the list, the whole list is reapplied to the remaining content continuously until no content remains. Each full cycle through the list of delimiters represents a row of the target source s table, and a tuple of extracted information. Each wrapper used by the system uses this skeleton wrapper as a basis for construction. However, the wrappers that are used by the system also include added functionality unique to their designated source: As each source also has its own table in the database the code for updating that table is kept in the relevant wrapper. This allows the system to choose which wrappers are run and therefore which tables will be updated in the database. 42

51 8.1.2 Wrapper Processing System The first priority when implementing the wrapper processing system was to create an effective class structure and allocate possible system roles to methods. This meant logically dividing the process of extracting information up into different classes. The system starts by running the collection of wrappers, each wrapper in turn runs grablabels() from ContentGrabber by passing it a string representation of the source URL to be wrapped, and the wrapper itself. This method is responsible for extracting the various labels for all the tuples from the given source. It works by calling gethtml() from ContentGrabber, this method is passed a string representation of the URL as a parameter and returns a string containing all the HTML source code from that page. The source code string is then passed to getcontent() in TupleGrabber along with the wrapper. The getcontent() method is responsible for processing the HLRT algorithm by running the wrapper over the source code string. In order to do this; it first removes all content in the source string until the head delimiter from the wrapper is found. It then removes all content in the source string following the finding of the tail delimiter from the wrapper. Once the source string has been stripped down it can be iterated over using the list of delimiters from the wrapper. As mentioned previously, the list is continually applied until no content is left. Figure Implementation of the HLRT wrapper algorithm The implementation of the HLRT algorithm is shown in figure This shows how getlabel() is called by passing the source code string, and the list of delimiters. Once these are matched, the string between the left parameter and right parameter of a delimiter is then returned. The source code string is continually decreased in size as information is extracted from it, by removing the redundant, or previously wrapped content. 43

52 Several of the methods used to handle the content string are capable of handling regular expressions. When implementing these methods several problems arose when using the regular expressions package javax.util.regex from the Java API. The system requires a regular expression to be matched, and following this match, the content to the left or right must be returned or dealt with in some way. The javax.util.regex package did not offer a solution for this problem, instead the pattern had to match the entire string which when looking for a delimiter in the source code string, was not feasible. Therefore the com.stevesoft.pat [32] package was implemented; it offers an advanced regular expression handling toolkit capable of matching a regular expression against a string and then returning the content to the left or to the right of the matched area. When implementing the storage of the extracted information, the decision was reached concluding that the best method of storing the records extracted would be in a table for each source, and one seperate larger table to store all of the records. Once records had been extracted from their sources and were being stored in the database complications arose where several of their attributes contained characters capable of causing exceptions when they were placed in SQL queries. This lead to the creation of a name handler class capable of removing all the harmful characters that caused exceptions. Another problem, which surfaced, was the different naming conventions used on different sites. On the Juno web site the artist name is written as the last name followed by the first name. However, the remaining sources have the artist name written as the first name followed by the last name. This problem was tackled by creating a method in the name handler to alter the formatting of names to one unified form for ease of use. 44

8.1.3 Presentation As stated in the requirements put forward in chapter six, it is important to present the extracted information to the end user.

53 8.1.3 Presentation As stated in the requirements put forward in chapter six, it is important to present the extracted information to the end user. And as stated in chapter seven, the presentation of the information was to be implemented using Java Server Pages. The presentation was therefore dealt with by implementing several JSP files capable of accessing the database where the information is stored. Each page then executes an SQL query and displays the returned result set in a clear and readable manner to the user and can be accessed using any web browser. There are several options available to the user once the site has been accessed. Browse Users are able to browse all the records extracted from all sources, or only individual sources. The viewed records can also be filtered; this allows records to be sorted alphabetically according to their artist, title or label. Figure 8.1.3a Screen shot of Extrax displaying records from all sources 45

Search Users can search for records by entering queries in to the search tool using either the artist, title, or label. The results are then displayed on the page. Figure 8.1.

This page contains information about that particular record. It also offers a larger view of the sleeve image. Figure 8.1.

3d the system is capable of displaying an image of the record sleeve extracted from the source site by storing the URL of that image.

54 Search Users can search for records by entering queries in to the search tool using either the artist, title, or label. The results are then displayed on the page. Figure 8.1.3b Screen shot of the search tool in Extrax View When looking at a page of records each record is highlighted and contains a link to an individual page. This page contains information about that particular record. It also offers a larger view of the sleeve image. Figure 8.1.3c Screen shot of Extrax displaying a single record Media Content As shown in figure 8.1.3d the system is capable of displaying an image of the record sleeve extracted from the source site by storing the URL of that image. The system also displays the audio samples available for a given record. There are two types of audio available each with distinct icons; mp3 format and real format. Figure 8.1.3d Screen shot of Extrax offering different audio formats The presentation format of the system is easy to use and read. Each record contains a link to the source from which it was extracted, allowing users of the system the opportunity to purchase any records they are interested in. 46

55 9. Evaluation 9.1 Requirements Evaluation Recalling the priority matrix from chapter six. The matrix contained a list of the various requirements associated to each component of the system. Each requirement was then labelled by how important it was to the system. These requirements can now be assessed to find out if they have been met by the system. Requirement Priority Met Skeleton Wrapper Provide a rigid and logical format for the wrapper. E Yes Provide details of how to store the extracted information. N Yes Format complies to the HLRT wrapper class. E Yes Provide any needed methods for manipulating the extracting content. N Yes Wrapper Processing System Able to process a collection of wrappers. N Yes Will store all extracted content in a secure database N Yes Process any wrapper, which complies with the skeleton wrapper E Yes design. Able to alter extracted content into one format for consistency. D Yes Process wrappers using an implementation of the HLRT wrapper E Yes algorithm. Extract audio file URLs from large segments of extracted content. D Yes Presentation of the extracted information Display the extracted content to the user. E Yes Allow the user to search for a given record. O Yes Order records by their attributes. D Yes Offer media content associated with each. D Yes Allow individual records to be viewed. O Yes Offer links to the sites from which the records were extracted. N Yes Offer comparisons between records. O No Automatically update the database with the latest records. O No Search for a record using any attribute. O Yes The system has met the majority of the requirements from the matrix. All of the essential and necessary requirements have been met by the system, and only two optional requirements have not been met. The comparison between records was neglected from the implementation. To implement this requirement, the system would be required to include some method of cumulating records from the site, and then referencing the records later. This could be included in future work. Automatic updating of the database was not implemented due to the nature of the system. The system would require a thread to be constantly running to update the database automatically and frequently. Instead it became clear that it would be more feasible 47

56 for the user in charge of the system would be required to manually run the update process. 48

57 9.2 Test Case Evaluation Following the construction of the wrapper processing system and the skeleton wrapper, it was important to use test cases that would allow assessment of how the system would function when applied to a semi-structured web page containing tabular data Test Source 1: DCS Staff Page Figure 9.2.1a Screen shot of the DCS staff page The Department of Computer Science staff page [33] was chosen as an ideal source to act as a test case for the system (figure 9.2.1a). The page contains details about the members of staff working in the department. Each row contains a tuple of attributes including the staff members name, extension number, and the room they are based in. The HTML format of the page allowed a wrapper to be written to extract the relevant information from the page. The source code is written in such a way that each piece of information to be extracted is separated by a HTML tag, or part of one. Figure 9.2.1b shows a section from the html source of the page. The member of staff attribute is surrounded by an open link tag <a href= > and a close link tag </a>. The extension number is surrounded by an open cell tag <td> and another open cell tag <td>, and the room number is surrounded by an open cell tag <td> and a close row tag </tr>. 49

<hr><tr><td> <a href=http://www.dcs.shef.ac.uk/cgi-bin/makeperson?j.barker> Dr. Jon Barker</a><td>21824<td>144</tr> <tr><td> <a href= http://www.dcs.shef.ac.uk/cgibin/makeperson?k.bogdanov > Dr.

58 <hr><tr><td> <a href= Dr. Jon Barker</a><td>21824<td>144</tr> <tr><td> <a href= > Dr. Kirill Bogdanov</a><td>21847<td>114</tr>... <hr> Figure 9.2.1b Section of HTML code from the DCS staff page The wrapper is created as follows: w.sethead( <hr> ); w.settail( <hr ); Delimiter[] delims = new Delimiter[3]; delims[0] = new Delimiter( <a href=[^>]*>, </a> ); delims[1] = new Delimiter( <td>, <t ); delims[2] = new Delimiter( d>, </tr> ); w.setdelims(delims); The skeleton wrapper is created by setting the head to a unique delimiter before the start of the main content area to be extracted, the tail is then set to where the wrapper must stop running. A list of delimiters is then created and each element of the list is set. The first delimiter represents the staff members name and contains a regular expression to mark the left of the name, and a simple close link tag as the right. The second delimiter extracts the extension number from the page using an opening cell tag as the left, and half of an open cell tag as the right. The room number is then extracted using the second half of the open cell tag as the left, and a close row tag as the right. The reason for splitting the open cell html tag is because of the methodology of the algorithm: Once the algorithm has found the delimiter it was looking for; it then discards the content until after the delimiter. In the case of the staff page, if the extension number was to be extracted using an open cell tag as the left side, and another open cell tag as the right side then the room number would have no left delimiter, as figure 9.1.2b demonstrates. Figure 9.2.1c DCS wrapper console output 50

9.2.2 Test Source 2: Juno The second test case uses a house record web site [13] to test the implementation of the skeleton wrapper and the wrapper processing system.

59 9.2.2 Test Source 2: Juno The second test case uses a house record web site [13] to test the implementation of the skeleton wrapper and the wrapper processing system. The chosen web site has an extensive list of house music records. Like many of the house record web sites researched, the records were stored alphabetically by the artist and in a tabular manner. The rows are presented to show the sleeve image, artist, title, label, price, and a list of various mp3s available. The wrapper for this particular source extracts the link to the sound clip and the link that displays the image. It is often the case that sounds and images are stored locally to the page and therefore the address that corresponds to them does not include the full URL. Instead the address is a simple local address, eg. /images/1122.jpg. This can cause problems when the full address is required. Therefore the system implements a handler, able to take the URL of the site that is treated as the base and the local address, and create a full URL representation. For example; the local address of an image is as mentioned previously; /images/1122.jpg, and the URL of the site is; The sound handler is capable of taking both of those components and creating the URL representation of the image from before; Figure 9.2.2a Screen shot of 51

60 Figure 9.2.2b HTML source code of Figure 9.2.2c Juno wrapper console output 52

WHISK: Learning IE Rules for Semistructured

WHISK: Learning IE Rules for Semistructured and Free Text Roadmap Information Extraction WHISK Rule Representation The WHISK Algorithm Interactive Preparation of Training Empirical Results Information