WEDKEX - Web-based Engineering Design Knowledge EXtraction

Size: px

Start display at page:

Download "WEDKEX - Web-based Engineering Design Knowledge EXtraction"

George Smith
5 years ago
Views:

1 WEDKEX - Web-based Engineering Design Knowledge EXtraction Frank Heyen, Janik M. Hager, and Steffen Schlinger Figure 1: A visualization showing the path the different text types take from extraction to output. Tabular data will be immediately stored in the data base while natural text is further processed by two different methods before storage. Abstract This paper addresses the topic of automatic retrieval of product information like its properties and attributes from Internet-based data sources. The extracted information should furthermore be analyzed in order to receive a result that is comprehensible for both human users and machines. Using a set of parsers and analysis software, the web site s content is processed and prepared for the subsequent steps. Thereafter, two diploma dissertation programs are applied which concern the examination of continuous and semi-structured texts and analyze and filter the previous results. Eventually, the findings are stored in a data base and are exported as RSS feed and XML file to facilitate further use. Through this automation getting an overview over the hardware components available on the market, especially online, becomes a lot easier. Index Terms Internet-based text analysis, product information retrieval 1 INTRODUCTION When searching for alternative hardware components that could improve a system already in use, by now users are forced to search manually. At the moment, the list of available technologies to do this automatically, like for example SemProM [19], is rather short. Assuming you are a system vendor with plenty of clients. It would be beneficial to your business if you were always informed about new and upcoming computer components. Having this information, assembling and delivering systems, containing these new parts, to your customers may be more efficient. Apparently, there is a need for some kind of service dealing with supporting the user in performing this task. 2 PROBLEM There is an incredible amount of hardware products available on the market, with new ones appearing continuously. Keeping an overview is almost impossible and requires at least some work done regularly. One would have to watch over updates on for example web shops in different categories, having to look through texts and tables for the information that is particularly relevant for a certain use case. We wanted to develop an automatic system supporting the user in solving this problem of getting data and sorting through it. Frank Heyen inf88797@stud.uni-stuttgart.de Janik M. Hager inf88808@stud.uni-stuttgart.de Steffen Schlinger inf85307@stud.uni-stuttgart.de 3 RELATED WORK Our work is partially based on two diploma dissertations made in 2013 at IRIS at the University of Stuttgart. They will be presented in the following two subsections. 3.1 Internet-based Text Analysis for Extraction of Product Development Knowledge using semi-structured Documents Published under the title Internetgestützte Textanalyse zur Extraktion von Produktentwicklungswissen anhand von semi-strukturierten Dokumenten, Fan Zou s work performs a Named Entity Recognition and Classification (NERC) [5], which consists of two major tasks, namely separation and classification of units [20], [1]. For these tasks, a semi-supervised machine learning method using Conditional Random Fields (CRF) [4] had been chosen. The new technique was named Product Named Entity Recognition (PNER) and five properties had been established [20]: 1. Frequent changes and updates in product names occur. 2. Various name schemes are used and the position of components can differ. 3. Product names contain letters and numbers and may also contain words of common language, leading to possible misinterpretations. 4. Names can be abbreviated and multiple names can be used for a single product. 1

3.1.1 Part-of-Speech Tagging As in Fan Zou s diploma thesis [20], the Stanford Part-Of-Speech Tagger [15], [13] is being used to label each word of a text with the proper part-of-speech tag.

2 3.1.1 Part-of-Speech Tagging As in Fan Zou s diploma thesis [20], the Stanford Part-Of-Speech Tagger [15], [13] is being used to label each word of a text with the proper part-of-speech tag. Figure 2: The PNEs are divided into three components, brand, series and type of the product. [20] (page 30) tag B-BRA I-BRA B-SER I-SER B-TYP I-TYP B-PRO I-PRO O meaning beginning of the brand continuation of the brand beginning of the series continuation of the series beginning of the type continuation of the type beginning of the product name continuation of the product name the tagged word is not a PNE Table 1: Tags used for PNER [20] (page 31) (translated) 5. PNE boundaries are vague, compared to traditional NER, since there are less so called feature words. Therefore, instead of the standard rule-based method, supervised machine learning based on stochastic methods is being used. The obvious drawback of this decision is the required manual work, namely tagging of the training corpus. On the other hand, this topic specific training increases the accuracy of the results. Since product names often consist of certain components, the PNEs are considered to be divisible into the three following parts [20]: the products brand, series and type. To avoid problems in cases where some of these components have been omitted, Zou proposes a rule for valid PNEs: They have to contain one to three brands, series and type (see figure 2). Based on this decomposition, the tags for the several parts are introduced as shown in table 1. After the tagging has been completed, the recognized parts are assembled to PNEs following certain rules. The DT Canon NNP PowerShot NNP SX260 NNP HS NN Digital NNP Camera NNP is VBZ a DT stunningly RB powerful JJ point-and-shoot NN with IN a DT number NN of IN useful JJ features NNS.. Example text after applying POS tags. [20] (page 49) PNE Tagging The PNEs are tagged in the two phases seen in figure 3. In the first phase, the words are labeled with brand, series and type tags. Then in the second one, based on those tags, words that belong to one PNE and follow each other, are assembled to this single PNE. All other words are marked with a capital O, they will not be considered relevant anymore. The DT O O Canon NNP B-BRA B-PRO PowerShot NNP B-SER I-PRO SX260 NNP B-TYP I-PRO HS NN I-TYP I-PRO Digital NNP B-PRO I-PRO Camera NNP I-PRO I-PRO is VBZ O O a DT O O stunningly RB O O powerful JJ O O point-and-shoot NN O O with IN O O a DT O O number NN O O of IN O O useful JJ O O features NNS O O.. O O Example text after applying PNE tags. [20] (page 49f) CRF++ The PNE tagging is being performed by an C++ open source (New BSD License) software named CRF++ [8]. It has low requirements in time and memory, is available for free and platform independent. Further details and a guidance for training are contained in [20]. Figure 3: Visualization of the two phases in which the tags are applied. [20] (page 37) (translated) Post-Processing After the text has been completely tagged with part-of-speech, brand, series, type and product tags, the latter are used to concatenate all words belonging to the same PNE into one string, which will then be the output of this whole process. 2

3 3.1.5 Summary The input text is first tagged with the Stanford Part-Of-Speech Tagger. Then, using CRF++, the words are labeled according to the components of a product name, which are the brand, series and type names. In a second phase, CRF++ adds tags which mark the beginning and continuation of a PNE. In the post-processing step, word sequences with those tags will be recomposed to one product name string. 3.2 Internet-based Text Analysis for Extraction of Product Development Knowledge using OntoUSP: A Feasibility Study The second diploma dissertation, published by Chen Wang under the title Internetgestützte Textanalyse zur Extraktion von Produktentwicklungswissen mittels OntoUSP: eine Machbarkeitsanalyse [16], focuses mainly on the extraction of product properties and their values out of a continuous natural text. His work is based on a syntax parsing performed by the Stanford Parser [14] and a successive semantic parsing using Unsupervised Semantic Parsing (USP). Despite its name it runs with USP instead of Ontology USP (OntoUSP) because OntoUSP is mainly an expansion of USP and uses its output. After the analysis of the USP outcome a hierarchy of words is created which is filtered afterward. At the end a graph is generated out of the filtered information with three special properties: 1. The root is either a product name or a brand name. 2. The leaves are the values of the product properties. 3. The parents of the leaves are the corresponding properties the leaves belong to Stanford Parser We use the Stanford Parser just like Chen Wang did. A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as phrases ) and which words are the subject or object of a verb. Probabilistic parsers use knowledge of language gained from handparsed sentences to try to produce the most likely analysis of new sentences. These statistical parsers still make some mistakes, but commonly work rather well. [14] So the Stanford Parser works out the relations between the occurring words (see figure 4) together with their positions in the phrase. Additionally the words are POS tagged (look above) and a list of the words morphologies is created. with the words as nodes and their relations as edges (see figure 5). It processes the data with the help of semantic analysis. The output of USP is semantic parsed data and a Markov Logic Network (MLN). Further details and a guide are found in [6, 7]. Figure 5: Example graph of USP for the analyzed sentence above. [16] (page 32) Post-Processing Based on the analysis of the USP data a hierarchy of all words is created. After some filtering of this hierarchy according to certain predefined rules, the product s name, its properties and its values are read out with help of the previously mentioned conditions of the graph data Summary First of all the text is tagged and analyzed by the Stanford Parser which generates an overview over the dependencies between the words. USP continues processing the data by creating a graph out of all words and their relations. After that, the whole data is filtered and the product properties and their corresponding values are read out. A complete visualization is depicted in figure 6. Figure 4: Example tree build out of a sentence tagged by the Stanford Parser. [16] (page 13) USP As a program which uses unsupervised learning, USP generates a model out of an input which allows a certain prediction. Out of the out coming data of the Stanford Parser a dependency tree is created Figure 6: Summary of the main process to extract products, properties and their values from a text from Wang. [16] (page 12) (translated) 4 SOLUTION 4.1 Overview The input to our concept system will be a web site. Its text gets fetched, then somehow analyzed to extract only relevant information. After that, some post-processing will be required to improve the 3

2 Natural Texts On the other hand, there is the much more common kind of data: Continuous texts are widely used to describe products.

4 result, which will then be the output. A simple sketch of this idea is presented in figure 7. Figure 7: A coarse scheme showing which steps are necessary in getting from a web site text to the desired output Natural Texts On the other hand, there is the much more common kind of data: Continuous texts are widely used to describe products. The difficulty of this kind of texts is to extract only the relevant data without losing important information. Through the application of various analysis methods, the text gets converted stepwise from a natural text into the desired key-value-pairs, which are easier to process by computers. Further details are to be found in the implementation section. An example is shown in figure Retrieval of Web Site Texts By now, we support fetching from two differently lay-outed web sites: The English version of the German store hardwareversand.de [18], which uses a tabular layout, and the web shop of CCL Computers, CCLonline.com [17], using natural texts. The fetching is done by an unlimited number of modules, where each module represents another data source. Data sources can be anything, like web sites or API s of shops like for example Amazon. Since most software used in our implementation is limited to processing English texts, the data sources need to be in English as well. 4.3 Extraction of Product Information Web sites texts containing product information, such as for example descriptions in web shops, can be roughly divided into two groups based on their structure: 1) information that is structured in a table with key and value pairs 2) information contained in a natural text with full, grammatically correct sentences. Due to this two different kinds of texts, it is more efficient to also use different strategies depending on the particular kind Tabular Data This kind of data is the easier to process one. We can simply take the attribute name and value and save them as they are, resulting in perfect results (assuming that the website content is correct). An example is shown in figure 8. Figure 9: An example for product information embodied in natural text. [2] 4.4 Combining the Results The attributes are saved to the database as attribute name and value pairs. Tabular data and natural texts return, after being processed, key-value-pairs, which can be directly saved to the database. Furthermore, other information can be extracted from the data source directly, like the title of the product on web sites. These values can be saved with an appropriate attribute name. Figure 8: An example for tabular structured product information text. [3] 4.5 Data Base Layout The data base, figure 10, consists of two tables: 1) one table containing all products and their primary attributes 2) a second table containing all attributes of all products together with their values. Since each attribute data set also has a link to the ID of the product it belongs to, finding a products attributes can be performed by a single SQL query. Although most field names should be self-explanatory, here is a short explanation: In the product table, the fields module name and source identi f ier contain information about which web site s module the text comes from, and further, what was the web site intern identifier for this specific product. The latter is especially important in order to avoid duplicate products, by only retrieving products with identifiers not yet occurring in the data base. Additionally the data 4

in url provides the direct origin of the text. To have the raw data available in case of later analysis improvements, the whole web page HTML code is stored in source code.

Attributes are connected to their product by the product id field storing the particular product s ID.

Figure 11: A schematic diagram of our system s architecture. Using a specific module for each web site, the harvester retrieves text from the Internet.

5 in url provides the direct origin of the text. To have the raw data available in case of later analysis improvements, the whole web page HTML code is stored in source code. Figure 10: The Layout of the data base used for storing all processed data. The respective primary keys are underlined. Attributes are connected to their product by the product id field storing the particular product s ID. 5 IMPLEMENTATION In the following section some details are explained concerning the implementation of our project and the used software. A more detailed version of our scheme is shown in figure 11. Figure 11: A schematic diagram of our system s architecture. Using a specific module for each web site, the harvester retrieves text from the Internet. If all information has been structured in tables, it is directly parsed into key-value-pairs and saved in the data base. Otherwise, if natural text occurs, it will be processed by both diploma dissertations algorithms and later combined, before being added to the data base as well. From there, product information can be exported as XML and RSS. 5.1 General Considerations The Wedkex software is written in Java, due to the availability of both diploma dissertations and most components as Java implementation, indeed all but CRF++. By using Java, nearly all code could be organized in one project, making it easier to move functionality between classes or remove unnecessary code. For the automatic organization of dependencies, Maven [10], respectively the Gradle [9] plug-in for Eclipse has been used, as far as possible. 5.2 Retrieval of Web Site Texts The data retrieval is managed by the harvester. The harvester takes modules as the input. Each of these modules implements the interface moduleinterface. The functions of this interface do the specific work, like fetching a list of new products and the products themselves from web sites or API s, do duplicate checking and save the products to the database. The output of each module is exactly one database entry for each new product, containing the earlier introduced data structure, including the continuous description text and tabular data, if available. New data sources can be easily added by creating a new class, which implements this interface. After being enabled in the configuration file, the module should start working. 5.3 Extraction of Product Information In this chapter a small overview is described over the use and implementation of all software components used for extracting the needed product information out of a continuous text Tabular Data In this rather trivial and therefore ideal case, the data has already a key-value like form, making it easy to directly create product and attribute objects that are then saved to the data base. The following steps including the combination of recognition results can therefore obviously be skipped, drastically reducing the execution time Integration of Stanford POS Tagger and Stanford Parser The Stanford POS Tagger [13] and the Stanford Parser [14] are both available as Java packages, therefore loading pre-defined models and calling the corresponding methods is generally speaking all that is necessary. For the Stanford Parser few changes were needed because it read in text files. Therefore in order to use it properly in our project, the input had to be adjusted so that the Stanford Parser uses the whole text as a string as input for its methods. Additionally, we created a folder which should only contain temporary files created by the Stanford Parser and by USP. So the paths for the output files had to be changed Connecting to CRF++ The simplest way to use CRF++ is to write its input to a file, run it via batch command and wait for it to finish tagging, before reading the output files generated by it. This is, except for the usage of a batch script instead of PHP, analogous to the procedure used in [20]. The batch script receives the file paths for input, temporary and output file via arguments and then runs CRF++ twice: For the first phase, as explained earlier, tags for brand, series and type are added to the word in the input file, the new data is then written to a temp file. That file is the input for the second phase, in which the product tags are added. The output of this phase gets written to the output file, from where the main program reads the information and continues with the post-processing step Post-Processing (1) Since the whole architecture of the original software has been changed to better fit our approach to combine everything into one Java project, all code concerning post-processing was concentrated into a single method Using USP The whole USP Java package has been imported into one project package in order to be able to use it properly. The only things that had to be changed manually at some points of USP were the input and output file paths. Therefore in general, the original code has be used completely except for this one adaptive change Post-Processing (2) We used the original software, which was integrated in our Java project nearly unchanged. A few changes were made in using the right input paths because of our folder containing the temporary files of the Stanford Parser and USP. Additionally, we adjusted the output in creating a list of attributes out of the resulting outcome of the software. For this purpose we used the previous mentioned properties of the graph with the product property as parents of their 5

6 proper values as leaves of the graph. This list of attributes is then further used in the combination of the two processing results. 5.4 Combining the Results All attributes are saved to the database, so merging the results is necessary. Each attribute contains a name, a value and the associated product ID. In the case of tabular data, the attribute name and the attribute value can be saved directly to the database, creating a perfect result. In the case of natural texts, the output of our analysis methods are key-value-pairs, which represent attribute name-valuepairs. The quality of the output may vary in this case, depending on the input. To improve the results, further checks on the results should be done. Additional information from the data source, like web sites, are added to the database, if the harvesting module provides them. 5.5 Data Base For the data base server, the open source software MySQL [11] has been chosen. Compared to other available data base software, MySQL is one of the easiest to use and a great advantage is the availability of the MySQL Connector/J, the official JDBC driver for MySQL [12], via Maven. 5.6 Generated Output Files When running the main program or using the corresponding API methods, the user is able to create parametrized output of two kinds: XML Output By exporting all information stored in the data base for a selected number of products to an XML file, it is easier to use the data in other systems. The XML scheme allows for one or multiple products with an arbitrary number of attributes, which themselves may have any number of values. In this way, the XML structure becomes very flexible but still retains an ordered structure that can be understood by other software RSS Output The RSS output mainly serves as a notification to the user. It can show the latest updates that were found in the last run or earlier, the time and product number can be limited by arguments. 5.7 User Interface By now, the user interface is limited to a command line interface, allowing the input of optional arguments and following the progress of all steps, which is visualized by the textual output. 6 EVALUATION While fetching content from web sites and extracting the raw texts works satisfactorily, the final result depends strongly on the kind of text structure. For this reason, they will be evaluated separately in the following two subsections. 6.1 Tabular Data As already mentioned earlier, in this trivial and ideal case, not only the required time and memory is fairly low, but also the resulting data is of high quality. 6.2 Natural Texts In this case, the quality of the output depends strongly on the input. The analysis methods work unequally well on different input texts. There are always many good result pairs, but the number of bad result pairs varies intensively, often exceeding the good ones. It s also quite hard to further process the results to improve the quality, because we can t distinguish between the different result pairs. For the evaluation done for each of the two diploma dissertations software, refer to [20] and [16]. 7 CONCLUSION The problem was finding an efficient way to support a user in getting information about new hardware components available on the market. Our solution regarding this problem is based on two diploma dissertations dealing with the extraction of product development knowledge from Internet-based documents using either semi-structured data or continuous texts. Those dissertations themselves employ a set of parsers and analysis methods which all needed to be assimilated in our software. First of all, the web sites content is fetched by the harvesting modules. The tabular data is evaluated immediately due to its simplicity. On the other hand, any natural text needs some further processing and examination steps. After some preparations, tagging and filtering, the refined data is recombined to a data set containing all important information in an organized manner. This data can be accessed by either API or file output. The results quality differs by the type of text and methods used. As already mentioned, the tabular data leads to a nearly ideal result, whereas the quality of the natural texts outcome suffers slightly from the many sub-steps performed. Our main accomplishment was integrating the two dissertations programs as well as all components into a single software service. Wedkex performs all necessary steps, starting with reading in texts from web sites followed by various analysis phases concluding in merging their results, storing them into a data base and exporting them as a RSS feed and a XML file. 8 FUTURE WORK Although Wedkex is already a functional system, there are multiple possibilities to extend its abilities. In the following chapter some suggestions for new or augmented functions are being proposed. 8.1 Support of Additional Information Sources One of the easiest possible expansions is the addition of modules for the harvester, enabling it to handle more web sites. As aforementioned, for each web site the creation of a proper module is required, for the harvester to be able to read and process the web site s content. There is already a module for the two web pages supported by now, so it is rather simple to add new modules for different web pages because of the modules fairly similar structure. So the search and integration of new web pages is just a small task and the only effort is creating the module. Relating to the modules one has to pay attention that these modules have to be up to date. That is due to the actuality of the web site because modules have to be adapted to every structural change of the web page. Furthermore it is sometimes possible that the modules are of no use anymore because the structure of the given web page does not fit to the before used analysis method. Another obvious case of losing the modules functionality is when a web page ceases to exist. In this case the module should be deleted because it can not receive and process any new data. Therefore it is essential to check the used web pages and their modules regularly and maintain them if necessary. Here lies another possibility for improvement, namely an automatic routine check if the models are still up to date, combined with some kind of exception handling. 8.2 Improvement of the used Analysis Methods Another option is the extension of the used recognition procedures. Adding new methods to them would also be helpful. That means that methods to analyze and process different types of texts could be developed to extend the possibilities of information extraction from web site texts. 8.3 Introduction of Categories and Filters To get a better overview of the already analyzed products, an automatic categorization of these products should be introduced. With this method, products in the database would be assigned to one or more categories, for example by their type of hardware, like hard disk, CPU or storage. It would then be possible to search in the 6

7 database just for a specific type of hardware components. This would also simplify the further usage of the system for clients, because it would be much easier for them to get along and only look for interesting products with the help of these categories. Filtering and rating products after the analysis would be another useful functionality that could be integrated into Wedkex. Products that are already analyzed and appear in the database are to be filtered and then, with the help of their properties and their property values, rated. This method would greatly improve the overview over all products in the database and create some kind of ranking out of their rating. Thus it would be much easier to distinguish between two similar products that differ in performance and choose the better one. An example for the usefulness of this method could be the case that a company wants to upgrade some parts of their servers. The used hardware components of the servers could be searched in the database and with the help of a ranking, suggestions of better fitting parts could be made. This functionality would be especially reliable and useful if it would be used with the previously mentioned categorization of products. 8.4 Integration of User-based Information and Notification To keep the clients always updated, it would be possible to complement the currently used RSS-Feed and XML output with an -based notification service. It would send a message to clients as soon as new products had been found and analyzed by Wedkex and inform them about their properties. This service would be especially useful if it would be used with the additions explained above, the categorization, filtering and rating. A resulting possible feature for the clients would then be to create a configuration of their currently owned hardware components and only get notifications if newer or better hardware that also fits their configuration is found. Another possibility would be to just use the categorization with the notification service and receive update messages only for some specific types of hardware. With this feature it would be a lot easier for clients to be up to date and be informed about newer and better hardware, without having to sort through all the information themselves. In conclusion, it is obvious that there are many possibilities and chances to extend Wedkex with new features, to make Wedkex applicable in many different ways and integrate it in larger systems. [8] (software). Crf++. [9] (software). Gradle. [10] (software). Maven. [11] (software). Mysql. [12] (software). Mysql connector/j. [13] (software). Stanford log-linear part-of-speech tagger. [14] (software). Stanford parser. [15] K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages Association for Computational Linguistics, [16] C. Wang. Internetgestützte textanalyse zur extraktion von produktentwicklungswissen mittels ontousp: eine machbarkeitsanalyse [17] (web site). Cclonline.com. [18] (web site). hardwareversand.de. [19] (web site). Semprom.org. [20] F. Zou. Internetgestützte textanalyse zur extraktion von produktentwicklungswissen anhand von semi-strukturierten dokumenten ACKNOWLEDGMENTS We would like to thank our tutors Julian Eichhoff and Akram Chamakh for their help and support. And thanks to Fan Zou and Chen Wang for accepting the use of their source code and providing it to us. REFERENCES [1] N. Chinchor and P. Robinson. Muc-7 named entity task definition. In Proceedings of the 7th Conference on Message Understanding, [2] (image). sample text from cclonline.com, 11/18/ State-Drives-SSDs-/Samsung-840-EVOMZ-7TE GB-SSD- SATA-6Gb/s-2-5-inch-Internal-/HDD2065/. [3] (image). sample text from hardwareversand.de, 11/18/ el+core+i7-4790k+box [4] A. McCallum and W. Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages Association for Computational Linguistics, [5] D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3 26, [6] H. Poon. Unsupervised semantic parsing (usp) user guide [7] H. Poon and P. Domingos. Unsupervised ontology induction from text. In Proceedings of the 48th annual meeting of the Association for Computational Linguistics, pages Association for Computational Linguistics,

WEB APPLICATION MANAGEMENT: IMPLEMENTING A DYNAMIC DATABASE UPGRADE MODEL

WEB APPLICATION MANAGEMENT: IMPLEMENTING A DYNAMIC DATABASE UPGRADE MODEL Richard Wilson 1 & Daniel Lowes 2 1 Dept. of Computer Science and Software Engineering, University of Melbourne (Australia) 2 Dept.