Ontology Augmentation Through Matching with Web Tables

Size: px
Start display at page:

Download "Ontology Augmentation Through Matching with Web Tables"

Transcription

1 Ontology Augmentation Through Matching with Web Tables Oliver Lehmberg 1 and Oktie Hassanzadeh 2 1 University of Mannheim, B6 26, Mannheim, Germany 2 IBM Research, Yorktown Heights, New York, U.S.A. Abstract. In this paper, we examine the possibility of using data collected from millions of tables on the Web to extend an ontology with new attributes. There are two major challenges in using such a large number of potentially noisy tables for this task. First, table columns need to be matched to create groups of columns that represent a new (or existing) attribute for a particular class in the ontology. Second, the column groups need to be ranked according to their usefulness in augmenting the ontology. We show several approaches to addressing these challenges and report on the results of our extensive experiments using Web Tables from the Web Data Commons corpus, and using the DBpedia Ontology as our target ontology. 1 Introduction The Web is a vast source of valuable knowledge that can be used to extend or augment a given ontology. Knowledge extraction from the Web is a wellstudied problem and an active area of research [5, 7, 11]. While such knowledge is often extracted from textual (or semi-structured) contents using information extraction and wrapper induction techniques, there have also been attempts in using the structured data that is exposed on web pages as HTML tables [5, 14, 13]. In this paper, we examine the possibility of using Web tables to augment a given ontology with a new set of attributes. Our hypothesis is that for each class in the given ontology, there are tables on the Web describing instances of the class and their various attributes. Further, not only a large number of these attributes are not already captured in the ontology, but many are not considered useful, i.e., may be irrelevant, inaccurate, or redundant. The approach we take in this work is a two-step process. First, tables are matched among each other and to the target ontology, to group columns that refer to the same attribute and align them with classes and existing attributes in the target ontology. The second step ranks the column groups based on a measure of quality or usefulness of the group in augmenting the existing attributes in the target ontology. We perform an empirical study of the performance of this approach in using Web Tables extracted from the Common Crawl 3 to augment the properties in DBpedia ontology. 3

2 2 Oliver Lehmberg and Oktie Hassanzadeh 2 Related Work The pioneering work using Web tables to discover new attributes was done by Cafarella et al. in 2008 [3]. They create the so-called attribute correlation statistics database (AcsDb) which contains attribute counts based on the column headers in a large corpus of Web tables. From these counts, they estimate attribute occurrence probabilities. Applications for this database are a schema auto-complete function, synonym generation and a tool enabling easy join graph traversal for end-users. We extend their approach as we use clusters derived from matched columns instead of columns headers as basic unit for the statistics. Das Sarma et al. [4] use label- and value-based schema matching methods to map Web tables to a given query table. For their Schema Complement operation they consider all unmapped columns and rank them using the AcsDb and the entity coverage of the input table provided by the user. Their goal is to rank complete tables by their usefulness for the complement task. While they use a matching of Web table columns to the query table to rule out existing attributes, when it comes to finding new attributes, they fall back to the AcsDb approach. In contrast to that, we calculate attribute statistics based on matched column clusters. Lee et al. [9] extract attributes from Probase [15], Web documents, a search engine query log and DBpedia [1] and estimate their typicality using frequencies of class/attribute and instance/attribute occurrences. The extraction process is completely label-based. For the merging of attributes, they use synonyms derived from Wikipedia. Several systems have been proposed to extend a user-specified query table with content from a corpus of Web tables [2, 16, 10]. For the task of finding new attributes, the user can specify a keyword query which describes the new attribute, so no ranking is required. Alternatively, the InfoGather system [16] and the Mannheim SearchJoin Engine [10] can generate additional attributes based on a schema matching, but both systems do not rank the resulting attributes based on a relevance score. 3 Approach Our goal is to design an ontology augmentation solution to find new attributes for an ontology using an external source of structured data, such as a corpus of web tables. The general idea followed by existing approaches is to count attribute occurrences in the table corpus and use them to estimate probabilities for encountering these attributes. Based on these probabilities, several different metrics can be defined to assess the value of adding an attribute to the ontology (see Section 3.3). These metrics measure how likely a new attribute is to co-occur with existing attributes (in the ontology) or how consistent the resulting schema would be if the new attribute is added to the ontology. Existing methods often consider the use-case of extending a user-provided data source in an ad-hoc setting. In the case of extending an ontology, however,

3 Ontology Augmentation Through Matching with Web Tables 3 a variety of matching methods can be used to align the schema of the Web tables with the ontology. We propose to incorporate the mapping created by such methods by calculating all co-occurrence frequencies based on the mapping. The relevance of new attributes is measured based on how frequently they co-occur with known attributes. Using exact string matching, these frequencies can be obtained from a corpus of web tables by counting, as shown by the approaches using the AcsDb [3]. When using fuzzy matching methods, however, the attributes must first be mapped among each other and then be partitioned according to their similarity values. This results in attribute clusters whose frequency can be determined by adding up the frequencies of all attributes in the cluster. 3.1 Identifying Equal Attributes We compare several different approaches of defining attribute similarity, which will be introduced in the following. Equality of Known Attributes. For the attributes that already exist in the ontology, we create a mapping from the web tables to the ontology. For the results in this paper, we use T2K Match [12] to map Web Tables to DBpedia ontology. This mapping defines which columns in the web tables correspond to which property in the ontology. By transitivity, all attributes which correspond to the same property are equal. Equality of Unknown Attributes. Based on the mapping produced by T2K Match, we can group the web tables by their class in the knowledge base (blocking step) and then match all un-mapped attributes among each other. For attributes which do not exist in the ontology, we compare the following schema matching approaches: Label-based Matching. Using the column headers of web tables as features, we evaluate using exact column header equality to find matching columns. We refer to this approach as Exact in our experiments. We further evaluate String Similarity, which calculates the similarity of the column headers using the Generalised Jaccard Similarity with Edit Distance as inner similarity function. Instance-based Equality. We further evaluate similarities which are created by the instance-based schema matcher of the Helix System [6]. We refer to the configuration using cosine similarity as Helix Cosine and to the configuration using containment similarity as Helix Containment. Key/Value-based Equality. The Key/Value-based equality Key/Value Matching compares only those values of two columns, which are mapped to the same instance in the ontology. This means, two columns are equivalent only if they contain similar values for the same instances. To obtain the similarity values, we use the value-based matching component of T2K Match.

4 4 Oliver Lehmberg and Oktie Hassanzadeh 3.2 Similarity Graph Partitioning After the calculation of the similarity values, we must decide which set of columns refers to the same attribute. For attributes that already exist in the ontology, all columns with a similarity value which is above a threshold are considered to be equal to the existing attribute. However, for attributes which do not exist in the ontology, there is no such central attribute. We hence evaluate different partitioning strategies [8] for the graph that is defined by the similarities among the columns of the web tables. Connected Components. We calculate the connected components on the similarity graph. Each resulting component is a cluster. Center. The Center algorithm uses the list of similarities sorted in descending order to create star-shaped clusters. The first time a node is encountered in the sorted list, it becomes the center of a cluster. Any other node appearing in a similarity pair with this node is then assigned to the cluster having the former node as center. MergeCenter. The MergeCenter algorithm is similar to the Center algorithm, but has one extension. This extension is that if a node is similar to the centers of two different clusters, these clusters are merged together. 3.3 Attribute Ranking After defining attribute equality, we can now specify how the relevance of new attributes is determined. All compared ranking methods are defined based on attribute cooccurrence probabilities, which we define according to Cafarella et al. [3]. Let a schema s S be a set of attributes and S be the set of all schemata. A table has this schema if its columns correspond to the attributes (based on the schema mapping), regardless of their order and column header. Let f req(s) be the number of tables with schema s in the corpus and schema freq(a) be the number of tables that contain attribute a: schema freq(a) = {s s S a s} freq(s) (1) Then the probability of encountering a in any table in the corpus is p(a) = schema freq(a) s S freq(s) (2) The number of tables that contain two attributes a 1, a 2 is defined analogously as schema freq(a 1, a 2 ). The conditional probability of seeing attribute a 1 given a 2 is

5 Ontology Augmentation Through Matching with Web Tables 5 p(a 1 a 2 ) = schema freq(a 1, a 2 ) schema freq(a 2 ) (3) And the joint probability is p(a 1, a 2 ) = schema freq(a 1, a 2 ) s S freq(s) (4) Attribute Ranking Methods. We now define the methods that are used to calculate a score for each attribute, which is then used to rank all unknown attributes. A higher score indicates a higher relevance of the attribute for the schema extension task. Conditional Probability based on Class. Given the class C in the ontology, how likely is it to encounter the attribute a [9]. If each schema is mapped to a class C and schema freq(a, C) is the number of tables mapped to C that contain a, we can define the conditional probability of encountering an attribute based on the class as in Equation 5, where S C is the schema of class C. This measure only considers the class mapping of the web tables, irrespective of the presence of known attributes in the same web table. schema freq(a, C) p(a C) = a 2 S C schema freq(a 2, C) (5) Schema Consistency. This measure reflects the likelihood of seeing a new attribute together with the existing attributes [4]. It is based on the conditional probability derived from the cooccurrence statistics. This measure considers all known attributes which co-occur with the new attribute a, i.e., the more known attributes co-occur, the higher the score. SchemaConsistency(a, s) = 1 s a 2 s p(a a 2 ) (6) Schema Coherency. Based on Point-wise Mutual Information (PMI), schema coherency is the average of the PMI scores of all possible attribute combinations [3]. The PMI score of two attributes is positive if the attributes are correlated, zero if they are independent, and negative if they are negatively correlated. SchemaCoherency(a, s) = 1 s a 1 s npmi(a 1, a) (7) 1 npmi(a 1, a 2 ) = log p(a 1, a 2 ) log p(a 1, a 2 ) p(a 1 ) p(a 2 ) (8)

6 6 Oliver Lehmberg and Oktie Hassanzadeh 4 Experiments 4.1 Experiments on T2D Gold Standard Our first set of experiments are performed on the T2D Gold Standard [12], which was originally developed to evaluate systems for the web table to knowledge base matching task (using DBpedia as the knowledge base). Identifying equal Attributes We evaluate the different matching and partitioning approaches introduced in Section 3.1. The gold standard contains mappings from the web table columns to properties in the ontology. As we are interested in finding partitions of columns which represent the same attribute, we create one partition for each property in the ontology, which contains all columns which are mapped to this property. We then apply the different methods to all columns of the web tables in the gold standard and measure the degree to which we can reconstruct these partitions. Figure 1 shows a comparison of the different partitioning approaches. The x- axis depicts the similarity threshold and the y-axis shows the resulting F1-score. We can see that the best performance is achieve with rather low thresholds and the Center algorithm. Fig. 1. Evaluation of similarity graph partitioning methods. Figure 2 shows the quality of the Clusterings using different matching approaches. Again, the x-axis depicts the similarity threshold and the y-axis shows the resulting F1-score. We see that the label-based matching with string similarity outperforms the instance-based approaches. The reason for the good performance of the label-based approach is that the web tables are grouped by the class in the ontology to which they are mapped, and hence column headers are

7 Ontology Augmentation Through Matching with Web Tables 7 in most cases not ambiguous. The rather bad performance of the instance-based approaches is explained by the fact that web tables usually only have very few rows and there might not be enough overlap among the columns from different tables. Fig. 2. Evaluation of matching methods. Combining the instance-based and label-based approach into a hybrid matcher did not significantly improve the performance compared to the label-based approach. Closer inspection of the results showed that this is due to the used gold standard, which contains mostly tables with columns labels of high quality. Attribute Ranking For a subset of the T2D tables, those mapped to the country class, we manually label all columns with either useful or not useful. In total, this subset contains 207 columns, of which 86 are annotated as useful. We then evaluate the performance of the different ranking methods. Figure 3 shows the precision@k and recall@k achieved by the different ranking approaches. In addition to the ranking methods described in Section 3.3, we further evaluate each of the ranking methods in a variant that is weighted by PageRank. The intuition is that web pages with a high PageRank likely contain useful content and hence the web tables on these pages also contain relevant attributes. The used PageRank values are obtained from the publicly available Common Crawl WWW Ranking. 4 For each partition of columns, we use the maximum PageRank of all source web pages and multiply it with the score that was calculated by the ranking method. Among the different ranking methods, schema consistency performs best, followed by schema coherency. The variations with PageRank 4

8 8 Oliver Lehmberg and Oktie Hassanzadeh perform worst, which might be caused by the rather small number of web sites in the gold standard. Fig. 3. and achieved by the different ranking methods using the Key/Value matcher. Remove one Attribute Experiment The assessment of the usefulness of an attribute can be subjective. Hence we design another experiment, where we remove one existing attribute from the ontology for several classes. As this attribute was already existing, we can objectively say that it is useful. We then measure the quality of the first cluster that resembles this attribute and also the rank at which we can find it in the output. We use the following classes and attributes in this experiment: Company (industry), Country (population), Film

9 Ontology Augmentation Through Matching with Web Tables 9 (year), Mountain (height), Plant (family), VideoGame (genre). The left chart in Figure 4 shows the average rank of the first attribute cluster which matches the removed attribute over all ranking methods by matcher. The bar No Matching shows the result of neither using correspondences to the ontology nor any of the matching approaches, i.e., attributes are equal only if their column headers match exactly. The results without prior mapping knowledge show the importance of matching attributes before calculating the ranking functions. Without mapping knowledge, attribute frequencies are under-estimated, and the respective attribute is ranked too low. The right chart in Figure 4 shows the average rank over all matching methods by ranking method. Again, the schema consistency ranking performs best and the variations including PageRank consistently perform worse. Fig. 4. Rank of the first cluster matching the removed attribute. Left: by matcher. Right: by ranking method. 4.2 Experiments on WDC Table Corpus We now repeat our experiments on the WDC Web Tables Corpus , which contains 147 million relational web tables. To give an overall impression of the full corpus, Figure 5 shows the number of new columns and clusters that we can generate for selected classes. These numbers show the large amount of potentially new attributes that can be found in the corpus. Attribute Ranking As we have no gold standard for the full corpus, we manually annotate the top 15 ranked clusters for each ranking method for several classes with either useful or not useful. Figure 5 shows the performance of each method averaged over all classes in terms of precision@15. The results show again that the schema coherency and consistency measures outperform the conditional measure. This indicates that attribute co-occurrence is a stronger signal than pure frequency of attributes, even if conditioned with a class from the ontology. 5

10 10 Oliver Lehmberg and Oktie Hassanzadeh Fig. 5. Left: Number of attributes and clusters, that do not exist in the ontology. Right: Manual evaluation of the usefulness of new attributes. Remove one Attribute Experiment Again, to have a more objective view on the results, we remove one attribute from DBpedia as before and find the topranked attribute cluster which matches the removed attribute. The left chart in Figure 6 shows the rank of these clusters by matcher and the right chart by ranking method. Concerning the matching approach, we now find that the label-based and key/value-based methods achieve comparable results. The difference here to the experiment on the gold standard is that we take into account a much larger number of tables and hence have more variety and a more realistic sample of the data quality. If we compare both of the matching approaches to a baseline approach ( No Matching ), which does not use the prior knowledge of the mappings to the ontology, we can again see that the ranking results are worse. Looking at the different ranking methods, we see a result that differs from the previous results. The Conditional Probability ranking now performs best. A possible explanation is that the attributes that we removed are quite common. Hence, many tables have such attributes and the ranking by frequency is sufficient. Another interesting fact is that now the PageRank makes a difference. Although it is still worse than without, we can presume that a reasonable evaluation of a ranking method incorporating the PageRank requires the use of a large corpus. Fig. 6. Left: Rank of the first cluster matching the removed attribute by matcher. Right: Rank of the first cluster matching the removed attribute.

11 Ontology Augmentation Through Matching with Web Tables 11 5 Conclusion & Future Work In summary, the results of our experiments show that: It is feasible to use a large corpus of structured data from the Web to augment an ontology. In particular, we are able to augment a general-domain ontology such as DBpedia with millions of Web Tables extracted from the Web. We manually verified that a number of attributes ranked highly by our algorithms were strong candidates for augmenting the DBpedia ontology, and such augmentations would enable new applications of the ontology. Our results comparing different algorithms were mixed and without a clear winner across all the experiments. The size of the gold standard and the classes chosen for manual verification clearly affected the relative performance of the algorithms. This calls for larger benchmarks, more comprehensive evaluation, and hybrid/ensemble methods that effectively take advantage of the benefits of each of the algorithms. Future work also includes: 1) extending our framework to include more advanced matching techniques particularly from recent work in ontology matching 2) evaluation on other sources of structured data (e.g., open data portals such as data.gov), and other ontologies 3) Extending the augmentation to relations and classes of the ontology 4) using the same quality metrics for ontology augmentation from textual and semi-structured sources and an evaluation of how well structured data on the Web can contribute to building and augmenting an ontology, comparing with the textual and semi-structured sources. References 1. Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. Dbpedia: A nucleus for a web of open data. Springer, Michael J Cafarella, Alon Halevy, and Nodira Khoussainova. Data integration for the relational web. Proceedings of the VLDB Endowment, 2(1): , Michael J Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment, 1(1): , Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, and Cong Yu. Finding related tables. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages ACM, Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. Knowledge vault: A webscale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages ACM, Jason Ellis, Achille Fokoue, Oktie Hassanzadeh, Anastasios Kementsietsidis, Kavitha Srinivas, and Michael J Ward. Exploring big data with helix: Finding needles in a big haystack. ACM SIGMOD Record, 43(4):43 54, 2015.

12 12 Oliver Lehmberg and Oktie Hassanzadeh 7. Rahul Gupta, Alon Halevy, Xuezhi Wang, Steven Euijong Whang, and Fei Wu. Biperpedia: An ontology for search applications. Proceedings of the VLDB Endowment, 7(7): , Oktie Hassanzadeh, Fei Chiang, Hyun Chul Lee, and Renée J Miller. Framework for evaluating clustering algorithms in duplicate detection. Proceedings of the VLDB Endowment, 2(1): , Taesung Lee, Zhongyuan Wang, Haixun Wang, and Seung-won Hwang. Attribute extraction and scoring: A probabilistic approach. In Data Engineering (ICDE), 2013 IEEE 29th International Conference on, pages IEEE, Oliver Lehmberg, Dominique Ritze, Petar Ristoski, Robert Meusel, Heiko Paulheim, and Christian Bizer. The mannheim search join engine. Web Semantics: Science, Services and Agents on the World Wide Web, Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11 33, Dominique Ritze, Oliver Lehmberg, and Christian Bizer. Matching html tables to dbpedia. In Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics, page 10. ACM, Dominique Ritze, Oliver Lehmberg, Yaser Oulabi, and Christian Bizer. Profiling the potential of web tables for augmenting cross-domain knowledge bases. In Proceedings of the 25th international conference on world wide web, pages International World Wide Web Conferences Steering Committee, Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Fei Wu, Gengxin Miao, and Chung Wu. Recovering semantics of tables on the web. Proceedings of the VLDB Endowment, 4(9): , Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q Zhu. Probase: A probabilistic taxonomy for text understanding. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages ACM, Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages ACM, 2012.

A Robust Number Parser based on Conditional Random Fields

A Robust Number Parser based on Conditional Random Fields A Robust Number Parser based on Conditional Random Fields Heiko Paulheim Data and Web Science Group, University of Mannheim, Germany Abstract. When processing information from unstructured sources, numbers

More information

Visualizing semantic table annotations with TableMiner+

Visualizing semantic table annotations with TableMiner+ Visualizing semantic table annotations with TableMiner+ MAZUMDAR, Suvodeep and ZHANG, Ziqi Available from Sheffield Hallam University Research Archive (SHURA) at:

More information

Understanding a Large Corpus of Web Tables Through Matching with Knowledge Bases An Empirical Study

Understanding a Large Corpus of Web Tables Through Matching with Knowledge Bases An Empirical Study Understanding a Large Corpus of Web Tables Through Matching with Knowledge Bases An Empirical Study Oktie Hassanzadeh, Michael J. Ward, Mariano Rodriguez-Muro, and Kavitha Srinivas IBM T.J. Watson Research

More information

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu Presented by: Dimitri Galmanovich Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu 1 When looking for Unstructured data 2 Millions of such queries every day

More information

Recent Advances in Structured Data and the Web

Recent Advances in Structured Data and the Web Recent Advances in Structured Data and the Web Alon Halevy Google April 10, 2013 Joint work with: Jayant Madhavan, Cong Yu, Fei Wu, Hongrae Lee, Nitin Gupta, Warren Shen Anish Das Sarma, Boulos Harb, Zack

More information

EntiTables: Smart Assistance for Entity-Focused Tables

EntiTables: Smart Assistance for Entity-Focused Tables EntiTables: Smart Assistance for Entity-Focused Tables ABSTRACT Shuo Zhang University of Stavanger shuo.zhang@uis.no Tables are among the most powerful and practical tools for organizing and working with

More information

Matching Web Tables with Knowledge Base Entities: From Entity Lookups to Entity Embeddings

Matching Web Tables with Knowledge Base Entities: From Entity Lookups to Entity Embeddings Matching Web Tables with Knowledge Base Entities: From Entity Lookups to Entity Embeddings Vasilis Efthymiou 1, Oktie Hassanzadeh 2, Mariano Rodriguez-Muro 2, and Vassilis Christophides 3 1 ICS-FORTH &

More information

Big Data Integration for Data Enthusiasts. Jayant Madhavan Structured Data Research Google Inc.

Big Data Integration for Data Enthusiasts. Jayant Madhavan Structured Data Research Google Inc. for Data Enthusiasts Jayant Madhavan Structured Data Research Google Inc. Big Data Challenge Running computations over ginormous datasets Petabytes, Exabytes, maybe more! Only one aspect of the challenge!

More information

Creating Large-scale Training and Test Corpora for Extracting Structured Data from the Web

Creating Large-scale Training and Test Corpora for Extracting Structured Data from the Web Creating Large-scale Training and Test Corpora for Extracting Structured Data from the Web Robert Meusel and Heiko Paulheim University of Mannheim, Germany Data and Web Science Group {robert,heiko}@informatik.uni-mannheim.de

More information

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Jianyong Wang Department of Computer Science and Technology Tsinghua University Jianyong Wang Department of Computer Science and Technology Tsinghua University jianyong@tsinghua.edu.cn Joint work with Wei Shen (Tsinghua), Ping Luo (HP), and Min Wang (HP) Outline Introduction to entity

More information

Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval

Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval Faculty of Science and Technology Department of Electrical Engineering and Computer Science Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval Master s Thesis in Computer Science

More information

Open Data Integration. Renée J. Miller

Open Data Integration. Renée J. Miller Open Data Integration Renée J. Miller miller@northeastern.edu !2 Open Data Principles Timely & Comprehensive Accessible and Usable Complete - All public data is made available. Public data is data that

More information

Finding Related Tables

Finding Related Tables Finding Related Tables Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Cong Yu University of Michigan, Google Inc., University of California, Berkeley ljfang@umich.edu,

More information

Linking Entities in Chinese Queries to Knowledge Graph

Linking Entities in Chinese Queries to Knowledge Graph Linking Entities in Chinese Queries to Knowledge Graph Jun Li 1, Jinxian Pan 2, Chen Ye 1, Yong Huang 1, Danlu Wen 1, and Zhichun Wang 1(B) 1 Beijing Normal University, Beijing, China zcwang@bnu.edu.cn

More information

Towards Efficient and Effective Semantic Table Interpretation Ziqi Zhang

Towards Efficient and Effective Semantic Table Interpretation Ziqi Zhang Towards Efficient and Effective Semantic Table Interpretation Ziqi Zhang Department of Computer Science, University of Sheffield Outline Define semantic table interpretation State-of-the-art and motivation

More information

WebTables: Exploring the Power of Tables on the Web

WebTables: Exploring the Power of Tables on the Web WebTables: Exploring the Power of Tables on the Web Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, Yang Zhang Presented by: Ganesh Viswanathan September 29 th, 2011 CIS 6930 Data Science:

More information

Matching Web Tables To DBpedia - A Feature Utility Study

Matching Web Tables To DBpedia - A Feature Utility Study Matching Web Tables To DBpedia - A Feature Utility Study Dominique Ritze, Christian Bizer Data and Web Science Group, University of Mannheim, B6, 26 68159 Mannheim, Germany {dominique,chris}@informatik.uni-mannheim.de

More information

(Big Data Integration) : :

(Big Data Integration) : : (Big Data Integration) : : 3 # $%&'! ()* +$,- 2/30 ()* + # $%&' = 3 : $ 2 : 17 ;' $ # < 2 6 ' $%&',# +'= > 0 - '? @0 A 1 3/30 3?. - B 6 @* @(C : E6 - > ()* (C :(C E6 1' +'= - ''3-6 F :* 2G '> H-! +'-?

More information

Towards Rule Learning Approaches to Instance-based Ontology Matching

Towards Rule Learning Approaches to Instance-based Ontology Matching Towards Rule Learning Approaches to Instance-based Ontology Matching Frederik Janssen 1, Faraz Fallahi 2 Jan Noessner 3, and Heiko Paulheim 1 1 Knowledge Engineering Group, TU Darmstadt, Hochschulstrasse

More information

Log Linear Model for String Transformation Using Large Data Sets

Log Linear Model for String Transformation Using Large Data Sets Log Linear Model for String Transformation Using Large Data Sets Mr.G.Lenin 1, Ms.B.Vanitha 2, Mrs.C.K.Vijayalakshmi 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology,

More information

arxiv: v1 [cs.cl] 4 Nov 2018

arxiv: v1 [cs.cl] 4 Nov 2018 ColNet: Embedding the Semantics of Web Tables for Column Type Prediction Jiaoyan Chen 1, Ernesto Jiménez-Ruiz 3,4, Ian Horrocks 1,4, Charles Sutton 2,4 1 Department of Computer Science, University of Oxford,

More information

arxiv: v1 [cs.ir] 17 Feb 2018

arxiv: v1 [cs.ir] 17 Feb 2018 arxiv:1802.06290v1 [cs.ir] 17 Feb 2018 ABSTRACT Majid Ghasemi-Gol University of Southern California ghasemig@usc.edu There are hundreds of millions of tables in Web pages that contain useful information

More information

Query Expansion using Wikipedia and DBpedia

Query Expansion using Wikipedia and DBpedia Query Expansion using Wikipedia and DBpedia Nitish Aggarwal and Paul Buitelaar Unit for Natural Language Processing, Digital Enterprise Research Institute, National University of Ireland, Galway firstname.lastname@deri.org

More information

Deep Web Content Mining

Deep Web Content Mining Deep Web Content Mining Shohreh Ajoudanian, and Mohammad Davarpanah Jazi Abstract The rapid expansion of the web is causing the constant growth of information, leading to several problems such as increased

More information

EXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES

EXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES EXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES B. GEETHA KUMARI M. Tech (CSE) Email-id: Geetha.bapr07@gmail.com JAGETI PADMAVTHI M. Tech (CSE) Email-id: jageti.padmavathi4@gmail.com ABSTRACT:

More information

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University

More information

Exploiting Semantics Where We Find Them

Exploiting Semantics Where We Find Them Vrije Universiteit Amsterdam 19/06/2018 Exploiting Semantics Where We Find Them A Bottom-up Approach to the Semantic Web Prof. Dr. Christian Bizer Bizer: Exploiting Semantics Where We Find Them. VU Amsterdam,

More information

Comprehensive and Progressive Duplicate Entities Detection

Comprehensive and Progressive Duplicate Entities Detection Comprehensive and Progressive Duplicate Entities Detection Veerisetty Ravi Kumar Dept of CSE, Benaiah Institute of Technology and Science. Nagaraju Medida Assistant Professor, Benaiah Institute of Technology

More information

arxiv: v1 [cs.ir] 13 May 2018

arxiv: v1 [cs.ir] 13 May 2018 arxiv:1805.04875v1 [cs.ir] 13 May 2018 ABSTRACT Shuo Zhang University of Stavanger shuo.zhang@uis.no Many information needs revolve around entities, which would be better answered by summarizing results

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Outline. Part I. Introduction Part II. ML for DI. Part III. DI for ML Part IV. Conclusions and research direction

Outline. Part I. Introduction Part II. ML for DI. Part III. DI for ML Part IV. Conclusions and research direction Outline Part I. Introduction Part II. ML for DI ML for entity linkage ML for data extraction ML for data fusion ML for schema alignment Part III. DI for ML Part IV. Conclusions and research direction Data

More information

On-the-fly Table Generation

On-the-fly Table Generation On-the-fly Table Generation ABSTRACT Shuo Zhang University of Stavanger shuo.zhang@uis.no Many information needs revolve around entities, which would be better answered by summarizing results in a tabular

More information

Structured Data on the Web

Structured Data on the Web Structured Data on the Web Alon Halevy Google Australasian Computer Science Week January, 2010 Structured Data & The Web Andree Hudson, 4 th of July Hard to find structured data via search engines

More information

Top-k Keyword Search Over Graphs Based On Backward Search

Top-k Keyword Search Over Graphs Based On Backward Search Top-k Keyword Search Over Graphs Based On Backward Search Jia-Hui Zeng, Jiu-Ming Huang, Shu-Qiang Yang 1College of Computer National University of Defense Technology, Changsha, China 2College of Computer

More information

Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs

Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs Authors: Andreas Wagner, Veli Bicer, Thanh Tran, and Rudi Studer Presenter: Freddy Lecue IBM Research Ireland 2014 International

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

Ten Years of WebTables

Ten Years of WebTables Ten Years of WebTables Michael Cafarella Alon Halevy University of Michigan Megagon Labs michjc@umich.edu alon@megagon.ai Hongrae Lee Jayant Madhavan Cong Yu Google, Inc. {hrlee, jayant, congyu}@google.com

More information

DC Proposal: Graphical Models and Probabilistic Reasoning for Generating Linked Data from Tables

DC Proposal: Graphical Models and Probabilistic Reasoning for Generating Linked Data from Tables DC Proposal: Graphical Models and Probabilistic Reasoning for Generating Linked Data from Tables Varish Mulwad Computer Science and Electrical Engineering University of Maryland, Baltimore County varish1@cs.umbc.edu

More information

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2 A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2 1 Student, M.E., (Computer science and Engineering) in M.G University, India, 2 Associate Professor

More information

A Survey on Information Extraction in Web Searches Using Web Services

A Survey on Information Extraction in Web Searches Using Web Services A Survey on Information Extraction in Web Searches Using Web Services Maind Neelam R., Sunita Nandgave Department of Computer Engineering, G.H.Raisoni College of Engineering and Management, wagholi, India

More information

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (  1 Cluster Based Speed and Effective Feature Extraction for Efficient Search Engine Manjuparkavi A 1, Arokiamuthu M 2 1 PG Scholar, Computer Science, Dr. Pauls Engineering College, Villupuram, India 2 Assistant

More information

Collaborative Filtering using a Spreading Activation Approach

Collaborative Filtering using a Spreading Activation Approach Collaborative Filtering using a Spreading Activation Approach Josephine Griffith *, Colm O Riordan *, Humphrey Sorensen ** * Department of Information Technology, NUI, Galway ** Computer Science Department,

More information

Tag Based Image Search by Social Re-ranking

Tag Based Image Search by Social Re-ranking Tag Based Image Search by Social Re-ranking Vilas Dilip Mane, Prof.Nilesh P. Sable Student, Department of Computer Engineering, Imperial College of Engineering & Research, Wagholi, Pune, Savitribai Phule

More information

1Table A System for Managing Structured Web Data. Yang Zhang with: Alon Halevy, Mike Cafarella, Nodira Khoussainova, Eugene Wu and Daisy Zhe

1Table A System for Managing Structured Web Data. Yang Zhang with: Alon Halevy, Mike Cafarella, Nodira Khoussainova, Eugene Wu and Daisy Zhe 1Table A System for Managing Structured Web Data Yang Zhang with: Alon Halevy, Mike Cafarella, Nodira Khoussainova, Eugene Wu and Daisy Zhe Structured Web Data No Web is more than just text tables, tags,

More information

Recovering Semantics of Tables on the Web

Recovering Semantics of Tables on the Web Recovering Semantics of Tables on the Web Petros Venetis Alon Halevy Jayant Madhavan Marius Paşca Stanford University Google Inc. Google Inc. Google Inc. venetis@cs.stanford.edu halevy@google.com jayant@google.com

More information

Entity Linking in Web Tables with Multiple Linked Knowledge Bases

Entity Linking in Web Tables with Multiple Linked Knowledge Bases Entity Linking in Web Tables with Multiple Linked Knowledge Bases Tianxing Wu, Shengjia Yan, Zhixin Piao, Liang Xu, Ruiming Wang, Guilin Qi School of Computer Science and Engineering, Southeast University,

More information

A Framework for Securing Databases from Intrusion Threats

A Framework for Securing Databases from Intrusion Threats A Framework for Securing Databases from Intrusion Threats R. Prince Jeyaseelan James Department of Computer Applications, Valliammai Engineering College Affiliated to Anna University, Chennai, India Email:

More information

Data Services Leveraging Bing s Data Assets

Data Services Leveraging Bing s Data Assets Data s Leveraging Bing s Data Assets Kaushik Chakrabarti, Surajit Chaudhuri, Zhimin Chen, Kris Ganjam, Yeye He Microsoft Research Redmond, WA {kaushik, surajitc, zmchen, krisgan, yeyehe}@microsoft.com

More information

Principles of Dataspaces

Principles of Dataspaces Principles of Dataspaces Seminar From Databases to Dataspaces Summer Term 2007 Monika Podolecheva University of Konstanz Department of Computer and Information Science Tutor: Prof. M. Scholl, Alexander

More information

TAIPAN: Automatic Property Mapping for Tabular Data

TAIPAN: Automatic Property Mapping for Tabular Data TAIPAN: Automatic Property Mapping for Tabular Data Ivan Ermilov and Axel-Cyrille Ngonga Ngomo University of Leipzig, Institute of Computer Science, Leipzig, Germany iermilov,ngonga@informatik.uni-leipzig.de

More information

Papers for comprehensive viva-voce

Papers for comprehensive viva-voce Papers for comprehensive viva-voce Priya Radhakrishnan Advisor : Dr. Vasudeva Varma Search and Information Extraction Lab, International Institute of Information Technology, Gachibowli, Hyderabad, India

More information

Improving Open Data Usability through Semantics

Improving Open Data Usability through Semantics Improving Open Data Usability through Semantics PhD research proposal Sebastian Neumaier Vienna University of Economics and Business, Vienna, Austria sebastian.neumaier@wu.ac.at Abstract. With the success

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

Doctoral Thesis Proposal Learning Semantics of WikiTables

Doctoral Thesis Proposal Learning Semantics of WikiTables Doctoral Thesis Proposal Learning Semantics of WikiTables Chandra Sekhar Bhagavatula Department of Electrical Engineering and Computer Science Northwestern University csbhagav@u.northwestern.edu December

More information

ImgSeek: Capturing User s Intent For Internet Image Search

ImgSeek: Capturing User s Intent For Internet Image Search ImgSeek: Capturing User s Intent For Internet Image Search Abstract - Internet image search engines (e.g. Bing Image Search) frequently lean on adjacent text features. It is difficult for them to illustrate

More information

HOLISTIC DATA INTEGRATION FOR BIG DATA

HOLISTIC DATA INTEGRATION FOR BIG DATA HOLISTIC DATA INTEGRATION FOR BIG DATA ERHARD RAHM, UNIVERSITY OF LEIPZIG, AUGUST 2016 www.scads.de GERMAN CENTERS FOR BIG DATA Two Centers of Excellence for Big Data in Germany ScaDS Dresden/Leipzig Berlin

More information

MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI

MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI 1 KAMATCHI.M, 2 SUNDARAM.N 1 M.E, CSE, MahaBarathi Engineering College Chinnasalem-606201, 2 Assistant Professor,

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

DBpedia-An Advancement Towards Content Extraction From Wikipedia

DBpedia-An Advancement Towards Content Extraction From Wikipedia DBpedia-An Advancement Towards Content Extraction From Wikipedia Neha Jain Government Degree College R.S Pura, Jammu, J&K Abstract: DBpedia is the research product of the efforts made towards extracting

More information

Synthesizing Union Tables from the Web

Synthesizing Union Tables from the Web Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Synthesizing Union Tables from the Web Xiao Ling xiaoling@cs.washington.edu University of Washington Alon Halevy

More information

Context Sensitive Search Engine

Context Sensitive Search Engine Context Sensitive Search Engine Remzi Düzağaç and Olcay Taner Yıldız Abstract In this paper, we use context information extracted from the documents in the collection to improve the performance of the

More information

Tools and Infrastructure for Supporting Enterprise Knowledge Graphs

Tools and Infrastructure for Supporting Enterprise Knowledge Graphs Tools and Infrastructure for Supporting Enterprise Knowledge Graphs Sumit Bhatia, Nidhi Rajshree, Anshu Jain, and Nitish Aggarwal IBM Research sumitbhatia@in.ibm.com, {nidhi.rajshree,anshu.n.jain}@us.ibm.com,nitish.aggarwal@ibm.com

More information

TSS: A Hybrid Web Searches

TSS: A Hybrid Web Searches 410 TSS: A Hybrid Web Searches Li-Xin Han 1,2,3, Gui-Hai Chen 3, and Li Xie 3 1 Department of Mathematics, Nanjing University, Nanjing 210093, P.R. China 2 Department of Computer Science and Engineering,

More information

RiMOM Results for OAEI 2010

RiMOM Results for OAEI 2010 RiMOM Results for OAEI 2010 Zhichun Wang 1, Xiao Zhang 1, Lei Hou 1, Yue Zhao 2, Juanzi Li 1, Yu Qi 3, Jie Tang 1 1 Tsinghua University, Beijing, China {zcwang,zhangxiao,greener,ljz,tangjie}@keg.cs.tsinghua.edu.cn

More information

Computer-assisted Ontology Construction System: Focus on Bootstrapping Capabilities

Computer-assisted Ontology Construction System: Focus on Bootstrapping Capabilities Computer-assisted Ontology Construction System: Focus on Bootstrapping Capabilities Omar Qawasmeh 1, Maxime Lefranois 2, Antoine Zimmermann 2, Pierre Maret 1 1 Univ. Lyon, CNRS, Lab. Hubert Curien UMR

More information

Keywords Data alignment, Data annotation, Web database, Search Result Record

Keywords Data alignment, Data annotation, Web database, Search Result Record Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Bumjoon Jo and Sungwon Jung (&) Department of Computer Science and Engineering, Sogang University, 35 Baekbeom-ro, Mapo-gu, Seoul 04107,

More information

Extraction of Web Image Information: Semantic or Visual Cues?

Extraction of Web Image Information: Semantic or Visual Cues? Extraction of Web Image Information: Semantic or Visual Cues? Georgina Tryfou and Nicolas Tsapatsoulis Cyprus University of Technology, Department of Communication and Internet Studies, Limassol, Cyprus

More information

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL Lim Bee Huang 1, Vimala Balakrishnan 2, Ram Gopal Raj 3 1,2 Department of Information System, 3 Department

More information

Learning mappings and queries

Learning mappings and queries Learning mappings and queries Marie Jacob University Of Pennsylvania DEIS 2010 1 Schema mappings Denote relationships between schemas Relates source schema S and target schema T Defined in a query language

More information

Ad Hoc Table Retrieval using Semantic Similarity

Ad Hoc Table Retrieval using Semantic Similarity Ad Hoc Table Retrieval using Semantic Similarity ABSTRACT Shuo Zhang University of Stavanger shuo.zhang@uis.no We introduce and address the problem of ad hoc table retrieval: answering a keyword query

More information

Recommendation System for Location-based Social Network CS224W Project Report

Recommendation System for Location-based Social Network CS224W Project Report Recommendation System for Location-based Social Network CS224W Project Report Group 42, Yiying Cheng, Yangru Fang, Yongqing Yuan 1 Introduction With the rapid development of mobile devices and wireless

More information

Disambiguating Entities Referred by Web Endpoints using Tree Ensembles

Disambiguating Entities Referred by Web Endpoints using Tree Ensembles Disambiguating Entities Referred by Web Endpoints using Tree Ensembles Gitansh Khirbat Jianzhong Qi Rui Zhang Department of Computing and Information Systems The University of Melbourne Australia gkhirbat@student.unimelb.edu.au

More information

PRIOR System: Results for OAEI 2006

PRIOR System: Results for OAEI 2006 PRIOR System: Results for OAEI 2006 Ming Mao, Yefei Peng University of Pittsburgh, Pittsburgh, PA, USA {mingmao,ypeng}@mail.sis.pitt.edu Abstract. This paper summarizes the results of PRIOR system, which

More information

Using Linked Data to Reduce Learning Latency for e-book Readers

Using Linked Data to Reduce Learning Latency for e-book Readers Using Linked Data to Reduce Learning Latency for e-book Readers Julien Robinson, Johann Stan, and Myriam Ribière Alcatel-Lucent Bell Labs France, 91620 Nozay, France, Julien.Robinson@alcatel-lucent.com

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web

More information

Ontology Based Prediction of Difficult Keyword Queries

Ontology Based Prediction of Difficult Keyword Queries Ontology Based Prediction of Difficult Keyword Queries Lubna.C*, Kasim K Pursuing M.Tech (CSE)*, Associate Professor (CSE) MEA Engineering College, Perinthalmanna Kerala, India lubna9990@gmail.com, kasim_mlp@gmail.com

More information

Data Cleansing. LIU Jingyuan, Vislab WANG Yilei, Theoretical group

Data Cleansing. LIU Jingyuan, Vislab WANG Yilei, Theoretical group Data Cleansing LIU Jingyuan, Vislab WANG Yilei, Theoretical group What is Data Cleansing Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or inconsistencies

More information

A Korean Knowledge Extraction System for Enriching a KBox

A Korean Knowledge Extraction System for Enriching a KBox A Korean Knowledge Extraction System for Enriching a KBox Sangha Nam, Eun-kyung Kim, Jiho Kim, Yoosung Jung, Kijong Han, Key-Sun Choi KAIST / The Republic of Korea {nam.sangha, kekeeo, hogajiho, wjd1004109,

More information

Lessons Learned and Research Agenda for Big Data Integration of Product Specifications (Discussion Paper)

Lessons Learned and Research Agenda for Big Data Integration of Product Specifications (Discussion Paper) Lessons Learned and Research Agenda for Big Data Integration of Product Specifications (Discussion Paper) Luciano Barbosa 1, Valter Crescenzi 2, Xin Luna Dong 3, Paolo Merialdo 2, Federico Piai 2, Disheng

More information

Sentiment analysis under temporal shift

Sentiment analysis under temporal shift Sentiment analysis under temporal shift Jan Lukes and Anders Søgaard Dpt. of Computer Science University of Copenhagen Copenhagen, Denmark smx262@alumni.ku.dk Abstract Sentiment analysis models often rely

More information

Knowledge Based Systems Text Analysis

Knowledge Based Systems Text Analysis Knowledge Based Systems Text Analysis Dr. Shubhangi D.C 1, Ravikiran Mitte 2 1 H.O.D, 2 P.G.Student 1,2 Department of Computer Science and Engineering, PG Center Regional Office VTU, Kalaburagi, Karnataka

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

A Review on Identifying the Main Content From Web Pages

A Review on Identifying the Main Content From Web Pages A Review on Identifying the Main Content From Web Pages Madhura R. Kaddu 1, Dr. R. B. Kulkarni 2 1, 2 Department of Computer Scienece and Engineering, Walchand Institute of Technology, Solapur University,

More information

Mining Summaries for Knowledge Graph Search

Mining Summaries for Knowledge Graph Search Mining Summaries for Knowledge Graph Search Qi Song Yinghui Wu Xin Luna Dong 2 Washington State University 2 Google Inc. {qsong, yinghui}@eecs.wsu.edu lunadong@google.com Abstract Mining and searching

More information

Supporting Fuzzy Keyword Search in Databases

Supporting Fuzzy Keyword Search in Databases I J C T A, 9(24), 2016, pp. 385-391 International Science Press Supporting Fuzzy Keyword Search in Databases Jayavarthini C.* and Priya S. ABSTRACT An efficient keyword search system computes answers as

More information

Topic Classification in Social Media using Metadata from Hyperlinked Objects

Topic Classification in Social Media using Metadata from Hyperlinked Objects Topic Classification in Social Media using Metadata from Hyperlinked Objects Sheila Kinsella 1, Alexandre Passant 1, and John G. Breslin 1,2 1 Digital Enterprise Research Institute, National University

More information

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query,

More information

Topic Diversity Method for Image Re-Ranking

Topic Diversity Method for Image Re-Ranking Topic Diversity Method for Image Re-Ranking D.Ashwini 1, P.Jerlin Jeba 2, D.Vanitha 3 M.E, P.Veeralakshmi M.E., Ph.D 4 1,2 Student, 3 Assistant Professor, 4 Associate Professor 1,2,3,4 Department of Information

More information

Automatically Generating Government Linked Data from Tables

Automatically Generating Government Linked Data from Tables Automatically Generating Government Linked Data from Tables Varish Mulwad, Tim Finin and Anupam Joshi Computer Science and Electrical Engineering University of Maryland, Baltimore County Baltimore, Maryland

More information

Context Based Indexing in Search Engines: A Review

Context Based Indexing in Search Engines: A Review International Journal of Computer (IJC) ISSN 2307-4523 (Print & Online) Global Society of Scientific Research and Researchers http://ijcjournal.org/ Context Based Indexing in Search Engines: A Review Suraksha

More information

DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY

DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY Reham I. Abdel Monem 1, Ali H. El-Bastawissy 2 and Mohamed M. Elwakil 3 1 Information Systems Department, Faculty of computers and information,

More information

Top- k Entity Augmentation Using C onsistent Set C overing

Top- k Entity Augmentation Using C onsistent Set C overing Top- k Entity Augmentation Using C onsistent Set C overing J ulian Eberius*, Maik Thiele, Katrin Braunschweig and Wolfgang Lehner Technische Universität Dresden, Germany What is Entity Augmentation? ENTITY

More information

Introduction Entity Match Service. Step-by-Step Description

Introduction Entity Match Service. Step-by-Step Description Introduction Entity Match Service In order to incorporate as much institutional data into our central alumni and donor database (hereafter referred to as CADS ), we ve developed a comprehensive suite of

More information

Lily: Ontology Alignment Results for OAEI 2009

Lily: Ontology Alignment Results for OAEI 2009 Lily: Ontology Alignment Results for OAEI 2009 Peng Wang 1, Baowen Xu 2,3 1 College of Software Engineering, Southeast University, China 2 State Key Laboratory for Novel Software Technology, Nanjing University,

More information

Ontology Matching with CIDER: Evaluation Report for the OAEI 2008

Ontology Matching with CIDER: Evaluation Report for the OAEI 2008 Ontology Matching with CIDER: Evaluation Report for the OAEI 2008 Jorge Gracia, Eduardo Mena IIS Department, University of Zaragoza, Spain {jogracia,emena}@unizar.es Abstract. Ontology matching, the task

More information

HotMatch Results for OEAI 2012

HotMatch Results for OEAI 2012 HotMatch Results for OEAI 2012 Thanh Tung Dang, Alexander Gabriel, Sven Hertling, Philipp Roskosch, Marcel Wlotzka, Jan Ruben Zilke, Frederik Janssen, and Heiko Paulheim Technische Universität Darmstadt

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 2:00pm-3:30pm, Tuesday, December 15th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

A Novel Approach for Restructuring Web Search Results by Feedback Sessions Using Fuzzy clustering

A Novel Approach for Restructuring Web Search Results by Feedback Sessions Using Fuzzy clustering A Novel Approach for Restructuring Web Search Results by Feedback Sessions Using Fuzzy clustering R.Dhivya 1, R.Rajavignesh 2 (M.E CSE), Department of CSE, Arasu Engineering College, kumbakonam 1 Asst.

More information