Demand-Weighted Completeness Prediction for a Knowledge Base

Size: px
Start display at page:

Download "Demand-Weighted Completeness Prediction for a Knowledge Base"

Transcription

1 Demand-Weighted Completeness Prediction for a Knowledge Base Andrew Hopkinson and Amit Gurdasani and Dave Palfrey and Arpit Mittal Amazon Research Cambridge Cambridge, UK {hopkia, amitgurd, dpalfrey, mitarpit}@amazon.co.uk Abstract In this paper we introduce the notion of Demand-Weighted Completeness, allowing estimation of the completeness of a knowledge base with respect to how it is used. Defining an entity by its classes, we employ usage data to predict the distribution over relations for that entity. For example, instances of person in a knowledge base may require a birth date, name and nationality to be considered complete. These predicted relation distributions enable detection of important gaps in the knowledge base, and define the required facts for unseen entities. Such characterisation of the knowledge base can also quantify how usage and completeness change over time. We demonstrate a method to measure Demand- Weighted Completeness, and show that a simple neural network model performs well at this prediction task. 1 Introduction Knowledge Bases (KBs) are widely used for representing information in a structured format. Such KBs, including Wikidata (Vrandečić and Krötzsch, 2014), Google Knowledge Vault (Dong et al., 2014), and YAGO (Suchanek et al., 2007), often store information as facts in the form of triples, consisting of two entities and a relation between them. KBs have many applications in fields such as machine translation, information retrieval and question answering (Ferrucci, 2012). When considering a KB s suitability for a task, primary considerations are the number of facts it contains (Färber et al., 2015), and the precision of those facts. One metric which is often overlooked is completeness. This can be defined as the proportion of facts about an entity that are present in the KB as compared to an ideal KB which has every fact that can be known about that entity. For example, previous research (Suchanek et al., 2011; Min et al., 2013) has shown that between 69% and 99% of entities in popular KBs lack at least one relation that other entities in the same class have. As of 2016, Wikidata knows the father of only 2% of all people in the KB (Galárraga et al., 2017). Google found that 71% of people in Freebase have no known place of birth, and 75% have no known nationality (Dong et al., 2014). Previous work has focused on a general concept of completeness, where all KB entities are expected to be fully complete, independent of how the KB is used (Motro, 1989; Razniewski et al., 2016; Zaveri et al., 2013). This is a problem because different use cases of a KB may have different completeness requirements. For this work, we were interested in determining a KB s completeness with respect to its query usage, which we term Demand-Weighted Completeness. For example, a relation used 100 times per day is more important than one only used twice per day. 1.1 Problem specification We define our task as follows: Given an entity E in a KB, and query usage data of the KB, predict the distribution of relations that E must have in order for 95% of queries about E to be answered successfully. 1.2 Motivation Demand-Weighted Completeness allows us to predict both important missing relations for existing entities, and relations required for unseen entities. As a result we can target acquisition of sources to fill important KB gaps. It is possible to be entirely reactive when ad- 200 Proceedings of NAACL-HLT 2018, pages New Orleans, Louisiana, June 1-6, c 2017 Association for Computational Linguistics

2 dressing gaps in KB data. Failing queries can be examined and missing fields marked for investigation. However, this approach assumes that: 1. the same KB entity will be accessed again in future, making the data acquisition useful. This is far from guaranteed. 2. the KB already contains all entities needed. While this may hold for some use cases, the most useful KB s today grow and change to reflect a changing world. Both assumptions become unnecessary with an abstract representation of entities, allowing generalization to predict usage. The appropriateness of the abstract representation can be measured by how well the model distinguishes different entity types, and how well the model predicts actual usage for a set of entities, either known or unknown. Further, the Demand-Weighted Completeness of a KB with respect to a specific task can be used as a metric for system performance at that task. By identifying gaps in the KB, it allows targeting of specific improvements to achieve the greatest increase in completeness. Our work is the first to consider KB completeness using the distribution of observed KB queries as a signal. This paper details a learning-based approach that predicts the required relation distributions for both seen and unseen class signatures (Section 3), and shows that a neural network model can generalize relation distributions efficiently and accurately compared to a baseline frequency-based approach (Section 6). 2 Related work Previous work has studied the completeness of the individual properties or database tables over which queries are executed (Razniewski and Nutt, 2011; Razniewski et al., 2015). This approach is suitable for KBs or use cases where individual tables, and individual rows in those tables, are all of equal importance to the KB, or are queried separately. Completeness of KBs has also been measured based on the cardinality of properties. Galárraga et al. (2017) and Mirza et al. (2016) estimated cardinality for several relations with respect to individual entities, yielding targeted completeness information for specific entities. This approach depends on the availability of relevant free text, and uses handcrafted regular expressions to extract the barackobama: person: 1 politician: 1 democrat: 1 republican: 0 writer: 1 Figure 1: Class signature for barackobama. Other entities with the same class membership will have the same signature. information, which can be noisy and doesn t scale to large numbers of relations. The potential for metrics around completeness and dynamicity of a KB are explored in Zaveri et al. (2013), focusing on the task-independent idea of completeness, and the temporal currency, volatility and timeliness of the KB contents. While their concept of timeliness has some similarities to demand-weighted completeness in its taskspecific data currency, we focus more on how the demand varies over time, and how the completeness of the KB varies with respect to that change in demand. 3 Representing Entities 3.1 Class Distributions The data for a single entity does not generalize on its own. In order to generalize from observed usage information to unseen entities and unseen usage, and smooth out outliers, we need to combine data from similar entities. Such combination requires a shared entity representation, allowing combination of similar entities while preventing their confusion with dissimilar entities. For this work, an entity may be a member of multiple classes (or types). We aggregate usage across multiple entities by abstracting to their classes. Membership of a class can be considered as a binary attribute for an entity, with the entity s membership of all the classes considered in the analysis forming a class signature. For example, the entity barackobama is a person, politician, democrat, and writer, among other classes. He is not a republican. Considering these five classes as our class space, the class signature for barackobama would look like Figure 1. Defining an entity by its classes has precedent in previous work (Galárraga et al., 2017; Razniewski et al., 2016). It allows consideration of entities and 201

3 Figure 2: Graph representation of the facts needed to solve the query in Equation 1. The path walked by the query can branch arbitrarily, but maintains a directionality from initial entities to result entities. USA: haspresident: 13 hascapital: 8 haspopulation: 6... Figure 3: Absolute usage data for the entity USA. class combinations not yet seen in the KB (though not entirely new classes). 3.2 Relation Distributions KB queries can be considered as graph traversals, stepping through multiple edges of the knowledge graph to determine the result of multi-clause query. For example, the query: y : haspresident(usa, x) hasspouse(y, x) (1) determines the spouse of the president of the United States by composing two clauses, as shown in Figure 2. The demand-weighted importance of a relation R for an entity E is defined as the number of query clauses about E which contain R, as a fraction of the total number of clauses about E. For example, Equation 1 contains two clauses. As the first clause queries for the haspresident relation of the USA entity, we attribute this occurrence of haspresident to the USA entity. Aggregating the clauses for an entity gives a total entity usage of the form seen in Figure 3. Since the distribution of relation usage is dominated by a few high-value relations (see Figure 6), we only consider relations required to satisfy 95% of queries. 3.3 Predicting Relations from Classes Combining the two representation methods above, we aim to predict the relation distribution for a barackobama: hasheight: 0.16 hasbirthdate: 0.12 hasbirthplace: 0.08 hasspouse: 0.07 haschild: 0.05 Figure 4: An example of a predicted relation distribution for an individual entity. The values represent the proportion of usage of the entity that requires the given relation. given entity (as in Figure 4) using the class membership for the entity (as in Figure 1). This provides the expected usage profile of an entity, potentially before it has seen any usage. 4 Data and Models 4.1 Our knowledge base We make use of a proprietary KB (Tunstall-Pedoe, 2010) constructed over several years, combining a hand-curated ontology with publicly available data from Wikipedia, Freebase, DBPedia, and other sources. However, the task can be applied to any KB with usage data, relations and classes. We use a subset of our KB for this analysis due to the limitation of model size as a function of the number of classes (input features) and the number of relations (output features). Our usage data is generated by our Natural Language Understanding system, which produces KB queries from text utterances. Though it is difficult to remove all biases and errors from the system when operated at industrial scale, we use a hybrid system of curated rules and statistical methods to reduce such problems to a minimum. Such errors should not impact the way we evaluate different models for their ability to model the data itself. 4.2 Datasets To create a class signature, we first determine the binary class membership vector for every entity in the usage dataset. We then group entities by class signature, so entities with identical class membership are grouped together. For each class signature, we generate the relation distribution from the usage data of the entities with that signature. In our case, this usage data is a random subset of query traffic against the KB taken from a specific period of time. The more usage a class signature has, the more fine-grained the 202

4 Dataset Classes Relations Signatures D1 small D2 medium D3 large Table 1: Dataset statistics. person: hasname: 31 hasage: 18 hasheight: Figure 5: Aggregated usage data for the class person. Figure 6: Example histogram of the predicted (using a neural model) and observed relation distributions for a single class signature, showing the region of intersection in green and the weighted Jaccard index in black. distribution of relations becomes. The data is divided into 10 cross-validation folds to ensure that no class signature appears in both the validation and training sets. We generate 3 different sizes of dataset for experimentation (see Table 1), to see how dataset size influences the models. 4.3 Relation prediction models Baseline - Frequency-Based In this approach, we compute the relation distribution for each individual class by summing the usage data for all entities of that class (see Section 3). This gives a combined raw relation usage as seen in Figure 5. For every class in the training set we store this raw relation distribution. At test time, we compute the predicted relation distribution for a class signature as the normalized sum of the raw distributions of all its classes. However, these single-class distributions do not capture the influence of class co-occurrence, where the presence of two classes together may have a stronger influence on the importance of a relation than each class on their own. Additionally, storing distributions for each class signature does not scale, and does not generalize to unseen class combinations Learning-Based Approaches To investigate the impact of class co-occurrence, we use two different learning models to predict the relation distribution for a given set of input classes. The vector of classes comprising the class signature is used as input to the learned models. Linear regression. Using the normalized relation distribution for each class signature, we trained a least-squares linear regression model to predict the relation distribution from a binary vector of classes. This model has (n m) parameters, where n is the number of input classes and m is the number of relations. We implemented our linear regression model using Scikit-learn toolkit (Pedregosa et al., 2011). Neural network. We trained a feed-forward neural network using the binary class vector as the input layer, with a low-dimensional (h) hidden layer (with rectified linear unit as activation) followed by a softmax output layer of the size of the relation set. This model has h(n + m) parameters, which depending on the value of h is significantly smaller than the linear regression model. The objective function used for training was Kullback- Liebler Divergence. We chose Keras (Chollet, 2015) to implement the neural network model. The model had a single 10-node Rectified Linear Unit hidden layer, with a softmax over the output. 5 Evaluation We compare the predicted relation distributions to those observed for the test examples in two ways: Weighted Jaccard Index. We modified the Jaccard index (Jaccard, 1912) to include a weighting term, which weights every relation with the mean weight in the predicted and observed distribution (see Figure 6). This rewards a correctly predicted relation without focusing on the proportion predicted for that relation, and is sufficient to define a set of important relations for a class sig- 203

5 nature. This is given by: W (R i ) R i (P O) i J = W (R i ) R i (P O) i (2) where P is the predicted distribution, O is the observed distribution, W (R i ) is the mean weight of relation R i in P and O. We also calculate false negatives (observed but not predicted) and false positives (predicted but not observed), by modifying the second term in the numerator of Equation 2 to give P \O and O\P, rather than P O. Intersection. We compute the intersection of the two distributions (see Figure 6). This is a more strict comparison between the distributions which penalizes differences in weight for individual relations. This is given by: I = i min(p (R i ), O(R i )) (3) 5.1 Usage Weighted Evaluation We also evaluated the models using the Weighted Jaccard index and Intersection methods, but weighting by usage counts for each signature. This metric rewards the models more for correctly predicting relation distributions for common class signatures in the usage data. While unweighted analysis is useful to examine how the model covers the breadth of the problem space, weighted evaluation more closely reflects the model s utility for real usage data. 5.2 Temporal Prediction Additionally, we evaluated the models on their ability to predict future usage. With an unchanging usage pattern, evaluation against future usage would be equivalent to cross-validation (assuming the same signature distribution in the folds). However, in many real world cases, usage of a KB varies over time, seasonally or as a result of changing user requirements. Therefore we also evaluated a neural model against future usage data to measure how elapsed time affected model performance. The datasets T1, T2, and T3 each contain 3 datasets (of similar size to D1 small, D2 medium, and D1 large ), and were created using usage data from time periods with a fixed offset, t. The base set was created at time t 0, T1 at time t 0 + t, T2 at time t 0 + 2t, and T3 at time t 0 + 3t. A time interval was chosen that reflected the known variability of the usage data, Model Jaccard False Neg. False Pos. D1 small Freq Regr NN D2 medium Freq Regr NN D3 large Freq Regr NN Table 2: Unweighted results for the three models on the three datasets. such that we would expect the usage to not be the same. 6 Results 6.1 Cross-Validation 10-fold cross-validation results are shown in Table 2. The neural network model performs best, outperforming the baseline model by 6-8 percentage points. The regression model performs worst, trailing the baseline model by 4-8 percentage points Baseline The baseline model shows little improvement with increasing amounts of data - the results from D1 small to D3 large (3x more data points) only improve by just over 1 percentage point. This suggests that this model is unable to generalise from the data, which is expected from the lack of class co-occurrence information in the model. Interestingly, the baseline model shows an increase in false negatives on the larger datasets, implying the lack of generalisation is more problematic for more fine-grained relation distributions Linear Regression The linear regression model gives a much lower Jaccard measure than the baseline model. This is likely due to the number of parameters in the model relative to the number of examples. For D1 small, the model has approximately 6m parameters, with 12k training examples, making this an 204

6 under-determined system. For D3 large the number of parameters rises to 20m, with 37k training examples, maintaining the poor example:parameter ratio. From this we might expect the performance of the model to be invariant with the amount of data. However, the larger datasets also have higher resolution relation distributions, as they are aggregated from more individual examples. This has the effect of reducing the impact of outliers in the data, giving improved predictions when the model generalises. We do indeed see that the linear regression model improves notably with larger datasets, closing the gap to the baseline model from 8 percentage points to Neural Network The neural network model shows much better performance than either of the other two methods. The Jaccard score is consistently 6-8% above the regression model, with far fewer false negatives and smaller numbers of false positives. This is likely to be due to the smaller number of parameters of the neural model versus the linear regression model. For D3 large, the 10-node hidden layer model amounts to 115k parameters with 37k training examples, a far better ratio (though still not ideal) than for the linear regression model Weighted Evaluation We include in Table 3 the results using the weighted evaluation scheme described in Section 5.1. This gives more usage-focused evaluation, emphasizing the non-uniform usage of different class signatures. The D3 large neural model achieves 85% precision with a weighted evaluation. With the low rate of false negatives, this indicates that a similar model could be used to predict the necessary relations for KB usage. 6.2 Intersection Table 4 gives measurements of the intersection metric. These show a similar trend to the Jaccard scores, with lower absolute values from the stricter evaluation metric. Although the Jaccard measure shows correct relation set prediction with a precision of 0.700, predicting the proportions for those relations accurately remains a difficult problem. The best value we achieved was Model Jaccard False Neg. False Pos. D1 small Freq Regr NN D2 medium Freq Regr NN D3 large Freq Regr NN Table 3: Usage-weighted results for the three models on the three datasets. Model Freq. Regr. NN Inter Table 4: Results for the three methods for the D3 large dataset using the intersection metric. The difference between the methods is similar to the Jaccard measure above. Interval T1 T2 T3 D1 small D2 medium D3 large Table 5: Results of training a neural model on all available data for D1 small - D3 large, then evaluating on T1- T3. The values for D2 medium and D3 large are higher than cross-validation, as cross-validation never tests a model on examples used to train it. However, the T datasets contain all data from the specified period. The downward trend with increasing T is clear, but slight. 205

7 6.3 Unweighted Temporal Prediction In addition to evaluating models on their ability to predict the behaviour of unseen class signatures, we also evaluated the neural model on its ability to predict future usage behaviour. The results of this experiment are given in Table 5. We observe a very slight downward trend in the precision of the model using all three base datasets (D1 small - D3 large ), with a steeper (but still slight) downward trend for the larger datasets. This suggests that a model trained on usage data from one period of time will have significant predictive power on future datasets. 7 Measuring Completeness of a KB Once we have a suitable model of the expected relation distributions for class combinations, we use the model to predict the expected relation distribution for specific entities in our KB. We then compare the predicted relation distribution to the observed relations for each specific entity. The completeness of an entity is given by the sum of the relation proportions for the predicted relations the entity has in the KB. Any gaps for an entity represent relations that, if added to the KB, would have a quantifiable positive impact on the performance of the KB. By focussing on the most important entities according to our usage, we can target fact addition to have the greatest impact to the usage the KB receives. By aggregating the completeness values for a set of entities, we may estimate the completeness of subsets of the KB. This aggregation is weighted by the frequency with which the entity appears in the usage data, giving a usage-weighted measure of the subset s completeness. These subsets can represent individual topics, individual classes of entity, or overall information about the KB as a whole. For example, using the best neural model above on an unrepresentative subset of our KB, we evaluate the completeness of that subset at 58.3%. This not only implies that we are missing a substantial amount of necessary information for these entities with respect to the usage data chosen, but permits targeting of source acquisition to improve the entity completness in aggregate. For example, if we are missing a large number of hasbirthdate facts for people, we might locate a source that has that information. We can quantify the benefit of that effort in terms of improved usage performance. 8 Conclusions and Future Work We have introduced the notion of Demand- Weighted Completeness as a way of determining a KB s suitability by employing usage data. We have demonstrated a method to predict the distribution of relations needed in a KB for entities of a given class signature, and have compared three different models for predicting these distributions. Further, we have described a method to measure the completeness of a KB using these distributions. For future work we would like to try complex neural network architectures, regularisation, and semantic embeddings or other abstracted relations to enhance the signatures. We would also like to investigate Good-Turing frequency estimation (Good, 1953). References François Chollet Keras. com/fchollet/keras. Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang Knowledge Vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, KDD 14, pages org/ / Michael Färber, Basil Ell, Carsten Menne, and Achim Rettinger A comparative survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO. Semantic Web Journal, July. D. A. Ferrucci Introduction to this is watson. IBM J. Res. Dev. 56(3): org/ /jrd Luis Galárraga, Simon Razniewski, Antoine Amarilli, and Fabian M. Suchanek Predicting completeness in knowledge bases. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, New York, NY, USA, WSDM 17, pages doi.org/ / I. J. Good The population frequencies of species and the estimation of population parameters. Biometrika 40(3-4): org/ /biomet/ Paul Jaccard The distribution of the flora in the alpine zone.1. New Phytologist 11(2): j tb05611.x. 206

8 Bonan Min, Ralph Grishman, Li Wan, Chang Wang, and David Gondek Distant supervision for relation extraction with an incomplete knowledge base. In HLT-NAACL. pages Paramita Mirza, Simon Razniewski, and Werner Nutt Expanding Wikidata s parenthood information by 178%, or how to mine relation cardinalities. In ISWC 2016 Posters & Demonstrations Trac. CEUR-WS. org. Amihai Motro Integrity = validity + completeness. ACM Trans. Database Syst. 14(4): F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12: Simon Razniewski, Flip Korn, Werner Nutt, and Divesh Srivastava Identifying the extent of completeness of query answers over partially complete databases. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, USA, SIGMOD 15, pages / Simon Razniewski and Werner Nutt Completeness of queries over incomplete databases. VLDB 4: Simon Razniewski, Fabian M Suchanek, and Werner Nutt But what do we actually know. Proceedings of AKBC pages Fabian Suchanek, David Gross-Amblard, and Serge Abiteboul Watermarking for ontologies. The semantic web ISWC 2011 pages Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum YAGO: A core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web. ACM, New York, NY, USA, WWW 07, pages https: //doi.org/ / William Tunstall-Pedoe True Knowledge: Open-domain question answering using structured knowledge and inference. AI Magazine 31(3): v31i Denny Vrandečić and Markus Krötzsch Wikidata: A free collaborative knowledgebase. Commun. ACM 57(10): / Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann, and Sören Auer Quality assessment for linked open data: A survey. 207

Machine Learning Final Project

Machine Learning Final Project Machine Learning Final Project Team: hahaha R01942054 林家蓉 R01942068 賴威昇 January 15, 2014 1 Introduction In this project, we are asked to solve a classification problem of Chinese characters. The training

More information

YAGO: a Multilingual Knowledge Base from Wikipedia, Wordnet, and Geonames

YAGO: a Multilingual Knowledge Base from Wikipedia, Wordnet, and Geonames 1/38 YAGO: a Multilingual Knowledge Base from Wikipedia, Wordnet, and Geonames Fabian Suchanek 1, Johannes Hoffart 2, Joanna Biega 2, Thomas Rebele 1, Erdal Kuzey 2, Gerhard Weikum 2 1 Télécom ParisTech

More information

A Model-based Application Autonomic Manager with Fine Granular Bandwidth Control

A Model-based Application Autonomic Manager with Fine Granular Bandwidth Control A Model-based Application Autonomic Manager with Fine Granular Bandwidth Control Nasim Beigi-Mohammadi, Mark Shtern, and Marin Litoiu Department of Computer Science, York University, Canada Email: {nbm,

More information

Effectiveness of Sparse Features: An Application of Sparse PCA

Effectiveness of Sparse Features: An Application of Sparse PCA 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Knowledge Graph Embedding with Numeric Attributes of Entities

Knowledge Graph Embedding with Numeric Attributes of Entities Knowledge Graph Embedding with Numeric Attributes of Entities Yanrong Wu, Zhichun Wang College of Information Science and Technology Beijing Normal University, Beijing 100875, PR. China yrwu@mail.bnu.edu.cn,

More information

Linking Entities in Chinese Queries to Knowledge Graph

Linking Entities in Chinese Queries to Knowledge Graph Linking Entities in Chinese Queries to Knowledge Graph Jun Li 1, Jinxian Pan 2, Chen Ye 1, Yong Huang 1, Danlu Wen 1, and Zhichun Wang 1(B) 1 Beijing Normal University, Beijing, China zcwang@bnu.edu.cn

More information

A Survey on Information Extraction in Web Searches Using Web Services

A Survey on Information Extraction in Web Searches Using Web Services A Survey on Information Extraction in Web Searches Using Web Services Maind Neelam R., Sunita Nandgave Department of Computer Engineering, G.H.Raisoni College of Engineering and Management, wagholi, India

More information

Supervised Ranking for Plagiarism Source Retrieval

Supervised Ranking for Plagiarism Source Retrieval Supervised Ranking for Plagiarism Source Retrieval Notebook for PAN at CLEF 2013 Kyle Williams, Hung-Hsuan Chen, and C. Lee Giles, Information Sciences and Technology Computer Science and Engineering Pennsylvania

More information

Search Engine Switching Detection Based on User Personal Preferences and Behavior Patterns

Search Engine Switching Detection Based on User Personal Preferences and Behavior Patterns Search Engine Switching Detection Based on User Personal Preferences and Behavior Patterns Denis Savenkov, Dmitry Lagun, Qiaoling Liu Mathematics & Computer Science Emory University Atlanta, US {denis.savenkov,

More information

Hybrid Acquisition of Temporal Scopes for RDF Data

Hybrid Acquisition of Temporal Scopes for RDF Data Hybrid Acquisition of Temporal Scopes for RDF Data Anisa Rula 1, Matteo Palmonari 1, Axel-Cyrille Ngonga Ngomo 2, Daniel Gerber 2, Jens Lehmann 2, and Lorenz Bühmann 2 1. University of Milano-Bicocca,

More information

Correction of Model Reduction Errors in Simulations

Correction of Model Reduction Errors in Simulations Correction of Model Reduction Errors in Simulations MUQ 15, June 2015 Antti Lipponen UEF // University of Eastern Finland Janne Huttunen Ville Kolehmainen University of Eastern Finland University of Eastern

More information

Query-driven Data Completeness Assessment

Query-driven Data Completeness Assessment Query-driven Data Completeness Assessment Simon Razniewski Free University of Bozen-Bolzano, Italy Part I joint work with Flip Korn, Werner Nutt, Divesh Srivastava Background 2011-2014: PhD (on reasoning

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Interactive Image Segmentation with GrabCut

Interactive Image Segmentation with GrabCut Interactive Image Segmentation with GrabCut Bryan Anenberg Stanford University anenberg@stanford.edu Michela Meister Stanford University mmeister@stanford.edu Abstract We implement GrabCut and experiment

More information

A Korean Knowledge Extraction System for Enriching a KBox

A Korean Knowledge Extraction System for Enriching a KBox A Korean Knowledge Extraction System for Enriching a KBox Sangha Nam, Eun-kyung Kim, Jiho Kim, Yoosung Jung, Kijong Han, Key-Sun Choi KAIST / The Republic of Korea {nam.sangha, kekeeo, hogajiho, wjd1004109,

More information

Allstate Insurance Claims Severity: A Machine Learning Approach

Allstate Insurance Claims Severity: A Machine Learning Approach Allstate Insurance Claims Severity: A Machine Learning Approach Rajeeva Gaur SUNet ID: rajeevag Jeff Pickelman SUNet ID: pattern Hongyi Wang SUNet ID: hongyiw I. INTRODUCTION The insurance industry has

More information

Clustering Startups Based on Customer-Value Proposition

Clustering Startups Based on Customer-Value Proposition Clustering Startups Based on Customer-Value Proposition Daniel Semeniuta Stanford University dsemeniu@stanford.edu Meeran Ismail Stanford University meeran@stanford.edu Abstract K-means clustering is a

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

node2vec: Scalable Feature Learning for Networks

node2vec: Scalable Feature Learning for Networks node2vec: Scalable Feature Learning for Networks A paper by Aditya Grover and Jure Leskovec, presented at Knowledge Discovery and Data Mining 16. 11/27/2018 Presented by: Dharvi Verma CS 848: Graph Database

More information

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu Presented by: Dimitri Galmanovich Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu 1 When looking for Unstructured data 2 Millions of such queries every day

More information

Chat Disentanglement: Identifying Semantic Reply Relationships with Random Forests and Recurrent Neural Networks

Chat Disentanglement: Identifying Semantic Reply Relationships with Random Forests and Recurrent Neural Networks Chat Disentanglement: Identifying Semantic Reply Relationships with Random Forests and Recurrent Neural Networks Shikib Mehri Department of Computer Science University of British Columbia Vancouver, Canada

More information

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018 MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge

More information

Re-contextualization and contextual Entity exploration. Sebastian Holzki

Re-contextualization and contextual Entity exploration. Sebastian Holzki Re-contextualization and contextual Entity exploration Sebastian Holzki Sebastian Holzki June 7, 2016 1 Authors: Joonseok Lee, Ariel Fuxman, Bo Zhao, and Yuanhua Lv - PAPER PRESENTATION - LEVERAGING KNOWLEDGE

More information

Explore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan

Explore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan Explore Co-clustering on Job Applications Qingyun Wan SUNet ID:qywan 1 Introduction In the job marketplace, the supply side represents the job postings posted by job posters and the demand side presents

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

Predicting Messaging Response Time in a Long Distance Relationship

Predicting Messaging Response Time in a Long Distance Relationship Predicting Messaging Response Time in a Long Distance Relationship Meng-Chen Shieh m3shieh@ucsd.edu I. Introduction The key to any successful relationship is communication, especially during times when

More information

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Jianyong Wang Department of Computer Science and Technology Tsinghua University Jianyong Wang Department of Computer Science and Technology Tsinghua University jianyong@tsinghua.edu.cn Joint work with Wei Shen (Tsinghua), Ping Luo (HP), and Min Wang (HP) Outline Introduction to entity

More information

Papers for comprehensive viva-voce

Papers for comprehensive viva-voce Papers for comprehensive viva-voce Priya Radhakrishnan Advisor : Dr. Vasudeva Varma Search and Information Extraction Lab, International Institute of Information Technology, Gachibowli, Hyderabad, India

More information

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before

More information

Ontology Augmentation Through Matching with Web Tables

Ontology Augmentation Through Matching with Web Tables Ontology Augmentation Through Matching with Web Tables Oliver Lehmberg 1 and Oktie Hassanzadeh 2 1 University of Mannheim, B6 26, 68159 Mannheim, Germany 2 IBM Research, Yorktown Heights, New York, U.S.A.

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Disambiguating Entities Referred by Web Endpoints using Tree Ensembles

Disambiguating Entities Referred by Web Endpoints using Tree Ensembles Disambiguating Entities Referred by Web Endpoints using Tree Ensembles Gitansh Khirbat Jianzhong Qi Rui Zhang Department of Computing and Information Systems The University of Melbourne Australia gkhirbat@student.unimelb.edu.au

More information

Real-time population of Knowledge Bases: Opportunities and Challenges. Ndapa Nakashole Gerhard Weikum

Real-time population of Knowledge Bases: Opportunities and Challenges. Ndapa Nakashole Gerhard Weikum Real-time population of Knowledge Bases: Opportunities and Challenges Ndapa Nakashole Gerhard Weikum AKBC Workshop at NAACL 2012 Real-time Data Sources In news and social media, the implicit query is:

More information

Predicting the Outcome of H-1B Visa Applications

Predicting the Outcome of H-1B Visa Applications Predicting the Outcome of H-1B Visa Applications CS229 Term Project Final Report Beliz Gunel bgunel@stanford.edu Onur Cezmi Mutlu cezmi@stanford.edu 1 Introduction In our project, we aim to predict the

More information

Classification of Building Information Model (BIM) Structures with Deep Learning

Classification of Building Information Model (BIM) Structures with Deep Learning Classification of Building Information Model (BIM) Structures with Deep Learning Francesco Lomio, Ricardo Farinha, Mauri Laasonen, Heikki Huttunen Tampere University of Technology, Tampere, Finland Sweco

More information

An Empirical Analysis of Communities in Real-World Networks

An Empirical Analysis of Communities in Real-World Networks An Empirical Analysis of Communities in Real-World Networks Chuan Sheng Foo Computer Science Department Stanford University csfoo@cs.stanford.edu ABSTRACT Little work has been done on the characterization

More information

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as

More information

Predicting Parallelization of Sequential Programs Using Supervised Learning

Predicting Parallelization of Sequential Programs Using Supervised Learning Predicting Parallelization of Sequential Programs Using Supervised Learning Daniel Fried, Zhen Li, Ali Jannesari, and Felix Wolf University of Arizona, Tucson, Arizona, USA German Research School for Simulation

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

Pull the Plug? Predicting If Computers or Humans Should Segment Images Supplementary Material

Pull the Plug? Predicting If Computers or Humans Should Segment Images Supplementary Material In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, June 2016. Pull the Plug? Predicting If Computers or Humans Should Segment Images Supplementary Material

More information

Collaborative Filtering using a Spreading Activation Approach

Collaborative Filtering using a Spreading Activation Approach Collaborative Filtering using a Spreading Activation Approach Josephine Griffith *, Colm O Riordan *, Humphrey Sorensen ** * Department of Information Technology, NUI, Galway ** Computer Science Department,

More information

FORECASTING INITIAL POPULARITY OF JUST-UPLOADED USER-GENERATED VIDEOS. Changsha Ma, Zhisheng Yan and Chang Wen Chen

FORECASTING INITIAL POPULARITY OF JUST-UPLOADED USER-GENERATED VIDEOS. Changsha Ma, Zhisheng Yan and Chang Wen Chen FORECASTING INITIAL POPULARITY OF JUST-UPLOADED USER-GENERATED VIDEOS Changsha Ma, Zhisheng Yan and Chang Wen Chen Dept. of Comp. Sci. and Eng., State Univ. of New York at Buffalo, Buffalo, NY, 14260,

More information

Sifaka: Text Mining Above a Search API

Sifaka: Text Mining Above a Search API Sifaka: Text Mining Above a Search API ABSTRACT Cameron VandenBerg Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA cmw2@cs.cmu.edu Text mining and analytics software

More information

Robust Discovery of Positive and Negative Rules in Knowledge-Bases

Robust Discovery of Positive and Negative Rules in Knowledge-Bases Robust Discovery of Positive and Negative Rules in Knowledge-Bases Paolo Papotti joint work with S. Ortona (Meltwater) and V. Meduri (ASU) http://www.eurecom.fr/en/publication/5321/detail/robust-discovery-of-positive-and-negative-rules-in-knowledge-bases

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept

More information

Q.I. Leap Analytics Inc.

Q.I. Leap Analytics Inc. QI Leap Analytics Inc REX A Cloud-Based Personalization Engine Ali Nadaf May 2017 Research shows that online shoppers lose interest after 60 to 90 seconds of searching[ 1] Users either find something interesting

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Daniel Lowd January 14, 2004 1 Introduction Probabilistic models have shown increasing popularity

More information

Tools and Infrastructure for Supporting Enterprise Knowledge Graphs

Tools and Infrastructure for Supporting Enterprise Knowledge Graphs Tools and Infrastructure for Supporting Enterprise Knowledge Graphs Sumit Bhatia, Nidhi Rajshree, Anshu Jain, and Nitish Aggarwal IBM Research sumitbhatia@in.ibm.com, {nidhi.rajshree,anshu.n.jain}@us.ibm.com,nitish.aggarwal@ibm.com

More information

Visualizing Deep Network Training Trajectories with PCA

Visualizing Deep Network Training Trajectories with PCA with PCA Eliana Lorch Thiel Fellowship ELIANALORCH@GMAIL.COM Abstract Neural network training is a form of numerical optimization in a high-dimensional parameter space. Such optimization processes are

More information

Towards a hybrid approach to Netflix Challenge

Towards a hybrid approach to Netflix Challenge Towards a hybrid approach to Netflix Challenge Abhishek Gupta, Abhijeet Mohapatra, Tejaswi Tenneti March 12, 2009 1 Introduction Today Recommendation systems [3] have become indispensible because of the

More information

Fusion of Event Stream and Background Knowledge for Semantic-Enabled Complex Event Processing

Fusion of Event Stream and Background Knowledge for Semantic-Enabled Complex Event Processing Fusion of Event Stream and Background Knowledge for Semantic-Enabled Complex Event Processing Challenge Paper Kia Teymourian, Malte Rohde, Ahmad Hasan, and Adrian Paschke Freie Universität Berlin Institute

More information

arxiv: v1 [cs.ir] 7 Nov 2017

arxiv: v1 [cs.ir] 7 Nov 2017 Quality-Efficiency Trade-offs in Machine Learning for Text Processing arxiv:1711.02295v1 [cs.ir] 7 Nov 2017 Abstract Nowadays, the amount of available digital documents is rapidly growing from a variety

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

Comparison of Optimization Methods for L1-regularized Logistic Regression

Comparison of Optimization Methods for L1-regularized Logistic Regression Comparison of Optimization Methods for L1-regularized Logistic Regression Aleksandar Jovanovich Department of Computer Science and Information Systems Youngstown State University Youngstown, OH 44555 aleksjovanovich@gmail.com

More information

Author profiling using stylometric and structural feature groupings

Author profiling using stylometric and structural feature groupings Author profiling using stylometric and structural feature groupings Notebook for PAN at CLEF 2015 Andreas Grivas, Anastasia Krithara, and George Giannakopoulos Institute of Informatics and Telecommunications,

More information

A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

More information

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul 1 CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul Introduction Our problem is crawling a static social graph (snapshot). Given

More information

Random Walk Inference and Learning. Carnegie Mellon University 7/28/2011 EMNLP 2011, Edinburgh, Scotland, UK

Random Walk Inference and Learning. Carnegie Mellon University 7/28/2011 EMNLP 2011, Edinburgh, Scotland, UK Random Walk Inference and Learning in A Large Scale Knowledge Base Ni Lao, Tom Mitchell, William W. Cohen Carnegie Mellon University 2011.7.28 1 Outline Motivation Inference in Knowledge Bases The NELL

More information

Classifying Depositional Environments in Satellite Images

Classifying Depositional Environments in Satellite Images Classifying Depositional Environments in Satellite Images Alex Miltenberger and Rayan Kanfar Department of Geophysics School of Earth, Energy, and Environmental Sciences Stanford University 1 Introduction

More information

Identifying Layout Classes for Mathematical Symbols Using Layout Context

Identifying Layout Classes for Mathematical Symbols Using Layout Context Rochester Institute of Technology RIT Scholar Works Articles 2009 Identifying Layout Classes for Mathematical Symbols Using Layout Context Ling Ouyang Rochester Institute of Technology Richard Zanibbi

More information

Enriching an Academic Knowledge base using Linked Open Data

Enriching an Academic Knowledge base using Linked Open Data Enriching an Academic Knowledge base using Linked Open Data Chetana Gavankar 1,2 Ashish Kulkarni 1 Yuan Fang Li 3 Ganesh Ramakrishnan 1 (1) IIT Bombay, Mumbai, India (2) IITB-Monash Research Academy, Mumbai,

More information

Research on outlier intrusion detection technologybased on data mining

Research on outlier intrusion detection technologybased on data mining Acta Technica 62 (2017), No. 4A, 635640 c 2017 Institute of Thermomechanics CAS, v.v.i. Research on outlier intrusion detection technologybased on data mining Liang zhu 1, 2 Abstract. With the rapid development

More information

Improved Cardinality Estimation using Entity Resolution in Crowdsourced Data

Improved Cardinality Estimation using Entity Resolution in Crowdsourced Data IJIRST International Journal for Innovative Research in Science & Technology Volume 3 Issue 02 July 2016 ISSN (online): 2349-6010 Improved Cardinality Estimation using Entity Resolution in Crowdsourced

More information

Predicting User Ratings Using Status Models on Amazon.com

Predicting User Ratings Using Status Models on Amazon.com Predicting User Ratings Using Status Models on Amazon.com Borui Wang Stanford University borui@stanford.edu Guan (Bell) Wang Stanford University guanw@stanford.edu Group 19 Zhemin Li Stanford University

More information

Exploring Econometric Model Selection Using Sensitivity Analysis

Exploring Econometric Model Selection Using Sensitivity Analysis Exploring Econometric Model Selection Using Sensitivity Analysis William Becker Paolo Paruolo Andrea Saltelli Nice, 2 nd July 2013 Outline What is the problem we are addressing? Past approaches Hoover

More information

Forecasting Lung Cancer Diagnoses with Deep Learning

Forecasting Lung Cancer Diagnoses with Deep Learning Forecasting Lung Cancer Diagnoses with Deep Learning Data Science Bowl 2017 Technical Report Daniel Hammack April 22, 2017 Abstract We describe our contribution to the 2nd place solution for the 2017 Data

More information

Computer-based Tracking Protocols: Improving Communication between Databases

Computer-based Tracking Protocols: Improving Communication between Databases Computer-based Tracking Protocols: Improving Communication between Databases Amol Deshpande Database Group Department of Computer Science University of Maryland Overview Food tracking and traceability

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

The clustering in general is the task of grouping a set of objects in such a way that objects

The clustering in general is the task of grouping a set of objects in such a way that objects Spectral Clustering: A Graph Partitioning Point of View Yangzihao Wang Computer Science Department, University of California, Davis yzhwang@ucdavis.edu Abstract This course project provide the basic theory

More information

A Scalable Approach to Incrementally Building Knowledge Graphs

A Scalable Approach to Incrementally Building Knowledge Graphs A Scalable Approach to Incrementally Building Knowledge Graphs Gleb Gawriljuk 1, Andreas Harth 1, Craig A. Knoblock 2, and Pedro Szekely 2 1 Institute of Applied Informatics and Formal Description Methods

More information

Adaptive Supersampling Using Machine Learning Techniques

Adaptive Supersampling Using Machine Learning Techniques Adaptive Supersampling Using Machine Learning Techniques Kevin Winner winnerk1@umbc.edu Abstract Previous work in adaptive supersampling methods have utilized algorithmic approaches to analyze properties

More information

Active Blocking Scheme Learning for Entity Resolution

Active Blocking Scheme Learning for Entity Resolution Active Blocking Scheme Learning for Entity Resolution Jingyu Shao and Qing Wang Research School of Computer Science, Australian National University {jingyu.shao,qing.wang}@anu.edu.au Abstract. Blocking

More information

Computationally Efficient Serial Combination of Rotation-invariant and Rotation Compensating Iris Recognition Algorithms

Computationally Efficient Serial Combination of Rotation-invariant and Rotation Compensating Iris Recognition Algorithms Computationally Efficient Serial Combination of Rotation-invariant and Rotation Compensating Iris Recognition Algorithms Andreas Uhl Department of Computer Sciences University of Salzburg, Austria uhl@cosy.sbg.ac.at

More information

Outline. Morning program Preliminaries Semantic matching Learning to rank Entities

Outline. Morning program Preliminaries Semantic matching Learning to rank Entities 112 Outline Morning program Preliminaries Semantic matching Learning to rank Afternoon program Modeling user behavior Generating responses Recommender systems Industry insights Q&A 113 are polysemic Finding

More information

Slice Intelligence!

Slice Intelligence! Intern @ Slice Intelligence! Wei1an(Wu( September(8,(2014( Outline!! Details about the job!! Skills required and learned!! My thoughts regarding the internship! About the company!! Slice, which we call

More information

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA. Data Mining Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA January 13, 2011 Important Note! This presentation was obtained from Dr. Vijay Raghavan

More information

Object Purpose Based Grasping

Object Purpose Based Grasping Object Purpose Based Grasping Song Cao, Jijie Zhao Abstract Objects often have multiple purposes, and the way humans grasp a certain object may vary based on the different intended purposes. To enable

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

arxiv: v1 [cs.cl] 3 Jun 2015

arxiv: v1 [cs.cl] 3 Jun 2015 Traversing Knowledge Graphs in Vector Space Kelvin Gu Stanford University kgu@stanford.edu John Miller Stanford University millerjp@stanford.edu Percy Liang Stanford University pliang@cs.stanford.edu arxiv:1506.01094v1

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

TouchML: A Machine Learning Toolkit for Modelling Spatial Touch Targeting Behaviour

TouchML: A Machine Learning Toolkit for Modelling Spatial Touch Targeting Behaviour TouchML: A Machine Learning Toolkit for Modelling Spatial Touch Targeting Behaviour Daniel Buschek, Florian Alt Media Informatics Group, University of Munich (LMU) {daniel.buschek, florian.alt}@ifi.lmu.de

More information

Logical Templates for Feature Extraction in Fingerprint Images

Logical Templates for Feature Extraction in Fingerprint Images Logical Templates for Feature Extraction in Fingerprint Images Bir Bhanu, Michael Boshra and Xuejun Tan Center for Research in Intelligent Systems University of Califomia, Riverside, CA 9252 1, USA Email:

More information

On Biased Reservoir Sampling in the Presence of Stream Evolution

On Biased Reservoir Sampling in the Presence of Stream Evolution Charu C. Aggarwal T J Watson Research Center IBM Corporation Hawthorne, NY USA On Biased Reservoir Sampling in the Presence of Stream Evolution VLDB Conference, Seoul, South Korea, 2006 Synopsis Construction

More information

Search Costs vs. User Satisfaction on Mobile

Search Costs vs. User Satisfaction on Mobile Search Costs vs. User Satisfaction on Mobile Manisha Verma, Emine Yilmaz University College London mverma@cs.ucl.ac.uk, emine.yilmaz@ucl.ac.uk Abstract. Information seeking is an interactive process where

More information

HUMAN activities associated with computational devices

HUMAN activities associated with computational devices INTRUSION ALERT PREDICTION USING A HIDDEN MARKOV MODEL 1 Intrusion Alert Prediction Using a Hidden Markov Model Udaya Sampath K. Perera Miriya Thanthrige, Jagath Samarabandu and Xianbin Wang arxiv:1610.07276v1

More information

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery : Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,

More information

VICKEY: Mining Conditional Keys on Knowledge Bases

VICKEY: Mining Conditional Keys on Knowledge Bases VICKEY: Mining Conditional Keys on Knowledge Bases Danai Symeonidou 1, Luis Galárraga 2, Nathalie Pernelle 3, Fatiha Saïs 3, Fabian Suchanek 4 1 INRA, France 2 Aalborg University, Denmark 3 LRI, France

More information

3. Data Preprocessing. 3.1 Introduction

3. Data Preprocessing. 3.1 Introduction 3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 03-3, Martin Ester 84 3.1 Introduction Motivation

More information

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at Performance Evaluation of Ensemble Method Based Outlier Detection Algorithm Priya. M 1, M. Karthikeyan 2 Department of Computer and Information Science, Annamalai University, Annamalai Nagar, Tamil Nadu,

More information

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking

More information

hyperspace: Automated Optimization of Complex Processing Pipelines for pyspace

hyperspace: Automated Optimization of Complex Processing Pipelines for pyspace hyperspace: Automated Optimization of Complex Processing Pipelines for pyspace Torben Hansing, Mario Michael Krell, Frank Kirchner Robotics Research Group University of Bremen Robert-Hooke-Str. 1, 28359

More information

2. Data Preprocessing

2. Data Preprocessing 2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459

More information

Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu CS 229 Fall

Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu CS 229 Fall Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu (fcdh@stanford.edu), CS 229 Fall 2014-15 1. Introduction and Motivation High- resolution Positron Emission Tomography

More information