Top- k Entity Augmentation Using C onsistent Set C overing
|
|
- Shavonne Crawford
- 5 years ago
- Views:
Transcription
1 Top- k Entity Augmentation Using C onsistent Set C overing J ulian Eberius*, Maik Thiele, Katrin Braunschweig and Wolfgang Lehner Technische Universität Dresden, Germany
2 What is Entity Augmentation? ENTITY AUGMENTATION QUERY (EAQ) Q uery of the form Q EE (a +, E a1,,a n ) Result: same set of entities with the augmented attribute added Example Dresden Web Table Corpus 1 : 125M tables Web Tables n_nationkey n_regionkey n_name 0 0 Algeria 1 1 Argentina 2 1 Brazil 3 1 Canada 4 4 Egypt Q EE population, = n_nationkey n_regionkey n_name population 0 0 Algeria pop ALGERIA 1 1 Argentina pop ARGENTINA 2 1 Brazil pop BRAZIL 3 1 Canada pop CANADA 4 4 Egypt pop EGYPT ENTITY AUGMENTATION SYSTEM (EAS) Manages a corpus of data sources Selects a subset D relevant to a query Q EA 1 dresden.de/misc/dwtc 2
3 Why C onsistency? Challenge 1: Information need expressed by a keyword underspecified Q EA ( revenue, E) Company Bank of China Banco do Brasil Rogers Communications China Mobile AT&T S 1 : revenues 2013 Revenue Agricultural Bank of China x 1 Banco do Brasil x 2 AT&T x 3 China Mobile x 4 S 4 : banking sector S 2 : US market share Revenue AT&T x C hallenge 2: Data sources unknown to the user difficult S 3 : telco to sector express precise queries Revenue 0.92 Revenue Bank of China x 1 Rogers x 1 S 6 : revenue changes Banco do Brasil x China Mobile x 2 Revenue S 5 : chinese market growth AT&T x Rogers x 1 Revenue Growth S 7 : revenues 2013 AT&T x 2 Bank of China x 1 y 1 Revenue Banco do Brasil x 3 China Mobile x 2 y 2 Rogers x
4 Why C onsistency? (2) Challenge 3: EAQ explorative in nature cover 1 Users may want to stumble on cover new aspects 2 of their information cover 3 S 3 : telco sector need (similar to document S 1 : revenues search 2013 engines) S 5 : chinese market growth Revenue Revenue Revenue Growth Rogers x 1 Agricultural Bank of China x 1 Bank of China x 1 y 1 China Mobile x 2 AT&T x 3 S 4 : banking sector 0.85 Revenue Bank of China x 1 Banco do Brasil x 2 AT&T x 3 China Mobile x 4 S 7 : revenues 2013 Revenue 0.69 China Mobile x 2 y 2 S 6 : revenue changes Revenue Rogers x Banco do Brasil x Rogers x AT&T x 2 Banco do Brasil x
5 O ur Top- k Approach UNCERTAINTY IN SOURCES AND UNCLEAR USER INTENT Challenge 1: Information need expressed by a keyword underspecified C hallenge 2: Data sources unknown to the user difficult to express precise queries Challenge 3: EAQ explorative in nature users may want to stumble on new aspects of their information need (similar to document search engines) WE SOLVE THIS PROBLEM by presenting not one, but multiple, ranked, alternative covers TOP-K 5
6 Top- 1 Entity Augmentation INITIAL EXAMPLE Entities c 11 c 10 c 9 c 8 c 7 c 6 c 5 c 4 c 3 c 2 c 1 data sources by rank 6
7 Top- 1 Entity Augmentation (3) STRATEGY 1: TOP-RANKED Entities c 11 c 10 c 9 c 8 C andidate Scoring Schema- /instance match using string distance functions Metadata match using set distance Popularity scores (Alexa Web information through AWS) c 7 Problem Large number of data sources c 5 picked C ombines data sources that c does not 4 fit to each other consistency!! c 6 c 3 cover c rank c 2 c 1 data sources by rank 7
8 Top- 1 Entity Augmentation (4) STRATEGY 2: MINIMAL NUMBER OF SOURCES Entities c 11 C onsistency is maximized Q uality according to the ranking function will be low c 10 c 9 c 8 c 7 c 6 cover c min c 5 c 4 c 3 c 2 c 1 data sources by rank 8
9 Top- 1 Entity Augmentation (5) STRATEGY 3: SIMILAR DATA SOURCES Entities c 11 c 10 Coherence Measures Similarity of attribute, values, metadata, domains, tags C ombined using a weighted sum c 9 c 8 c 7 c 6 c 5 c 4 c 3 c 2 cover c consistent c 1 data sources by rank 9
10 Top- k Entity Augmentation ORTHOGONAL STRATEGY: CREATE MULTIPLE (TOP-K) DIVERSE RESULTS Entities c 11 c 10 c 9 c 8 c 7 c 6 c 5 1. cover c 1 2. cover c 2 3. cover c 3... k. cover c k c 4 c 3 c 2 c 1 data sources by rank 10
11 Top- k C onsistent Entity Augmentation FOR A GIVEN entity set, an attribute and a large scale corpus of web tables WE SOLVE THE PROBLEM BY mapping the Top- k C onsistent Entity Augmentation to the weighted set cover problem WE PROPOSE NEW ENTITY AUGMENTATION ALGORITHMS THAT construct multiple (top- k), minimal and consistent S 2 (3) S 3 (5) S 4 (2) COVERS, THAT ARE complementary alternatives. S 1 (8) S 6 (7) S 5 (1) 11
12 Top- k C onsistent Entity Augmentation TOP-K CONSISTENT ENTITY AUGMENTATION Universe of elements U = Set of entities E WEIGHTED SET COVER PROBLEM Family of subsets of this universe S = Set of data sources D = d 1,, d n Associated with a weight w i = Data source quality sssss(d) with respect to a query Find a subset s of S whose union = Find a cover c = [d i,, d x ] with d c ccc(d) = E equals U, such that i S w i is minimized such that d c sssss(d) is maximized SO FAR WE COULD TRIVIALLY MAP OUR PROBLEM TO THE WEIGHTED SET COVER PROBLEM BUT THERE ARE DIFFERENCES 12
13 Top- k C onsistent Entity Augmentation ONE VERSUS MANY (TOP-K) O ne minimal cover c <> list of covers C= [c 1,, c n ] CONSISTENCY CONSTRAINT Covers consisting of similar datasets according to a function sss() c C mmmmmmmm sss(d i, d j ) d i,d j c DIVERSITY CONSTRAINT List of covers should not consist of the same or similar datasets, but be complementary alternatives miiimmmm sss(c i, c j ) c i,c j C 13
14 Greedy Top- k C onsistent EA GREEDY IMPLEMENTATION Initially empty cover c Free entity set F = E In each iteration pick the dataset that maximizes Until F = sssss(d) ccc(d) F + CONSISTENCY C onsidering documents already added to the partial cover result sssss(d) ccc(d) F sss F (d, c) with the special case of sss F d, = 1 Encourages picks of data sources that are similar to data sources that were already picked for the current cover 14
15 Greedy Top- k C onsistent EA (2) TOP-K Basic idea: perform k consecutive runs of the greedy algorithm using the same input datasets to create k covers CREATE K COMPLEMENTARY COVERS (= MAXIMIZE RESULT DIVERSITY) Extend the scoring function sssss(d) ccc(d) F sss F (d, c) rrrrrrrrrr(d, F, C) with rrrrrrrrrr d, F, C = sss F (d, ccccccccc(e, C)) e F ccc(d) Function ccccccccc(e, C): returns datasets that were used to augment an entity e in previously created covers C 15
16 Greedy Top- k C onsistent EA (3) 16
17 Greedy* Top- k C onsistent EA DRAWBACK OF GREEDY TOP-K CONSISTENT EA After the first cover is created the search is mainly guided by using different datasets for each cover New combinations of used data sets are often not considered in the basic greedy algorithm GREEDY*, CONSISTING OF TWO PHASE Phase 1: uses the Greedy Top- K C onsistent EA algorithm to create s time more covers than requested Phase 2: select the k best covers from the a pool of s k solutions, using a greedy approach - Pick the best solution by score and consistency - Pick further solutions that maximize these criteria while using dissimilar datasets (scoring function defined on the level of covers instead) 17
18 Genetic Top- k C onsistent EA POPULATION Initialized by creating covers with the Greedy algorithm Until all candidate sources have been used in at least one cover slightly modification of the Greedy algorithm to favor unused data sources GENES AND GENOME Gene = one data source Genome = set of data sources FITNESS FUNCTION Same as the scoring function in the Greedy algorithm sssss(d) ccc(d) F sss F (d, c) rrrrrrrrrr(d, F, C) 18
19 Genetic Top- k C onsistent EA CROSSOVER FUNCTION Combining two sets of data sources does not necessarily yield a set that is again a set cover - Uncovered entities - C over may not minimal (if too many genes are passed) Use the basic Greedy approach applied to the union of genes instead of the whole set D MUTATION STRATEGY Randomly flipping bits in the genome Fill and prune - Remove datasets that may have become redundant - Add random datasets to fill holes created by mutation 19
20 Genetic Top- K C onsistent EA 20
21 Evaluation SETUP All approaches 1 and baseline algorithms implemented in Scala version 2.11, executed on a version Oracle J VM Dresden Web Table Corpus (DWTC) 2 : 125 million web tables All queries are run against the full corpus - Companies (attributes: revenue, employees, founded) - C ountries (attributes: population, population growth, area) - Large cities (attribute: population) Gold standard: human judges evaluated for each entity in each result set the relevance of the data source and the correctness of the entity match INDICATORS C overage, minimality, consistency, score, diversity and precision 1 ulianeberius/rea 2 dresden.de/misc/dwtc 21
22 Baseline Algorithms VALUEGROUPING Generates one cover (Infogather approach) C ollects all values into per- entity group Clusters these groups according to their values Selects the highest scoring value from the cluster with the highest sum of scores Favoring high ranking data sources that are supported by many other sources with similar values TOPRANKED Generates k covers Selects the highest ranked data sources that can produce a value for each entity Next cover is constructed from each second best ranked value for each entity, until k results have been constructed C onsistency and diversity agnostic 22
23 Single Entity Precision 23
24 Inherent Score 24
25 C onistency and Minimality 25
26 Diversity 26
27 Relaxing C overage coverage = 1.0 coverage = 0.75 coverage =
28 Relaxing Coverage (2) coverage = 1.0 coverage = 0.75 coverage =
29 Relaxing Coverage (3) coverage = 1.0 coverage = 0.75 coverage =
30 Relaxing Coverage (4) coverage = 1.0 coverage = 0.75 coverage =
31 Runtime Performance 31
32 Summary EAQ BASED ON LARGE CORPORA OF DATA SOURCES SHOULD BE PROCESSED a st o p - k queries while ensuring consistency and minimal size of the individual augmentations and the diversity of the result as a whole THREE NEW ALGORITHMS FOR CONSISTENT MULTI-SOLUTION SET COVERING Greedy Greedy* and Genetic Top- k C onsistent EA EXPERIMENT SHOWED THAT Genetic set covering- based approach improves both consistency and minimality of the results significantly without loss of coverage while producing a diverse set of results 32
33 Backup
34 Why Web Tables VERY IMPORTANT SOURCE OF INFORMATION Used as a compact and efficient way to present relational information Have many applications - Knowledge management, e.g. ontology enrichment (Mulwad et al.) - Information retrieval, e.g. entity augmentation (Eberius et al.), factual search (Yin et al.), table search Dresden Web Table Corpus (DTWC) - 125M tables based on C ommon C rawl (3.6 billion web pages, 266TB) - dresden.de/misc/dwtc 34
35 Evaluation SETUP All approaches and baseline algorithms implemented in Scala version 2.11, executed on a version Oracle J VM Dresden Web Table Corpus (DWTC) 1 : 125M web tables All queries are run against the full corpus - Companies from the Forbes 2000 list (attributes: revenue, employees, founded) - C ountries (attributes: population, population growth, area) - Large cities from the Mondial database (attribute: population) Default query set - 4 queries for each domain- attribute pair, with 20 entities ( E = 20 ) each, and k = 10 - C ompany and city domain: top and tail queries, i.e. queries for the top 10 0 entities, queries with randomly picked entities Gold standard: human judges evaluated for each entity in each result set the relevance of the data source and the correctness of the entity match 1 dresden.de/misc/dwtc 35
36 Evaluation Indicators COVERAGE Percentage of the queried entities that are augmented with a value PRECISION Percentage of entities for which a relevant value was retrieved E.g. precision = 1.0 each entity was augmented with a value that was judged relevant with respect to the query keyword, but the augmentation is not necessary consistent CONSISTENCY Average similarity between the datasets that were used to create the respective cover Higher value cover was created from more similar datasets 36
37 Evaluation Indicators (2) MINIMALITY Number of datasets used in a cover in relation to the number of entities it covers 1 c d c ccc(d) or 1.0 i c = 1 High value cover was created from a small number of data sources Value of 0.0 each entity was covered using a different data source SCORE Average ranking score each individual dataset received from the scoring function DIVERSITY (FOR A SET OF COVERS) Average distance between all covers in the set (average distance between the datasets of two covers) 37
38 Baseline Algorithms VALUEGROUPING Generates one cover C ollects all values into per- entity group Clusters these groups according to their values Selects the highest scoring value from the cluster with the highest sum of scores Favoring high ranking data sources that are supported by many other sources with similar values TOPRANKED Generates k covers Selects the highest ranked data sources that can produce a value for each entity Next cover is constructed from each second best ranked value for each entity, until K results have been constructed C onsistency and diversity agnostic 39
39 Evaluating Entity C overage Results are comparable to Infogather+ (SIG MO D 13) 40
40 Evaluating Single Entity Precision (2) 41
41 Evaluating Single Entity Precision (3) PERCENTAGE OF QUERIES WITH BEST SOLUTIONS NOT ON RANK 1, AND IMPROVEMENTS OVER RANK 1 Best Result not on Rank 1 Avg. Improvement over Rank 1 Genetic 14% 12% Greedy 14% 10% Greedy* 23% 9% TopRanked 32% 7% 42
42 Evaluating C over Q uality (3) SETUP thc ov=1.0 (demanding full covers) and thc ons=0.0 43
43 Top- 1 Entity Augmentation (2) RELATED WORK: AGGREGATION Entities c 11 c 10 c 9 c 8 Aggregating all candidate values into one result (Infogather: Yakout et al. SIGMOD 12, Zhang et al. SIG MO D 13) Generating one augmentation c 7 c 6 c 5 c 4 c 3 c 2 c 1 data sources by rank 44
Top-k Entity Augmentation using Consistent Set Covering
Top-k Entity Augmentation using Consistent Set Covering Julian Eberius, Maik Thiele, Katrin Braunschweig, and Wolfgang Lehner Technische Universität Dresden [firstname.lastname]@tu-dresden.de ABSTRACT
More informationEvolutionary Approaches for Resilient Surveillance Management. Ruidan Li and Errin W. Fulp. U N I V E R S I T Y Department of Computer Science
Evolutionary Approaches for Resilient Surveillance Management Ruidan Li and Errin W. Fulp WAKE FOREST U N I V E R S I T Y Department of Computer Science BioSTAR Workshop, 2017 Surveillance Systems Growing
More informationFast Efficient Clustering Algorithm for Balanced Data
Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut
More informationPromoting Ranking Diversity for Biomedical Information Retrieval based on LDA
Promoting Ranking Diversity for Biomedical Information Retrieval based on LDA Yan Chen, Xiaoshi Yin, Zhoujun Li, Xiaohua Hu and Jimmy Huang State Key Laboratory of Software Development Environment, Beihang
More informationSearching the Deep Web
Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index
More informationTable Identification and Information extraction in Spreadsheets
Table Identification and Information extraction in Spreadsheets Elvis Koci 1,2, Maik Thiele 1, Oscar Romero 2, and Wolfgang Lehner 1 1 Technische Universität Dresden, Germany 2 Universitat Politècnica
More informationSearching the Deep Web
Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index
More informationEvolutionary Computation
Evolutionary Computation Lecture 9 Mul+- Objec+ve Evolu+onary Algorithms 1 Multi-objective optimization problem: minimize F(X) = ( f 1 (x),..., f m (x)) The objective functions may be conflicting or incommensurable.
More informationCrew Scheduling Problem: A Column Generation Approach Improved by a Genetic Algorithm. Santos and Mateus (2007)
In the name of God Crew Scheduling Problem: A Column Generation Approach Improved by a Genetic Algorithm Spring 2009 Instructor: Dr. Masoud Yaghini Outlines Problem Definition Modeling As A Set Partitioning
More informationCHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL
85 CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 5.1 INTRODUCTION Document clustering can be applied to improve the retrieval process. Fast and high quality document clustering algorithms play an important
More informationA Memetic Heuristic for the Co-clustering Problem
A Memetic Heuristic for the Co-clustering Problem Mohammad Khoshneshin 1, Mahtab Ghazizadeh 2, W. Nick Street 1, and Jeffrey W. Ohlmann 1 1 The University of Iowa, Iowa City IA 52242, USA {mohammad-khoshneshin,nick-street,jeffrey-ohlmann}@uiowa.edu
More informationGraph Mining and Social Network Analysis
Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References q Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann
More informationDATA MINING AND WAREHOUSING
DATA MINING AND WAREHOUSING Qno Question Answer 1 Define data warehouse? Data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data that supports management's decision-making
More informationover Multi Label Images
IBM Research Compact Hashing for Mixed Image Keyword Query over Multi Label Images Xianglong Liu 1, Yadong Mu 2, Bo Lang 1 and Shih Fu Chang 2 1 Beihang University, Beijing, China 2 Columbia University,
More informationHEURISTIC OPTIMIZATION USING COMPUTER SIMULATION: A STUDY OF STAFFING LEVELS IN A PHARMACEUTICAL MANUFACTURING LABORATORY
Proceedings of the 1998 Winter Simulation Conference D.J. Medeiros, E.F. Watson, J.S. Carson and M.S. Manivannan, eds. HEURISTIC OPTIMIZATION USING COMPUTER SIMULATION: A STUDY OF STAFFING LEVELS IN A
More informationMidterm Examination CS540-2: Introduction to Artificial Intelligence
Midterm Examination CS540-2: Introduction to Artificial Intelligence March 15, 2018 LAST NAME: FIRST NAME: Problem Score Max Score 1 12 2 13 3 9 4 11 5 8 6 13 7 9 8 16 9 9 Total 100 Question 1. [12] Search
More informationLearning independent, diverse binary hash functions: pruning and locality
Learning independent, diverse binary hash functions: pruning and locality Ramin Raziperchikolaei and Miguel Á. Carreira-Perpiñán Electrical Engineering and Computer Science University of California, Merced
More informationArtificial Intelligence Application (Genetic Algorithm)
Babylon University College of Information Technology Software Department Artificial Intelligence Application (Genetic Algorithm) By Dr. Asaad Sabah Hadi 2014-2015 EVOLUTIONARY ALGORITHM The main idea about
More informationDynamic Feature Selection for Dependency Parsing
Dynamic Feature Selection for Dependency Parsing He He, Hal Daumé III and Jason Eisner EMNLP 2013, Seattle Structured Prediction in NLP Part-of-Speech Tagging Parsing N N V Det N Fruit flies like a banana
More informationGenetic Algorithms. PHY 604: Computational Methods in Physics and Astrophysics II
Genetic Algorithms Genetic Algorithms Iterative method for doing optimization Inspiration from biology General idea (see Pang or Wikipedia for more details): Create a collection of organisms/individuals
More informationJianyong Wang Department of Computer Science and Technology Tsinghua University
Jianyong Wang Department of Computer Science and Technology Tsinghua University jianyong@tsinghua.edu.cn Joint work with Wei Shen (Tsinghua), Ping Luo (HP), and Min Wang (HP) Outline Introduction to entity
More informationUniversity of Delaware at Diversity Task of Web Track 2010
University of Delaware at Diversity Task of Web Track 2010 Wei Zheng 1, Xuanhui Wang 2, and Hui Fang 1 1 Department of ECE, University of Delaware 2 Yahoo! Abstract We report our systems and experiments
More informationEvolutionary Algorithms. CS Evolutionary Algorithms 1
Evolutionary Algorithms CS 478 - Evolutionary Algorithms 1 Evolutionary Computation/Algorithms Genetic Algorithms l Simulate natural evolution of structures via selection and reproduction, based on performance
More informationAutomatic people tagging for expertise profiling in the enterprise
Automatic people tagging for expertise profiling in the enterprise Pavel Serdyukov * (Yandex, Moscow, Russia) Mike Taylor, Vishwa Vinay, Matthew Richardson, Ryen White (Microsoft Research, Cambridge /
More informationKEYWORD search is a well known method for extracting
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 7, JULY 2014 1657 Efficient Duplication Free and Minimal Keyword Search in Graphs Mehdi Kargar, Student Member, IEEE, Aijun An, Member,
More informationA Survey on Keyword Diversification Over XML Data
ISSN (Online) : 2319-8753 ISSN (Print) : 2347-6710 International Journal of Innovative Research in Science, Engineering and Technology An ISO 3297: 2007 Certified Organization Volume 6, Special Issue 5,
More informationCHAPTER 6 REAL-VALUED GENETIC ALGORITHMS
CHAPTER 6 REAL-VALUED GENETIC ALGORITHMS 6.1 Introduction Gradient-based algorithms have some weaknesses relative to engineering optimization. Specifically, it is difficult to use gradient-based algorithms
More informationThe k-means Algorithm and Genetic Algorithm
The k-means Algorithm and Genetic Algorithm k-means algorithm Genetic algorithm Rough set approach Fuzzy set approaches Chapter 8 2 The K-Means Algorithm The K-Means algorithm is a simple yet effective
More informationHOW BUILDING OUR OWN E-COMMERCE SEARCH CHANGED OUR APPROACH TO SEARCH QUALITY. 2018// Berlin
HOW BUILDING OUR OWN E-COMMERCE SEARCH CHANGED OUR APPROACH TO SEARCH QUALITY 13.06.18 1 Our Search Team @otto.de Search Team in 2017 Christine Bellstedt Business Designer Search www.otto.de Search Quality
More informationToday. CS 188: Artificial Intelligence Fall Example: Boolean Satisfiability. Reminder: CSPs. Example: 3-SAT. CSPs: Queries.
CS 188: Artificial Intelligence Fall 2007 Lecture 5: CSPs II 9/11/2007 More CSPs Applications Tree Algorithms Cutset Conditioning Today Dan Klein UC Berkeley Many slides over the course adapted from either
More informationA Genetic Algorithm Applied to Graph Problems Involving Subsets of Vertices
A Genetic Algorithm Applied to Graph Problems Involving Subsets of Vertices Yaser Alkhalifah Roger L. Wainwright Department of Mathematical Department of Mathematical and Computer Sciences and Computer
More informationData Preprocessing. Slides by: Shree Jaswal
Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data
More informationFinding Related Tables
Finding Related Tables Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Cong Yu University of Michigan, Google Inc., University of California, Berkeley ljfang@umich.edu,
More informationPRISM: Concept-preserving Social Image Search Results Summarization
PRISM: Concept-preserving Social Image Search Results Summarization Boon-Siew Seah Sourav S Bhowmick Aixin Sun Nanyang Technological University Singapore Outline 1 Introduction 2 Related studies 3 Search
More informationPresented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu
Presented by: Dimitri Galmanovich Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu 1 When looking for Unstructured data 2 Millions of such queries every day
More informationEstimating. Local Ancestry in admixed Populations (LAMP)
Estimating Local Ancestry in admixed Populations (LAMP) QIAN ZHANG 572 6/05/2014 Outline 1) Sketch Method 2) Algorithm 3) Simulated Data: Accuracy Varying Pop1-Pop2 Ancestries r 2 pruning threshold Number
More informationComparative Study of Subspace Clustering Algorithms
Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that
More informationGenetic Algorithms for Vision and Pattern Recognition
Genetic Algorithms for Vision and Pattern Recognition Faiz Ul Wahab 11/8/2014 1 Objective To solve for optimization of computer vision problems using genetic algorithms 11/8/2014 2 Timeline Problem: Computer
More informationContext Management in Database Systems with Word Embeddings. Michael Günther
Context Management in Database Systems with Word Embeddings Michael Günther Introduction Language learning methods Word embedding operations Inception: Shutter_Island:. [0.54, -0.71, 0.11, ] [0.31, -0.59,
More informationLiterature Review On Implementing Binary Knapsack problem
Literature Review On Implementing Binary Knapsack problem Ms. Niyati Raj, Prof. Jahnavi Vitthalpura PG student Department of Information Technology, L.D. College of Engineering, Ahmedabad, India Assistant
More informationNetwork Routing Protocol using Genetic Algorithms
International Journal of Electrical & Computer Sciences IJECS-IJENS Vol:0 No:02 40 Network Routing Protocol using Genetic Algorithms Gihan Nagib and Wahied G. Ali Abstract This paper aims to develop a
More informationEntity Linking at TAC Task Description
Entity Linking at TAC 2013 Task Description Version 1.0 of April 9, 2013 1 Introduction The main goal of the Knowledge Base Population (KBP) track at TAC 2013 is to promote research in and to evaluate
More informationWikipedia and the Web of Confusable Entities: Experience from Entity Linking Query Creation for TAC 2009 Knowledge Base Population
Wikipedia and the Web of Confusable Entities: Experience from Entity Linking Query Creation for TAC 2009 Knowledge Base Population Heather Simpson 1, Stephanie Strassel 1, Robert Parker 1, Paul McNamee
More informationThe Curse of Dimensionality
The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more
More informationCreating Large-scale Training and Test Corpora for Extracting Structured Data from the Web
Creating Large-scale Training and Test Corpora for Extracting Structured Data from the Web Robert Meusel and Heiko Paulheim University of Mannheim, Germany Data and Web Science Group {robert,heiko}@informatik.uni-mannheim.de
More informationRegression Test Case Prioritization using Genetic Algorithm
9International Journal of Current Trends in Engineering & Research (IJCTER) e-issn 2455 1392 Volume 2 Issue 8, August 2016 pp. 9 16 Scientific Journal Impact Factor : 3.468 http://www.ijcter.com Regression
More informationResPubliQA 2010
SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first
More informationData Analytics for. Transmission Expansion Planning. Andrés Ramos. January Estadística II. Transmission Expansion Planning GITI/GITT
Data Analytics for Andrés Ramos January 2018 1 1 Introduction 2 Definition Determine which lines and transformers and when to build optimizing total investment and operation costs 3 Challenges for TEP
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationData Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality
Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing
More informationA GENETIC ALGORITHM FOR CLUSTERING ON VERY LARGE DATA SETS
A GENETIC ALGORITHM FOR CLUSTERING ON VERY LARGE DATA SETS Jim Gasvoda and Qin Ding Department of Computer Science, Pennsylvania State University at Harrisburg, Middletown, PA 17057, USA {jmg289, qding}@psu.edu
More informationLocal Search. CS 486/686: Introduction to Artificial Intelligence Winter 2016
Local Search CS 486/686: Introduction to Artificial Intelligence Winter 2016 1 Overview Uninformed Search Very general: assumes no knowledge about the problem BFS, DFS, IDS Informed Search Heuristics A*
More informationEvaluation of Retrieval Systems
Performance Criteria Evaluation of Retrieval Systems. Expressiveness of query language Can query language capture information needs? 2. Quality of search results Relevance to users information needs 3.
More informationApproximation Algorithms
Approximation Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A 4 credit unit course Part of Theoretical Computer Science courses at the Laboratory of Mathematics There will be 4 hours
More informationFeature Subset Selection using Clusters & Informed Search. Team 3
Feature Subset Selection using Clusters & Informed Search Team 3 THE PROBLEM [This text box to be deleted before presentation Here I will be discussing exactly what the prob Is (classification based on
More informationGenetic Algorithms for Classification and Feature Extraction
Genetic Algorithms for Classification and Feature Extraction Min Pei, Erik D. Goodman, William F. Punch III and Ying Ding, (1995), Genetic Algorithms For Classification and Feature Extraction, Michigan
More information3/3/2008. Announcements. A Table with a View (continued) Fields (Attributes) and Primary Keys. Video. Keys Primary & Foreign Primary/Foreign Key
Announcements Quiz will cover chapter 16 in Fluency Nothing in QuickStart Read Chapter 17 for Wednesday Project 3 3A due Friday before 11pm 3B due Monday, March 17 before 11pm A Table with a View (continued)
More informationOptimizing Testing Performance With Data Validation Option
Optimizing Testing Performance With Data Validation Option 1993-2016 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording
More informationApproximation Algorithms
Approximation Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours
More informationGRASP. Greedy Randomized Adaptive. Search Procedure
GRASP Greedy Randomized Adaptive Search Procedure Type of problems Combinatorial optimization problem: Finite ensemble E = {1,2,... n } Subset of feasible solutions F 2 Objective function f : 2 Minimisation
More informationOn Demand Phenotype Ranking through Subspace Clustering
On Demand Phenotype Ranking through Subspace Clustering Xiang Zhang, Wei Wang Department of Computer Science University of North Carolina at Chapel Hill Chapel Hill, NC 27599, USA {xiang, weiwang}@cs.unc.edu
More informationSequence clustering. Introduction. Clustering basics. Hierarchical clustering
Sequence clustering Introduction Data clustering is one of the key tools used in various incarnations of data-mining - trying to make sense of large datasets. It is, thus, natural to ask whether clustering
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster
More informationAnnouncements. CS 188: Artificial Intelligence Fall Reminder: CSPs. Today. Example: 3-SAT. Example: Boolean Satisfiability.
CS 188: Artificial Intelligence Fall 2008 Lecture 5: CSPs II 9/11/2008 Announcements Assignments: DUE W1: NOW P1: Due 9/12 at 11:59pm Assignments: UP W2: Up now P2: Up by weekend Dan Klein UC Berkeley
More informationCS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008 Lecture 5: CSPs II 9/11/2008 Dan Klein UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore 1 1 Assignments: DUE Announcements
More informationMatching Web Tables To DBpedia - A Feature Utility Study
Matching Web Tables To DBpedia - A Feature Utility Study Dominique Ritze, Christian Bizer Data and Web Science Group, University of Mannheim, B6, 26 68159 Mannheim, Germany {dominique,chris}@informatik.uni-mannheim.de
More informationA Genetic Algorithm-Based Approach for Building Accurate Decision Trees
A Genetic Algorithm-Based Approach for Building Accurate Decision Trees by Z. Fu, Fannie Mae Bruce Golden, University of Maryland S. Lele,, University of Maryland S. Raghavan,, University of Maryland Edward
More informationCS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks
CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks Archana Sulebele, Usha Prabhu, William Yang (Group 29) Keywords: Link Prediction, Review Networks, Adamic/Adar,
More informationSql Server Syllabus. Overview
Sql Server Syllabus Overview This SQL Server training teaches developers all the Transact-SQL skills they need to create database objects like Tables, Views, Stored procedures & Functions and triggers
More informationAutomatic Generation of Test Case based on GATS Algorithm *
Automatic Generation of Test Case based on GATS Algorithm * Xiajiong Shen and Qian Wang Institute of Data and Knowledge Engineering Henan University Kaifeng, Henan Province 475001, China shenxj@henu.edu.cn
More informationA Learning Automata-based Memetic Algorithm
A Learning Automata-based Memetic Algorithm M. Rezapoor Mirsaleh and M. R. Meybodi 2,3 Soft Computing Laboratory, Computer Engineering and Information Technology Department, Amirkabir University of Technology,
More informationExploring a Few Good Tuples From Text Databases
Exploring a Few Good Tuples From Text Databases Alpa Jain, Divesh Srivastava Columbia University, AT&T Labs-Research Abstract Information extraction from text databases is a useful paradigm to populate
More informationNortheastern University in TREC 2009 Web Track
Northeastern University in TREC 2009 Web Track Shahzad Rajput, Evangelos Kanoulas, Virgil Pavlu, Javed Aslam College of Computer and Information Science, Northeastern University, Boston, MA, USA Information
More informationGenetic Algorithms. Kang Zheng Karl Schober
Genetic Algorithms Kang Zheng Karl Schober Genetic algorithm What is Genetic algorithm? A genetic algorithm (or GA) is a search technique used in computing to find true or approximate solutions to optimization
More informationDETERMINING MAXIMUM/MINIMUM VALUES FOR TWO- DIMENTIONAL MATHMATICLE FUNCTIONS USING RANDOM CREOSSOVER TECHNIQUES
DETERMINING MAXIMUM/MINIMUM VALUES FOR TWO- DIMENTIONAL MATHMATICLE FUNCTIONS USING RANDOM CREOSSOVER TECHNIQUES SHIHADEH ALQRAINY. Department of Software Engineering, Albalqa Applied University. E-mail:
More informationClustering. CS294 Practical Machine Learning Junming Yin 10/09/06
Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,
More informationGreedy Algorithms. Algorithms
Greedy Algorithms Algorithms Greedy Algorithms Many algorithms run from stage to stage At each stage, they make a decision based on the information available A Greedy algorithm makes decisions At each
More informationOpen Data Integration. Renée J. Miller
Open Data Integration Renée J. Miller miller@northeastern.edu !2 Open Data Principles Timely & Comprehensive Accessible and Usable Complete - All public data is made available. Public data is data that
More informationAdministrative. Local Search!
Administrative Local Search! CS311 David Kauchak Spring 2013 Assignment 2 due Tuesday before class Written problems 2 posted Class participation http://www.youtube.com/watch? v=irhfvdphfzq&list=uucdoqrpqlqkvctckzqa
More informationCHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES
CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES 6.1 INTRODUCTION The exploration of applications of ANN for image classification has yielded satisfactory results. But, the scope for improving
More informationFeature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani
Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework
More informationCAD Algorithms. Placement and Floorplanning
CAD Algorithms Placement Mohammad Tehranipoor ECE Department 4 November 2008 1 Placement and Floorplanning Layout maps the structural representation of circuit into a physical representation Physical representation:
More informationIdentifying Web Spam With User Behavior Analysis
Identifying Web Spam With User Behavior Analysis Yiqun Liu, Rongwei Cen, Min Zhang, Shaoping Ma, Liyun Ru State Key Lab of Intelligent Tech. & Sys. Tsinghua University 2008/04/23 Introduction simple math
More information10601 Machine Learning. Model and feature selection
10601 Machine Learning Model and feature selection Model selection issues We have seen some of this before Selecting features (or basis functions) Logistic regression SVMs Selecting parameter value Prior
More informationQuery Expansion using Wikipedia and DBpedia
Query Expansion using Wikipedia and DBpedia Nitish Aggarwal and Paul Buitelaar Unit for Natural Language Processing, Digital Enterprise Research Institute, National University of Ireland, Galway firstname.lastname@deri.org
More informationMidterm Examination CS 540-2: Introduction to Artificial Intelligence
Midterm Examination CS 54-2: Introduction to Artificial Intelligence March 9, 217 LAST NAME: FIRST NAME: Problem Score Max Score 1 15 2 17 3 12 4 6 5 12 6 14 7 15 8 9 Total 1 1 of 1 Question 1. [15] State
More informationImproving the Performance of OLAP Queries Using Families of Statistics Trees
Improving the Performance of OLAP Queries Using Families of Statistics Trees Joachim Hammer Dept. of Computer and Information Science University of Florida Lixin Fu Dept. of Mathematical Sciences University
More informationGrid-Based Genetic Algorithm Approach to Colour Image Segmentation
Grid-Based Genetic Algorithm Approach to Colour Image Segmentation Marco Gallotta Keri Woods Supervised by Audrey Mbogho Image Segmentation Identifying and extracting distinct, homogeneous regions from
More informationTotal Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al.
Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al. Presented by Brandon Smith Computer Vision Fall 2007 Objective Given a query image of an object,
More informationEffect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching
Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Wolfgang Tannebaum, Parvaz Madabi and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna
More informationEvaluation of Retrieval Systems
Performance Criteria Evaluation of Retrieval Systems 1 1. Expressiveness of query language Can query language capture information needs? 2. Quality of search results Relevance to users information needs
More informationTime Complexity Analysis of the Genetic Algorithm Clustering Method
Time Complexity Analysis of the Genetic Algorithm Clustering Method Z. M. NOPIAH, M. I. KHAIRIR, S. ABDULLAH, M. N. BAHARIN, and A. ARIFIN Department of Mechanical and Materials Engineering Universiti
More informationGenetic-Algorithm-Based Construction of Load-Balanced CDSs in Wireless Sensor Networks
Genetic-Algorithm-Based Construction of Load-Balanced CDSs in Wireless Sensor Networks Jing He, Shouling Ji, Mingyuan Yan, Yi Pan, and Yingshu Li Department of Computer Science Georgia State University,
More informationCHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on to remove this watermark.
119 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 120 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 5.1. INTRODUCTION Association rule mining, one of the most important and well researched
More informationThe Binary Genetic Algorithm. Universidad de los Andes-CODENSA
The Binary Genetic Algorithm Universidad de los Andes-CODENSA 1. Genetic Algorithms: Natural Selection on a Computer Figure 1 shows the analogy between biological i l evolution and a binary GA. Both start
More informationTowards a Distributed Web Search Engine
Towards a Distributed Web Search Engine Ricardo Baeza-Yates Yahoo! Labs Barcelona, Spain Joint work with many people, most at Yahoo! Labs, In particular Berkant Barla Cambazoglu A Research Story Architecture
More informationDynamic Programming Course: A structure based flexible search method for motifs in RNA. By: Veksler, I., Ziv-Ukelson, M., Barash, D.
Dynamic Programming Course: A structure based flexible search method for motifs in RNA By: Veksler, I., Ziv-Ukelson, M., Barash, D., Kedem, K Outline Background Motivation RNA s structure representations
More informationEffective Searching of RDF Knowledge Bases
Effective Searching of RDF Knowledge Bases Shady Elbassuoni Joint work with: Maya Ramanath and Gerhard Weikum RDF Knowledge Bases Annie Hall is a 1977 American romantic comedy directed by Woody Allen and
More informationInformation Fusion Dr. B. K. Panigrahi
Information Fusion By Dr. B. K. Panigrahi Asst. Professor Department of Electrical Engineering IIT Delhi, New Delhi-110016 01/12/2007 1 Introduction Classification OUTLINE K-fold cross Validation Feature
More information