Graph-Based Weakly- Supervised Methods for Information Extraction & Integration. Partha Pratim Talukdar University of Pennsylvania
|
|
- Lesley Newton
- 5 years ago
- Views:
Transcription
1 Graph-Based Weakly- Supervised Methods for Information Extraction & Integration Partha Pratim Talukdar University of Pennsylvania Dissertation Defense, February 24, 2010
2 End Goal We should be able to answer any question for which data exists in the dataset. 2
3 3 Query: alma maters of US mayors
4 3 Query: alma maters of US mayors
5 Query: alma maters of US mayors There is probably no single page which can answer this query exactly. 3
6 4 Google Squared?
7 Google Squared? 28 mayors listed out of thousands Alma mater of only four mayors found 4
8 Google Squared? 28 mayors listed out of thousands Alma mater of only four mayors found An Important First Step! 4
9 Often, users need information that combines data from multiple sites (pages) holds a bachelor of science degree from the University of Alabama. 5
10 Often, users need information that combines data from multiple sites (pages) holds a bachelor of science degree from the University of Alabama. Information Extraction (IE) Information Extraction (IE) 5 Mayor City State Bill Ham Jr. Auburn AL Edward May Bessemer AL Loretta Spencer Huntsville AL Person Alma mater Loretta Spencer Univ. of Alabama......
11 Often, users need information that combines data from multiple sites (pages) holds a bachelor of science degree from the University of Alabama. Information Extraction (IE) Information Extraction (IE) 5 Mayor City State Bill Ham Jr. Auburn AL Edward May Bessemer AL Loretta Spencer Huntsville AL Information Integration (II) Person Alma mater Loretta Spencer Univ. of Alabama......
12 Often, users need information that combines data from multiple sites (pages) holds a bachelor of science degree from the University of Alabama. Mayor City State Alma mater Bill Ham Jr. Auburn AL... Edward May Bessemer AL... Loretta Spencer Huntsville AL Univ. of Alabama
13 ... or from Tables, as in the Life Sciences Example user keyword query genes proteins malaria Unstructured Source (e.g. research paper) Disease DB1 Gene DB2 Protein DB1 Protein DB2 Disease DB2 Disease DB3 6 Structured Source Gene DB1
14 Current Solution in Life Sciences: Hand Programmed WebForms, with Small number of Sources 7 Human-written SQL powering WebForm: SELECT distinct cast(aseq.assembly_na_sequence_id WebForm as varchar2 (32)) as na_sequence_id (over structured data), '@PROJECT_ID@' as project_id, count (distinct aseq.na_sequence_id) as libcount FROM DoTS.EST@MUS_LINK@ est, DoTS.Library@MUS_LINK@ lib, DoTS.AssemblySequence@MUS_LINK@ aseq, epcondata.isexpressed ie WHERE lib.dbest_id = $$panclibraryp$$ AND lib.library_id = est.library_id AND est.na_sequence_id = aseq.na_sequence_id AND aseq.assembly_na_sequence_id is not NULL AND aseq.assembly_na_sequence_id = ie.na_sequence_id GROUP BY aseq.assembly_na_sequence_id
15 Current Solution in Life Sciences: Hand Programmed WebForms, with Small number of Sources 7 Human-written SQL powering WebForm: SELECT distinct cast(aseq.assembly_na_sequence_id as varchar2 (32)) as na_sequence_id, '@PROJECT_ID@' as project_id, count (distinct aseq.na_sequence_id) as libcount FROM DoTS.EST@MUS_LINK@ est, DoTS.Library@MUS_LINK@ lib, DoTS.AssemblySequence@MUS_LINK@ aseq, epcondata.isexpressed ie WHERE lib.dbest_id = $$panclibraryp$$ AND lib.library_id = est.library_id AND est.na_sequence_id = aseq.na_sequence_id AND aseq.assembly_na_sequence_id is not NULL AND aseq.assembly_na_sequence_id = ie.na_sequence_id GROUP BY aseq.assembly_na_sequence_id
16 Current Solution in Life Sciences: Hand Programmed WebForms, with Small number of Sources 7 Human-written SQL powering WebForm: SELECT distinct cast(aseq.assembly_na_sequence_id as varchar2 (32)) as na_sequence_id Requires access to programmers, '@PROJECT_ID@' as project_id -, expensive count (distinct and aseq.na_sequence_id) not scalableas libcount - FROM DoTS.EST@MUS_LINK@ est, exploits DoTS.Library@MUS_LINK@ only a small subset lib of, available DoTS.AssemblySequence@MUS_LINK@ data sources aseq, epcondata.isexpressed ie WHERE lib.dbest_id = $$panclibraryp$$ Not AND suitable lib.library_id for = est.library_id discovery mode! AND est.na_sequence_id = aseq.na_sequence_id AND aseq.assembly_na_sequence_id is not NULL AND aseq.assembly_na_sequence_id = ie.na_sequence_id GROUP BY aseq.assembly_na_sequence_id
17 What is Needed to Satisfy User Information Need? 8
18 What is Needed to Satisfy User Information Need? Take standard keyword queries, but exploit semantic information to: combine data from within (IE) and across (II) sources (documents and tables) take user information need (personalization/context) into account 8
19 What is Needed to Satisfy User Information Need? Take standard keyword queries, but exploit semantic information to: combine data from within (IE) and across (II) sources (documents and tables) take user information need (personalization/context) into account Existing approaches require extensive human input (e.g., annotations, mediated schemas): doesn t scale 8
20 What is Needed to Satisfy User Information Need? Take standard keyword queries, but exploit semantic information to: combine data from within (IE) and across (II) sources (documents and tables) take user information need (personalization/context) into account Existing approaches require extensive human input (e.g., annotations, mediated schemas): doesn t scale My thesis addresses the challenges of doing these at scale, by: Learning from small amounts of human annotation, specification, or feedback Generalizing to large number of data items and schemas 8
21 8 What is Needed to Satisfy User Information Need? Take standard keyword queries, but exploit semantic information to: combine data from within (IE) and across (II) sources (documents and tables) take user information need (personalization/context) into account Existing approaches require extensive human input (e.g., annotations, mediated schemas): doesn t scale My thesis addresses the challenges of doing these at scale, by: Learning from small amounts of human annotation, specification, or feedback Generalizing to large number of data items and schemas... through the use of graph-based methods.
22 Thesis Statement Graph-based representation of data and learning over such graphs result in effective and scalable methods for Information Extraction (IE) and Integration (II). 9
23 10 This Talk: Two Parts
24 This Talk: Two Parts 1. Information Extraction (IE) Class-Instance acquisition on large scale using graph-based methods, and their comparisons 10
25 This Talk: Two Parts 1. Information Extraction (IE) Class-Instance acquisition on large scale using graph-based methods, and their comparisons 2. Information Integration (II) Search and feedback driven information integration Automatically adding new sources, and feedback based association correction 10
26 This Talk: Two Parts 1. Information Extraction (IE) Class-Instance acquisition on large scale using graph-based methods, and their comparisons 2. Information Integration (II) Search and feedback driven information integration Automatically adding new sources, and feedback based association correction System proposed in my thesis: Q 10
27 Q: Overall Architecture A B Association (Edge) Discovery E B A D G C F IE C D Ranking Sources Relevance E A F G E B D G C F 11 Unstructured Data Structured Data
28 Q: Overall Architecture A B Association (Edge) Discovery E B A D G C F IE C D Ranking Sources Relevance E A F G E B D G C F 11 Unstructured Data Structured Data
29 Q: Overall Architecture A B Association (Edge) Discovery E B A D G C F IE C D Ranking Sources Relevance E A F G E B D G C F 11 Unstructured Data Structured Data
30 Next: Information Extraction (Class Instance Acquisition) A B Association (Edge) Discovery E B A D G C F IE C D Ranking Sources Relevance E A F G E B D G C F 12 Unstructured Data Structured Data
31 Class Instance Acquisition Unlabeled Data Medline Newswire Web Partial Instance Lists 13
32 Class Instance Acquisition Unlabeled Data Medline Newswire Web Partial Instance Lists 13 Car Company Toyota Honda Ford...
33 Class Instance Acquisition Unlabeled Data Medline Newswire Web Partial Instance Lists 13 Volcano US Cities Volcano New York Car Company Kilauea Toyota Mt. Mt. Fuji Fuji Philadelphia Mt. Fuji Honda Boston Mt. Andrus Ford
34 Class Instance Acquisition Unlabeled Data Medline Newswire Web Partial Instance Lists 13 Volcano US Cities Volcano New York Car Company Kilauea Toyota Mt. Mt. Fuji Fuji Philadelphia Mt. Fuji Honda Boston Mt. Andrus Ford Can we combine all these sources to build a large repository of class-instance pairs?
35 14 State-of-the-art
36 State-of-the-art Several approaches for class instance acquisition exist: 14
37 State-of-the-art Several approaches for class instance acquisition exist: unstructured data (A8 [van Durme and Pasca, 2008]) 14
38 State-of-the-art Several approaches for class instance acquisition exist: unstructured data (A8 [van Durme and Pasca, 2008]) semi-structured data ([Wang and Cohen, 2007]) 14
39 State-of-the-art Several approaches for class instance acquisition exist: unstructured data (A8 [van Durme and Pasca, 2008]) semi-structured data ([Wang and Cohen, 2007]) Structured data (WebTables (WT) [Cafarella et al., 2008]) 14
40 State-of-the-art Several approaches for class instance acquisition exist: unstructured data (A8 [van Durme and Pasca, 2008]) semi-structured data ([Wang and Cohen, 2007]) Structured data (WebTables (WT) [Cafarella et al., 2008]) A particular extraction might be easier in one data source than other. 14
41 State-of-the-art Several approaches for class instance acquisition exist: unstructured data (A8 [van Durme and Pasca, 2008]) semi-structured data ([Wang and Cohen, 2007]) Structured data (WebTables (WT) [Cafarella et al., 2008]) A particular extraction might be easier in one data source than other. Can we combine extractions from different sources (and methods) and learn from the combined extractions to improve coverage? 14
42 Our Approach: Graph-based Expansion 15
43 Our Approach: Graph-based Expansion WT Musician Billy Joel (0.75) Johnny Cash (0.73) Singer Bob Dylan (0.95) Johnny Cash (0.87) Billy Joel (0.82) A8 15
44 Our Approach: Graph-based Expansion Cluster ID WT Musician Billy Joel (0.75) Johnny Cash (0.73) Singer Bob Dylan (0.95) Johnny Cash (0.87) Billy Joel (0.82) A8 15
45 Our Approach: Graph-based Expansion Cluster ID WT Musician Billy Joel (0.75) Johnny Cash (0.73) Singer Bob Dylan (0.95) Johnny Cash (0.87) Billy Joel (0.82) A8 Extraction Confidence 15
46 Our Approach: Graph-based Expansion Cluster ID WT Musician Billy Joel (0.75) Johnny Cash (0.73) Singer Bob Dylan (0.95) Johnny Cash (0.87) Billy Joel (0.82) A Bob Dylan Extraction Confidence Singer 0.87 Musician Johnny Cash Billy Joel
47 Our Approach: Graph-based Expansion Cluster ID WT Musician Billy Joel (0.75) Johnny Cash (0.73) Singer Bob Dylan (0.95) Johnny Cash (0.87) Billy Joel (0.82) A Bob Dylan Extraction Confidence Singer 0.87? Musician Johnny Cash Billy Joel
48 Our Approach: Graph-based WT Musician Billy Joel (0.75) Johnny Cash (0.73) Expansion Singer Bob Dylan (0.95) Johnny Cash (0.87) Billy Joel (0.82) Can we infer that Bob Dylan is also a Musician, as that is missing in current extractions? Cluster ID A Bob Dylan Extraction Confidence Singer 0.87? Musician Johnny Cash Billy Joel
49 Our Approach: Graph-based WT Musician Billy Joel (0.75) Johnny Cash (0.73) Expansion Singer Bob Dylan (0.95) Johnny Cash (0.87) Billy Joel (0.82) Can we infer that Bob Dylan is also a Musician, as that is missing in current extractions? Cluster ID A Bob Dylan Extraction Confidence Singer ? Musician 1.0 Johnny Cash 15 Musician Billy Joel Musician 1.0 Seed Classes
50 Observations on the Constructed Graph Singer Musician Bob Dylan Johnny Cash Billy Joel Musician 1.0 Musician
51 Observations on the Constructed Graph Singer Musician Bob Dylan Musician 1.0 Johnny Cash Smoothness: Nodes connected by an edge should be assigned similar classes, as enforced by edge weight Billy Joel Musician
52 Observations on the Constructed Graph Nodes corresponding to clusters extracted by first phase extractors. Singer Musician Bob Dylan Musician 1.0 Johnny Cash Smoothness: Nodes connected by an edge should be assigned similar classes, as enforced by edge weight Billy Joel Musician
53 Observations on the Constructed Graph Nodes corresponding to clusters extracted by first phase extractors. Singer Musician Bob Dylan Musician 1.0 Johnny Cash Smoothness: Nodes connected by an edge should be assigned similar classes, as enforced by edge weight Billy Joel Musician Coupling Node: Force (softly) all instance nodes connected to it to have similar class labels, exploiting the Smoothness requirement.
54 Observations on the Constructed Graph Nodes corresponding to clusters extracted by first phase extractors. Singer Musician Bob Dylan Musician 1.0 Johnny Cash Smoothness: Nodes connected by an edge should be assigned similar classes, as enforced by edge weight. 16 Coupling Node: Force (softly) all instance nodes connected to it to have similar class labels, exploiting the Smoothness requirement Billy Joel Musician 1.0 Seed classes can be different from the cluster IDs of first phase extractors (A8, WT, etc.)
55 Our Approach: Graph-based Expansion [Talukdar et al., EMNLP 2008] 0.95 Bob Dylan Singer Johnny Cash Musician Billy Joel 17
56 Our Approach: Graph-based Expansion [Talukdar et al., EMNLP 2008] 0.95 Bob Dylan Singer Johnny Cash Musician We use Adsorption [Baluja et al., 2008] for label propagation (more details shortly). Billy Joel
57 Our Approach: Graph-based Expansion [Talukdar et al., EMNLP 2008] Singer 0.95 Bob Dylan Initialization 0.87 Musician Musician 1.0 Johnny Cash Musician 1.0 Seed Labels 17 Musician We use Adsorption [Baluja et al., 2008] for label propagation (more details shortly). Billy Joel Musician 1.0 Musician 1.0
58 Our Approach: Graph-based Expansion [Talukdar et al., EMNLP 2008] Singer 0.95 Bob Dylan Iteration Musician 0.8 Musician Musician 1.0 Johnny Cash Musician 1.0 Seed Labels 17 Musician We use Adsorption [Baluja et al., 2008] for label propagation (more details shortly). Billy Joel Musician 1.0 Musician 1.0
59 Our Approach: Graph-based Expansion [Talukdar et al., EMNLP 2008] Singer 0.95 Bob Dylan Musician 0.6 Iteration 2 Musician 0.8 Musician Musician 1.0 Johnny Cash Musician 1.0 Derived Labels Seed Labels 17 Musician We use Adsorption [Baluja et al., 2008] for label propagation (more details shortly). Billy Joel Musician 1.0 Musician 1.0
60 Our Approach: Graph-based Expansion [Talukdar et al., EMNLP 2008] Singer Bob Dylan Iteration Johnny Cash Musician We use Adsorption [Baluja et al., 2008] for label propagation (more details shortly). Billy Joel
61 Class Assignment for Fixed Instances 18
62 Class Assignment for Fixed Instances A8 Adsorption WebTables 18
63 Class Assignment for Fixed Instances 924k (class, instance) pairs extracted from 100m web documents. A8 Adsorption WebTables 74m (class, instance) pairs extracted from WebTables dataset. 18
64 Class Assignment for Fixed Instances A8 Adsorption WebTables Graph with 1.4m nodes, 75m edges used. 18
65 Class Assignment for Fixed Instances Evaluation against WordNet Dataset (38 classes, 8910 instances) Mean Reciprocal Rank (MRR) A8 Adsorption WebTables Graph with 1.4m nodes, 75m edges used. 18 Recall
66 Class Assignment for Fixed Instances Evaluation against WordNet Dataset (38 classes, 8910 instances) Mean Reciprocal Rank (MRR) Adsorption is able to assign better class labels to more instances A8 Adsorption WebTables Graph with 1.4m nodes, 75m edges used. 18 Recall
67 19 Can We Improve Class-Instance Acquisition with Additional Semantic Constraints?
68 Can We Improve Class-Instance Acquisition with Additional Semantic Constraints? Isaac Newton people-person-name filmmusic_contributor-name Johnny Cash Bob Dylan 19
69 Can We Improve Class-Instance Acquisition with Additional Semantic Constraints? Isaac Newton people-person-name Instances with shared attributes are likely to be from the same class. 19 has_attributealbums filmmusic_contributor-name Johnny Cash Bob Dylan
70 Can We Improve Class-Instance Acquisition with Additional Semantic Constraints? Isaac Newton Graph-based representation makes it easy people-person-name to incorporate such constraints! Instances with shared attributes are likely to be from the same class. 19 has_attributealbums filmmusic_contributor-name Johnny Cash Bob Dylan
71 Improving Class-Instance Acquisition with YAGO Attributes 20
72 Improving Class-Instance Acquisition with YAGO Attributes 170 WordNet Classes, 10 Seeds per Class, using Adsorption Mean Reciprocal Rank (MRR) TextRunner Graph YAGO Graph TextRunner + YAGO Graph
73 Improving Class-Instance Acquisition with YAGO Attributes Mean Reciprocal Rank (MRR) WordNet Classes, 10 Seeds per Class, using Adsorption TextRunner Graph YAGO Graph TextRunner + YAGO Graph Graph constructed from TextRunner (UWash) output, 175k nodes, 529k edges
74 Improving Class-Instance Acquisition with YAGO Attributes 170 WordNet Classes, 10 Seeds per Class, using Adsorption Mean Reciprocal Rank (MRR) TextRunner Graph YAGO Graph TextRunner + YAGO Graph Graph constructed from output of YAGO Knowledge Base, 142k nodes, 777k edges
75 Improving Class-Instance Acquisition with YAGO Attributes 170 WordNet Classes, 10 Seeds per Class, using Adsorption Mean Reciprocal Rank (MRR) TextRunner Graph YAGO Graph TextRunner + YAGO Graph Combined graph, with 237k nodes, 1.3m edges
76 Improving Class-Instance Acquisition with YAGO Attributes 170 WordNet Classes, 10 Seeds per Class, using Adsorption Mean Reciprocal Rank (MRR) TextRunner Graph YAGO Graph TextRunner + YAGO Graph
77 Improving Class-Instance Acquisition with YAGO Attributes Additional semantic constraints help Mean Reciprocal Rank (MRR) WordNet Classes, 10 Seeds per Class, using Adsorption TextRunner Graph YAGO Graph TextRunner + YAGO Graph improve performance significantly
78 Improving Class-Instance Acquisition with YAGO Attributes 170 WordNet Classes, 10 Seeds per Class, using Adsorption Additional semantic constraints help Mean Reciprocal Rank (MRR) TextRunner Graph YAGO Graph TextRunner + YAGO Graph improve performance significantly This further demonstrates the benefit of combining information from multiple sources
79 Class Instance Acquisition: Recap 21
80 Class Instance Acquisition: Recap Showed benefits of Adsorption, a highly scalable (parallelizable) graph-based semi-supervised learning (SSL) method: to aggregate extractions from different sources (and methods), resulting in better classes for more instances 21
81 Class Instance Acquisition: Recap Showed benefits of Adsorption, a highly scalable (parallelizable) graph-based semi-supervised learning (SSL) method: to aggregate extractions from different sources (and methods), resulting in better classes for more instances Demonstrated improved performance through additional semantic constraints 21
82 Class Instance Acquisition: Recap Showed benefits of Adsorption, a highly scalable (parallelizable) graph-based semi-supervised learning (SSL) method: to aggregate extractions from different sources (and methods), resulting in better classes for more instances Demonstrated improved performance through additional semantic constraints 21 Next: Modification to Adsorption and comparison of different graph-based SSL methods.
83 Adsorption & Its Extension Seed Scores v Label Priors Estimated Scores 22
84 Adsorption & Its Extension Adsorption uses the following update at iteration (t +1): Ŷ (t+1) v p inj v Y v + p cont v B (t) v + p abnd v r where B (t) v = u W uv u W u v Ŷ (t) u v Seed Scores Label Priors Estimated Scores 22
85 Adsorption & Its Extension Adsorption uses the following update at iteration (t +1): Ŷ (t+1) v p inj v Y v + p cont v B (t) v + p abnd v r where Node specific random walk probabilities used to control information passing through the node. B (t) v = u W uv u W u v Ŷ (t) u v Seed Scores Label Priors Estimated Scores 22
86 Adsorption & Its Extension Adsorption uses the following update at iteration (t +1): Ŷ (t+1) v p inj v Y v + p cont v B (t) v + p abnd v r where B (t) v = u W uv u W u v Ŷ (t) u v Seed Scores Label Priors Weighted neighborhood class scores after iteration (t) Estimated Scores 22
87 Adsorption & Its Extension Adsorption uses the following update at iteration (t +1): Ŷ (t+1) v p inj v Y v + p cont v B (t) v + p abnd v Label Uncertainty r where B (t) v = u W uv u W u v Ŷ (t) u v Seed Scores Label Priors Estimated Scores 22
88 Adsorption & Its Extension Adsorption uses the following update at iteration (t +1): Ŷ (t+1) v p inj v Y v + p cont v B (t) v + p abnd v r where B (t) v = u W uv u W u v Ŷ (t) u v Seed Scores Label Priors Estimated Scores 22 Adsorption s key drawback: it is not optimizing any well defined objective
89 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores ] + µ 2 Ŷl L Ŷ l + µ 3 Ŷl R l Seed Indicator Laplacian Priors Scores 23
90 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) Match Seeds (soft) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores ] + µ 2 Ŷl L Ŷ l + µ 3 Ŷl R l Seed Indicator Laplacian Priors Scores 23
91 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) Match Seeds (soft) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores Seed Indicator Smooth ] + µ 2 Ŷl L Ŷ l + µ 3 Ŷl R l Laplacian Priors Scores 23
92 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) Match Seeds (soft) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores Seed Indicator Smooth Laplacian Match Priors (Regularizer) + µ 2 Ŷ l L Ŷ l + µ 3 Ŷl R l ] Priors Scores 23
93 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) Match Seeds (soft) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores Seed Indicator Smooth Laplacian Match Priors (Regularizer) + µ 2 Ŷ l L Ŷ l + µ 3 Ŷl R l ] Priors Scores LP-ZGL [Zhu et al., 2003] Objective 23 min Ŷ l Ŷ l LŶl, s.t. SY l = SŶl
94 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) Match Seeds (soft) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores Seed Indicator Smooth Laplacian Match Priors (Regularizer) + µ 2 Ŷ l L Ŷ l + µ 3 Ŷl R l ] Priors Scores LP-ZGL [Zhu et al., 2003] Objective 23 min Ŷ l Ŷ l LŶl, s.t. SY l = SŶl Smooth
95 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) Match Seeds (soft) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores Seed Indicator Smooth Laplacian Match Priors (Regularizer) + µ 2 Ŷ l L Ŷ l + µ 3 Ŷl R l ] Priors Scores LP-ZGL [Zhu et al., 2003] Objective 23 min Ŷ l Ŷ l LŶl, s.t. SY l = SŶl Smooth Match Seeds (hard)
96 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) Match Seeds (soft) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores Seed Indicator Smooth Laplacian Match Priors (Regularizer) + µ 2 Ŷ l L Ŷ l + µ 3 Ŷl R l ] Priors Scores LP-ZGL [Zhu et al., 2003] Objective 23 min Ŷ l Ŷ l LŶl, s.t. SY l = SŶl Smooth Match Seeds (hard) LP-ZGL can be considered as MAD without regularization.
97 Graph-based SSL Comparisons 24
98 Graph-based SSL Comparisons 0.35 TextRunner Graph, 170 WordNet Classes LP-ZGL Adsorption MAD Graph with 175k nodes, 529k edges. Mean Reciprocal Rank (MRR) x x 10 Amount of Supervision 24
99 Graph-based SSL Comparisons 0.39 Freebase-2 Graph, 192 WordNet Classes LP-ZGL Adsorption MAD Mean Reciprocal Rank (MRR) Graph with 303k nodes, 2.3m edges x x 10 Amount of Supervision 25
100 When is MAD most effective? 0.4 Relative Increase in MRR by MAD over LP-ZGL Average Degree 26
101 When is MAD most effective? Relative Increase in MRR by MAD over LP-ZGL MAD seems to be more effective in graphs with high average degree, where there is 0.2 greater need for regularization Average Degree 26
102 Next: Integrating Data across Sources to Answer Queries A B Association (Edge) Discovery E B A D G C F IE C D Ranking Sources Relevance E A F G E B D G C F 27 Unstructured Data Structured Data
103 Integrating Data Across Sources P b c d G M 28
104 Integrating Data Across Sources P b c d G M 28
105 Integrating Data Across Sources P PRO_ID GENE_NAME p12 g P b c 0.07 Join Condition d 0.04 P.GENE_NAME = b.gene_name SPECIES GENE_NAME s1 g b G M 28
106 Integrating Data Across Sources P b c 0.07 Lower cost reflects user preference for the join d 0.04 G M 28
107 Integrating Data Across Sources P b c Protein d G Query Keywords M Genes 28 Information Need Find Protein, Gene, disease info on Malaria Malaria (Disease)
108 Main Questions P Keyword Matching Nodes b 0.07 c d 0.04 G Information Need Find Protein, Gene, disease info on Malaria M 29
109 Main Questions P Keyword Matching Nodes b 0.07 d c 0.04 G Information Need Find Protein, Gene, disease info on Malaria 1. How do we determine which edges to include? M 29
110 Main Questions P Keyword Matching Nodes b 0.07 d c 0.04 G Information Need Find Protein, Gene, disease info on Malaria 1. How do we determine which edges to include? 2. How do we adjust edge costs to reflect user preferences (i.e., personalization)? M 29
111 Our Approach: Learn the Queries to Integrate Data [Talukdar et al., VLDB 2008] 1. How do we determine which edges to include? Inference: K-Best Steiner Tree Generation 2. How do we adjust edge costs to reflect user preferences (i.e., personalization)? Learn from user feedback over answers 30
112 Steiner Trees: Finding Lowest-Cost Queries P b 0.07 c 0.04 d G M 31
113 Steiner Trees: Finding Lowest-Cost Queries A tree of minimal cost (sum of edge costs) in a graph (G) which includes all the required nodes (S). P b 0.07 c 0.04 d G M 31
114 Steiner Trees: Finding Lowest-Cost Queries A tree of minimal cost (sum of edge costs) in a graph (G) which includes all the required nodes (S). P b 0.07 c 0.04 Steiner Tree is a generalization of Minimum Spanning Tree (MST) [equivalent when S = all vertices in G]. d M G 31
115 Inference: K-Best Steiner Tree Generation 32
116 Inference: K-Best Steiner Tree Generation Schema Graph P b 0.07 c 0.04 d M G 32
117 Inference: K-Best Steiner Tree Generation Schema Graph P b 0.07 c 0.04 d M G Find Steiner trees connecting red tables 32
118 Inference: K-Best Steiner Tree Generation Schema Graph P b 0.07 c 0.04 d G M Find Steiner trees connecting red tables P b P b 0.07 Q1 d Rank = 1 G Q2 Rank = 2 c 0.04 d G 32 Cost = 0.4 M Cost = 0.41 M
119 Our K-Best Steiner Tree Algorithms 33
120 Our K-Best Steiner Tree Exact Inference Algorithms 33
121 Our K-Best Steiner Tree Algorithms Exact Inference Integer Linear Program (ILP) based formulation, using ideas from multi-commodity network flows 33
122 Our K-Best Steiner Tree Algorithms Exact Inference Integer Linear Program (ILP) based formulation, using ideas from multi-commodity network flows Contribution: extending 1-best to K-best 33
123 Our K-Best Steiner Tree Algorithms Exact Inference Integer Linear Program (ILP) based formulation, using ideas from multi-commodity network flows Contribution: extending 1-best to K-best Approximate Inference 33
124 Our K-Best Steiner Tree Algorithms Exact Inference Integer Linear Program (ILP) based formulation, using ideas from multi-commodity network flows Contribution: extending 1-best to K-best Approximate Inference Shortest Paths Complete Subgraph Heuristic. 33
125 Our K-Best Steiner Tree Algorithms Exact Inference Integer Linear Program (ILP) based formulation, using ideas from multi-commodity network flows Contribution: extending 1-best to K-best Approximate Inference Shortest Paths Complete Subgraph Heuristic. Reduce problem size by pruning graph, and then apply ILP on the reduced graph. 33
126 Our K-Best Steiner Tree Algorithms Exact Inference Integer Linear Program (ILP) based formulation, using ideas from multi-commodity network flows Contribution: extending 1-best to K-best 33 Approximate Inference Shortest Paths Complete Subgraph Heuristic. Reduce problem size by pruning graph, and then apply ILP on the reduced graph. Significantly faster; in practice, often gives optimal solution.
127 Exact vs Approximate Inference Larger schema graph of size (408, 1366) from real sources: GUS, GO, BioSQL. 343
128 Exact vs Approximate Inference Larger schema graph of size (408, 1366) from real sources: GUS, GO, BioSQL. K Speedup Error
129 Exact vs Approximate Inference Larger schema graph of size (408, 1366) from real sources: GUS, GO, BioSQL. K Speedup Error It is possible to do K-best inference in larger graphs quickly and with little or no loss (none in this case).
130 Query Formulation & Execution Trees can be easily written as executable queries: Steiner Tree P b d G M 35
131 Query Formulation & Execution Trees can be easily written as executable queries: Steiner Tree P b d Join Condition P.y = b.y M G 35
132 Query Formulation & Execution Trees can be easily written as executable queries: Steiner Tree P b d G M Conjunctive Query: P(x,y) & b(y,z) & d(z,w) & M(w,u) & G(w,v) 35 We can use Orchestra [Ives+05] to execute queries and record provenance.
133 Our Approach: Learn the Queries to Integrate Data [Talukdar et al., VLDB 2008] 1. How do we determine which edges to include? Inference: K-Best Steiner Tree Generation 2. How do we adjust edge costs to reflect user preferences (i.e., personalization)? Learn from user feedback over answers 36
134 Learning New Edge Costs Top P b d G M... P b 0.07 c 0.04 Bottom d M G 37
135 Learning New Edge Costs Top P b Query d G Query M Bottom P b 0.07 c 0.04 d M G Query * 37
136 Learning New Edge Costs Top P b Query Tuples d G Query M Bottom P b 0.07 c 0.04 d M G Query * 37
137 Learning New Edge Costs Top P b Query Tuples d G Query M Bottom 37 P b 0.07 c 0.04 d M G Query * feedback on answers, which is what the user cares about
138 Learning New Edge Costs updated cost Top b P c 0.04 d G Query Query * Tuples M Bottom P b d M G Query 37
139 Learning: Cost Model Components 38
140 Learning: Cost Model Components Edge Cost = wdb1 + wdb2 + wdef DB1 DB2 Feature Name Feature Value Coefficient (Values Learned) Is this edge incident on DB1? Is this edge incident on DB2? 1 wdb1 1 wdb2 38 Default 1 wdef
141 Learning: Incorporating User Feedback Model feedback incorporation as a constrained optimization problem. 39
142 Learning: Incorporating User Feedback Model feedback incorporation as a constrained optimization problem. MIRA Algorithm (Crammer et al., 2006) 39
143 Learning: Incorporating User Feedback Model feedback incorporation as a constrained optimization problem. New Model Parameters MIRA Algorithm (Crammer et al., 2006) Current Model Parameters 39
144 Learning: Incorporating User Feedback Model feedback incorporation as a constrained optimization problem. New Model Parameters Tree Cost MIRA Algorithm (Crammer et al., 2006) Current Model Parameters Loss 39
145 Learning: Incorporating User Feedback Model feedback incorporation as a constrained optimization problem. New Model Parameters Tree Cost MIRA Algorithm (Crammer et al., 2006) Current Model Parameters Loss Tree that user doesn t like. Tree that user likes 39
146 Results: Learning Expert Ranking Graph: Start with the BioGuide [Cohen-Boulakia+07] bio sources, with 28 vertices and 96 edges. 5 Goal: Learn BioGuide s expert s rankings G1 P3 Error Methodology: All weights are set to default. Sequence of 25 queries For each, user feedback identifies & promotes a tuple from the gold standard answer Total queries seen 40
147 Results: Learning Expert Ranking Graph: Start with the BioGuide [Cohen-Boulakia+07] bio sources, with 28 vertices and 96 edges Goal: Learn BioGuide s expert s rankings G1 P3 Error Methodology: All weights are set to default. Sequence of 25 queries For each, user feedback identifies & promotes a tuple from the gold standard answer Total queries seen After 40-60% searches, Q finds the top query immediately. For each search, a single feedback is enough to learn top query.
148 Our Approach: Learn the Queries to Integrate Data [Talukdar et al., VLDB 2008] 1. How do we determine which edges to include? Inference: K-Best Steiner Tree Generation 2. How do we adjust edge costs to reflect user preferences (i.e., personalization)? Learn from user feedback over answers 41
149 Next: Combining and Adding Sources A B Association (Edge) Discovery E B A D G C F IE C D Ranking Sources Relevance E A F G E B D G C F 42 Unstructured Data Structured Data
150 Automatically Adding New Sources in Q [Talukdar, Ives, and Pereira, SIGMOD 2010] 43
151 Automatically Adding New Sources in Q [Talukdar, Ives, and Pereira, SIGMOD 2010] P b 0.07 c 0.04 d G M 43
152 Automatically Adding New Sources in Q [Talukdar, Ives, and Pereira, SIGMOD 2010] P b 0.07 c 0.04 d G n M New Source 43
153 Automatically Adding New Sources in Q [Talukdar, Ives, and Pereira, SIGMOD 2010] P b 0.07 c 0.04 n?????? d M G New Source 43
154 Automatically Adding New Sources in Q [Talukdar, Ives, and Pereira, SIGMOD 2010] P b 0.07 How to discover new associations automatically? 0.04 c n?????? d M G New Source 43
155 Automatically Adding New Sources in Q [Talukdar, Ives, and Pereira, SIGMOD 2010] P b 0.07 How to discover new associations automatically? 0.04 How to correct?? mistakes? made d during G automatic association discovery??? n? c M New Source 43
156 Discovering New Associations 44
157 Discovering New Associations Any off-the-shelf Schema Matcher may be used: 44
158 Discovering New Associations Any off-the-shelf Schema Matcher may be used: COMA++ (metadata level) [Do and Rahm, 2007]: need pairwise comparisons 44
159 Discovering New Associations Any off-the-shelf Schema Matcher may be used: COMA++ (metadata level) [Do and Rahm, 2007]: need pairwise comparisons Using Label Propagation (instance level), proposed in the thesis: pairwise comparisons are not necessary 44
160 Discovering New Associations Any off-the-shelf Schema Matcher may be used: COMA++ (metadata level) [Do and Rahm, 2007]: need pairwise comparisons Using Label Propagation (instance level), proposed in the thesis: pairwise comparisons are not necessary How to correct automatic schema matching errors? 44
161 Discovering New Associations Any off-the-shelf Schema Matcher may be used: COMA++ (metadata level) [Do and Rahm, 2007]: need pairwise comparisons Using Label Propagation (instance level), proposed in the thesis: pairwise comparisons are not necessary How to correct automatic schema matching errors? by exploiting end user s expertise in the data, by flagging bad answers 44
162 Discovering New Associations Any off-the-shelf Schema Matcher may be used: COMA++ (metadata level) [Do and Rahm, 2007]: need pairwise comparisons Using Label Propagation (instance level), proposed in the thesis: pairwise comparisons are not necessary 44 How to correct automatic schema matching errors? by exploiting end user s expertise in the data, by flagging bad answers without requiring administrator based knowledge of metadata, as they don t take user context into account, and they are often expensive to obtain
163 Using Q to Correct Alignment Errors Edge Cost = wdb1 + wdb2 + wdef * WCOMA * WLP DB1 DB2 Feature Name Feature Value Coefficient (Values Learned) Is this edge incident on DB1? 1 wdb1 Is this edge incident on DB2? 1 wdb2 Default 1 wdef COMA++ Aligned 0.90 wcoma++ 45 LabelProp Aligned 0.7 wlp
164 Using Q to Correct Alignment Errors Edge Cost = wdb1 + wdb2 + wdef * WCOMA * WLP DB1 DB2 Feature Name Feature Value Coefficient (Values Learned) Is this edge incident on DB1? 1 wdb1 Is this edge incident on DB2? 1 wdb2 Alignment Feature Weights Default 1 wdef COMA++ Aligned 0.90 wcoma++ 45 LabelProp Aligned 0.7 wlp
165 Correcting Schema Matching Errors with Q 46
166 Correcting Schema Matching Errors with Q Learning with Q helps correct schema matching errors. 46
167 Reducing Pairwise Comparisons during Association Discovery 47 3
168 Reducing Pairwise Comparisons during Association Discovery Keyword Cost Neighborhood GO InterPro2GO InterPro Entry 2 Pub acc term_id go_id entry_ac 2 entry_ac pub_id term plasma membrane 0.25 InterPro Entry InterPro Pub name entry_ac title pub_id 2 A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost 2 from at least one of the keywords. 47 3
169 Reducing Pairwise Comparisons during Association Discovery Keyword Cost Neighborhood GO InterPro2GO InterPro Entry 2 Pub New Source? acc term_id term plasma membrane 0.25 go_id InterPro Entry 0 0 entry_ac entry_ac InterPro Pub 0 0 pub_id 2 name entry_ac title pub_id 2 A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost 2 from at least one of the keywords. 47 3
170 Reducing Pairwise Comparisons during Association Discovery Keyword Cost Neighborhood GO InterPro2GO InterPro Entry 2 Pub New Source? acc term_id term plasma membrane 0.25 go_id InterPro Entry 0 0 entry_ac entry_ac InterPro Pub 0 0 pub_id 2 name entry_ac title pub_id 47 3 View Based Aligner A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost 2 from at least one of the keywords. Prune comparisons based on whether they are likely to affect query results, as otherwise there will be no feedback from user. 2
171 Reducing Pairwise Comparisons during Association Discovery # Pairwise Comparisons Exhaustive ViewBasedAligner Number of Tables in the Schema Graph 48
172 49 Summary of Contributions
173 Summary of Contributions Weakly-supervised acquisition of class-instance pairs from unstructured and structured sources 49
174 Summary of Contributions Weakly-supervised acquisition of class-instance pairs from unstructured and structured sources Scalable method, suitable for large data volume 49
175 Summary of Contributions Weakly-supervised acquisition of class-instance pairs from unstructured and structured sources Scalable method, suitable for large data volume A method for learning data-integrating queries taking user information need into account removing the need for expert input or heavy human supervision, interactive speed automatic incorporation of new source correction of wrong association 49
176 Related Work Extractions from unstructured data: [Etzioni et al., 2005, Van Durme and Pas ça, 2008],... Extractions from semi-structured data: SEAL [Wang and Cohen, 2007], and its extensions Graph-based SSL methods: LP-ZGL [Zhu et al., 2003], LGC [Bengio et al., 2007], Adsorption [Baluja et al., 2008],... Keyword Search over Databases: BANKS [Bhalotia et al., 2002], BLINKS [He et al., 2007], BioGuide [Cohen-Boulakia+07] 50
177 Future Work: Complete the Loop Association (Edge) Discovery IE Ranking Sources Relevance 51
178 Future Work: Complete the Loop Association (Edge) Discovery IE Ranking Sources Relevance 51
179 Future Work: Complete the Loop Association (Edge) Discovery IE Ranking Sources Relevance Correct extraction errors based on their effect on the final answers, as measured by user feedback over those answers. 51
180 52 More Future Work
181 More Future Work Incorporation of other types of semantic constraints in class instance acquisition 52
182 More Future Work Incorporation of other types of semantic constraints in class instance acquisition Graph-based SSL methods for other types (non IS-A) of relation extraction 52
183 More Future Work Incorporation of other types of semantic constraints in class instance acquisition Graph-based SSL methods for other types (non IS-A) of relation extraction Roll Q out to life scientists and get their feedback, and also apply Q in non life science datasets 52
184 More Future Work Incorporation of other types of semantic constraints in class instance acquisition Graph-based SSL methods for other types (non IS-A) of relation extraction Roll Q out to life scientists and get their feedback, and also apply Q in non life science datasets Investigate user adaptation in Q: use model trained for one user to initialize another, exploiting any available user similarity information 52
185 Acknowledgements Advisors: Zack Ives, Mark Liberman, and Fernando Pereira Committee: William Cohen, Aravind Joshi, Ben Taskar, and Lyle Ungar My Co-authors: Rahul Bhagat, Thorsten Brants, Koby Crammer, Sudipto Guha, Marie Jacob, Salman Mehmood, Marius Pasca, Deepak Ravichandran, Joseph Reisinger 53 DARPA, Google, NSF grant #IIS
186
187 Thank You!
188
189 56 Current Approaches
190 Current Approaches Supervised Named Entity Recognition (NER) Deals with only limited number of coarse classes Very resource intensive, labeled data is expensive! 56
191 Current Approaches Supervised Named Entity Recognition (NER) Deals with only limited number of coarse classes Very resource intensive, labeled data is expensive! Pattern based Extraction Textual patterns ( analyst at <ENT>. ) effective only in repetitive contexts [Bellare et al., 2007] Extractions usually high-precision, low-recall! 56
192 Context Pattern based Extraction [Talukdar et al., CoNLL 2006] Partial entity lists extended into longer lists using context patterns induced from unstructured text. Extended lists used as features in supervised tagger, improving its performance. analyst at -ENT-. series against the -ENT-tonight Today 's Schaeffer 's Option Activity Watch features -ENT- ( Boston Red Sox St. Louis Cardinals Chicago Cubs Florida Marlins 57
193 New Extractions found by Adsorption Class Scientific Journals NFL Players Book Publishers A few non-seed Instances found by Adsorption Journal of Physics, Nature, Structural and Molecular Biology, Sciences Sociales et sante, Kidney and Blood Pressure Research, American Journal of Physiology- Cell Physiology, Tony Gonzales, Thabiti Davis, Taylor Stubblefield, Ron Dixon, Rodney Hannan, Small Night Shade Books, House of Ansari Press, Highwater Books, Distributed Art Publishers, Cooper Canyon Press, 58 Total classes: 9081
194 Graph Stats Statistics of Graphs used in Class-Instance Acquisition Experiments 59
195 Improving Class-Instance Acquisition with Additional Attributes 170 WordNet Classes, 10 Seeds per Class Mean Reciprocal Rank (MRR) TextRunner Graph YAGO Graph TextRunner + YAGO Graph 0.3 LP-ZGL Adsorption MAD 60 Amount of Supervision
196 Improving Class-Instance Acquisition with Additional Attributes Mean Reciprocal Rank (MRR) WordNet Classes, 10 Seeds per Class TextRunner Graph YAGO Graph TextRunner + YAGO Graph Additional semantic constraints in the form of (instance, attribute) edges from YAGO help improve performance significantly! LP-ZGL Adsorption MAD Amount of Supervision
197 Effect of Class Similarity Constraints TextRunner Graph, 170 WordNet Classes LP-ZGL Adsorption MAD MADDL Mean Reciprocal Rank (MRR) x x 10 Graph with 175k nodes, 529k edges. 61 Amount of Supervision
198 Effect of Class Similarity Constraints TextRunner Graph, 170 WordNet Classes LP-ZGL Adsorption MAD MADDL 61 Mean Reciprocal Rank (MRR) Class similarity constraints are helpful, more investigation is 0.23 necessary! x x 10 Amount of Supervision Graph with 175k nodes, 529k edges.
199 Effect of Class Sparsity Constraints 0.42 Effect of Per-node Sparsity Constraint Mean Reciprocal Rank (MRR) Maximum Allowed Classes per Node
200 SVM Comparison 0.4 Freebase-2 Graph, 192 WordNet Classes LP-ZGL Adsorption MAD SVM Graph with 303k nodes, 2.3m edges. Mean Reciprocal Rank (MRR) x x 10 Amount of Supervision
201 SVM Comparison TextRunner Graph, 170 WordNet Classes LP-ZGL Adsorption MAD SVM Mean Reciprocal Rank (MRR) x x 10 Amount of Supervision Graph with 175k nodes, 529k edges. 64
202 Results: Time to generate K- best Queries Schema graph of size (28, 96) from BioGuide (Boulakia et al., 2007). K Time (s)
203 Results: Time to generate K- best Queries Schema graph of size (28, 96) from BioGuide (Boulakia et al., 2007). K Time (s) It is possible to generate the top queries in interactive range. Query execution is pipelined. 65
204 Discovering New Associations COMA++ (metadata level) [Do and Rahm, 2007] pairwise comparisons Using Label Propagation (instance level) pairwise comparisons not necessary 66
205 Discovering New Associations COMA++ (metadata level) [Do and Rahm, 2007] pairwise comparisons Using Label Propagation (instance level) pairwise comparisons not necessary 1.0 GO: Interpro2GO go_id GO: GO term acc 1.0 GO:
206 Discovering New Associations COMA++ (metadata level) [Do and Rahm, 2007] pairwise comparisons Using Label Propagation (instance level) pairwise comparisons not necessary Interpro2GO go_id GO: go_id 1.0 Interpro2GO go_id GO: GO: GO: acc GO term acc 1.0 GO term acc 1.0 GO: GO:
207 Discovering New Associations COMA++ (metadata level) [Do and Rahm, 2007] pairwise comparisons Using Label Propagation (instance level) pairwise comparisons not necessary Interpro2GO go_id GO: go_id 1.0 Interpro2GO go_id GO: go_id 1.0 Interpro2GO go_id GO: go_id 0.8 acc GO: acc GO: go_id 0.8 acc 0.2 acc GO: go_id 0.51 acc 0.49 GO term acc 1.0 GO: GO term acc 1.0 GO: go_id 0.25 acc 0.75 GO term acc 1.0 GO: go_id 0.55 acc
208 Reusing Feedback Helps! Precision-Recall Plots for Q with Different Levels of Feedback Precision Q (1x1) Q (10x1) Q (10x2) Q (10x4) Adsorption and COMA++ Averaged Recall
209 Number of Pairwise Comparisons Reducing Pairwise Comparisons during Association Discovery 68 Total 18 Tables in Schema Graph No Additional Filter Value Overlap Filter Exhaustive ViewBasedAligner Alignment Strategy
210 Reducing Pairwise Comparisons during Association Discovery Number of Pairwise Column Comparisons Number of Pairwise Column Comparisons for Increasing Schema Graph Size Exhaustive ViewBasedAligner PreferentialAligner Existing Number of Sources in Schema Graph 69 Number of pairwise attribute comparisons as we scale the size of the search graph (avg. over the introduction of 40 new sources).
Graph-based Semi- Supervised Learning as Optimization
Graph-based Semi- Supervised Learning as Optimization Partha Pratim Talukdar CMU Machine Learning with Large Datasets (10-605) April 3, 2012 Graph-based Semi-Supervised Learning 0.2 0.1 0.2 0.3 0.3 0.2
More informationAutomatically Incorporating New Sources in Keyword Search-Based Data Integration
University of Pennsylvania ScholarlyCommons Departmental Papers (CIS) Department of Computer & Information Science 6-2010 Automatically Incorporating New Sources in Keyword Search-Based Data Integration
More informationExperiments in Graph-based Semi-Supervised Learning Methods for Class-Instance Acquisition
Experiments in Graph-based Semi-Superised Learning Methods for Class-Instance Acquisition Partha Pratim Talukdar Search Labs, Microsoft Research Mountain View, CA 94043 partha@talukdar.net Fernando Pereira
More informationSemi-Supervised Learning: Lecture Notes
Semi-Supervised Learning: Lecture Notes William W. Cohen March 30, 2018 1 What is Semi-Supervised Learning? In supervised learning, a learner is given a dataset of m labeled examples {(x 1, y 1 ),...,
More information(Graph-based) Semi-Supervised Learning. Partha Pratim Talukdar Indian Institute of Science
(Graph-based) Semi-Supervised Learning Partha Pratim Talukdar Indian Institute of Science ppt@serc.iisc.in April 7, 2015 Supervised Learning Labeled Data Learning Algorithm Model 2 Supervised Learning
More informationLearning to Create Data-Integrating Queries
Learning to Create Data-Integrating Queries Partha Pratim Talukdar Marie Jacob Muhammad Salman Mehmood Koby Crammer Zachary G. Ives Fernando Pereira Sudipto Guha University of Pennsylvania, Philadelphia,
More informationLearning Better Data Representation using Inference-Driven Metric Learning
Learning Better Data Representation using Inference-Driven Metric Learning Paramveer S. Dhillon CIS Deptt., Univ. of Penn. Philadelphia, PA, U.S.A dhillon@cis.upenn.edu Partha Pratim Talukdar Search Labs,
More informationSearching the Deep Web
Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index
More informationScaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch
Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch Graph-based SSL using a count-min sketch has a number of properties that are desirable, and somewhat surprising.
More informationSearching the Deep Web
Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index
More informationRandom Walk Inference and Learning. Carnegie Mellon University 7/28/2011 EMNLP 2011, Edinburgh, Scotland, UK
Random Walk Inference and Learning in A Large Scale Knowledge Base Ni Lao, Tom Mitchell, William W. Cohen Carnegie Mellon University 2011.7.28 1 Outline Motivation Inference in Knowledge Bases The NELL
More informationLearning To Scale Up Search-Driven Data Integration
University of Pennsylvania ScholarlyCommons Publicly Accessible Penn Dissertations 2016 Learning To Scale Up Search-Driven Data Integration Zhepeng Yan University of Pennsylvania, zhepeng@cis.upenn.edu
More informationSemi-supervised learning SSL (on graphs)
Semi-supervised learning SSL (on graphs) 1 Announcement No office hour for William after class today! 2 Semi-supervised learning Given: A pool of labeled examples L A (usually larger) pool of unlabeled
More informationIntroduction to Text Mining. Hongning Wang
Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:
More informationDatabase and Knowledge-Base Systems: Data Mining. Martin Ester
Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro
More informationWEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS
1 WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS BRUCE CROFT NSF Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts,
More informationExploring and Exploiting the Biological Maze. Presented By Vidyadhari Edupuganti Advisor Dr. Zoe Lacroix
Exploring and Exploiting the Biological Maze Presented By Vidyadhari Edupuganti Advisor Dr. Zoe Lacroix Motivation An abundance of biological data sources contain data about scientific entities, such as
More informationTransductive Phoneme Classification Using Local Scaling And Confidence
202 IEEE 27-th Convention of Electrical and Electronics Engineers in Israel Transductive Phoneme Classification Using Local Scaling And Confidence Matan Orbach Dept. of Electrical Engineering Technion
More informationIEOR E4008: Computational Discrete Optimization
Yuri Faenza IEOR Department Jan 23th, 2018 Logistics Instructor: Yuri Faenza Assistant Professor @ IEOR from 2016 Research area: Discrete Optimization Schedule: MW, 10:10-11:25 Room: 303 Mudd Office Hours:
More informationINTRO TO SEMI-SUPERVISED LEARNING (SSL)
SSL (on graphs) 1 INTRO TO SEMI-SUPERVISED LEARNING (SSL) Semi-supervised learning Given: A pool of labeled examples L A (usually larger) pool of unlabeled examples U Option 1 for using L and U : Ignore
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK REVIEW PAPER ON IMPLEMENTATION OF DOCUMENT ANNOTATION USING CONTENT AND QUERYING
More informationText Mining. Representation of Text Documents
Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,
More informationOpen Data Integration. Renée J. Miller
Open Data Integration Renée J. Miller miller@northeastern.edu !2 Open Data Principles Timely & Comprehensive Accessible and Usable Complete - All public data is made available. Public data is data that
More informationRelational Retrieval Using a Combination of Path-Constrained Random Walks
Relational Retrieval Using a Combination of Path-Constrained Random Walks Ni Lao, William W. Cohen University 2010.9.22 Outline Relational Retrieval Problems Path-constrained random walks The need for
More informationTowards Efficient and Effective Semantic Table Interpretation Ziqi Zhang
Towards Efficient and Effective Semantic Table Interpretation Ziqi Zhang Department of Computer Science, University of Sheffield Outline Define semantic table interpretation State-of-the-art and motivation
More informationKeyword search in relational databases. By SO Tsz Yan Amanda & HON Ka Lam Ethan
Keyword search in relational databases By SO Tsz Yan Amanda & HON Ka Lam Ethan 1 Introduction Ubiquitous relational databases Need to know SQL and database structure Hard to define an object 2 Query representation
More informationIJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 2013 ISSN:
Semi Automatic Annotation Exploitation Similarity of Pics in i Personal Photo Albums P. Subashree Kasi Thangam 1 and R. Rosy Angel 2 1 Assistant Professor, Department of Computer Science Engineering College,
More informationStructured Data on the Web
Structured Data on the Web Alon Halevy Google Australasian Computer Science Week January, 2010 Structured Data & The Web Andree Hudson, 4 th of July Hard to find structured data via search engines
More informationThe Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem
Int. J. Advance Soft Compu. Appl, Vol. 9, No. 1, March 2017 ISSN 2074-8523 The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem Loc Tran 1 and Linh Tran
More informationExtracting and Querying Probabilistic Information From Text in BayesStore-IE
Extracting and Querying Probabilistic Information From Text in BayesStore-IE Daisy Zhe Wang, Michael J. Franklin, Minos Garofalakis 2, Joseph M. Hellerstein University of California, Berkeley Technical
More informationOntology Based Prediction of Difficult Keyword Queries
Ontology Based Prediction of Difficult Keyword Queries Lubna.C*, Kasim K Pursuing M.Tech (CSE)*, Associate Professor (CSE) MEA Engineering College, Perinthalmanna Kerala, India lubna9990@gmail.com, kasim_mlp@gmail.com
More informationAutomatic Domain Partitioning for Multi-Domain Learning
Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels
More informationComplex Prediction Problems
Problems A novel approach to multiple Structured Output Prediction Max-Planck Institute ECML HLIE08 Information Extraction Extract structured information from unstructured data Typical subtasks Named Entity
More informationWeb-Scale Extraction of Structured Data
Web-Scale Extraction of Structured Data Michael J. Cafarella University of Washington mjc@cs.washington.edu Jayant Madhavan Google Inc. jayant@google.com Alon Halevy Google Inc. halevy@google.com ABSTRACT
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationIJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:
IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T
More informationSupporting Fuzzy Keyword Search in Databases
I J C T A, 9(24), 2016, pp. 385-391 International Science Press Supporting Fuzzy Keyword Search in Databases Jayavarthini C.* and Priya S. ABSTRACT An efficient keyword search system computes answers as
More informationText, Knowledge, and Information Extraction. Lizhen Qu
Text, Knowledge, and Information Extraction Lizhen Qu A bit about Myself PhD: Databases and Information Systems Group (MPII) Advisors: Prof. Gerhard Weikum and Prof. Rainer Gemulla Thesis: Sentiment Analysis
More informationConclusion and review
Conclusion and review Domain-specific search (DSS) 2 3 Emerging opportunities for DSS Fighting human trafficking Predicting cyberattacks Stopping Penny Stock Fraud Accurate geopolitical forecasting 3 General
More informationData about data is database Select correct option: True False Partially True None of the Above
Within a table, each primary key value. is a minimal super key is always the first field in each table must be numeric must be unique Foreign Key is A field in a table that matches a key field in another
More informationInteractive Data Integration through Smart Copy & Paste
Interactive Data Integration through Smart Copy & Paste Zachary G. Ives 1 Craig A. Knoblock 2 Steven Minton 3 Marie Jacob 1 Partha Pratim Talukdar 1 Rattapoom Tuchinda 4 Jose Luis Ambite 2 Maria Muslea
More informationIncremental Integer Linear Programming for Non-projective Dependency Parsing
Incremental Integer Linear Programming for Non-projective Dependency Parsing Sebastian Riedel James Clarke ICCS, University of Edinburgh 22. July 2006 EMNLP 2006 S. Riedel, J. Clarke (ICCS, Edinburgh)
More informationIntuitive and Interactive Query Formulation to Improve the Usability of Query Systems for Heterogeneous Graphs
Intuitive and Interactive Query Formulation to Improve the Usability of Query Systems for Heterogeneous Graphs Nandish Jayaram University of Texas at Arlington PhD Advisors: Dr. Chengkai Li, Dr. Ramez
More informationOKKAM-based instance level integration
OKKAM-based instance level integration Paolo Bouquet W3C RDF2RDB This work is co-funded by the European Commission in the context of the Large-scale Integrated project OKKAM (GA 215032) RoadMap Using the
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationComputer-based Tracking Protocols: Improving Communication between Databases
Computer-based Tracking Protocols: Improving Communication between Databases Amol Deshpande Database Group Department of Computer Science University of Maryland Overview Food tracking and traceability
More informationTyped Graph Models for Semi-Supervised Learning of Name Ethnicity
Typed Graph Models for Semi-Supervised Learning of Name Ethnicity Delip Rao Dept. of Computer Science Johns Hopkins University delip@cs.jhu.edu David Yarowsky Dept. of Computer Science Johns Hopkins University
More informationLarge Scale Distributed Semi-Supervised Learning Using Streaming Approximation
Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation Several classification and knowledge expansion type of problems involve a large number of labels in realworld scenarios. For
More informationINF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering
INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised
More informationLocal higher-order graph clustering
Local higher-order graph clustering Hao Yin Stanford University yinh@stanford.edu Austin R. Benson Cornell University arb@cornell.edu Jure Leskovec Stanford University jure@cs.stanford.edu David F. Gleich
More informationAll groups final presentation/poster and write-up
Logistics Non-NIST groups project proposals Guidelines posted Write-up and slides due this Friday Coming Monday, each Non-NIST group will give project pitch (5 min) based on the slides Everyone in class
More informationOverview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010
INFORMATICS SEMINAR SEPT. 27 & OCT. 4, 2010 Introduction to Semi-Supervised Learning Review 2 Overview Citation X. Zhu and A.B. Goldberg, Introduction to Semi- Supervised Learning, Morgan & Claypool Publishers,
More informationPart I: Data Mining Foundations
Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?
More informationWhat is Text Mining? Sophia Ananiadou National Centre for Text Mining University of Manchester
National Centre for Text Mining www.nactem.ac.uk University of Manchester Outline Aims of text mining Text Mining steps Text Mining uses Applications 2 Aims Extract and discover knowledge hidden in text
More informationImproving the Performance of OLAP Queries Using Families of Statistics Trees
Improving the Performance of OLAP Queries Using Families of Statistics Trees Joachim Hammer Dept. of Computer and Information Science University of Florida Lixin Fu Dept. of Mathematical Sciences University
More informationIntroduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.
Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How
More informationExtending Functional Dependency to Detect Abnormal Data in RDF Graphs
Extending Functional Dependency to Detect Abnormal Data in RDF Graphs Yang Yu, Jeff Heflin SWAT Lab Department of Computer Science and Engineering Lehigh University PA, USA Outline Semantic Web data and
More informationSEMANTIC WEB POWERED PORTAL INFRASTRUCTURE
SEMANTIC WEB POWERED PORTAL INFRASTRUCTURE YING DING 1 Digital Enterprise Research Institute Leopold-Franzens Universität Innsbruck Austria DIETER FENSEL Digital Enterprise Research Institute National
More informationINF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering
INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,
More informationSharing Work in Keyword Search Over Databases
University of Pennsylvania ScholarlyCommons Departmental Papers (CIS) Department of Computer & Information Science 2011 Sharing Work in Keyword Search Over Databases Marie Jacobs University of Pennsylvania
More informationKeywords Data alignment, Data annotation, Web database, Search Result Record
Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web
More informationInformation Integration of Partially Labeled Data
Information Integration of Partially Labeled Data Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab, University of Hildesheim srendle@ismll.uni-hildesheim.de, schmidt-thieme@ismll.uni-hildesheim.de
More informationLearning mappings and queries
Learning mappings and queries Marie Jacob University Of Pennsylvania DEIS 2010 1 Schema mappings Denote relationships between schemas Relates source schema S and target schema T Defined in a query language
More informationQuery Processing & Optimization
Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction
More informationData Cleansing. LIU Jingyuan, Vislab WANG Yilei, Theoretical group
Data Cleansing LIU Jingyuan, Vislab WANG Yilei, Theoretical group What is Data Cleansing Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or inconsistencies
More informationGraph based machine learning with applications to media analytics
Graph based machine learning with applications to media analytics Lei Ding, PhD 9-1-2011 with collaborators at Outline Graph based machine learning Basic structures Algorithms Examples Applications in
More informationNERD workshop. Luca ALMAnaCH - Inria Paris. Berlin, 18/09/2017
NERD workshop Luca Foppiano @ ALMAnaCH - Inria Paris Berlin, 18/09/2017 Agenda Introducing the (N)ERD service NERD REST API Usages and use cases Entities Rigid textual expressions corresponding to certain
More informationClustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York
Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity
More informationLeveraging Data and Structure in Ontology Integration
Leveraging Data and Structure in Ontology Integration O. Udrea L. Getoor R.J. Miller Group 15 Enrico Savioli Andrea Reale Andrea Sorbini DEIS University of Bologna Searching Information in Large Spaces
More information9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology
9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example
More informationPresented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu
Presented by: Dimitri Galmanovich Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu 1 When looking for Unstructured data 2 Millions of such queries every day
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework
More informationFlat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017
Flat Clustering Slides are mostly from Hinrich Schütze March 7, 07 / 79 Overview Recap Clustering: Introduction 3 Clustering in IR 4 K-means 5 Evaluation 6 How many clusters? / 79 Outline Recap Clustering:
More informationEfficient Iterative Semi-supervised Classification on Manifold
. Efficient Iterative Semi-supervised Classification on Manifold... M. Farajtabar, H. R. Rabiee, A. Shaban, A. Soltani-Farani Sharif University of Technology, Tehran, Iran. Presented by Pooria Joulani
More informationAn Overview of various methodologies used in Data set Preparation for Data mining Analysis
An Overview of various methodologies used in Data set Preparation for Data mining Analysis Arun P Kuttappan 1, P Saranya 2 1 M. E Student, Dept. of Computer Science and Engineering, Gnanamani College of
More informationLightly-Supervised Attribute Extraction
Lightly-Supervised Attribute Extraction Abstract We introduce lightly-supervised methods for extracting entity attributes from natural language text. Using those methods, we are able to extract large number
More informationInformatica Enterprise Information Catalog
Data Sheet Informatica Enterprise Information Catalog Benefits Automatically catalog and classify all types of data across the enterprise using an AI-powered catalog Identify domains and entities with
More informationShrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent
More informationPapers for comprehensive viva-voce
Papers for comprehensive viva-voce Priya Radhakrishnan Advisor : Dr. Vasudeva Varma Search and Information Extraction Lab, International Institute of Information Technology, Gachibowli, Hyderabad, India
More informationSemantic Interoperability. Being serious about the Semantic Web
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA 1 Being serious about the Semantic Web It is not one person s ontology It is not several people s common
More informationMulti-Stage Rocchio Classification for Large-scale Multilabeled
Multi-Stage Rocchio Classification for Large-scale Multilabeled Text data Dong-Hyun Lee Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul, Korea dhlee347@gmail.com Abstract. Large-scale
More informationAdvanced Databases. Lecture 4 - Query Optimization. Masood Niazi Torshiz Islamic Azad university- Mashhad Branch
Advanced Databases Lecture 4 - Query Optimization Masood Niazi Torshiz Islamic Azad university- Mashhad Branch www.mniazi.ir Query Optimization Introduction Transformation of Relational Expressions Catalog
More informationQuery Optimization. Shuigeng Zhou. December 9, 2009 School of Computer Science Fudan University
Query Optimization Shuigeng Zhou December 9, 2009 School of Computer Science Fudan University Outline Introduction Catalog Information for Cost Estimation Estimation of Statistics Transformation of Relational
More informationTheme Identification in RDF Graphs
Theme Identification in RDF Graphs Hanane Ouksili PRiSM, Univ. Versailles St Quentin, UMR CNRS 8144, Versailles France hanane.ouksili@prism.uvsq.fr Abstract. An increasing number of RDF datasets is published
More informationDeep Web Content Mining
Deep Web Content Mining Shohreh Ajoudanian, and Mohammad Davarpanah Jazi Abstract The rapid expansion of the web is causing the constant growth of information, leading to several problems such as increased
More informationTop-k Keyword Search Over Graphs Based On Backward Search
Top-k Keyword Search Over Graphs Based On Backward Search Jia-Hui Zeng, Jiu-Ming Huang, Shu-Qiang Yang 1College of Computer National University of Defense Technology, Changsha, China 2College of Computer
More informationSelecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach +
Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach + Abdullah Al-Hamdani, Gultekin Ozsoyoglu Electrical Engineering and Computer Science Dept, Case Western Reserve University,
More informationDocument Retrieval using Predication Similarity
Document Retrieval using Predication Similarity Kalpa Gunaratna 1 Kno.e.sis Center, Wright State University, Dayton, OH 45435 USA kalpa@knoesis.org Abstract. Document retrieval has been an important research
More informationEXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES
EXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES B. GEETHA KUMARI M. Tech (CSE) Email-id: Geetha.bapr07@gmail.com JAGETI PADMAVTHI M. Tech (CSE) Email-id: jageti.padmavathi4@gmail.com ABSTRACT:
More informationNatural Language Processing. SoSe Question Answering
Natural Language Processing SoSe 2017 Question Answering Dr. Mariana Neves July 5th, 2017 Motivation Find small segments of text which answer users questions (http://start.csail.mit.edu/) 2 3 Motivation
More informationUpdate Exchange with Mappings and Provenance
Update Exchange with Mappings and Provenance Todd J. Green with Grigoris Karvounarakis, Zachary G. Ives, and Val Tannen CSE 455 / CIS 555: Internet and Web Systems April 18, 2007 Challenge: data sharing
More informationPERSONALIZED TAG RECOMMENDATION
PERSONALIZED TAG RECOMMENDATION Ziyu Guan, Xiaofei He, Jiajun Bu, Qiaozhu Mei, Chun Chen, Can Wang Zhejiang University, China Univ. of Illinois/Univ. of Michigan 1 Booming of Social Tagging Applications
More informationCS 4460 Intro. to Information Visualization September 15, 2017 John Stasko
Case Study: Jigsaw CS 4460 Intro. to Information Visualization September 15, 2017 John Stasko Learning Objectives Become familiar with investigative analysis process carried out by various types of analysts
More informationInteractive Data Exploration Related works
Interactive Data Exploration Related works Ali El Adi Bruno Rubio Deepak Barua Hussain Syed Databases and Information Retrieval Integration Project Recap Smart-Drill AlphaSum: Size constrained table summarization
More informationSnowball : Extracting Relations from Large Plain-Text Collections. Eugene Agichtein Luis Gravano. Department of Computer Science Columbia University
Snowball : Extracting Relations from Large Plain-Text Collections Luis Gravano Department of Computer Science 1 Extracting Relations from Documents Text documents hide valuable structured information.
More informationJianyong Wang Department of Computer Science and Technology Tsinghua University
Jianyong Wang Department of Computer Science and Technology Tsinghua University jianyong@tsinghua.edu.cn Joint work with Wei Shen (Tsinghua), Ping Luo (HP), and Min Wang (HP) Outline Introduction to entity
More informationGraph Mining and Social Network Analysis
Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References q Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann
More informationMaximizing the Value of STM Content through Semantic Enrichment. Frank Stumpf December 1, 2009
Maximizing the Value of STM Content through Semantic Enrichment Frank Stumpf December 1, 2009 What is Semantics and Semantic Processing? Content Knowledge Framework Technology Framework Search Text Images
More informationSemi-Supervised Abstraction-Augmented String Kernel for bio-relationship Extraction
Semi-Supervised Abstraction-Augmented String Kernel for bio-relationship Extraction Pavel P. Kuksa, Rutgers University Yanjun Qi, Bing Bai, Ronan Collobert, NEC Labs Jason Weston, Google Research NY Vladimir
More informationText Analytics (Text Mining)
CSE 6242 / CX 4242 Apr 1, 2014 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer,
More information