Graph-Based Weakly- Supervised Methods for Information Extraction & Integration. Partha Pratim Talukdar University of Pennsylvania

Size: px

Start display at page:

Download "Graph-Based Weakly- Supervised Methods for Information Extraction & Integration. Partha Pratim Talukdar University of Pennsylvania"

Lesley Newton
5 years ago
Views:

1 Graph-Based Weakly- Supervised Methods for Information Extraction & Integration Partha Pratim Talukdar University of Pennsylvania Dissertation Defense, February 24, 2010

2 End Goal We should be able to answer any question for which data exists in the dataset. 2

3 3 Query: alma maters of US mayors

4 3 Query: alma maters of US mayors

5 Query: alma maters of US mayors There is probably no single page which can answer this query exactly. 3

6 4 Google Squared?

7 Google Squared? 28 mayors listed out of thousands Alma mater of only four mayors found 4

8 Google Squared? 28 mayors listed out of thousands Alma mater of only four mayors found An Important First Step! 4

9 Often, users need information that combines data from multiple sites (pages) holds a bachelor of science degree from the University of Alabama. 5

10 Often, users need information that combines data from multiple sites (pages) holds a bachelor of science degree from the University of Alabama. Information Extraction (IE) Information Extraction (IE) 5 Mayor City State Bill Ham Jr. Auburn AL Edward May Bessemer AL Loretta Spencer Huntsville AL Person Alma mater Loretta Spencer Univ. of Alabama......

11 Often, users need information that combines data from multiple sites (pages) holds a bachelor of science degree from the University of Alabama. Information Extraction (IE) Information Extraction (IE) 5 Mayor City State Bill Ham Jr. Auburn AL Edward May Bessemer AL Loretta Spencer Huntsville AL Information Integration (II) Person Alma mater Loretta Spencer Univ. of Alabama......

12 Often, users need information that combines data from multiple sites (pages) holds a bachelor of science degree from the University of Alabama. Mayor City State Alma mater Bill Ham Jr. Auburn AL... Edward May Bessemer AL... Loretta Spencer Huntsville AL Univ. of Alabama

13 ... or from Tables, as in the Life Sciences Example user keyword query genes proteins malaria Unstructured Source (e.g. research paper) Disease DB1 Gene DB2 Protein DB1 Protein DB2 Disease DB2 Disease DB3 6 Structured Source Gene DB1

Current Solution in Life Sciences: Hand Programmed WebForms, with Small number of Sources 7 Human-written SQL powering WebForm: SELECT distinct cast(aseq.

14 Current Solution in Life Sciences: Hand Programmed WebForms, with Small number of Sources 7 Human-written SQL powering WebForm: SELECT distinct cast(aseq.assembly_na_sequence_id WebForm as varchar2 (32)) as na_sequence_id (over structured data), '@PROJECT_ID@' as project_id, count (distinct aseq.na_sequence_id) as libcount FROM DoTS.EST@MUS_LINK@ est, DoTS.Library@MUS_LINK@ lib, DoTS.AssemblySequence@MUS_LINK@ aseq, epcondata.isexpressed ie WHERE lib.dbest_id = $$panclibraryp$$ AND lib.library_id = est.library_id AND est.na_sequence_id = aseq.na_sequence_id AND aseq.assembly_na_sequence_id is not NULL AND aseq.assembly_na_sequence_id = ie.na_sequence_id GROUP BY aseq.assembly_na_sequence_id

15 Current Solution in Life Sciences: Hand Programmed WebForms, with Small number of Sources 7 Human-written SQL powering WebForm: SELECT distinct cast(aseq.assembly_na_sequence_id as varchar2 (32)) as na_sequence_id, '@PROJECT_ID@' as project_id, count (distinct aseq.na_sequence_id) as libcount FROM DoTS.EST@MUS_LINK@ est, DoTS.Library@MUS_LINK@ lib, DoTS.AssemblySequence@MUS_LINK@ aseq, epcondata.isexpressed ie WHERE lib.dbest_id = $$panclibraryp$$ AND lib.library_id = est.library_id AND est.na_sequence_id = aseq.na_sequence_id AND aseq.assembly_na_sequence_id is not NULL AND aseq.assembly_na_sequence_id = ie.na_sequence_id GROUP BY aseq.assembly_na_sequence_id

16 Current Solution in Life Sciences: Hand Programmed WebForms, with Small number of Sources 7 Human-written SQL powering WebForm: SELECT distinct cast(aseq.assembly_na_sequence_id as varchar2 (32)) as na_sequence_id Requires access to programmers, '@PROJECT_ID@' as project_id -, expensive count (distinct and aseq.na_sequence_id) not scalableas libcount - FROM DoTS.EST@MUS_LINK@ est, exploits DoTS.Library@MUS_LINK@ only a small subset lib of, available DoTS.AssemblySequence@MUS_LINK@ data sources aseq, epcondata.isexpressed ie WHERE lib.dbest_id = $$panclibraryp$$ Not AND suitable lib.library_id for = est.library_id discovery mode! AND est.na_sequence_id = aseq.na_sequence_id AND aseq.assembly_na_sequence_id is not NULL AND aseq.assembly_na_sequence_id = ie.na_sequence_id GROUP BY aseq.assembly_na_sequence_id

17 What is Needed to Satisfy User Information Need? 8

18 What is Needed to Satisfy User Information Need? Take standard keyword queries, but exploit semantic information to: combine data from within (IE) and across (II) sources (documents and tables) take user information need (personalization/context) into account 8

19 What is Needed to Satisfy User Information Need? Take standard keyword queries, but exploit semantic information to: combine data from within (IE) and across (II) sources (documents and tables) take user information need (personalization/context) into account Existing approaches require extensive human input (e.g., annotations, mediated schemas): doesn t scale 8

20 What is Needed to Satisfy User Information Need? Take standard keyword queries, but exploit semantic information to: combine data from within (IE) and across (II) sources (documents and tables) take user information need (personalization/context) into account Existing approaches require extensive human input (e.g., annotations, mediated schemas): doesn t scale My thesis addresses the challenges of doing these at scale, by: Learning from small amounts of human annotation, specification, or feedback Generalizing to large number of data items and schemas 8

21 8 What is Needed to Satisfy User Information Need? Take standard keyword queries, but exploit semantic information to: combine data from within (IE) and across (II) sources (documents and tables) take user information need (personalization/context) into account Existing approaches require extensive human input (e.g., annotations, mediated schemas): doesn t scale My thesis addresses the challenges of doing these at scale, by: Learning from small amounts of human annotation, specification, or feedback Generalizing to large number of data items and schemas... through the use of graph-based methods.

22 Thesis Statement Graph-based representation of data and learning over such graphs result in effective and scalable methods for Information Extraction (IE) and Integration (II). 9

23 10 This Talk: Two Parts

24 This Talk: Two Parts 1. Information Extraction (IE) Class-Instance acquisition on large scale using graph-based methods, and their comparisons 10

25 This Talk: Two Parts 1. Information Extraction (IE) Class-Instance acquisition on large scale using graph-based methods, and their comparisons 2. Information Integration (II) Search and feedback driven information integration Automatically adding new sources, and feedback based association correction 10

26 This Talk: Two Parts 1. Information Extraction (IE) Class-Instance acquisition on large scale using graph-based methods, and their comparisons 2. Information Integration (II) Search and feedback driven information integration Automatically adding new sources, and feedback based association correction System proposed in my thesis: Q 10

27 Q: Overall Architecture A B Association (Edge) Discovery E B A D G C F IE C D Ranking Sources Relevance E A F G E B D G C F 11 Unstructured Data Structured Data

28 Q: Overall Architecture A B Association (Edge) Discovery E B A D G C F IE C D Ranking Sources Relevance E A F G E B D G C F 11 Unstructured Data Structured Data

29 Q: Overall Architecture A B Association (Edge) Discovery E B A D G C F IE C D Ranking Sources Relevance E A F G E B D G C F 11 Unstructured Data Structured Data

30 Next: Information Extraction (Class Instance Acquisition) A B Association (Edge) Discovery E B A D G C F IE C D Ranking Sources Relevance E A F G E B D G C F 12 Unstructured Data Structured Data

31 Class Instance Acquisition Unlabeled Data Medline Newswire Web Partial Instance Lists 13

32 Class Instance Acquisition Unlabeled Data Medline Newswire Web Partial Instance Lists 13 Car Company Toyota Honda Ford...

33 Class Instance Acquisition Unlabeled Data Medline Newswire Web Partial Instance Lists 13 Volcano US Cities Volcano New York Car Company Kilauea Toyota Mt. Mt. Fuji Fuji Philadelphia Mt. Fuji Honda Boston Mt. Andrus Ford

34 Class Instance Acquisition Unlabeled Data Medline Newswire Web Partial Instance Lists 13 Volcano US Cities Volcano New York Car Company Kilauea Toyota Mt. Mt. Fuji Fuji Philadelphia Mt. Fuji Honda Boston Mt. Andrus Ford Can we combine all these sources to build a large repository of class-instance pairs?

35 14 State-of-the-art

36 State-of-the-art Several approaches for class instance acquisition exist: 14

37 State-of-the-art Several approaches for class instance acquisition exist: unstructured data (A8 [van Durme and Pasca, 2008]) 14

38 State-of-the-art Several approaches for class instance acquisition exist: unstructured data (A8 [van Durme and Pasca, 2008]) semi-structured data ([Wang and Cohen, 2007]) 14

39 State-of-the-art Several approaches for class instance acquisition exist: unstructured data (A8 [van Durme and Pasca, 2008]) semi-structured data ([Wang and Cohen, 2007]) Structured data (WebTables (WT) [Cafarella et al., 2008]) 14

40 State-of-the-art Several approaches for class instance acquisition exist: unstructured data (A8 [van Durme and Pasca, 2008]) semi-structured data ([Wang and Cohen, 2007]) Structured data (WebTables (WT) [Cafarella et al., 2008]) A particular extraction might be easier in one data source than other. 14

41 State-of-the-art Several approaches for class instance acquisition exist: unstructured data (A8 [van Durme and Pasca, 2008]) semi-structured data ([Wang and Cohen, 2007]) Structured data (WebTables (WT) [Cafarella et al., 2008]) A particular extraction might be easier in one data source than other. Can we combine extractions from different sources (and methods) and learn from the combined extractions to improve coverage? 14

42 Our Approach: Graph-based Expansion 15

43 Our Approach: Graph-based Expansion WT Musician Billy Joel (0.75) Johnny Cash (0.73) Singer Bob Dylan (0.95) Johnny Cash (0.87) Billy Joel (0.82) A8 15

44 Our Approach: Graph-based Expansion Cluster ID WT Musician Billy Joel (0.75) Johnny Cash (0.73) Singer Bob Dylan (0.95) Johnny Cash (0.87) Billy Joel (0.82) A8 15

45 Our Approach: Graph-based Expansion Cluster ID WT Musician Billy Joel (0.75) Johnny Cash (0.73) Singer Bob Dylan (0.95) Johnny Cash (0.87) Billy Joel (0.82) A8 Extraction Confidence 15

46 Our Approach: Graph-based Expansion Cluster ID WT Musician Billy Joel (0.75) Johnny Cash (0.73) Singer Bob Dylan (0.95) Johnny Cash (0.87) Billy Joel (0.82) A Bob Dylan Extraction Confidence Singer 0.87 Musician Johnny Cash Billy Joel

47 Our Approach: Graph-based Expansion Cluster ID WT Musician Billy Joel (0.75) Johnny Cash (0.73) Singer Bob Dylan (0.95) Johnny Cash (0.87) Billy Joel (0.82) A Bob Dylan Extraction Confidence Singer 0.87? Musician Johnny Cash Billy Joel

48 Our Approach: Graph-based WT Musician Billy Joel (0.75) Johnny Cash (0.73) Expansion Singer Bob Dylan (0.95) Johnny Cash (0.87) Billy Joel (0.82) Can we infer that Bob Dylan is also a Musician, as that is missing in current extractions? Cluster ID A Bob Dylan Extraction Confidence Singer 0.87? Musician Johnny Cash Billy Joel

49 Our Approach: Graph-based WT Musician Billy Joel (0.75) Johnny Cash (0.73) Expansion Singer Bob Dylan (0.95) Johnny Cash (0.87) Billy Joel (0.82) Can we infer that Bob Dylan is also a Musician, as that is missing in current extractions? Cluster ID A Bob Dylan Extraction Confidence Singer ? Musician 1.0 Johnny Cash 15 Musician Billy Joel Musician 1.0 Seed Classes

50 Observations on the Constructed Graph Singer Musician Bob Dylan Johnny Cash Billy Joel Musician 1.0 Musician

51 Observations on the Constructed Graph Singer Musician Bob Dylan Musician 1.0 Johnny Cash Smoothness: Nodes connected by an edge should be assigned similar classes, as enforced by edge weight Billy Joel Musician

52 Observations on the Constructed Graph Nodes corresponding to clusters extracted by first phase extractors. Singer Musician Bob Dylan Musician 1.0 Johnny Cash Smoothness: Nodes connected by an edge should be assigned similar classes, as enforced by edge weight Billy Joel Musician

53 Observations on the Constructed Graph Nodes corresponding to clusters extracted by first phase extractors. Singer Musician Bob Dylan Musician 1.0 Johnny Cash Smoothness: Nodes connected by an edge should be assigned similar classes, as enforced by edge weight Billy Joel Musician Coupling Node: Force (softly) all instance nodes connected to it to have similar class labels, exploiting the Smoothness requirement.

54 Observations on the Constructed Graph Nodes corresponding to clusters extracted by first phase extractors. Singer Musician Bob Dylan Musician 1.0 Johnny Cash Smoothness: Nodes connected by an edge should be assigned similar classes, as enforced by edge weight. 16 Coupling Node: Force (softly) all instance nodes connected to it to have similar class labels, exploiting the Smoothness requirement Billy Joel Musician 1.0 Seed classes can be different from the cluster IDs of first phase extractors (A8, WT, etc.)

55 Our Approach: Graph-based Expansion [Talukdar et al., EMNLP 2008] 0.95 Bob Dylan Singer Johnny Cash Musician Billy Joel 17

56 Our Approach: Graph-based Expansion [Talukdar et al., EMNLP 2008] 0.95 Bob Dylan Singer Johnny Cash Musician We use Adsorption [Baluja et al., 2008] for label propagation (more details shortly). Billy Joel

57 Our Approach: Graph-based Expansion [Talukdar et al., EMNLP 2008] Singer 0.95 Bob Dylan Initialization 0.87 Musician Musician 1.0 Johnny Cash Musician 1.0 Seed Labels 17 Musician We use Adsorption [Baluja et al., 2008] for label propagation (more details shortly). Billy Joel Musician 1.0 Musician 1.0

58 Our Approach: Graph-based Expansion [Talukdar et al., EMNLP 2008] Singer 0.95 Bob Dylan Iteration Musician 0.8 Musician Musician 1.0 Johnny Cash Musician 1.0 Seed Labels 17 Musician We use Adsorption [Baluja et al., 2008] for label propagation (more details shortly). Billy Joel Musician 1.0 Musician 1.0

59 Our Approach: Graph-based Expansion [Talukdar et al., EMNLP 2008] Singer 0.95 Bob Dylan Musician 0.6 Iteration 2 Musician 0.8 Musician Musician 1.0 Johnny Cash Musician 1.0 Derived Labels Seed Labels 17 Musician We use Adsorption [Baluja et al., 2008] for label propagation (more details shortly). Billy Joel Musician 1.0 Musician 1.0

60 Our Approach: Graph-based Expansion [Talukdar et al., EMNLP 2008] Singer Bob Dylan Iteration Johnny Cash Musician We use Adsorption [Baluja et al., 2008] for label propagation (more details shortly). Billy Joel

61 Class Assignment for Fixed Instances 18

62 Class Assignment for Fixed Instances A8 Adsorption WebTables 18

63 Class Assignment for Fixed Instances 924k (class, instance) pairs extracted from 100m web documents. A8 Adsorption WebTables 74m (class, instance) pairs extracted from WebTables dataset. 18

64 Class Assignment for Fixed Instances A8 Adsorption WebTables Graph with 1.4m nodes, 75m edges used. 18

65 Class Assignment for Fixed Instances Evaluation against WordNet Dataset (38 classes, 8910 instances) Mean Reciprocal Rank (MRR) A8 Adsorption WebTables Graph with 1.4m nodes, 75m edges used. 18 Recall

66 Class Assignment for Fixed Instances Evaluation against WordNet Dataset (38 classes, 8910 instances) Mean Reciprocal Rank (MRR) Adsorption is able to assign better class labels to more instances A8 Adsorption WebTables Graph with 1.4m nodes, 75m edges used. 18 Recall

67 19 Can We Improve Class-Instance Acquisition with Additional Semantic Constraints?

68 Can We Improve Class-Instance Acquisition with Additional Semantic Constraints? Isaac Newton people-person-name filmmusic_contributor-name Johnny Cash Bob Dylan 19

69 Can We Improve Class-Instance Acquisition with Additional Semantic Constraints? Isaac Newton people-person-name Instances with shared attributes are likely to be from the same class. 19 has_attributealbums filmmusic_contributor-name Johnny Cash Bob Dylan

70 Can We Improve Class-Instance Acquisition with Additional Semantic Constraints? Isaac Newton Graph-based representation makes it easy people-person-name to incorporate such constraints! Instances with shared attributes are likely to be from the same class. 19 has_attributealbums filmmusic_contributor-name Johnny Cash Bob Dylan

71 Improving Class-Instance Acquisition with YAGO Attributes 20

72 Improving Class-Instance Acquisition with YAGO Attributes 170 WordNet Classes, 10 Seeds per Class, using Adsorption Mean Reciprocal Rank (MRR) TextRunner Graph YAGO Graph TextRunner + YAGO Graph

73 Improving Class-Instance Acquisition with YAGO Attributes Mean Reciprocal Rank (MRR) WordNet Classes, 10 Seeds per Class, using Adsorption TextRunner Graph YAGO Graph TextRunner + YAGO Graph Graph constructed from TextRunner (UWash) output, 175k nodes, 529k edges

74 Improving Class-Instance Acquisition with YAGO Attributes 170 WordNet Classes, 10 Seeds per Class, using Adsorption Mean Reciprocal Rank (MRR) TextRunner Graph YAGO Graph TextRunner + YAGO Graph Graph constructed from output of YAGO Knowledge Base, 142k nodes, 777k edges

75 Improving Class-Instance Acquisition with YAGO Attributes 170 WordNet Classes, 10 Seeds per Class, using Adsorption Mean Reciprocal Rank (MRR) TextRunner Graph YAGO Graph TextRunner + YAGO Graph Combined graph, with 237k nodes, 1.3m edges

76 Improving Class-Instance Acquisition with YAGO Attributes 170 WordNet Classes, 10 Seeds per Class, using Adsorption Mean Reciprocal Rank (MRR) TextRunner Graph YAGO Graph TextRunner + YAGO Graph

77 Improving Class-Instance Acquisition with YAGO Attributes Additional semantic constraints help Mean Reciprocal Rank (MRR) WordNet Classes, 10 Seeds per Class, using Adsorption TextRunner Graph YAGO Graph TextRunner + YAGO Graph improve performance significantly

78 Improving Class-Instance Acquisition with YAGO Attributes 170 WordNet Classes, 10 Seeds per Class, using Adsorption Additional semantic constraints help Mean Reciprocal Rank (MRR) TextRunner Graph YAGO Graph TextRunner + YAGO Graph improve performance significantly This further demonstrates the benefit of combining information from multiple sources

79 Class Instance Acquisition: Recap 21

80 Class Instance Acquisition: Recap Showed benefits of Adsorption, a highly scalable (parallelizable) graph-based semi-supervised learning (SSL) method: to aggregate extractions from different sources (and methods), resulting in better classes for more instances 21

81 Class Instance Acquisition: Recap Showed benefits of Adsorption, a highly scalable (parallelizable) graph-based semi-supervised learning (SSL) method: to aggregate extractions from different sources (and methods), resulting in better classes for more instances Demonstrated improved performance through additional semantic constraints 21

82 Class Instance Acquisition: Recap Showed benefits of Adsorption, a highly scalable (parallelizable) graph-based semi-supervised learning (SSL) method: to aggregate extractions from different sources (and methods), resulting in better classes for more instances Demonstrated improved performance through additional semantic constraints 21 Next: Modification to Adsorption and comparison of different graph-based SSL methods.

83 Adsorption & Its Extension Seed Scores v Label Priors Estimated Scores 22

84 Adsorption & Its Extension Adsorption uses the following update at iteration (t +1): Ŷ (t+1) v p inj v Y v + p cont v B (t) v + p abnd v r where B (t) v = u W uv u W u v Ŷ (t) u v Seed Scores Label Priors Estimated Scores 22

85 Adsorption & Its Extension Adsorption uses the following update at iteration (t +1): Ŷ (t+1) v p inj v Y v + p cont v B (t) v + p abnd v r where Node specific random walk probabilities used to control information passing through the node. B (t) v = u W uv u W u v Ŷ (t) u v Seed Scores Label Priors Estimated Scores 22

86 Adsorption & Its Extension Adsorption uses the following update at iteration (t +1): Ŷ (t+1) v p inj v Y v + p cont v B (t) v + p abnd v r where B (t) v = u W uv u W u v Ŷ (t) u v Seed Scores Label Priors Weighted neighborhood class scores after iteration (t) Estimated Scores 22

87 Adsorption & Its Extension Adsorption uses the following update at iteration (t +1): Ŷ (t+1) v p inj v Y v + p cont v B (t) v + p abnd v Label Uncertainty r where B (t) v = u W uv u W u v Ŷ (t) u v Seed Scores Label Priors Estimated Scores 22

88 Adsorption & Its Extension Adsorption uses the following update at iteration (t +1): Ŷ (t+1) v p inj v Y v + p cont v B (t) v + p abnd v r where B (t) v = u W uv u W u v Ŷ (t) u v Seed Scores Label Priors Estimated Scores 22 Adsorption s key drawback: it is not optimizing any well defined objective

89 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores ] + µ 2 Ŷl L Ŷ l + µ 3 Ŷl R l Seed Indicator Laplacian Priors Scores 23

90 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) Match Seeds (soft) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores ] + µ 2 Ŷl L Ŷ l + µ 3 Ŷl R l Seed Indicator Laplacian Priors Scores 23

91 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) Match Seeds (soft) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores Seed Indicator Smooth ] + µ 2 Ŷl L Ŷ l + µ 3 Ŷl R l Laplacian Priors Scores 23

92 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) Match Seeds (soft) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores Seed Indicator Smooth Laplacian Match Priors (Regularizer) + µ 2 Ŷ l L Ŷ l + µ 3 Ŷl R l ] Priors Scores 23

93 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) Match Seeds (soft) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores Seed Indicator Smooth Laplacian Match Priors (Regularizer) + µ 2 Ŷ l L Ŷ l + µ 3 Ŷl R l ] Priors Scores LP-ZGL [Zhu et al., 2003] Objective 23 min Ŷ l Ŷ l LŶl, s.t. SY l = SŶl

94 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) Match Seeds (soft) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores Seed Indicator Smooth Laplacian Match Priors (Regularizer) + µ 2 Ŷ l L Ŷ l + µ 3 Ŷl R l ] Priors Scores LP-ZGL [Zhu et al., 2003] Objective 23 min Ŷ l Ŷ l LŶl, s.t. SY l = SŶl Smooth

95 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) Match Seeds (soft) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores Seed Indicator Smooth Laplacian Match Priors (Regularizer) + µ 2 Ŷ l L Ŷ l + µ 3 Ŷl R l ] Priors Scores LP-ZGL [Zhu et al., 2003] Objective 23 min Ŷ l Ŷ l LŶl, s.t. SY l = SŶl Smooth Match Seeds (hard)

96 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) Match Seeds (soft) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores Seed Indicator Smooth Laplacian Match Priors (Regularizer) + µ 2 Ŷ l L Ŷ l + µ 3 Ŷl R l ] Priors Scores LP-ZGL [Zhu et al., 2003] Objective 23 min Ŷ l Ŷ l LŶl, s.t. SY l = SŶl Smooth Match Seeds (hard) LP-ZGL can be considered as MAD without regularization.

97 Graph-based SSL Comparisons 24

98 Graph-based SSL Comparisons 0.35 TextRunner Graph, 170 WordNet Classes LP-ZGL Adsorption MAD Graph with 175k nodes, 529k edges. Mean Reciprocal Rank (MRR) x x 10 Amount of Supervision 24

99 Graph-based SSL Comparisons 0.39 Freebase-2 Graph, 192 WordNet Classes LP-ZGL Adsorption MAD Mean Reciprocal Rank (MRR) Graph with 303k nodes, 2.3m edges x x 10 Amount of Supervision 25

100 When is MAD most effective? 0.4 Relative Increase in MRR by MAD over LP-ZGL Average Degree 26

When is MAD most effective? Relative Increase in MRR by MAD over LP-ZGL 0.4 0.

101 When is MAD most effective? Relative Increase in MRR by MAD over LP-ZGL MAD seems to be more effective in graphs with high average degree, where there is 0.2 greater need for regularization Average Degree 26

102 Next: Integrating Data across Sources to Answer Queries A B Association (Edge) Discovery E B A D G C F IE C D Ranking Sources Relevance E A F G E B D G C F 27 Unstructured Data Structured Data

103 Integrating Data Across Sources P b c d G M 28

104 Integrating Data Across Sources P b c d G M 28

105 Integrating Data Across Sources P PRO_ID GENE_NAME p12 g P b c 0.07 Join Condition d 0.04 P.GENE_NAME = b.gene_name SPECIES GENE_NAME s1 g b G M 28

106 Integrating Data Across Sources P b c 0.07 Lower cost reflects user preference for the join d 0.04 G M 28

107 Integrating Data Across Sources P b c Protein d G Query Keywords M Genes 28 Information Need Find Protein, Gene, disease info on Malaria Malaria (Disease)

108 Main Questions P Keyword Matching Nodes b 0.07 c d 0.04 G Information Need Find Protein, Gene, disease info on Malaria M 29

109 Main Questions P Keyword Matching Nodes b 0.07 d c 0.04 G Information Need Find Protein, Gene, disease info on Malaria 1. How do we determine which edges to include? M 29

110 Main Questions P Keyword Matching Nodes b 0.07 d c 0.04 G Information Need Find Protein, Gene, disease info on Malaria 1. How do we determine which edges to include? 2. How do we adjust edge costs to reflect user preferences (i.e., personalization)? M 29

111 Our Approach: Learn the Queries to Integrate Data [Talukdar et al., VLDB 2008] 1. How do we determine which edges to include? Inference: K-Best Steiner Tree Generation 2. How do we adjust edge costs to reflect user preferences (i.e., personalization)? Learn from user feedback over answers 30

112 Steiner Trees: Finding Lowest-Cost Queries P b 0.07 c 0.04 d G M 31

113 Steiner Trees: Finding Lowest-Cost Queries A tree of minimal cost (sum of edge costs) in a graph (G) which includes all the required nodes (S). P b 0.07 c 0.04 d G M 31

114 Steiner Trees: Finding Lowest-Cost Queries A tree of minimal cost (sum of edge costs) in a graph (G) which includes all the required nodes (S). P b 0.07 c 0.04 Steiner Tree is a generalization of Minimum Spanning Tree (MST) [equivalent when S = all vertices in G]. d M G 31

115 Inference: K-Best Steiner Tree Generation 32

116 Inference: K-Best Steiner Tree Generation Schema Graph P b 0.07 c 0.04 d M G 32

117 Inference: K-Best Steiner Tree Generation Schema Graph P b 0.07 c 0.04 d M G Find Steiner trees connecting red tables 32

118 Inference: K-Best Steiner Tree Generation Schema Graph P b 0.07 c 0.04 d G M Find Steiner trees connecting red tables P b P b 0.07 Q1 d Rank = 1 G Q2 Rank = 2 c 0.04 d G 32 Cost = 0.4 M Cost = 0.41 M

119 Our K-Best Steiner Tree Algorithms 33

120 Our K-Best Steiner Tree Exact Inference Algorithms 33

121 Our K-Best Steiner Tree Algorithms Exact Inference Integer Linear Program (ILP) based formulation, using ideas from multi-commodity network flows 33

122 Our K-Best Steiner Tree Algorithms Exact Inference Integer Linear Program (ILP) based formulation, using ideas from multi-commodity network flows Contribution: extending 1-best to K-best 33

123 Our K-Best Steiner Tree Algorithms Exact Inference Integer Linear Program (ILP) based formulation, using ideas from multi-commodity network flows Contribution: extending 1-best to K-best Approximate Inference 33

124 Our K-Best Steiner Tree Algorithms Exact Inference Integer Linear Program (ILP) based formulation, using ideas from multi-commodity network flows Contribution: extending 1-best to K-best Approximate Inference Shortest Paths Complete Subgraph Heuristic. 33

125 Our K-Best Steiner Tree Algorithms Exact Inference Integer Linear Program (ILP) based formulation, using ideas from multi-commodity network flows Contribution: extending 1-best to K-best Approximate Inference Shortest Paths Complete Subgraph Heuristic. Reduce problem size by pruning graph, and then apply ILP on the reduced graph. 33

126 Our K-Best Steiner Tree Algorithms Exact Inference Integer Linear Program (ILP) based formulation, using ideas from multi-commodity network flows Contribution: extending 1-best to K-best 33 Approximate Inference Shortest Paths Complete Subgraph Heuristic. Reduce problem size by pruning graph, and then apply ILP on the reduced graph. Significantly faster; in practice, often gives optimal solution.

127 Exact vs Approximate Inference Larger schema graph of size (408, 1366) from real sources: GUS, GO, BioSQL. 343

128 Exact vs Approximate Inference Larger schema graph of size (408, 1366) from real sources: GUS, GO, BioSQL. K Speedup Error

129 Exact vs Approximate Inference Larger schema graph of size (408, 1366) from real sources: GUS, GO, BioSQL. K Speedup Error It is possible to do K-best inference in larger graphs quickly and with little or no loss (none in this case).

130 Query Formulation & Execution Trees can be easily written as executable queries: Steiner Tree P b d G M 35

131 Query Formulation & Execution Trees can be easily written as executable queries: Steiner Tree P b d Join Condition P.y = b.y M G 35

132 Query Formulation & Execution Trees can be easily written as executable queries: Steiner Tree P b d G M Conjunctive Query: P(x,y) & b(y,z) & d(z,w) & M(w,u) & G(w,v) 35 We can use Orchestra [Ives+05] to execute queries and record provenance.

133 Our Approach: Learn the Queries to Integrate Data [Talukdar et al., VLDB 2008] 1. How do we determine which edges to include? Inference: K-Best Steiner Tree Generation 2. How do we adjust edge costs to reflect user preferences (i.e., personalization)? Learn from user feedback over answers 36

134 Learning New Edge Costs Top P b d G M... P b 0.07 c 0.04 Bottom d M G 37

135 Learning New Edge Costs Top P b Query d G Query M Bottom P b 0.07 c 0.04 d M G Query * 37

136 Learning New Edge Costs Top P b Query Tuples d G Query M Bottom P b 0.07 c 0.04 d M G Query * 37

137 Learning New Edge Costs Top P b Query Tuples d G Query M Bottom 37 P b 0.07 c 0.04 d M G Query * feedback on answers, which is what the user cares about

138 Learning New Edge Costs updated cost Top b P c 0.04 d G Query Query * Tuples M Bottom P b d M G Query 37

139 Learning: Cost Model Components 38

140 Learning: Cost Model Components Edge Cost = wdb1 + wdb2 + wdef DB1 DB2 Feature Name Feature Value Coefficient (Values Learned) Is this edge incident on DB1? Is this edge incident on DB2? 1 wdb1 1 wdb2 38 Default 1 wdef

141 Learning: Incorporating User Feedback Model feedback incorporation as a constrained optimization problem. 39

142 Learning: Incorporating User Feedback Model feedback incorporation as a constrained optimization problem. MIRA Algorithm (Crammer et al., 2006) 39

143 Learning: Incorporating User Feedback Model feedback incorporation as a constrained optimization problem. New Model Parameters MIRA Algorithm (Crammer et al., 2006) Current Model Parameters 39

144 Learning: Incorporating User Feedback Model feedback incorporation as a constrained optimization problem. New Model Parameters Tree Cost MIRA Algorithm (Crammer et al., 2006) Current Model Parameters Loss 39

145 Learning: Incorporating User Feedback Model feedback incorporation as a constrained optimization problem. New Model Parameters Tree Cost MIRA Algorithm (Crammer et al., 2006) Current Model Parameters Loss Tree that user doesn t like. Tree that user likes 39

146 Results: Learning Expert Ranking Graph: Start with the BioGuide [Cohen-Boulakia+07] bio sources, with 28 vertices and 96 edges. 5 Goal: Learn BioGuide s expert s rankings G1 P3 Error Methodology: All weights are set to default. Sequence of 25 queries For each, user feedback identifies & promotes a tuple from the gold standard answer Total queries seen 40

147 Results: Learning Expert Ranking Graph: Start with the BioGuide [Cohen-Boulakia+07] bio sources, with 28 vertices and 96 edges Goal: Learn BioGuide s expert s rankings G1 P3 Error Methodology: All weights are set to default. Sequence of 25 queries For each, user feedback identifies & promotes a tuple from the gold standard answer Total queries seen After 40-60% searches, Q finds the top query immediately. For each search, a single feedback is enough to learn top query.

148 Our Approach: Learn the Queries to Integrate Data [Talukdar et al., VLDB 2008] 1. How do we determine which edges to include? Inference: K-Best Steiner Tree Generation 2. How do we adjust edge costs to reflect user preferences (i.e., personalization)? Learn from user feedback over answers 41

149 Next: Combining and Adding Sources A B Association (Edge) Discovery E B A D G C F IE C D Ranking Sources Relevance E A F G E B D G C F 42 Unstructured Data Structured Data

150 Automatically Adding New Sources in Q [Talukdar, Ives, and Pereira, SIGMOD 2010] 43

151 Automatically Adding New Sources in Q [Talukdar, Ives, and Pereira, SIGMOD 2010] P b 0.07 c 0.04 d G M 43

152 Automatically Adding New Sources in Q [Talukdar, Ives, and Pereira, SIGMOD 2010] P b 0.07 c 0.04 d G n M New Source 43

153 Automatically Adding New Sources in Q [Talukdar, Ives, and Pereira, SIGMOD 2010] P b 0.07 c 0.04 n?????? d M G New Source 43

154 Automatically Adding New Sources in Q [Talukdar, Ives, and Pereira, SIGMOD 2010] P b 0.07 How to discover new associations automatically? 0.04 c n?????? d M G New Source 43

155 Automatically Adding New Sources in Q [Talukdar, Ives, and Pereira, SIGMOD 2010] P b 0.07 How to discover new associations automatically? 0.04 How to correct?? mistakes? made d during G automatic association discovery??? n? c M New Source 43

156 Discovering New Associations 44

157 Discovering New Associations Any off-the-shelf Schema Matcher may be used: 44

158 Discovering New Associations Any off-the-shelf Schema Matcher may be used: COMA++ (metadata level) [Do and Rahm, 2007]: need pairwise comparisons 44

159 Discovering New Associations Any off-the-shelf Schema Matcher may be used: COMA++ (metadata level) [Do and Rahm, 2007]: need pairwise comparisons Using Label Propagation (instance level), proposed in the thesis: pairwise comparisons are not necessary 44

160 Discovering New Associations Any off-the-shelf Schema Matcher may be used: COMA++ (metadata level) [Do and Rahm, 2007]: need pairwise comparisons Using Label Propagation (instance level), proposed in the thesis: pairwise comparisons are not necessary How to correct automatic schema matching errors? 44

161 Discovering New Associations Any off-the-shelf Schema Matcher may be used: COMA++ (metadata level) [Do and Rahm, 2007]: need pairwise comparisons Using Label Propagation (instance level), proposed in the thesis: pairwise comparisons are not necessary How to correct automatic schema matching errors? by exploiting end user s expertise in the data, by flagging bad answers 44

162 Discovering New Associations Any off-the-shelf Schema Matcher may be used: COMA++ (metadata level) [Do and Rahm, 2007]: need pairwise comparisons Using Label Propagation (instance level), proposed in the thesis: pairwise comparisons are not necessary 44 How to correct automatic schema matching errors? by exploiting end user s expertise in the data, by flagging bad answers without requiring administrator based knowledge of metadata, as they don t take user context into account, and they are often expensive to obtain

163 Using Q to Correct Alignment Errors Edge Cost = wdb1 + wdb2 + wdef * WCOMA * WLP DB1 DB2 Feature Name Feature Value Coefficient (Values Learned) Is this edge incident on DB1? 1 wdb1 Is this edge incident on DB2? 1 wdb2 Default 1 wdef COMA++ Aligned 0.90 wcoma++ 45 LabelProp Aligned 0.7 wlp

164 Using Q to Correct Alignment Errors Edge Cost = wdb1 + wdb2 + wdef * WCOMA * WLP DB1 DB2 Feature Name Feature Value Coefficient (Values Learned) Is this edge incident on DB1? 1 wdb1 Is this edge incident on DB2? 1 wdb2 Alignment Feature Weights Default 1 wdef COMA++ Aligned 0.90 wcoma++ 45 LabelProp Aligned 0.7 wlp

165 Correcting Schema Matching Errors with Q 46

166 Correcting Schema Matching Errors with Q Learning with Q helps correct schema matching errors. 46

167 Reducing Pairwise Comparisons during Association Discovery 47 3

168 Reducing Pairwise Comparisons during Association Discovery Keyword Cost Neighborhood GO InterPro2GO InterPro Entry 2 Pub acc term_id go_id entry_ac 2 entry_ac pub_id term plasma membrane 0.25 InterPro Entry InterPro Pub name entry_ac title pub_id 2 A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost 2 from at least one of the keywords. 47 3

169 Reducing Pairwise Comparisons during Association Discovery Keyword Cost Neighborhood GO InterPro2GO InterPro Entry 2 Pub New Source? acc term_id term plasma membrane 0.25 go_id InterPro Entry 0 0 entry_ac entry_ac InterPro Pub 0 0 pub_id 2 name entry_ac title pub_id 2 A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost 2 from at least one of the keywords. 47 3

170 Reducing Pairwise Comparisons during Association Discovery Keyword Cost Neighborhood GO InterPro2GO InterPro Entry 2 Pub New Source? acc term_id term plasma membrane 0.25 go_id InterPro Entry 0 0 entry_ac entry_ac InterPro Pub 0 0 pub_id 2 name entry_ac title pub_id 47 3 View Based Aligner A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost 2 from at least one of the keywords. Prune comparisons based on whether they are likely to affect query results, as otherwise there will be no feedback from user. 2

171 Reducing Pairwise Comparisons during Association Discovery # Pairwise Comparisons Exhaustive ViewBasedAligner Number of Tables in the Schema Graph 48

172 49 Summary of Contributions

173 Summary of Contributions Weakly-supervised acquisition of class-instance pairs from unstructured and structured sources 49

174 Summary of Contributions Weakly-supervised acquisition of class-instance pairs from unstructured and structured sources Scalable method, suitable for large data volume 49

175 Summary of Contributions Weakly-supervised acquisition of class-instance pairs from unstructured and structured sources Scalable method, suitable for large data volume A method for learning data-integrating queries taking user information need into account removing the need for expert input or heavy human supervision, interactive speed automatic incorporation of new source correction of wrong association 49

176 Related Work Extractions from unstructured data: [Etzioni et al., 2005, Van Durme and Pas ça, 2008],... Extractions from semi-structured data: SEAL [Wang and Cohen, 2007], and its extensions Graph-based SSL methods: LP-ZGL [Zhu et al., 2003], LGC [Bengio et al., 2007], Adsorption [Baluja et al., 2008],... Keyword Search over Databases: BANKS [Bhalotia et al., 2002], BLINKS [He et al., 2007], BioGuide [Cohen-Boulakia+07] 50

177 Future Work: Complete the Loop Association (Edge) Discovery IE Ranking Sources Relevance 51

178 Future Work: Complete the Loop Association (Edge) Discovery IE Ranking Sources Relevance 51

179 Future Work: Complete the Loop Association (Edge) Discovery IE Ranking Sources Relevance Correct extraction errors based on their effect on the final answers, as measured by user feedback over those answers. 51

180 52 More Future Work

181 More Future Work Incorporation of other types of semantic constraints in class instance acquisition 52

182 More Future Work Incorporation of other types of semantic constraints in class instance acquisition Graph-based SSL methods for other types (non IS-A) of relation extraction 52

183 More Future Work Incorporation of other types of semantic constraints in class instance acquisition Graph-based SSL methods for other types (non IS-A) of relation extraction Roll Q out to life scientists and get their feedback, and also apply Q in non life science datasets 52

184 More Future Work Incorporation of other types of semantic constraints in class instance acquisition Graph-based SSL methods for other types (non IS-A) of relation extraction Roll Q out to life scientists and get their feedback, and also apply Q in non life science datasets Investigate user adaptation in Q: use model trained for one user to initialize another, exploiting any available user similarity information 52

185 Acknowledgements Advisors: Zack Ives, Mark Liberman, and Fernando Pereira Committee: William Cohen, Aravind Joshi, Ben Taskar, and Lyle Ungar My Co-authors: Rahul Bhagat, Thorsten Brants, Koby Crammer, Sudipto Guha, Marie Jacob, Salman Mehmood, Marius Pasca, Deepak Ravichandran, Joseph Reisinger 53 DARPA, Google, NSF grant #IIS

186

187 Thank You!

188

189 56 Current Approaches

190 Current Approaches Supervised Named Entity Recognition (NER) Deals with only limited number of coarse classes Very resource intensive, labeled data is expensive! 56

191 Current Approaches Supervised Named Entity Recognition (NER) Deals with only limited number of coarse classes Very resource intensive, labeled data is expensive! Pattern based Extraction Textual patterns ( analyst at <ENT>. ) effective only in repetitive contexts [Bellare et al., 2007] Extractions usually high-precision, low-recall! 56

192 Context Pattern based Extraction [Talukdar et al., CoNLL 2006] Partial entity lists extended into longer lists using context patterns induced from unstructured text. Extended lists used as features in supervised tagger, improving its performance. analyst at -ENT-. series against the -ENT-tonight Today 's Schaeffer 's Option Activity Watch features -ENT- ( Boston Red Sox St. Louis Cardinals Chicago Cubs Florida Marlins 57

193 New Extractions found by Adsorption Class Scientific Journals NFL Players Book Publishers A few non-seed Instances found by Adsorption Journal of Physics, Nature, Structural and Molecular Biology, Sciences Sociales et sante, Kidney and Blood Pressure Research, American Journal of Physiology- Cell Physiology, Tony Gonzales, Thabiti Davis, Taylor Stubblefield, Ron Dixon, Rodney Hannan, Small Night Shade Books, House of Ansari Press, Highwater Books, Distributed Art Publishers, Cooper Canyon Press, 58 Total classes: 9081

194 Graph Stats Statistics of Graphs used in Class-Instance Acquisition Experiments 59

195 Improving Class-Instance Acquisition with Additional Attributes 170 WordNet Classes, 10 Seeds per Class Mean Reciprocal Rank (MRR) TextRunner Graph YAGO Graph TextRunner + YAGO Graph 0.3 LP-ZGL Adsorption MAD 60 Amount of Supervision

196 Improving Class-Instance Acquisition with Additional Attributes Mean Reciprocal Rank (MRR) WordNet Classes, 10 Seeds per Class TextRunner Graph YAGO Graph TextRunner + YAGO Graph Additional semantic constraints in the form of (instance, attribute) edges from YAGO help improve performance significantly! LP-ZGL Adsorption MAD Amount of Supervision

197 Effect of Class Similarity Constraints TextRunner Graph, 170 WordNet Classes LP-ZGL Adsorption MAD MADDL Mean Reciprocal Rank (MRR) x x 10 Graph with 175k nodes, 529k edges. 61 Amount of Supervision

198 Effect of Class Similarity Constraints TextRunner Graph, 170 WordNet Classes LP-ZGL Adsorption MAD MADDL 61 Mean Reciprocal Rank (MRR) Class similarity constraints are helpful, more investigation is 0.23 necessary! x x 10 Amount of Supervision Graph with 175k nodes, 529k edges.

199 Effect of Class Sparsity Constraints 0.42 Effect of Per-node Sparsity Constraint Mean Reciprocal Rank (MRR) Maximum Allowed Classes per Node

200 SVM Comparison 0.4 Freebase-2 Graph, 192 WordNet Classes LP-ZGL Adsorption MAD SVM Graph with 303k nodes, 2.3m edges. Mean Reciprocal Rank (MRR) x x 10 Amount of Supervision

201 SVM Comparison TextRunner Graph, 170 WordNet Classes LP-ZGL Adsorption MAD SVM Mean Reciprocal Rank (MRR) x x 10 Amount of Supervision Graph with 175k nodes, 529k edges. 64

202 Results: Time to generate K- best Queries Schema graph of size (28, 96) from BioGuide (Boulakia et al., 2007). K Time (s)

203 Results: Time to generate K- best Queries Schema graph of size (28, 96) from BioGuide (Boulakia et al., 2007). K Time (s) It is possible to generate the top queries in interactive range. Query execution is pipelined. 65

204 Discovering New Associations COMA++ (metadata level) [Do and Rahm, 2007] pairwise comparisons Using Label Propagation (instance level) pairwise comparisons not necessary 66

205 Discovering New Associations COMA++ (metadata level) [Do and Rahm, 2007] pairwise comparisons Using Label Propagation (instance level) pairwise comparisons not necessary 1.0 GO: Interpro2GO go_id GO: GO term acc 1.0 GO:

206 Discovering New Associations COMA++ (metadata level) [Do and Rahm, 2007] pairwise comparisons Using Label Propagation (instance level) pairwise comparisons not necessary Interpro2GO go_id GO: go_id 1.0 Interpro2GO go_id GO: GO: GO: acc GO term acc 1.0 GO term acc 1.0 GO: GO:

207 Discovering New Associations COMA++ (metadata level) [Do and Rahm, 2007] pairwise comparisons Using Label Propagation (instance level) pairwise comparisons not necessary Interpro2GO go_id GO: go_id 1.0 Interpro2GO go_id GO: go_id 1.0 Interpro2GO go_id GO: go_id 0.8 acc GO: acc GO: go_id 0.8 acc 0.2 acc GO: go_id 0.51 acc 0.49 GO term acc 1.0 GO: GO term acc 1.0 GO: go_id 0.25 acc 0.75 GO term acc 1.0 GO: go_id 0.55 acc

208 Reusing Feedback Helps! Precision-Recall Plots for Q with Different Levels of Feedback Precision Q (1x1) Q (10x1) Q (10x2) Q (10x4) Adsorption and COMA++ Averaged Recall

209 Number of Pairwise Comparisons Reducing Pairwise Comparisons during Association Discovery 68 Total 18 Tables in Schema Graph No Additional Filter Value Overlap Filter Exhaustive ViewBasedAligner Alignment Strategy

210 Reducing Pairwise Comparisons during Association Discovery Number of Pairwise Column Comparisons Number of Pairwise Column Comparisons for Increasing Schema Graph Size Exhaustive ViewBasedAligner PreferentialAligner Existing Number of Sources in Schema Graph 69 Number of pairwise attribute comparisons as we scale the size of the search graph (avg. over the introduction of 40 new sources).

Graph-based Semi- Supervised Learning as Optimization

Graph-based Semi- Supervised Learning as Optimization Partha Pratim Talukdar CMU Machine Learning with Large Datasets (10-605) April 3, 2012 Graph-based Semi-Supervised Learning 0.2 0.1 0.2 0.3 0.3 0.2