Graph-Based Weakly- Supervised Methods for Information Extraction & Integration. Partha Pratim Talukdar University of Pennsylvania

Size: px
Start display at page:

Download "Graph-Based Weakly- Supervised Methods for Information Extraction & Integration. Partha Pratim Talukdar University of Pennsylvania"

Transcription

1 Graph-Based Weakly- Supervised Methods for Information Extraction & Integration Partha Pratim Talukdar University of Pennsylvania Dissertation Defense, February 24, 2010

2 End Goal We should be able to answer any question for which data exists in the dataset. 2

3 3 Query: alma maters of US mayors

4 3 Query: alma maters of US mayors

5 Query: alma maters of US mayors There is probably no single page which can answer this query exactly. 3

6 4 Google Squared?

7 Google Squared? 28 mayors listed out of thousands Alma mater of only four mayors found 4

8 Google Squared? 28 mayors listed out of thousands Alma mater of only four mayors found An Important First Step! 4

9 Often, users need information that combines data from multiple sites (pages) holds a bachelor of science degree from the University of Alabama. 5

10 Often, users need information that combines data from multiple sites (pages) holds a bachelor of science degree from the University of Alabama. Information Extraction (IE) Information Extraction (IE) 5 Mayor City State Bill Ham Jr. Auburn AL Edward May Bessemer AL Loretta Spencer Huntsville AL Person Alma mater Loretta Spencer Univ. of Alabama......

11 Often, users need information that combines data from multiple sites (pages) holds a bachelor of science degree from the University of Alabama. Information Extraction (IE) Information Extraction (IE) 5 Mayor City State Bill Ham Jr. Auburn AL Edward May Bessemer AL Loretta Spencer Huntsville AL Information Integration (II) Person Alma mater Loretta Spencer Univ. of Alabama......

12 Often, users need information that combines data from multiple sites (pages) holds a bachelor of science degree from the University of Alabama. Mayor City State Alma mater Bill Ham Jr. Auburn AL... Edward May Bessemer AL... Loretta Spencer Huntsville AL Univ. of Alabama

13 ... or from Tables, as in the Life Sciences Example user keyword query genes proteins malaria Unstructured Source (e.g. research paper) Disease DB1 Gene DB2 Protein DB1 Protein DB2 Disease DB2 Disease DB3 6 Structured Source Gene DB1

14 Current Solution in Life Sciences: Hand Programmed WebForms, with Small number of Sources 7 Human-written SQL powering WebForm: SELECT distinct cast(aseq.assembly_na_sequence_id WebForm as varchar2 (32)) as na_sequence_id (over structured data), '@PROJECT_ID@' as project_id, count (distinct aseq.na_sequence_id) as libcount FROM DoTS.EST@MUS_LINK@ est, DoTS.Library@MUS_LINK@ lib, DoTS.AssemblySequence@MUS_LINK@ aseq, epcondata.isexpressed ie WHERE lib.dbest_id = $$panclibraryp$$ AND lib.library_id = est.library_id AND est.na_sequence_id = aseq.na_sequence_id AND aseq.assembly_na_sequence_id is not NULL AND aseq.assembly_na_sequence_id = ie.na_sequence_id GROUP BY aseq.assembly_na_sequence_id

15 Current Solution in Life Sciences: Hand Programmed WebForms, with Small number of Sources 7 Human-written SQL powering WebForm: SELECT distinct cast(aseq.assembly_na_sequence_id as varchar2 (32)) as na_sequence_id, '@PROJECT_ID@' as project_id, count (distinct aseq.na_sequence_id) as libcount FROM DoTS.EST@MUS_LINK@ est, DoTS.Library@MUS_LINK@ lib, DoTS.AssemblySequence@MUS_LINK@ aseq, epcondata.isexpressed ie WHERE lib.dbest_id = $$panclibraryp$$ AND lib.library_id = est.library_id AND est.na_sequence_id = aseq.na_sequence_id AND aseq.assembly_na_sequence_id is not NULL AND aseq.assembly_na_sequence_id = ie.na_sequence_id GROUP BY aseq.assembly_na_sequence_id

16 Current Solution in Life Sciences: Hand Programmed WebForms, with Small number of Sources 7 Human-written SQL powering WebForm: SELECT distinct cast(aseq.assembly_na_sequence_id as varchar2 (32)) as na_sequence_id Requires access to programmers, '@PROJECT_ID@' as project_id -, expensive count (distinct and aseq.na_sequence_id) not scalableas libcount - FROM DoTS.EST@MUS_LINK@ est, exploits DoTS.Library@MUS_LINK@ only a small subset lib of, available DoTS.AssemblySequence@MUS_LINK@ data sources aseq, epcondata.isexpressed ie WHERE lib.dbest_id = $$panclibraryp$$ Not AND suitable lib.library_id for = est.library_id discovery mode! AND est.na_sequence_id = aseq.na_sequence_id AND aseq.assembly_na_sequence_id is not NULL AND aseq.assembly_na_sequence_id = ie.na_sequence_id GROUP BY aseq.assembly_na_sequence_id

17 What is Needed to Satisfy User Information Need? 8

18 What is Needed to Satisfy User Information Need? Take standard keyword queries, but exploit semantic information to: combine data from within (IE) and across (II) sources (documents and tables) take user information need (personalization/context) into account 8

19 What is Needed to Satisfy User Information Need? Take standard keyword queries, but exploit semantic information to: combine data from within (IE) and across (II) sources (documents and tables) take user information need (personalization/context) into account Existing approaches require extensive human input (e.g., annotations, mediated schemas): doesn t scale 8

20 What is Needed to Satisfy User Information Need? Take standard keyword queries, but exploit semantic information to: combine data from within (IE) and across (II) sources (documents and tables) take user information need (personalization/context) into account Existing approaches require extensive human input (e.g., annotations, mediated schemas): doesn t scale My thesis addresses the challenges of doing these at scale, by: Learning from small amounts of human annotation, specification, or feedback Generalizing to large number of data items and schemas 8

21 8 What is Needed to Satisfy User Information Need? Take standard keyword queries, but exploit semantic information to: combine data from within (IE) and across (II) sources (documents and tables) take user information need (personalization/context) into account Existing approaches require extensive human input (e.g., annotations, mediated schemas): doesn t scale My thesis addresses the challenges of doing these at scale, by: Learning from small amounts of human annotation, specification, or feedback Generalizing to large number of data items and schemas... through the use of graph-based methods.

22 Thesis Statement Graph-based representation of data and learning over such graphs result in effective and scalable methods for Information Extraction (IE) and Integration (II). 9

23 10 This Talk: Two Parts

24 This Talk: Two Parts 1. Information Extraction (IE) Class-Instance acquisition on large scale using graph-based methods, and their comparisons 10

25 This Talk: Two Parts 1. Information Extraction (IE) Class-Instance acquisition on large scale using graph-based methods, and their comparisons 2. Information Integration (II) Search and feedback driven information integration Automatically adding new sources, and feedback based association correction 10

26 This Talk: Two Parts 1. Information Extraction (IE) Class-Instance acquisition on large scale using graph-based methods, and their comparisons 2. Information Integration (II) Search and feedback driven information integration Automatically adding new sources, and feedback based association correction System proposed in my thesis: Q 10

27 Q: Overall Architecture A B Association (Edge) Discovery E B A D G C F IE C D Ranking Sources Relevance E A F G E B D G C F 11 Unstructured Data Structured Data

28 Q: Overall Architecture A B Association (Edge) Discovery E B A D G C F IE C D Ranking Sources Relevance E A F G E B D G C F 11 Unstructured Data Structured Data

29 Q: Overall Architecture A B Association (Edge) Discovery E B A D G C F IE C D Ranking Sources Relevance E A F G E B D G C F 11 Unstructured Data Structured Data

30 Next: Information Extraction (Class Instance Acquisition) A B Association (Edge) Discovery E B A D G C F IE C D Ranking Sources Relevance E A F G E B D G C F 12 Unstructured Data Structured Data

31 Class Instance Acquisition Unlabeled Data Medline Newswire Web Partial Instance Lists 13

32 Class Instance Acquisition Unlabeled Data Medline Newswire Web Partial Instance Lists 13 Car Company Toyota Honda Ford...

33 Class Instance Acquisition Unlabeled Data Medline Newswire Web Partial Instance Lists 13 Volcano US Cities Volcano New York Car Company Kilauea Toyota Mt. Mt. Fuji Fuji Philadelphia Mt. Fuji Honda Boston Mt. Andrus Ford

34 Class Instance Acquisition Unlabeled Data Medline Newswire Web Partial Instance Lists 13 Volcano US Cities Volcano New York Car Company Kilauea Toyota Mt. Mt. Fuji Fuji Philadelphia Mt. Fuji Honda Boston Mt. Andrus Ford Can we combine all these sources to build a large repository of class-instance pairs?

35 14 State-of-the-art

36 State-of-the-art Several approaches for class instance acquisition exist: 14

37 State-of-the-art Several approaches for class instance acquisition exist: unstructured data (A8 [van Durme and Pasca, 2008]) 14

38 State-of-the-art Several approaches for class instance acquisition exist: unstructured data (A8 [van Durme and Pasca, 2008]) semi-structured data ([Wang and Cohen, 2007]) 14

39 State-of-the-art Several approaches for class instance acquisition exist: unstructured data (A8 [van Durme and Pasca, 2008]) semi-structured data ([Wang and Cohen, 2007]) Structured data (WebTables (WT) [Cafarella et al., 2008]) 14

40 State-of-the-art Several approaches for class instance acquisition exist: unstructured data (A8 [van Durme and Pasca, 2008]) semi-structured data ([Wang and Cohen, 2007]) Structured data (WebTables (WT) [Cafarella et al., 2008]) A particular extraction might be easier in one data source than other. 14

41 State-of-the-art Several approaches for class instance acquisition exist: unstructured data (A8 [van Durme and Pasca, 2008]) semi-structured data ([Wang and Cohen, 2007]) Structured data (WebTables (WT) [Cafarella et al., 2008]) A particular extraction might be easier in one data source than other. Can we combine extractions from different sources (and methods) and learn from the combined extractions to improve coverage? 14

42 Our Approach: Graph-based Expansion 15

43 Our Approach: Graph-based Expansion WT Musician Billy Joel (0.75) Johnny Cash (0.73) Singer Bob Dylan (0.95) Johnny Cash (0.87) Billy Joel (0.82) A8 15

44 Our Approach: Graph-based Expansion Cluster ID WT Musician Billy Joel (0.75) Johnny Cash (0.73) Singer Bob Dylan (0.95) Johnny Cash (0.87) Billy Joel (0.82) A8 15

45 Our Approach: Graph-based Expansion Cluster ID WT Musician Billy Joel (0.75) Johnny Cash (0.73) Singer Bob Dylan (0.95) Johnny Cash (0.87) Billy Joel (0.82) A8 Extraction Confidence 15

46 Our Approach: Graph-based Expansion Cluster ID WT Musician Billy Joel (0.75) Johnny Cash (0.73) Singer Bob Dylan (0.95) Johnny Cash (0.87) Billy Joel (0.82) A Bob Dylan Extraction Confidence Singer 0.87 Musician Johnny Cash Billy Joel

47 Our Approach: Graph-based Expansion Cluster ID WT Musician Billy Joel (0.75) Johnny Cash (0.73) Singer Bob Dylan (0.95) Johnny Cash (0.87) Billy Joel (0.82) A Bob Dylan Extraction Confidence Singer 0.87? Musician Johnny Cash Billy Joel

48 Our Approach: Graph-based WT Musician Billy Joel (0.75) Johnny Cash (0.73) Expansion Singer Bob Dylan (0.95) Johnny Cash (0.87) Billy Joel (0.82) Can we infer that Bob Dylan is also a Musician, as that is missing in current extractions? Cluster ID A Bob Dylan Extraction Confidence Singer 0.87? Musician Johnny Cash Billy Joel

49 Our Approach: Graph-based WT Musician Billy Joel (0.75) Johnny Cash (0.73) Expansion Singer Bob Dylan (0.95) Johnny Cash (0.87) Billy Joel (0.82) Can we infer that Bob Dylan is also a Musician, as that is missing in current extractions? Cluster ID A Bob Dylan Extraction Confidence Singer ? Musician 1.0 Johnny Cash 15 Musician Billy Joel Musician 1.0 Seed Classes

50 Observations on the Constructed Graph Singer Musician Bob Dylan Johnny Cash Billy Joel Musician 1.0 Musician

51 Observations on the Constructed Graph Singer Musician Bob Dylan Musician 1.0 Johnny Cash Smoothness: Nodes connected by an edge should be assigned similar classes, as enforced by edge weight Billy Joel Musician

52 Observations on the Constructed Graph Nodes corresponding to clusters extracted by first phase extractors. Singer Musician Bob Dylan Musician 1.0 Johnny Cash Smoothness: Nodes connected by an edge should be assigned similar classes, as enforced by edge weight Billy Joel Musician

53 Observations on the Constructed Graph Nodes corresponding to clusters extracted by first phase extractors. Singer Musician Bob Dylan Musician 1.0 Johnny Cash Smoothness: Nodes connected by an edge should be assigned similar classes, as enforced by edge weight Billy Joel Musician Coupling Node: Force (softly) all instance nodes connected to it to have similar class labels, exploiting the Smoothness requirement.

54 Observations on the Constructed Graph Nodes corresponding to clusters extracted by first phase extractors. Singer Musician Bob Dylan Musician 1.0 Johnny Cash Smoothness: Nodes connected by an edge should be assigned similar classes, as enforced by edge weight. 16 Coupling Node: Force (softly) all instance nodes connected to it to have similar class labels, exploiting the Smoothness requirement Billy Joel Musician 1.0 Seed classes can be different from the cluster IDs of first phase extractors (A8, WT, etc.)

55 Our Approach: Graph-based Expansion [Talukdar et al., EMNLP 2008] 0.95 Bob Dylan Singer Johnny Cash Musician Billy Joel 17

56 Our Approach: Graph-based Expansion [Talukdar et al., EMNLP 2008] 0.95 Bob Dylan Singer Johnny Cash Musician We use Adsorption [Baluja et al., 2008] for label propagation (more details shortly). Billy Joel

57 Our Approach: Graph-based Expansion [Talukdar et al., EMNLP 2008] Singer 0.95 Bob Dylan Initialization 0.87 Musician Musician 1.0 Johnny Cash Musician 1.0 Seed Labels 17 Musician We use Adsorption [Baluja et al., 2008] for label propagation (more details shortly). Billy Joel Musician 1.0 Musician 1.0

58 Our Approach: Graph-based Expansion [Talukdar et al., EMNLP 2008] Singer 0.95 Bob Dylan Iteration Musician 0.8 Musician Musician 1.0 Johnny Cash Musician 1.0 Seed Labels 17 Musician We use Adsorption [Baluja et al., 2008] for label propagation (more details shortly). Billy Joel Musician 1.0 Musician 1.0

59 Our Approach: Graph-based Expansion [Talukdar et al., EMNLP 2008] Singer 0.95 Bob Dylan Musician 0.6 Iteration 2 Musician 0.8 Musician Musician 1.0 Johnny Cash Musician 1.0 Derived Labels Seed Labels 17 Musician We use Adsorption [Baluja et al., 2008] for label propagation (more details shortly). Billy Joel Musician 1.0 Musician 1.0

60 Our Approach: Graph-based Expansion [Talukdar et al., EMNLP 2008] Singer Bob Dylan Iteration Johnny Cash Musician We use Adsorption [Baluja et al., 2008] for label propagation (more details shortly). Billy Joel

61 Class Assignment for Fixed Instances 18

62 Class Assignment for Fixed Instances A8 Adsorption WebTables 18

63 Class Assignment for Fixed Instances 924k (class, instance) pairs extracted from 100m web documents. A8 Adsorption WebTables 74m (class, instance) pairs extracted from WebTables dataset. 18

64 Class Assignment for Fixed Instances A8 Adsorption WebTables Graph with 1.4m nodes, 75m edges used. 18

65 Class Assignment for Fixed Instances Evaluation against WordNet Dataset (38 classes, 8910 instances) Mean Reciprocal Rank (MRR) A8 Adsorption WebTables Graph with 1.4m nodes, 75m edges used. 18 Recall

66 Class Assignment for Fixed Instances Evaluation against WordNet Dataset (38 classes, 8910 instances) Mean Reciprocal Rank (MRR) Adsorption is able to assign better class labels to more instances A8 Adsorption WebTables Graph with 1.4m nodes, 75m edges used. 18 Recall

67 19 Can We Improve Class-Instance Acquisition with Additional Semantic Constraints?

68 Can We Improve Class-Instance Acquisition with Additional Semantic Constraints? Isaac Newton people-person-name filmmusic_contributor-name Johnny Cash Bob Dylan 19

69 Can We Improve Class-Instance Acquisition with Additional Semantic Constraints? Isaac Newton people-person-name Instances with shared attributes are likely to be from the same class. 19 has_attributealbums filmmusic_contributor-name Johnny Cash Bob Dylan

70 Can We Improve Class-Instance Acquisition with Additional Semantic Constraints? Isaac Newton Graph-based representation makes it easy people-person-name to incorporate such constraints! Instances with shared attributes are likely to be from the same class. 19 has_attributealbums filmmusic_contributor-name Johnny Cash Bob Dylan

71 Improving Class-Instance Acquisition with YAGO Attributes 20

72 Improving Class-Instance Acquisition with YAGO Attributes 170 WordNet Classes, 10 Seeds per Class, using Adsorption Mean Reciprocal Rank (MRR) TextRunner Graph YAGO Graph TextRunner + YAGO Graph

73 Improving Class-Instance Acquisition with YAGO Attributes Mean Reciprocal Rank (MRR) WordNet Classes, 10 Seeds per Class, using Adsorption TextRunner Graph YAGO Graph TextRunner + YAGO Graph Graph constructed from TextRunner (UWash) output, 175k nodes, 529k edges

74 Improving Class-Instance Acquisition with YAGO Attributes 170 WordNet Classes, 10 Seeds per Class, using Adsorption Mean Reciprocal Rank (MRR) TextRunner Graph YAGO Graph TextRunner + YAGO Graph Graph constructed from output of YAGO Knowledge Base, 142k nodes, 777k edges

75 Improving Class-Instance Acquisition with YAGO Attributes 170 WordNet Classes, 10 Seeds per Class, using Adsorption Mean Reciprocal Rank (MRR) TextRunner Graph YAGO Graph TextRunner + YAGO Graph Combined graph, with 237k nodes, 1.3m edges

76 Improving Class-Instance Acquisition with YAGO Attributes 170 WordNet Classes, 10 Seeds per Class, using Adsorption Mean Reciprocal Rank (MRR) TextRunner Graph YAGO Graph TextRunner + YAGO Graph

77 Improving Class-Instance Acquisition with YAGO Attributes Additional semantic constraints help Mean Reciprocal Rank (MRR) WordNet Classes, 10 Seeds per Class, using Adsorption TextRunner Graph YAGO Graph TextRunner + YAGO Graph improve performance significantly

78 Improving Class-Instance Acquisition with YAGO Attributes 170 WordNet Classes, 10 Seeds per Class, using Adsorption Additional semantic constraints help Mean Reciprocal Rank (MRR) TextRunner Graph YAGO Graph TextRunner + YAGO Graph improve performance significantly This further demonstrates the benefit of combining information from multiple sources

79 Class Instance Acquisition: Recap 21

80 Class Instance Acquisition: Recap Showed benefits of Adsorption, a highly scalable (parallelizable) graph-based semi-supervised learning (SSL) method: to aggregate extractions from different sources (and methods), resulting in better classes for more instances 21

81 Class Instance Acquisition: Recap Showed benefits of Adsorption, a highly scalable (parallelizable) graph-based semi-supervised learning (SSL) method: to aggregate extractions from different sources (and methods), resulting in better classes for more instances Demonstrated improved performance through additional semantic constraints 21

82 Class Instance Acquisition: Recap Showed benefits of Adsorption, a highly scalable (parallelizable) graph-based semi-supervised learning (SSL) method: to aggregate extractions from different sources (and methods), resulting in better classes for more instances Demonstrated improved performance through additional semantic constraints 21 Next: Modification to Adsorption and comparison of different graph-based SSL methods.

83 Adsorption & Its Extension Seed Scores v Label Priors Estimated Scores 22

84 Adsorption & Its Extension Adsorption uses the following update at iteration (t +1): Ŷ (t+1) v p inj v Y v + p cont v B (t) v + p abnd v r where B (t) v = u W uv u W u v Ŷ (t) u v Seed Scores Label Priors Estimated Scores 22

85 Adsorption & Its Extension Adsorption uses the following update at iteration (t +1): Ŷ (t+1) v p inj v Y v + p cont v B (t) v + p abnd v r where Node specific random walk probabilities used to control information passing through the node. B (t) v = u W uv u W u v Ŷ (t) u v Seed Scores Label Priors Estimated Scores 22

86 Adsorption & Its Extension Adsorption uses the following update at iteration (t +1): Ŷ (t+1) v p inj v Y v + p cont v B (t) v + p abnd v r where B (t) v = u W uv u W u v Ŷ (t) u v Seed Scores Label Priors Weighted neighborhood class scores after iteration (t) Estimated Scores 22

87 Adsorption & Its Extension Adsorption uses the following update at iteration (t +1): Ŷ (t+1) v p inj v Y v + p cont v B (t) v + p abnd v Label Uncertainty r where B (t) v = u W uv u W u v Ŷ (t) u v Seed Scores Label Priors Estimated Scores 22

88 Adsorption & Its Extension Adsorption uses the following update at iteration (t +1): Ŷ (t+1) v p inj v Y v + p cont v B (t) v + p abnd v r where B (t) v = u W uv u W u v Ŷ (t) u v Seed Scores Label Priors Estimated Scores 22 Adsorption s key drawback: it is not optimizing any well defined objective

89 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores ] + µ 2 Ŷl L Ŷ l + µ 3 Ŷl R l Seed Indicator Laplacian Priors Scores 23

90 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) Match Seeds (soft) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores ] + µ 2 Ŷl L Ŷ l + µ 3 Ŷl R l Seed Indicator Laplacian Priors Scores 23

91 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) Match Seeds (soft) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores Seed Indicator Smooth ] + µ 2 Ŷl L Ŷ l + µ 3 Ŷl R l Laplacian Priors Scores 23

92 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) Match Seeds (soft) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores Seed Indicator Smooth Laplacian Match Priors (Regularizer) + µ 2 Ŷ l L Ŷ l + µ 3 Ŷl R l ] Priors Scores 23

93 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) Match Seeds (soft) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores Seed Indicator Smooth Laplacian Match Priors (Regularizer) + µ 2 Ŷ l L Ŷ l + µ 3 Ŷl R l ] Priors Scores LP-ZGL [Zhu et al., 2003] Objective 23 min Ŷ l Ŷ l LŶl, s.t. SY l = SŶl

94 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) Match Seeds (soft) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores Seed Indicator Smooth Laplacian Match Priors (Regularizer) + µ 2 Ŷ l L Ŷ l + µ 3 Ŷl R l ] Priors Scores LP-ZGL [Zhu et al., 2003] Objective 23 min Ŷ l Ŷ l LŶl, s.t. SY l = SŶl Smooth

95 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) Match Seeds (soft) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores Seed Indicator Smooth Laplacian Match Priors (Regularizer) + µ 2 Ŷ l L Ŷ l + µ 3 Ŷl R l ] Priors Scores LP-ZGL [Zhu et al., 2003] Objective 23 min Ŷ l Ŷ l LŶl, s.t. SY l = SŶl Smooth Match Seeds (hard)

96 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] MAD effectively re-weights the graph input to Adsorption and minimizes the following objective, while retaining all of Adsorption s desirable properties (e.g., iterative update, parallelizability, etc.): MAD Objective min Ŷ l Seed Scores (given) Match Seeds (soft) ( ) ( ) [µ 1 Y l Ŷl S Y l Ŷl Estimated Scores Seed Indicator Smooth Laplacian Match Priors (Regularizer) + µ 2 Ŷ l L Ŷ l + µ 3 Ŷl R l ] Priors Scores LP-ZGL [Zhu et al., 2003] Objective 23 min Ŷ l Ŷ l LŶl, s.t. SY l = SŶl Smooth Match Seeds (hard) LP-ZGL can be considered as MAD without regularization.

97 Graph-based SSL Comparisons 24

98 Graph-based SSL Comparisons 0.35 TextRunner Graph, 170 WordNet Classes LP-ZGL Adsorption MAD Graph with 175k nodes, 529k edges. Mean Reciprocal Rank (MRR) x x 10 Amount of Supervision 24

99 Graph-based SSL Comparisons 0.39 Freebase-2 Graph, 192 WordNet Classes LP-ZGL Adsorption MAD Mean Reciprocal Rank (MRR) Graph with 303k nodes, 2.3m edges x x 10 Amount of Supervision 25

100 When is MAD most effective? 0.4 Relative Increase in MRR by MAD over LP-ZGL Average Degree 26

101 When is MAD most effective? Relative Increase in MRR by MAD over LP-ZGL MAD seems to be more effective in graphs with high average degree, where there is 0.2 greater need for regularization Average Degree 26

102 Next: Integrating Data across Sources to Answer Queries A B Association (Edge) Discovery E B A D G C F IE C D Ranking Sources Relevance E A F G E B D G C F 27 Unstructured Data Structured Data

103 Integrating Data Across Sources P b c d G M 28

104 Integrating Data Across Sources P b c d G M 28

105 Integrating Data Across Sources P PRO_ID GENE_NAME p12 g P b c 0.07 Join Condition d 0.04 P.GENE_NAME = b.gene_name SPECIES GENE_NAME s1 g b G M 28

106 Integrating Data Across Sources P b c 0.07 Lower cost reflects user preference for the join d 0.04 G M 28

107 Integrating Data Across Sources P b c Protein d G Query Keywords M Genes 28 Information Need Find Protein, Gene, disease info on Malaria Malaria (Disease)

108 Main Questions P Keyword Matching Nodes b 0.07 c d 0.04 G Information Need Find Protein, Gene, disease info on Malaria M 29

109 Main Questions P Keyword Matching Nodes b 0.07 d c 0.04 G Information Need Find Protein, Gene, disease info on Malaria 1. How do we determine which edges to include? M 29

110 Main Questions P Keyword Matching Nodes b 0.07 d c 0.04 G Information Need Find Protein, Gene, disease info on Malaria 1. How do we determine which edges to include? 2. How do we adjust edge costs to reflect user preferences (i.e., personalization)? M 29

111 Our Approach: Learn the Queries to Integrate Data [Talukdar et al., VLDB 2008] 1. How do we determine which edges to include? Inference: K-Best Steiner Tree Generation 2. How do we adjust edge costs to reflect user preferences (i.e., personalization)? Learn from user feedback over answers 30

112 Steiner Trees: Finding Lowest-Cost Queries P b 0.07 c 0.04 d G M 31

113 Steiner Trees: Finding Lowest-Cost Queries A tree of minimal cost (sum of edge costs) in a graph (G) which includes all the required nodes (S). P b 0.07 c 0.04 d G M 31

114 Steiner Trees: Finding Lowest-Cost Queries A tree of minimal cost (sum of edge costs) in a graph (G) which includes all the required nodes (S). P b 0.07 c 0.04 Steiner Tree is a generalization of Minimum Spanning Tree (MST) [equivalent when S = all vertices in G]. d M G 31

115 Inference: K-Best Steiner Tree Generation 32

116 Inference: K-Best Steiner Tree Generation Schema Graph P b 0.07 c 0.04 d M G 32

117 Inference: K-Best Steiner Tree Generation Schema Graph P b 0.07 c 0.04 d M G Find Steiner trees connecting red tables 32

118 Inference: K-Best Steiner Tree Generation Schema Graph P b 0.07 c 0.04 d G M Find Steiner trees connecting red tables P b P b 0.07 Q1 d Rank = 1 G Q2 Rank = 2 c 0.04 d G 32 Cost = 0.4 M Cost = 0.41 M

119 Our K-Best Steiner Tree Algorithms 33

120 Our K-Best Steiner Tree Exact Inference Algorithms 33

121 Our K-Best Steiner Tree Algorithms Exact Inference Integer Linear Program (ILP) based formulation, using ideas from multi-commodity network flows 33

122 Our K-Best Steiner Tree Algorithms Exact Inference Integer Linear Program (ILP) based formulation, using ideas from multi-commodity network flows Contribution: extending 1-best to K-best 33

123 Our K-Best Steiner Tree Algorithms Exact Inference Integer Linear Program (ILP) based formulation, using ideas from multi-commodity network flows Contribution: extending 1-best to K-best Approximate Inference 33

124 Our K-Best Steiner Tree Algorithms Exact Inference Integer Linear Program (ILP) based formulation, using ideas from multi-commodity network flows Contribution: extending 1-best to K-best Approximate Inference Shortest Paths Complete Subgraph Heuristic. 33

125 Our K-Best Steiner Tree Algorithms Exact Inference Integer Linear Program (ILP) based formulation, using ideas from multi-commodity network flows Contribution: extending 1-best to K-best Approximate Inference Shortest Paths Complete Subgraph Heuristic. Reduce problem size by pruning graph, and then apply ILP on the reduced graph. 33

126 Our K-Best Steiner Tree Algorithms Exact Inference Integer Linear Program (ILP) based formulation, using ideas from multi-commodity network flows Contribution: extending 1-best to K-best 33 Approximate Inference Shortest Paths Complete Subgraph Heuristic. Reduce problem size by pruning graph, and then apply ILP on the reduced graph. Significantly faster; in practice, often gives optimal solution.

127 Exact vs Approximate Inference Larger schema graph of size (408, 1366) from real sources: GUS, GO, BioSQL. 343

128 Exact vs Approximate Inference Larger schema graph of size (408, 1366) from real sources: GUS, GO, BioSQL. K Speedup Error

129 Exact vs Approximate Inference Larger schema graph of size (408, 1366) from real sources: GUS, GO, BioSQL. K Speedup Error It is possible to do K-best inference in larger graphs quickly and with little or no loss (none in this case).

130 Query Formulation & Execution Trees can be easily written as executable queries: Steiner Tree P b d G M 35

131 Query Formulation & Execution Trees can be easily written as executable queries: Steiner Tree P b d Join Condition P.y = b.y M G 35

132 Query Formulation & Execution Trees can be easily written as executable queries: Steiner Tree P b d G M Conjunctive Query: P(x,y) & b(y,z) & d(z,w) & M(w,u) & G(w,v) 35 We can use Orchestra [Ives+05] to execute queries and record provenance.

133 Our Approach: Learn the Queries to Integrate Data [Talukdar et al., VLDB 2008] 1. How do we determine which edges to include? Inference: K-Best Steiner Tree Generation 2. How do we adjust edge costs to reflect user preferences (i.e., personalization)? Learn from user feedback over answers 36

134 Learning New Edge Costs Top P b d G M... P b 0.07 c 0.04 Bottom d M G 37

135 Learning New Edge Costs Top P b Query d G Query M Bottom P b 0.07 c 0.04 d M G Query * 37

136 Learning New Edge Costs Top P b Query Tuples d G Query M Bottom P b 0.07 c 0.04 d M G Query * 37

137 Learning New Edge Costs Top P b Query Tuples d G Query M Bottom 37 P b 0.07 c 0.04 d M G Query * feedback on answers, which is what the user cares about

138 Learning New Edge Costs updated cost Top b P c 0.04 d G Query Query * Tuples M Bottom P b d M G Query 37

139 Learning: Cost Model Components 38

140 Learning: Cost Model Components Edge Cost = wdb1 + wdb2 + wdef DB1 DB2 Feature Name Feature Value Coefficient (Values Learned) Is this edge incident on DB1? Is this edge incident on DB2? 1 wdb1 1 wdb2 38 Default 1 wdef

141 Learning: Incorporating User Feedback Model feedback incorporation as a constrained optimization problem. 39

142 Learning: Incorporating User Feedback Model feedback incorporation as a constrained optimization problem. MIRA Algorithm (Crammer et al., 2006) 39

143 Learning: Incorporating User Feedback Model feedback incorporation as a constrained optimization problem. New Model Parameters MIRA Algorithm (Crammer et al., 2006) Current Model Parameters 39

144 Learning: Incorporating User Feedback Model feedback incorporation as a constrained optimization problem. New Model Parameters Tree Cost MIRA Algorithm (Crammer et al., 2006) Current Model Parameters Loss 39

145 Learning: Incorporating User Feedback Model feedback incorporation as a constrained optimization problem. New Model Parameters Tree Cost MIRA Algorithm (Crammer et al., 2006) Current Model Parameters Loss Tree that user doesn t like. Tree that user likes 39

146 Results: Learning Expert Ranking Graph: Start with the BioGuide [Cohen-Boulakia+07] bio sources, with 28 vertices and 96 edges. 5 Goal: Learn BioGuide s expert s rankings G1 P3 Error Methodology: All weights are set to default. Sequence of 25 queries For each, user feedback identifies & promotes a tuple from the gold standard answer Total queries seen 40

147 Results: Learning Expert Ranking Graph: Start with the BioGuide [Cohen-Boulakia+07] bio sources, with 28 vertices and 96 edges Goal: Learn BioGuide s expert s rankings G1 P3 Error Methodology: All weights are set to default. Sequence of 25 queries For each, user feedback identifies & promotes a tuple from the gold standard answer Total queries seen After 40-60% searches, Q finds the top query immediately. For each search, a single feedback is enough to learn top query.

148 Our Approach: Learn the Queries to Integrate Data [Talukdar et al., VLDB 2008] 1. How do we determine which edges to include? Inference: K-Best Steiner Tree Generation 2. How do we adjust edge costs to reflect user preferences (i.e., personalization)? Learn from user feedback over answers 41

149 Next: Combining and Adding Sources A B Association (Edge) Discovery E B A D G C F IE C D Ranking Sources Relevance E A F G E B D G C F 42 Unstructured Data Structured Data

150 Automatically Adding New Sources in Q [Talukdar, Ives, and Pereira, SIGMOD 2010] 43

151 Automatically Adding New Sources in Q [Talukdar, Ives, and Pereira, SIGMOD 2010] P b 0.07 c 0.04 d G M 43

152 Automatically Adding New Sources in Q [Talukdar, Ives, and Pereira, SIGMOD 2010] P b 0.07 c 0.04 d G n M New Source 43

153 Automatically Adding New Sources in Q [Talukdar, Ives, and Pereira, SIGMOD 2010] P b 0.07 c 0.04 n?????? d M G New Source 43

154 Automatically Adding New Sources in Q [Talukdar, Ives, and Pereira, SIGMOD 2010] P b 0.07 How to discover new associations automatically? 0.04 c n?????? d M G New Source 43

155 Automatically Adding New Sources in Q [Talukdar, Ives, and Pereira, SIGMOD 2010] P b 0.07 How to discover new associations automatically? 0.04 How to correct?? mistakes? made d during G automatic association discovery??? n? c M New Source 43

156 Discovering New Associations 44

157 Discovering New Associations Any off-the-shelf Schema Matcher may be used: 44

158 Discovering New Associations Any off-the-shelf Schema Matcher may be used: COMA++ (metadata level) [Do and Rahm, 2007]: need pairwise comparisons 44

159 Discovering New Associations Any off-the-shelf Schema Matcher may be used: COMA++ (metadata level) [Do and Rahm, 2007]: need pairwise comparisons Using Label Propagation (instance level), proposed in the thesis: pairwise comparisons are not necessary 44

160 Discovering New Associations Any off-the-shelf Schema Matcher may be used: COMA++ (metadata level) [Do and Rahm, 2007]: need pairwise comparisons Using Label Propagation (instance level), proposed in the thesis: pairwise comparisons are not necessary How to correct automatic schema matching errors? 44

161 Discovering New Associations Any off-the-shelf Schema Matcher may be used: COMA++ (metadata level) [Do and Rahm, 2007]: need pairwise comparisons Using Label Propagation (instance level), proposed in the thesis: pairwise comparisons are not necessary How to correct automatic schema matching errors? by exploiting end user s expertise in the data, by flagging bad answers 44

162 Discovering New Associations Any off-the-shelf Schema Matcher may be used: COMA++ (metadata level) [Do and Rahm, 2007]: need pairwise comparisons Using Label Propagation (instance level), proposed in the thesis: pairwise comparisons are not necessary 44 How to correct automatic schema matching errors? by exploiting end user s expertise in the data, by flagging bad answers without requiring administrator based knowledge of metadata, as they don t take user context into account, and they are often expensive to obtain

163 Using Q to Correct Alignment Errors Edge Cost = wdb1 + wdb2 + wdef * WCOMA * WLP DB1 DB2 Feature Name Feature Value Coefficient (Values Learned) Is this edge incident on DB1? 1 wdb1 Is this edge incident on DB2? 1 wdb2 Default 1 wdef COMA++ Aligned 0.90 wcoma++ 45 LabelProp Aligned 0.7 wlp

164 Using Q to Correct Alignment Errors Edge Cost = wdb1 + wdb2 + wdef * WCOMA * WLP DB1 DB2 Feature Name Feature Value Coefficient (Values Learned) Is this edge incident on DB1? 1 wdb1 Is this edge incident on DB2? 1 wdb2 Alignment Feature Weights Default 1 wdef COMA++ Aligned 0.90 wcoma++ 45 LabelProp Aligned 0.7 wlp

165 Correcting Schema Matching Errors with Q 46

166 Correcting Schema Matching Errors with Q Learning with Q helps correct schema matching errors. 46

167 Reducing Pairwise Comparisons during Association Discovery 47 3

168 Reducing Pairwise Comparisons during Association Discovery Keyword Cost Neighborhood GO InterPro2GO InterPro Entry 2 Pub acc term_id go_id entry_ac 2 entry_ac pub_id term plasma membrane 0.25 InterPro Entry InterPro Pub name entry_ac title pub_id 2 A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost 2 from at least one of the keywords. 47 3

169 Reducing Pairwise Comparisons during Association Discovery Keyword Cost Neighborhood GO InterPro2GO InterPro Entry 2 Pub New Source? acc term_id term plasma membrane 0.25 go_id InterPro Entry 0 0 entry_ac entry_ac InterPro Pub 0 0 pub_id 2 name entry_ac title pub_id 2 A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost 2 from at least one of the keywords. 47 3

170 Reducing Pairwise Comparisons during Association Discovery Keyword Cost Neighborhood GO InterPro2GO InterPro Entry 2 Pub New Source? acc term_id term plasma membrane 0.25 go_id InterPro Entry 0 0 entry_ac entry_ac InterPro Pub 0 0 pub_id 2 name entry_ac title pub_id 47 3 View Based Aligner A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost 2 from at least one of the keywords. Prune comparisons based on whether they are likely to affect query results, as otherwise there will be no feedback from user. 2

171 Reducing Pairwise Comparisons during Association Discovery # Pairwise Comparisons Exhaustive ViewBasedAligner Number of Tables in the Schema Graph 48

172 49 Summary of Contributions

173 Summary of Contributions Weakly-supervised acquisition of class-instance pairs from unstructured and structured sources 49

174 Summary of Contributions Weakly-supervised acquisition of class-instance pairs from unstructured and structured sources Scalable method, suitable for large data volume 49

175 Summary of Contributions Weakly-supervised acquisition of class-instance pairs from unstructured and structured sources Scalable method, suitable for large data volume A method for learning data-integrating queries taking user information need into account removing the need for expert input or heavy human supervision, interactive speed automatic incorporation of new source correction of wrong association 49

176 Related Work Extractions from unstructured data: [Etzioni et al., 2005, Van Durme and Pas ça, 2008],... Extractions from semi-structured data: SEAL [Wang and Cohen, 2007], and its extensions Graph-based SSL methods: LP-ZGL [Zhu et al., 2003], LGC [Bengio et al., 2007], Adsorption [Baluja et al., 2008],... Keyword Search over Databases: BANKS [Bhalotia et al., 2002], BLINKS [He et al., 2007], BioGuide [Cohen-Boulakia+07] 50

177 Future Work: Complete the Loop Association (Edge) Discovery IE Ranking Sources Relevance 51

178 Future Work: Complete the Loop Association (Edge) Discovery IE Ranking Sources Relevance 51

179 Future Work: Complete the Loop Association (Edge) Discovery IE Ranking Sources Relevance Correct extraction errors based on their effect on the final answers, as measured by user feedback over those answers. 51

180 52 More Future Work

181 More Future Work Incorporation of other types of semantic constraints in class instance acquisition 52

182 More Future Work Incorporation of other types of semantic constraints in class instance acquisition Graph-based SSL methods for other types (non IS-A) of relation extraction 52

183 More Future Work Incorporation of other types of semantic constraints in class instance acquisition Graph-based SSL methods for other types (non IS-A) of relation extraction Roll Q out to life scientists and get their feedback, and also apply Q in non life science datasets 52

184 More Future Work Incorporation of other types of semantic constraints in class instance acquisition Graph-based SSL methods for other types (non IS-A) of relation extraction Roll Q out to life scientists and get their feedback, and also apply Q in non life science datasets Investigate user adaptation in Q: use model trained for one user to initialize another, exploiting any available user similarity information 52

185 Acknowledgements Advisors: Zack Ives, Mark Liberman, and Fernando Pereira Committee: William Cohen, Aravind Joshi, Ben Taskar, and Lyle Ungar My Co-authors: Rahul Bhagat, Thorsten Brants, Koby Crammer, Sudipto Guha, Marie Jacob, Salman Mehmood, Marius Pasca, Deepak Ravichandran, Joseph Reisinger 53 DARPA, Google, NSF grant #IIS

186

187 Thank You!

188

189 56 Current Approaches

190 Current Approaches Supervised Named Entity Recognition (NER) Deals with only limited number of coarse classes Very resource intensive, labeled data is expensive! 56

191 Current Approaches Supervised Named Entity Recognition (NER) Deals with only limited number of coarse classes Very resource intensive, labeled data is expensive! Pattern based Extraction Textual patterns ( analyst at <ENT>. ) effective only in repetitive contexts [Bellare et al., 2007] Extractions usually high-precision, low-recall! 56

192 Context Pattern based Extraction [Talukdar et al., CoNLL 2006] Partial entity lists extended into longer lists using context patterns induced from unstructured text. Extended lists used as features in supervised tagger, improving its performance. analyst at -ENT-. series against the -ENT-tonight Today 's Schaeffer 's Option Activity Watch features -ENT- ( Boston Red Sox St. Louis Cardinals Chicago Cubs Florida Marlins 57

193 New Extractions found by Adsorption Class Scientific Journals NFL Players Book Publishers A few non-seed Instances found by Adsorption Journal of Physics, Nature, Structural and Molecular Biology, Sciences Sociales et sante, Kidney and Blood Pressure Research, American Journal of Physiology- Cell Physiology, Tony Gonzales, Thabiti Davis, Taylor Stubblefield, Ron Dixon, Rodney Hannan, Small Night Shade Books, House of Ansari Press, Highwater Books, Distributed Art Publishers, Cooper Canyon Press, 58 Total classes: 9081

194 Graph Stats Statistics of Graphs used in Class-Instance Acquisition Experiments 59

195 Improving Class-Instance Acquisition with Additional Attributes 170 WordNet Classes, 10 Seeds per Class Mean Reciprocal Rank (MRR) TextRunner Graph YAGO Graph TextRunner + YAGO Graph 0.3 LP-ZGL Adsorption MAD 60 Amount of Supervision

196 Improving Class-Instance Acquisition with Additional Attributes Mean Reciprocal Rank (MRR) WordNet Classes, 10 Seeds per Class TextRunner Graph YAGO Graph TextRunner + YAGO Graph Additional semantic constraints in the form of (instance, attribute) edges from YAGO help improve performance significantly! LP-ZGL Adsorption MAD Amount of Supervision

197 Effect of Class Similarity Constraints TextRunner Graph, 170 WordNet Classes LP-ZGL Adsorption MAD MADDL Mean Reciprocal Rank (MRR) x x 10 Graph with 175k nodes, 529k edges. 61 Amount of Supervision

198 Effect of Class Similarity Constraints TextRunner Graph, 170 WordNet Classes LP-ZGL Adsorption MAD MADDL 61 Mean Reciprocal Rank (MRR) Class similarity constraints are helpful, more investigation is 0.23 necessary! x x 10 Amount of Supervision Graph with 175k nodes, 529k edges.

199 Effect of Class Sparsity Constraints 0.42 Effect of Per-node Sparsity Constraint Mean Reciprocal Rank (MRR) Maximum Allowed Classes per Node

200 SVM Comparison 0.4 Freebase-2 Graph, 192 WordNet Classes LP-ZGL Adsorption MAD SVM Graph with 303k nodes, 2.3m edges. Mean Reciprocal Rank (MRR) x x 10 Amount of Supervision

201 SVM Comparison TextRunner Graph, 170 WordNet Classes LP-ZGL Adsorption MAD SVM Mean Reciprocal Rank (MRR) x x 10 Amount of Supervision Graph with 175k nodes, 529k edges. 64

202 Results: Time to generate K- best Queries Schema graph of size (28, 96) from BioGuide (Boulakia et al., 2007). K Time (s)

203 Results: Time to generate K- best Queries Schema graph of size (28, 96) from BioGuide (Boulakia et al., 2007). K Time (s) It is possible to generate the top queries in interactive range. Query execution is pipelined. 65

204 Discovering New Associations COMA++ (metadata level) [Do and Rahm, 2007] pairwise comparisons Using Label Propagation (instance level) pairwise comparisons not necessary 66

205 Discovering New Associations COMA++ (metadata level) [Do and Rahm, 2007] pairwise comparisons Using Label Propagation (instance level) pairwise comparisons not necessary 1.0 GO: Interpro2GO go_id GO: GO term acc 1.0 GO:

206 Discovering New Associations COMA++ (metadata level) [Do and Rahm, 2007] pairwise comparisons Using Label Propagation (instance level) pairwise comparisons not necessary Interpro2GO go_id GO: go_id 1.0 Interpro2GO go_id GO: GO: GO: acc GO term acc 1.0 GO term acc 1.0 GO: GO:

207 Discovering New Associations COMA++ (metadata level) [Do and Rahm, 2007] pairwise comparisons Using Label Propagation (instance level) pairwise comparisons not necessary Interpro2GO go_id GO: go_id 1.0 Interpro2GO go_id GO: go_id 1.0 Interpro2GO go_id GO: go_id 0.8 acc GO: acc GO: go_id 0.8 acc 0.2 acc GO: go_id 0.51 acc 0.49 GO term acc 1.0 GO: GO term acc 1.0 GO: go_id 0.25 acc 0.75 GO term acc 1.0 GO: go_id 0.55 acc

208 Reusing Feedback Helps! Precision-Recall Plots for Q with Different Levels of Feedback Precision Q (1x1) Q (10x1) Q (10x2) Q (10x4) Adsorption and COMA++ Averaged Recall

209 Number of Pairwise Comparisons Reducing Pairwise Comparisons during Association Discovery 68 Total 18 Tables in Schema Graph No Additional Filter Value Overlap Filter Exhaustive ViewBasedAligner Alignment Strategy

210 Reducing Pairwise Comparisons during Association Discovery Number of Pairwise Column Comparisons Number of Pairwise Column Comparisons for Increasing Schema Graph Size Exhaustive ViewBasedAligner PreferentialAligner Existing Number of Sources in Schema Graph 69 Number of pairwise attribute comparisons as we scale the size of the search graph (avg. over the introduction of 40 new sources).

Graph-based Semi- Supervised Learning as Optimization

Graph-based Semi- Supervised Learning as Optimization Graph-based Semi- Supervised Learning as Optimization Partha Pratim Talukdar CMU Machine Learning with Large Datasets (10-605) April 3, 2012 Graph-based Semi-Supervised Learning 0.2 0.1 0.2 0.3 0.3 0.2

More information

Automatically Incorporating New Sources in Keyword Search-Based Data Integration

Automatically Incorporating New Sources in Keyword Search-Based Data Integration University of Pennsylvania ScholarlyCommons Departmental Papers (CIS) Department of Computer & Information Science 6-2010 Automatically Incorporating New Sources in Keyword Search-Based Data Integration

More information

Experiments in Graph-based Semi-Supervised Learning Methods for Class-Instance Acquisition

Experiments in Graph-based Semi-Supervised Learning Methods for Class-Instance Acquisition Experiments in Graph-based Semi-Superised Learning Methods for Class-Instance Acquisition Partha Pratim Talukdar Search Labs, Microsoft Research Mountain View, CA 94043 partha@talukdar.net Fernando Pereira

More information

Semi-Supervised Learning: Lecture Notes

Semi-Supervised Learning: Lecture Notes Semi-Supervised Learning: Lecture Notes William W. Cohen March 30, 2018 1 What is Semi-Supervised Learning? In supervised learning, a learner is given a dataset of m labeled examples {(x 1, y 1 ),...,

More information

(Graph-based) Semi-Supervised Learning. Partha Pratim Talukdar Indian Institute of Science

(Graph-based) Semi-Supervised Learning. Partha Pratim Talukdar Indian Institute of Science (Graph-based) Semi-Supervised Learning Partha Pratim Talukdar Indian Institute of Science ppt@serc.iisc.in April 7, 2015 Supervised Learning Labeled Data Learning Algorithm Model 2 Supervised Learning

More information

Learning to Create Data-Integrating Queries

Learning to Create Data-Integrating Queries Learning to Create Data-Integrating Queries Partha Pratim Talukdar Marie Jacob Muhammad Salman Mehmood Koby Crammer Zachary G. Ives Fernando Pereira Sudipto Guha University of Pennsylvania, Philadelphia,

More information

Learning Better Data Representation using Inference-Driven Metric Learning

Learning Better Data Representation using Inference-Driven Metric Learning Learning Better Data Representation using Inference-Driven Metric Learning Paramveer S. Dhillon CIS Deptt., Univ. of Penn. Philadelphia, PA, U.S.A dhillon@cis.upenn.edu Partha Pratim Talukdar Search Labs,

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information

Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch

Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch Graph-based SSL using a count-min sketch has a number of properties that are desirable, and somewhat surprising.

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information

Random Walk Inference and Learning. Carnegie Mellon University 7/28/2011 EMNLP 2011, Edinburgh, Scotland, UK

Random Walk Inference and Learning. Carnegie Mellon University 7/28/2011 EMNLP 2011, Edinburgh, Scotland, UK Random Walk Inference and Learning in A Large Scale Knowledge Base Ni Lao, Tom Mitchell, William W. Cohen Carnegie Mellon University 2011.7.28 1 Outline Motivation Inference in Knowledge Bases The NELL

More information

Learning To Scale Up Search-Driven Data Integration

Learning To Scale Up Search-Driven Data Integration University of Pennsylvania ScholarlyCommons Publicly Accessible Penn Dissertations 2016 Learning To Scale Up Search-Driven Data Integration Zhepeng Yan University of Pennsylvania, zhepeng@cis.upenn.edu

More information

Semi-supervised learning SSL (on graphs)

Semi-supervised learning SSL (on graphs) Semi-supervised learning SSL (on graphs) 1 Announcement No office hour for William after class today! 2 Semi-supervised learning Given: A pool of labeled examples L A (usually larger) pool of unlabeled

More information

Introduction to Text Mining. Hongning Wang

Introduction to Text Mining. Hongning Wang Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:

More information

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Database and Knowledge-Base Systems: Data Mining. Martin Ester Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro

More information

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS 1 WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS BRUCE CROFT NSF Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts,

More information

Exploring and Exploiting the Biological Maze. Presented By Vidyadhari Edupuganti Advisor Dr. Zoe Lacroix

Exploring and Exploiting the Biological Maze. Presented By Vidyadhari Edupuganti Advisor Dr. Zoe Lacroix Exploring and Exploiting the Biological Maze Presented By Vidyadhari Edupuganti Advisor Dr. Zoe Lacroix Motivation An abundance of biological data sources contain data about scientific entities, such as

More information

Transductive Phoneme Classification Using Local Scaling And Confidence

Transductive Phoneme Classification Using Local Scaling And Confidence 202 IEEE 27-th Convention of Electrical and Electronics Engineers in Israel Transductive Phoneme Classification Using Local Scaling And Confidence Matan Orbach Dept. of Electrical Engineering Technion

More information

IEOR E4008: Computational Discrete Optimization

IEOR E4008: Computational Discrete Optimization Yuri Faenza IEOR Department Jan 23th, 2018 Logistics Instructor: Yuri Faenza Assistant Professor @ IEOR from 2016 Research area: Discrete Optimization Schedule: MW, 10:10-11:25 Room: 303 Mudd Office Hours:

More information

INTRO TO SEMI-SUPERVISED LEARNING (SSL)

INTRO TO SEMI-SUPERVISED LEARNING (SSL) SSL (on graphs) 1 INTRO TO SEMI-SUPERVISED LEARNING (SSL) Semi-supervised learning Given: A pool of labeled examples L A (usually larger) pool of unlabeled examples U Option 1 for using L and U : Ignore

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK REVIEW PAPER ON IMPLEMENTATION OF DOCUMENT ANNOTATION USING CONTENT AND QUERYING

More information

Text Mining. Representation of Text Documents

Text Mining. Representation of Text Documents Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,

More information

Open Data Integration. Renée J. Miller

Open Data Integration. Renée J. Miller Open Data Integration Renée J. Miller miller@northeastern.edu !2 Open Data Principles Timely & Comprehensive Accessible and Usable Complete - All public data is made available. Public data is data that

More information

Relational Retrieval Using a Combination of Path-Constrained Random Walks

Relational Retrieval Using a Combination of Path-Constrained Random Walks Relational Retrieval Using a Combination of Path-Constrained Random Walks Ni Lao, William W. Cohen University 2010.9.22 Outline Relational Retrieval Problems Path-constrained random walks The need for

More information

Towards Efficient and Effective Semantic Table Interpretation Ziqi Zhang

Towards Efficient and Effective Semantic Table Interpretation Ziqi Zhang Towards Efficient and Effective Semantic Table Interpretation Ziqi Zhang Department of Computer Science, University of Sheffield Outline Define semantic table interpretation State-of-the-art and motivation

More information

Keyword search in relational databases. By SO Tsz Yan Amanda & HON Ka Lam Ethan

Keyword search in relational databases. By SO Tsz Yan Amanda & HON Ka Lam Ethan Keyword search in relational databases By SO Tsz Yan Amanda & HON Ka Lam Ethan 1 Introduction Ubiquitous relational databases Need to know SQL and database structure Hard to define an object 2 Query representation

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 2013 ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 2013 ISSN: Semi Automatic Annotation Exploitation Similarity of Pics in i Personal Photo Albums P. Subashree Kasi Thangam 1 and R. Rosy Angel 2 1 Assistant Professor, Department of Computer Science Engineering College,

More information

Structured Data on the Web

Structured Data on the Web Structured Data on the Web Alon Halevy Google Australasian Computer Science Week January, 2010 Structured Data & The Web Andree Hudson, 4 th of July Hard to find structured data via search engines

More information

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem Int. J. Advance Soft Compu. Appl, Vol. 9, No. 1, March 2017 ISSN 2074-8523 The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem Loc Tran 1 and Linh Tran

More information

Extracting and Querying Probabilistic Information From Text in BayesStore-IE

Extracting and Querying Probabilistic Information From Text in BayesStore-IE Extracting and Querying Probabilistic Information From Text in BayesStore-IE Daisy Zhe Wang, Michael J. Franklin, Minos Garofalakis 2, Joseph M. Hellerstein University of California, Berkeley Technical

More information

Ontology Based Prediction of Difficult Keyword Queries

Ontology Based Prediction of Difficult Keyword Queries Ontology Based Prediction of Difficult Keyword Queries Lubna.C*, Kasim K Pursuing M.Tech (CSE)*, Associate Professor (CSE) MEA Engineering College, Perinthalmanna Kerala, India lubna9990@gmail.com, kasim_mlp@gmail.com

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

Complex Prediction Problems

Complex Prediction Problems Problems A novel approach to multiple Structured Output Prediction Max-Planck Institute ECML HLIE08 Information Extraction Extract structured information from unstructured data Typical subtasks Named Entity

More information

Web-Scale Extraction of Structured Data

Web-Scale Extraction of Structured Data Web-Scale Extraction of Structured Data Michael J. Cafarella University of Washington mjc@cs.washington.edu Jayant Madhavan Google Inc. jayant@google.com Alon Halevy Google Inc. halevy@google.com ABSTRACT

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

Supporting Fuzzy Keyword Search in Databases

Supporting Fuzzy Keyword Search in Databases I J C T A, 9(24), 2016, pp. 385-391 International Science Press Supporting Fuzzy Keyword Search in Databases Jayavarthini C.* and Priya S. ABSTRACT An efficient keyword search system computes answers as

More information

Text, Knowledge, and Information Extraction. Lizhen Qu

Text, Knowledge, and Information Extraction. Lizhen Qu Text, Knowledge, and Information Extraction Lizhen Qu A bit about Myself PhD: Databases and Information Systems Group (MPII) Advisors: Prof. Gerhard Weikum and Prof. Rainer Gemulla Thesis: Sentiment Analysis

More information

Conclusion and review

Conclusion and review Conclusion and review Domain-specific search (DSS) 2 3 Emerging opportunities for DSS Fighting human trafficking Predicting cyberattacks Stopping Penny Stock Fraud Accurate geopolitical forecasting 3 General

More information

Data about data is database Select correct option: True False Partially True None of the Above

Data about data is database Select correct option: True False Partially True None of the Above Within a table, each primary key value. is a minimal super key is always the first field in each table must be numeric must be unique Foreign Key is A field in a table that matches a key field in another

More information

Interactive Data Integration through Smart Copy & Paste

Interactive Data Integration through Smart Copy & Paste Interactive Data Integration through Smart Copy & Paste Zachary G. Ives 1 Craig A. Knoblock 2 Steven Minton 3 Marie Jacob 1 Partha Pratim Talukdar 1 Rattapoom Tuchinda 4 Jose Luis Ambite 2 Maria Muslea

More information

Incremental Integer Linear Programming for Non-projective Dependency Parsing

Incremental Integer Linear Programming for Non-projective Dependency Parsing Incremental Integer Linear Programming for Non-projective Dependency Parsing Sebastian Riedel James Clarke ICCS, University of Edinburgh 22. July 2006 EMNLP 2006 S. Riedel, J. Clarke (ICCS, Edinburgh)

More information

Intuitive and Interactive Query Formulation to Improve the Usability of Query Systems for Heterogeneous Graphs

Intuitive and Interactive Query Formulation to Improve the Usability of Query Systems for Heterogeneous Graphs Intuitive and Interactive Query Formulation to Improve the Usability of Query Systems for Heterogeneous Graphs Nandish Jayaram University of Texas at Arlington PhD Advisors: Dr. Chengkai Li, Dr. Ramez

More information

OKKAM-based instance level integration

OKKAM-based instance level integration OKKAM-based instance level integration Paolo Bouquet W3C RDF2RDB This work is co-funded by the European Commission in the context of the Large-scale Integrated project OKKAM (GA 215032) RoadMap Using the

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Computer-based Tracking Protocols: Improving Communication between Databases

Computer-based Tracking Protocols: Improving Communication between Databases Computer-based Tracking Protocols: Improving Communication between Databases Amol Deshpande Database Group Department of Computer Science University of Maryland Overview Food tracking and traceability

More information

Typed Graph Models for Semi-Supervised Learning of Name Ethnicity

Typed Graph Models for Semi-Supervised Learning of Name Ethnicity Typed Graph Models for Semi-Supervised Learning of Name Ethnicity Delip Rao Dept. of Computer Science Johns Hopkins University delip@cs.jhu.edu David Yarowsky Dept. of Computer Science Johns Hopkins University

More information

Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation

Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation Several classification and knowledge expansion type of problems involve a large number of labels in realworld scenarios. For

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised

More information

Local higher-order graph clustering

Local higher-order graph clustering Local higher-order graph clustering Hao Yin Stanford University yinh@stanford.edu Austin R. Benson Cornell University arb@cornell.edu Jure Leskovec Stanford University jure@cs.stanford.edu David F. Gleich

More information

All groups final presentation/poster and write-up

All groups final presentation/poster and write-up Logistics Non-NIST groups project proposals Guidelines posted Write-up and slides due this Friday Coming Monday, each Non-NIST group will give project pitch (5 min) based on the slides Everyone in class

More information

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010 INFORMATICS SEMINAR SEPT. 27 & OCT. 4, 2010 Introduction to Semi-Supervised Learning Review 2 Overview Citation X. Zhu and A.B. Goldberg, Introduction to Semi- Supervised Learning, Morgan & Claypool Publishers,

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

What is Text Mining? Sophia Ananiadou National Centre for Text Mining University of Manchester

What is Text Mining? Sophia Ananiadou National Centre for Text Mining   University of Manchester National Centre for Text Mining www.nactem.ac.uk University of Manchester Outline Aims of text mining Text Mining steps Text Mining uses Applications 2 Aims Extract and discover knowledge hidden in text

More information

Improving the Performance of OLAP Queries Using Families of Statistics Trees

Improving the Performance of OLAP Queries Using Families of Statistics Trees Improving the Performance of OLAP Queries Using Families of Statistics Trees Joachim Hammer Dept. of Computer and Information Science University of Florida Lixin Fu Dept. of Mathematical Sciences University

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

Extending Functional Dependency to Detect Abnormal Data in RDF Graphs

Extending Functional Dependency to Detect Abnormal Data in RDF Graphs Extending Functional Dependency to Detect Abnormal Data in RDF Graphs Yang Yu, Jeff Heflin SWAT Lab Department of Computer Science and Engineering Lehigh University PA, USA Outline Semantic Web data and

More information

SEMANTIC WEB POWERED PORTAL INFRASTRUCTURE

SEMANTIC WEB POWERED PORTAL INFRASTRUCTURE SEMANTIC WEB POWERED PORTAL INFRASTRUCTURE YING DING 1 Digital Enterprise Research Institute Leopold-Franzens Universität Innsbruck Austria DIETER FENSEL Digital Enterprise Research Institute National

More information

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

More information

Sharing Work in Keyword Search Over Databases

Sharing Work in Keyword Search Over Databases University of Pennsylvania ScholarlyCommons Departmental Papers (CIS) Department of Computer & Information Science 2011 Sharing Work in Keyword Search Over Databases Marie Jacobs University of Pennsylvania

More information

Keywords Data alignment, Data annotation, Web database, Search Result Record

Keywords Data alignment, Data annotation, Web database, Search Result Record Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web

More information

Information Integration of Partially Labeled Data

Information Integration of Partially Labeled Data Information Integration of Partially Labeled Data Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab, University of Hildesheim srendle@ismll.uni-hildesheim.de, schmidt-thieme@ismll.uni-hildesheim.de

More information

Learning mappings and queries

Learning mappings and queries Learning mappings and queries Marie Jacob University Of Pennsylvania DEIS 2010 1 Schema mappings Denote relationships between schemas Relates source schema S and target schema T Defined in a query language

More information

Query Processing & Optimization

Query Processing & Optimization Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction

More information

Data Cleansing. LIU Jingyuan, Vislab WANG Yilei, Theoretical group

Data Cleansing. LIU Jingyuan, Vislab WANG Yilei, Theoretical group Data Cleansing LIU Jingyuan, Vislab WANG Yilei, Theoretical group What is Data Cleansing Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or inconsistencies

More information

Graph based machine learning with applications to media analytics

Graph based machine learning with applications to media analytics Graph based machine learning with applications to media analytics Lei Ding, PhD 9-1-2011 with collaborators at Outline Graph based machine learning Basic structures Algorithms Examples Applications in

More information

NERD workshop. Luca ALMAnaCH - Inria Paris. Berlin, 18/09/2017

NERD workshop. Luca ALMAnaCH - Inria Paris. Berlin, 18/09/2017 NERD workshop Luca Foppiano @ ALMAnaCH - Inria Paris Berlin, 18/09/2017 Agenda Introducing the (N)ERD service NERD REST API Usages and use cases Entities Rigid textual expressions corresponding to certain

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

Leveraging Data and Structure in Ontology Integration

Leveraging Data and Structure in Ontology Integration Leveraging Data and Structure in Ontology Integration O. Udrea L. Getoor R.J. Miller Group 15 Enrico Savioli Andrea Reale Andrea Sorbini DEIS University of Bologna Searching Information in Large Spaces

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu Presented by: Dimitri Galmanovich Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu 1 When looking for Unstructured data 2 Millions of such queries every day

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework

More information

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017 Flat Clustering Slides are mostly from Hinrich Schütze March 7, 07 / 79 Overview Recap Clustering: Introduction 3 Clustering in IR 4 K-means 5 Evaluation 6 How many clusters? / 79 Outline Recap Clustering:

More information

Efficient Iterative Semi-supervised Classification on Manifold

Efficient Iterative Semi-supervised Classification on Manifold . Efficient Iterative Semi-supervised Classification on Manifold... M. Farajtabar, H. R. Rabiee, A. Shaban, A. Soltani-Farani Sharif University of Technology, Tehran, Iran. Presented by Pooria Joulani

More information

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

An Overview of various methodologies used in Data set Preparation for Data mining Analysis An Overview of various methodologies used in Data set Preparation for Data mining Analysis Arun P Kuttappan 1, P Saranya 2 1 M. E Student, Dept. of Computer Science and Engineering, Gnanamani College of

More information

Lightly-Supervised Attribute Extraction

Lightly-Supervised Attribute Extraction Lightly-Supervised Attribute Extraction Abstract We introduce lightly-supervised methods for extracting entity attributes from natural language text. Using those methods, we are able to extract large number

More information

Informatica Enterprise Information Catalog

Informatica Enterprise Information Catalog Data Sheet Informatica Enterprise Information Catalog Benefits Automatically catalog and classify all types of data across the enterprise using an AI-powered catalog Identify domains and entities with

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

Papers for comprehensive viva-voce

Papers for comprehensive viva-voce Papers for comprehensive viva-voce Priya Radhakrishnan Advisor : Dr. Vasudeva Varma Search and Information Extraction Lab, International Institute of Information Technology, Gachibowli, Hyderabad, India

More information

Semantic Interoperability. Being serious about the Semantic Web

Semantic Interoperability. Being serious about the Semantic Web Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA 1 Being serious about the Semantic Web It is not one person s ontology It is not several people s common

More information

Multi-Stage Rocchio Classification for Large-scale Multilabeled

Multi-Stage Rocchio Classification for Large-scale Multilabeled Multi-Stage Rocchio Classification for Large-scale Multilabeled Text data Dong-Hyun Lee Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul, Korea dhlee347@gmail.com Abstract. Large-scale

More information

Advanced Databases. Lecture 4 - Query Optimization. Masood Niazi Torshiz Islamic Azad university- Mashhad Branch

Advanced Databases. Lecture 4 - Query Optimization. Masood Niazi Torshiz Islamic Azad university- Mashhad Branch Advanced Databases Lecture 4 - Query Optimization Masood Niazi Torshiz Islamic Azad university- Mashhad Branch www.mniazi.ir Query Optimization Introduction Transformation of Relational Expressions Catalog

More information

Query Optimization. Shuigeng Zhou. December 9, 2009 School of Computer Science Fudan University

Query Optimization. Shuigeng Zhou. December 9, 2009 School of Computer Science Fudan University Query Optimization Shuigeng Zhou December 9, 2009 School of Computer Science Fudan University Outline Introduction Catalog Information for Cost Estimation Estimation of Statistics Transformation of Relational

More information

Theme Identification in RDF Graphs

Theme Identification in RDF Graphs Theme Identification in RDF Graphs Hanane Ouksili PRiSM, Univ. Versailles St Quentin, UMR CNRS 8144, Versailles France hanane.ouksili@prism.uvsq.fr Abstract. An increasing number of RDF datasets is published

More information

Deep Web Content Mining

Deep Web Content Mining Deep Web Content Mining Shohreh Ajoudanian, and Mohammad Davarpanah Jazi Abstract The rapid expansion of the web is causing the constant growth of information, leading to several problems such as increased

More information

Top-k Keyword Search Over Graphs Based On Backward Search

Top-k Keyword Search Over Graphs Based On Backward Search Top-k Keyword Search Over Graphs Based On Backward Search Jia-Hui Zeng, Jiu-Ming Huang, Shu-Qiang Yang 1College of Computer National University of Defense Technology, Changsha, China 2College of Computer

More information

Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach +

Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach + Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach + Abdullah Al-Hamdani, Gultekin Ozsoyoglu Electrical Engineering and Computer Science Dept, Case Western Reserve University,

More information

Document Retrieval using Predication Similarity

Document Retrieval using Predication Similarity Document Retrieval using Predication Similarity Kalpa Gunaratna 1 Kno.e.sis Center, Wright State University, Dayton, OH 45435 USA kalpa@knoesis.org Abstract. Document retrieval has been an important research

More information

EXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES

EXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES EXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES B. GEETHA KUMARI M. Tech (CSE) Email-id: Geetha.bapr07@gmail.com JAGETI PADMAVTHI M. Tech (CSE) Email-id: jageti.padmavathi4@gmail.com ABSTRACT:

More information

Natural Language Processing. SoSe Question Answering

Natural Language Processing. SoSe Question Answering Natural Language Processing SoSe 2017 Question Answering Dr. Mariana Neves July 5th, 2017 Motivation Find small segments of text which answer users questions (http://start.csail.mit.edu/) 2 3 Motivation

More information

Update Exchange with Mappings and Provenance

Update Exchange with Mappings and Provenance Update Exchange with Mappings and Provenance Todd J. Green with Grigoris Karvounarakis, Zachary G. Ives, and Val Tannen CSE 455 / CIS 555: Internet and Web Systems April 18, 2007 Challenge: data sharing

More information

PERSONALIZED TAG RECOMMENDATION

PERSONALIZED TAG RECOMMENDATION PERSONALIZED TAG RECOMMENDATION Ziyu Guan, Xiaofei He, Jiajun Bu, Qiaozhu Mei, Chun Chen, Can Wang Zhejiang University, China Univ. of Illinois/Univ. of Michigan 1 Booming of Social Tagging Applications

More information

CS 4460 Intro. to Information Visualization September 15, 2017 John Stasko

CS 4460 Intro. to Information Visualization September 15, 2017 John Stasko Case Study: Jigsaw CS 4460 Intro. to Information Visualization September 15, 2017 John Stasko Learning Objectives Become familiar with investigative analysis process carried out by various types of analysts

More information

Interactive Data Exploration Related works

Interactive Data Exploration Related works Interactive Data Exploration Related works Ali El Adi Bruno Rubio Deepak Barua Hussain Syed Databases and Information Retrieval Integration Project Recap Smart-Drill AlphaSum: Size constrained table summarization

More information

Snowball : Extracting Relations from Large Plain-Text Collections. Eugene Agichtein Luis Gravano. Department of Computer Science Columbia University

Snowball : Extracting Relations from Large Plain-Text Collections. Eugene Agichtein Luis Gravano. Department of Computer Science Columbia University Snowball : Extracting Relations from Large Plain-Text Collections Luis Gravano Department of Computer Science 1 Extracting Relations from Documents Text documents hide valuable structured information.

More information

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Jianyong Wang Department of Computer Science and Technology Tsinghua University Jianyong Wang Department of Computer Science and Technology Tsinghua University jianyong@tsinghua.edu.cn Joint work with Wei Shen (Tsinghua), Ping Luo (HP), and Min Wang (HP) Outline Introduction to entity

More information

Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References q Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

Maximizing the Value of STM Content through Semantic Enrichment. Frank Stumpf December 1, 2009

Maximizing the Value of STM Content through Semantic Enrichment. Frank Stumpf December 1, 2009 Maximizing the Value of STM Content through Semantic Enrichment Frank Stumpf December 1, 2009 What is Semantics and Semantic Processing? Content Knowledge Framework Technology Framework Search Text Images

More information

Semi-Supervised Abstraction-Augmented String Kernel for bio-relationship Extraction

Semi-Supervised Abstraction-Augmented String Kernel for bio-relationship Extraction Semi-Supervised Abstraction-Augmented String Kernel for bio-relationship Extraction Pavel P. Kuksa, Rutgers University Yanjun Qi, Bing Bai, Ronan Collobert, NEC Labs Jason Weston, Google Research NY Vladimir

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) CSE 6242 / CX 4242 Apr 1, 2014 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer,

More information