Algorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous, Distributed, Information Sources

Size: px

Start display at page:

Download "Algorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous, Distributed, Information Sources"

Neil Leonard
5 years ago
Views:

1 Algorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous, Distributed, Information Sources Vasant Honavar Bioinformatics and Computational Biology Graduate Program Center for Computational Intelligence, Learning, & Discovery Iowa State University

2 Coauthors Doina Caragea Jie Bao Jyotishman Pathak Jun Zhang

3 Outline Background and motivation Learning from data revisited Learning classifiers from distributed data Learning classifiers from semantically heterogeneous data Current Status and Summary of Results

4 Background Data revolution Bioinformatics Over 200 data repositories of interest to molecular biologists alone (Discala, 2000) Environmental Informatics Enterprise Informatics Medical Informatics Social Informatics... Connectivity revolution (Internet and the web) Integration revolution Need to understand the elephant as opposed to examining the trunk, the tail, etc. Needed infrastructure to support collaborative, integrative analysis of data

5 Infrastructure for scientific computing As part of efforts to build the scientific computing infrastructure in the US, Europe, Japan Many hundreds of millions of dollars have been spent Many high end computers built Many large databases constructed High speed networks installed Many hundreds of computers bought Many thousands of software applications developed But.. Have we succeeded in changing the nature of the practice of science?

6 Motivating Application Bioinformatics Discovery of functionally important sequence and structural features of proteins Prediction of protein-protein, protein-dna and protein-rna interfaces Discovery of genetic regulatory networks

Representative application: Discovery of Functionally Important Sequence and Structural Features of Proteins Figure 3a: The 3-dimensional structure of human Caspase-1 (MEROPS family C14),

7 Representative application: Discovery of Functionally Important Sequence and Structural Features of Proteins Figure 3a: The 3-dimensional structure of human Caspase-1 (MEROPS family C14), corresponding to PDB entry 1BMQ. The four labeled residues Arg 179, His 237, Cys 285, and Arg 341 are known to form the substrate binding pocket of the Caspase-1 enzyme [Wilson, et al., 1994 Nature 370: ]. Three of these residues (arg 179, His 237, and Cys 285) are located within the MEMEgenerated motifs frequently used by the decision tree classifier for the MEROPS family C14. These motifs correspond to residues (red), (yellow), (green). Figure 3b: The 3-dimensional structure of Astacin (MEROPS family M12) from A. astacus, corresponding to PDB entry 1QJJ. Five MEMEgenerated motifs selected by the decision tree algorithm for the MEROPS family M12 correspond to residues (red), (yellow), and (green). The five labeled residues -- His 92, His 96, Glu 93, His 102, Tyr149 that appear within the motifs have been shown to form the zinc binding pocket of the enzyme [Bond and Beynon, 1995, Protein Science 4: ]. [Wang et al., 2003]

8 Motivating Application Bioinformatics applications of machine learning require integrated analysis of data from multiple sources Solution 1 Assemble a data set using special purpose scripts to extract data sets from different sources and then apply standard algorithms to the assembled data set time consuming, not scalable, and does not handle partially specified data Solution 2 Understand, state, and design algorithms to solve the problem of learning from semantically heterogeneous, distributed data sources

9 Representative application scenario Learning sequence and structural correlates of protein function

10 Acquiring knowledge from data Most current machine learning algorithms assume centralized access to a semantically homogeneous data set Assumptions Data L h Knowledge

11 Challenges Gleaning useful knowledge from data requires tools for analysis of data from autonomous sources Large, distributed, data sources Semantic (ontological) gap Partially specified data Multiple points of view Access constraints...

12 Towards an infrastructure for collaborative discovery Building an effective infrastructure for scientific discovery requires coming to terms with How scientists communicate discipline-specific jargon versus common terms How scientists process information role of background knowledge, assumptions, points of view (ontological commitments) How scientists work capture and analyze data from multiple points of view, at multiple levels of abstraction Distributed, often massive data sources Autonomy of data sources (access restrictions, query capabilities) Semantic gaps between data sources and the data source and the user s point of view in a given context

13 Challenge: Distributed Data Sources Large Growing at an exponential rate Centralized access not feasible Can we learn without centralized access to data? How? How do the results compare with centralized setting?

14 Challenge: Semantic heterogeneity Sub-disciplines limited by their instruments of observation Stumbling block to scientific understanding: Blind men and the elephant syndrome

15 Semantic Gap Structural Genomics Functional Genomics Tissue Genome Sequence Disease Clinical Trials Clinical Data Countries separated by a common language! [Shaw, 1942, after Wilde, 1887]

16 Ontological differences? Temperature : Celsius Outlook : {Sunny, Rainy} Temp : Fahrenheit Precipitation : {NoPrec, Rain} Different terms, same meaning: Outlook vs. Precipitation Same term, different meaning: Wind (speed) vs. Wind (direction) Different domains of values for semantically equivalent attributes Different units: 32 deg. C vs. 75 deg. F

17 Challenge: Data source autonomy Access restrictions Privacy constraints Data source capabilities Queries Execution of user-supplied procedures Storage of partial results or indices Computing, memory and bandwidth limitations

18 Enabling technologies World wide web Knowledge representation; Description Logics, Ontology languages (OWL) Languages for making data sources, resources, and resources self describing (XML, RDF, WSDL) Service oriented computing (Web services)

19 Steps Towards the Semantic Web Web of Knowledge Machine-Machine, Human-Machine and Machine-Human communication Data and Programs 2010 Ontology, Knowledge, Inference, Services Self-Describing Documents 2000 Resource Description Framework extensible Markup Language Documents Early Web 1990 HyperText Markup Language HyperText Transfer Protocol Berners-Lee, Hendler; Nature, 2001

20 Towards an infrastructure for collaborative discovery Building an effective infrastructure for scientific discovery requires coming to terms with How scientists communicate disciplinespecific jargon versus common terms How scientists process information role of background knowledge, assumptions, points of view (ontological commitments) How scientists work capture and analyze data from multiple points of view, at multiple levels of abstraction

21 Solution: INDUS for Learning from Semantically Heterogeneous Distributed Autonomous Data Sources

22 Outline Background and motivation Learning from data revisited Learning classifiers from distributed data Learning classifiers from semantically heterogeneous data Current Status and Summary of Results

23 Learning Classifiers from Data Learning Data Labeled Examples Learner Classifier Classification Unlabeled Instance Classifier Class Standard learning algorithms assume centralized access to data

24 Example: Learning decision tree classifiers Day Outlook Sunny Sunny Overcast Overcast Temp. Hot Hot Hot Cold Humidity High High High Normal Wind Weak Strong Weak Weak Play Tennis No No Yes No Day 1 2 Day 3 4 Outlook Sunny Sunny Outlook Overcast Overcast Temp Hot Hot Temp Hot Cold Humid. High High Humid. High Normal Wind Weak Strong Wind Weak Strong Play No No Play Yes No {1, 2, 3, 4} {1, 2} Sunny No Outlook Overcast Temp. {3, 4} Hot Cold No Yes {4} {3}

25 Example: Learning decision tree classifiers Decision tree is constructed by recursively (and greedily) choosing the attribute that provides the greatest estimated information about the class label What do we need to choose a split at each step? Information gain Estimated probability distribution resulting from each candidate split Proportion of instances of each class along each branch of each candidate split If we have the relevant counts, we have no need for the data!

26 Learning from data reexamined Learner Data D Hypothesis Construction h i+1 C(h i, s (h i -> h i+1, D)) s(h i -> h i+1, D) Data D Statistical Query Generation Query s(h i -> h i+1, D) Learning = Sufficient statistics Extraction + Hypothesis construction

27 Learning from data reexamined Sufficient statistics f θ (D) is a sufficient statistic for θ if it contains all the information that is needed for estimating a parameter θ from the data D. Sample mean is a sufficient statistic for mean of a distribution. We have no use for the data once we have a sufficient statistic for the parameter of interest Note: The classical definition of a sufficient statistic is not constructive

28 Sufficient statistics for learning classifiers By drawing an analogy between a hypothesis h and a parameter estimate -- D D A L θ Θ h H

29 Sufficient statistics for learning classifiers A statistic s L (D,h) is called a sufficient statistic for learning a hypothesis h produced by the learning algorithm L when L is applied to a data set D if there exists an algorithm that takes s L (D,h) as input and outputs h. [Caragea, Silvescu, and Honavar, 2004]. We typically want minimal sufficient statistics and efficient algorithms for computing such statistics Trivially D is an s L (D,h) and so is h. Typically it helps to break down the computation of s L (D,h) into smaller steps queries to data D and computation on the results of the query

30 Sufficient statistic for learning a hypothesis A statistic s L (D,h i ->h i+1 ) is called a sufficient statistic for the refinement of h i into h i+1 if there exists an algorithm R that takes h i and s L (D,h i ->h i+1 ) as inputs and outputs h i+1. A statistic s L (D,h) is a sufficient statistic for learning a hypothesis h using the algorithm L applied to the data D if h can be obtained from h 0 =Ø through a sequence of refinement and composition operations. [Caragea, Silvescu, and Honavar, 2004]

31 Example: Learning decision tree classifiers Day Outlook Sunny Sunny Overcast Overcast Temp. Hot Hot Hot Cold Humidity High High High Normal Wind Weak Strong Weak Weak Play Tennis No No Yes No Day 1 2 Day 3 4 Outlook Sunny Sunny Outlook Overcast Overcast Temp Humid. Hot High Hot High Temp Humid. Hot High Cold Normal {1, 2, 3, 4} Wind Weak Strong Wind Weak Stron g Play No No Play Yes No {1, 2} Sunny No Outlook Overcast Temp. {3, 4} Hot Cold No Yes {4} {3}

32 Sufficient statistics for refining a decision tree Entropy H(D) = - i D D i log 2 D D i Sufficient statistics for refining a partially constructed decision tree count(attribute,class path) and count(class path)

33 Decision Tree Learning = Statistical Query Answering + Hypothesis refinement Outlook Counts(Attribute, Class), Counts(Class) Counts Sunny Overcast Rain Yes Wind Counts(Wind, Class Outlook), Counts(Class Outlook) Humidity Strong Weak No Yes Counts Counts(Humidity, Class Outlook), Counts(Class Outlook) Counts Data Data High Normal No Yes

34 Decision Tree Learning = Statistical Query Answering + Hypothesis refinement Data Attributes A i1,,a im Joint Count Sufficient Statistic count(a i1,,a i1 ) ; p m Joint count sufficient statistics provide all the information needed for learning Naïve Bayes, Bayesian Network (when the structure is known), Decision Tree and many other classifiers We can define refinement sufficient statistics for algorithms for SVM, logistic regression, etc. [Caragea, Silvescu, and Honavar, 2004; Caragea, Caragea, and Honavar, 2005]]

35 Learning from Data Reexamined Identification of minimal or near minimal sufficient statistics for different classes of learning algorithms Design of effective procedures for computing minimal or near minimal sufficient statistics or their efficient approximations Separation of concerns between hypothesis construction (through successive refinement and composition operations) and statistical query answering

36 Outline Background and motivation Learning from data revisited Learning classifiers from distributed data Learning classifiers from semantically heterogeneous data Current Status and Summary of Results

37 Learning Classifiers from Distributed Data Learning from distributed data requires learning from dataset fragments without gathering all of the data in a central location Assuming that the data set is represented in tabular form, data fragmentation can be horizontal vertical or more general (e.g. multi-relational)

38 Horizontal Data Fragmentation Example: Autonomously maintained data for different organisms in comparative genomics Data set fragments are distributed across multiple data repositories Complete data set is the union of data set fragments D 1 D 2 D 3 D 4

39 Vertical Data Fragmentation Example: data gathered by multiple laboratories about outcomes of different sets of clinical tests on a patient Each data set fragment contains sub-tuples of data tuples Sub-tuples of a tuple can associated with each other using a unique key (e.g., patient s social security number) Complete data set is the join of data set fragments D 1 D 2 D 3 D 4

40 Multi-Relational Data Fragmentation Data are stored in a set of relational database tables that can be conceptually tied together by a (global) schema

41 Learning from distributed data Learner S (D, h i ->h i+1 ) Query Decomposition q 1 q 2 D 1 D 2 Query S (D, h i ->h i+1 ) Answer Composition q 3 D 3

42 Learning from distributed data Learning classifiers from distributed data reduces to statistical query answering from distributed data A sound and complete procedure for answering the desired class of statistical queries from distributed data under Different types of data fragmentation Different constraints on access and query capabilities Different bandwidth and resource constraints [Caragea, Silvescu, and Honavar, 2004, also work in progress]

43 How can we evaluate algorithms for learning from distributed data? Compare with their batch counterparts Exactness guarantee that the learned hypothesis is the same as or equivalent to that obtained by the batch counterpart Approximation guarantee that the learned hypothesis is an approximation (in a quantifiable sense) of the hypothesis obtained in the batch setting Communication, memory, and processing requirements [Caragea, Silvescu, and Honavar., 2003, 2004]

44 Exact Learning of decision tree classifiers from distributed data under horizontal fragmentation Counts(Wind, Class Outlook), Counts(Class Outlook) Outlook Counts Query answering engine Sunny Overcast Rain D 1 Yes Wind Counts(Wind, Class Outlook), Counts(Class Outlook) Query Decomp Strong Weak Counts D 2 No Yes Answer Comp Humidity Add Up Counts D 3 High Normal No Yes

45 Time and communication complexity: centralized versus distributed case C is the number of classes (e.g., C =10), V is the maximum number of values of an attribute (e.g., V =10), D is the size of the data (number of examples) (e.g., D =1,000,000) T is the size of the tree (number of nodes) (e.g., T =100), K is the number of data sources (e.g., K=10). Theorem (Time): The algorithm for learning from horizontally fragmented distributed data is K times faster than the algorithm for learning from centralized data, if parallel access to the data sources is allowed. Theorem (Communication): If C V T K < D, then the algorithm for learning from horizontally fragmented distributed data is preferred to the algorithm for learning from centralized data, under the assumption that each data source allows both shipping raw data and computation of sufficient statistics Example: 10x10x100x10=100,000<1,000,000. [Caragea et al., 2003]

46 Some Results on Learning from Distributed Data Provably exact algorithms for learning decision trees, SVM, Naïve Bayes, Neural Network, and Bayesian network classifiers from distributed data Positive and negative results concerning efficiency (bandwith, memory, computation) of learning from distributed data without retrieving raw data relative to its centralized counterpart [Caragea, Silvescu, and Honavar, 2004] A theoretical framework based on sufficient statistics for analysis and design of efficient, exact algorithms for learning classifiers from distributed data

47 Related work learning from distributed data Parallel distributed learning: [Provost and Kolluri, 1999; Grossman and Guo, 2001] Ensemble approach: [Domingos, 1997; Prodromidis et al., 2000] Cooperation-based approach: [Provost and Henessy, 1996; Leckie and Kotagiri, 2002] Learning from vertically fragmented data: [Kargupta et al., 1999, 2001; Park and Kargupta, 2002] Relational learning: [Knobbe et al., 1999; Getoor et al., 2001; Atramentov et al., 2003] Privacy preserving data mining: [Lindell and Pinkas, 2002; Clifton et al., 2002] Attribute noise tolerant PAC Learning [Kearns, 1999]

48 Our approach Works for any learning algorithm Works for different types of data fragmentation Works for some scenarios where privacy preservation is required Yields algorithms that are provably exact with respect to their corresponding batch counterparts Lends itself to adaptation to learning from semantically heterogeneous data

49 Outline Background and motivation Learning from data revisited Learning classifiers from distributed data Learning classifiers from semantically heterogeneous data Current Status and Summary of Results

50 Learning from Semantically Heterogeneous Data Mappings from O to O 1.. O N Ontology M(O, O 1..O N ) O q 1 D 1, O 1 Learner S O (D, h i ->h i+1 ) Query Decomposition q 2 D 2, O 2 Query S O (D, h i ->h i+1 ) Answer Composition q 3 D 3, O 3

51 Learning from semantically heterogeneous data Requires solving the data integration problem: Given a set of autonomous, heterogeneous information sources, each with its own associated schema and ontology answer statistical queries from a user s perspective (user schema and ontology)

52 Semantically heterogeneous data Day Temperature (C) Wind Speed (km/h) Outlook Cloudy D Sunny Rainy Day Temp (F) Wind (mph) Precipitation D Rain Light Rain No Prec

53 Making Data Sources Self Describing Exposing the schema structure of data Specification of the attributes of the data and their types D 1 Day: day Temperature: deg C Wind Speed: kmh Outlook: outlook D 2 Day: day Temp: deg F Wind: mph Precipitation: prec Exposing the ontology conceptualization of semantics of data e.g., domains of attributes and relationships between values

54 Ontologies Partial order ontology (DAG structured) isa hierarchies part-of hierarchies Attribute-value-taxonomy (AVT)

55 Ontology Extended Data Sources Expose the data source schema structure of data specification of the attributes of the data and their types Expose the data source ontology conceptualization of semantics of data domains of attributes and relationships between values attribute value hierarchies Ontology extended data source = Data Source Schema + Data Source Ontology + Data

56 Mappings Querying data sources from a user s perspective is facilitated by specifying mappings at: Schema Level: from attributes from different data source schemas to attributes in the user schema Ontology Level: between values of the attributes from different data source ontologies to values of the corresponding attributes in the user ontology [Caragea, Pathak, and Honavar; 2004]

57 Mappings between schema D 1 Day: day Temperature: deg C Wind Speed: kmh Outlook: outlook D 2 Day: day Temp: deg F Wind: mph Precipitation: prec D U Day: day Temp: deg F Wind: kmh Outlook: outlook Day : D 1 Day: D U Day : D 2 Day: D U Temperature: D 1 Temp : D U Temp: D 2 Temp : D U

58 Mappings between Ontologies H 1 (is-a) H 2 (is-a) H U (is-a) The white nodes represent the values used to describe data

59 Data sources from a user s perspective H 1 (is-a) H U (is-a) Rainy : H 1 = Rain : H U Snow : H 1 = Snow : H U [Caragea, Pathak, and Honavar; 2004 NoPrec : H 1 < Outlook : H U {Sunny, Cloudy} : H 1 = NoPrec : H U Conversion functions are used to map units (e.g. degrees F to degrees C)

60 Conversion functions A total function τ 1 2τ 2 : dom(τ 1 )->dom(τ 2 ) that maps values of τ 1 to values of τ 2 is called conversion function from τ 1 to τ 2. For any two types τ 1, τ 2 Γ there exists at most one conversion function τ 1 2τ 2 For every type τ Γ, τ 2τ exists (identity) If τ i 2τ j and τ j 2τ k exist, then τ i 2τ k exists and τ i 2τ k = τ i 2τ j τ j 2τ k

61 Integration ontology An ontology (O U, ) is called an integration ontology of a set of data source ontologies O 1,,O K if there exist K partial injective mappings Φ 1,,Φ K from O 1,,O K, respectively, to O U that satisfy: Order preservation: x i y implies Φ i (x) Φ i (y), for all x,y O i Semantics preservation: (x:o i op y:o U ) IC, then (Φ i (x) op y), for all x O i and y O U

62 Semantic heterogeneity leads to Partially Specified Data Different data sources may describe data at different levels of abstraction Different users may want to view data at a certain level of abstraction H 1 (is-a) O U H U (is-a) Snow is under-specified in H 1 relative to user ontology H U Making D 1 partially specified from the user perspective [Zhang and Honavar, 2003; 2004]

63 Learning from Semantically Heterogeneous Data Mappings between O 1.. O N and O Ontology M(O, O 1..O N ) O q 1 D 1, O 1 Learner S O (h i ->h i+1,d) Query Decomposition q 2 D 2, O 2 Query S O (h i ->h i+1,d) Answer Composition q 3 D 3, O 3

64 Learning Classifiers from Attribute Value Taxonomies (AVT) and Partially Specified Data Given a taxonomy over values of each attribute, and data specified in terms of values at different levels of abstraction, learn a concise and accurate hypothesis Student Status Work Status h(γ 0 ) Undergraduate Graduate On-Campus Off-Campus h(γ 1 ) Freshman Senior Ph.D TA RA AA Government Private Sophomore Junior Master Federal Local Org State Com [Zhang and Honavar, 2003; 2004; 2005] h(γ k )

65 Learning Classifiers from (AVT) and Partially Specified Data Cuts through AVT induce a partial order over instance representations Classifiers AVT-DTL and AVT-NBL Show how to learn classifiers from partially specified data Estimate sufficient statistics from partially specified data under specific statistical assumptions Use CMDL score to trade off classifier complexity against accuracy [Zhang and Honavar, 2003; 2004; 2005]

66 AVT-NBL for Learning Classifiers from Partially Specified Data NBL Prop-NBL AVT-NBL Mushroom 10% 30% 50% 4.65(±1.33) 5.28 (±1.41) 6.63(±1.57) 4.69(±1.34) 4.84(±1.36) 5.82(±1.48) 0.30(±0.30) 0.64(±0.50) 1.24(±0.70) Nursery 10% 30% 50% 15.27(±1.81) 26.84(±2.23) 36.96(±2.43) 15.50(±1.82) 26.25(±2.21) 35.88(±2.41) 12.85(±1.67) 21.19(±2.05) 29.34(±2.29) Soybean 10% 30% 50% 8.76(±1.76) 12.45(±2.07) 19.39(±2.47) 9.08(±1.79) 11.54(±2.00) 16.91(±2.34) 6.75(±1.57) 10.32(±1.90) 16.93(±2.34) % Error rates on data with different percentages of partially or totally missing values based on 10-fold cross validation with 90% confidence interval [Zhang and Honavar, 2004]

67 Learning decision tree classifiers from semantically heterogeneous data data Schema and Ontology Level Mappings and Conversion Functions Oultlook O U O 1 Wind Sunny Overcast Yes Rain Strong No Wind Weak Yes Counts(Wind, Class Outlook), Counts(Class Outlook) Counts Query Decomp Query Engine Answer Comp Add Up Counts D 1 D 2 O 2

68 Related work Information integration [Levy, 1998; Davidson et al., 2001; Ekman, 2003] Ontology-extended relational algebra [Bonatti et al., 2003] Ontology and mappings editors [Noy et al., 2000; Eckman et al., 2002] Statistical databases [McClean et al., 2002] Learning from ontologies and fully specified data [Han and Fu, 1996; Koller and Sahami, 1997; Pazzani et al., 1997]

69 Our approach to learning classifiers from semantically heterogeneous data Is based on a separation of concerns between querying for sufficient statistics and hypothesis construction Supports learning from semantically heterogeneous data from a user perspective Offers a theoretically well founded solution to the problem of learning classifiers from semantically heterogeneous data

70 Outline of the talk Background and motivation Learning from data revisited Learning classifiers from distributed data Learning classifiers from semantically heterogeneous data Current Status and Summary of Results

71 Ontology-based information integration in INDUS

72 Results -- INDUS tools for for collaborative knowledge acquisition Algorithms for learning classifiers from distributed data with provable performance guarantees relative to their centralized or batch counterparts Algorithms for answering statistical queries from semantically heterogeneous data Algorithms for learning classifiers from partial order ontologies and partially specified data Modular ontologies, inter-ontology mappings, and inference to support collaborative ontology development and reuse Implementation of INDUS software Applications in bioinformatics classifiers for protein function annotation, classifiers for binding site identification

73 Capabilities of INDUS INDUS provides support for: Specification and update of schemas and ontologies Specification of mappings between ontologies Registration of new data sources Specification of user views Specification and execution of queries across distributed, semantically heterogeneous data source Learning classifiers from semantically heterogeneous data

74 INDUS Tools Ontology Editor for specifying or modifying ontologies Schema Editor for specifying or modifying data source schemas Mapping Editor for specifying mappings between ontologies and between schemas Data Editor for registering data sources with INDUS View Editor for defining user views Query Interface for formulating queries and displaying results

75 INDUS Users: Domain Ontologists A domain ontologist can specify or update: ontologies schemas mappings between ontologies mappings between schemas

76 INDUS Users: Data Providers A data provider can: Associate a predefined schema and ontology with a data source Specify data source location, type and access procedures Register a data source Act as a domain ontologist

77 INDUS Users: Domain Experts A domain expert can specify an application view select data sources of interest in an application domain an application specific schema an application specific ontology relevant mappings A domain expert can serve as Domain ontologist Data provider

78 INDUS Users: Analysis Tool Providers A analysis tool provider can: Register a tool (e.g., learning algorithm) Act as a data source provider Act as a domain ontologist Act as a domain expert

79 INDUS Users: Domain Scientists A domain scientist can Select an application view Formulate and execute queries Select and execute learning algorithms A domain scientist can act as Domain ontologist Data provider Domain expert Analysis tool provider

80 INDUS Some features of INDUS Clear distinction between structure and semantics of data Data integration from a user perspective - User-specifiable ontologies and mappings (no single global ontology) Semantic integrity of queries ensured by means of semantics preserving mappings

81 Current Directions Further development of the open source INDUS tools for collaborative discovery Algorithms for learning classifiers from semantically heterogeneous multi-relational data Modular collaborative ontology development Ontology-extended workflows and services Applications in bioinformatics, security informatics, medical informatics, social informatics

82 Current Ph.D. Students C. Andorf J. Bao C. Caragea J. Pathak T. Alcon O..Yakhnenko A. Silvescu F. Wu O. Kohutyuk M. Brathwaite F. Vasile D-K. Kang Y. El-Manzalawi P. Zaback Postdoctoral Fellows Recent Ph.D. grads Collaborating Ph.D. Students D. Caragea B. Olson C. Yan J. Zhang K. Vander T. Dunn O. Couture M. Terribilini Velden

83 Thank you! Vasant Honavar Bioinformatics and Computational Biology Program Center for Computational Intelligence, Learning, & Discovery Iowa State University

Learning from Semantically Heterogeneous Data

Learning from Semantically Heterogeneous Data Doina Caragea* Department of Computing and Information Sciences Kansas State University 234 Nichols Hall Manhattan, KS 66506 USA voice: +1 785-532-7908 fax: