Algorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous, Distributed, Information Sources

Size: px
Start display at page:

Download "Algorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous, Distributed, Information Sources"

Transcription

1 Algorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous, Distributed, Information Sources Vasant Honavar Bioinformatics and Computational Biology Graduate Program Center for Computational Intelligence, Learning, & Discovery Iowa State University

2 Coauthors Doina Caragea Jie Bao Jyotishman Pathak Jun Zhang

3 Outline Background and motivation Learning from data revisited Learning classifiers from distributed data Learning classifiers from semantically heterogeneous data Current Status and Summary of Results

4 Background Data revolution Bioinformatics Over 200 data repositories of interest to molecular biologists alone (Discala, 2000) Environmental Informatics Enterprise Informatics Medical Informatics Social Informatics... Connectivity revolution (Internet and the web) Integration revolution Need to understand the elephant as opposed to examining the trunk, the tail, etc. Needed infrastructure to support collaborative, integrative analysis of data

5 Infrastructure for scientific computing As part of efforts to build the scientific computing infrastructure in the US, Europe, Japan Many hundreds of millions of dollars have been spent Many high end computers built Many large databases constructed High speed networks installed Many hundreds of computers bought Many thousands of software applications developed But.. Have we succeeded in changing the nature of the practice of science?

6 Motivating Application Bioinformatics Discovery of functionally important sequence and structural features of proteins Prediction of protein-protein, protein-dna and protein-rna interfaces Discovery of genetic regulatory networks

7 Representative application: Discovery of Functionally Important Sequence and Structural Features of Proteins Figure 3a: The 3-dimensional structure of human Caspase-1 (MEROPS family C14), corresponding to PDB entry 1BMQ. The four labeled residues Arg 179, His 237, Cys 285, and Arg 341 are known to form the substrate binding pocket of the Caspase-1 enzyme [Wilson, et al., 1994 Nature 370: ]. Three of these residues (arg 179, His 237, and Cys 285) are located within the MEMEgenerated motifs frequently used by the decision tree classifier for the MEROPS family C14. These motifs correspond to residues (red), (yellow), (green). Figure 3b: The 3-dimensional structure of Astacin (MEROPS family M12) from A. astacus, corresponding to PDB entry 1QJJ. Five MEMEgenerated motifs selected by the decision tree algorithm for the MEROPS family M12 correspond to residues (red), (yellow), and (green). The five labeled residues -- His 92, His 96, Glu 93, His 102, Tyr149 that appear within the motifs have been shown to form the zinc binding pocket of the enzyme [Bond and Beynon, 1995, Protein Science 4: ]. [Wang et al., 2003]

8 Motivating Application Bioinformatics applications of machine learning require integrated analysis of data from multiple sources Solution 1 Assemble a data set using special purpose scripts to extract data sets from different sources and then apply standard algorithms to the assembled data set time consuming, not scalable, and does not handle partially specified data Solution 2 Understand, state, and design algorithms to solve the problem of learning from semantically heterogeneous, distributed data sources

9 Representative application scenario Learning sequence and structural correlates of protein function

10 Acquiring knowledge from data Most current machine learning algorithms assume centralized access to a semantically homogeneous data set Assumptions Data L h Knowledge

11 Challenges Gleaning useful knowledge from data requires tools for analysis of data from autonomous sources Large, distributed, data sources Semantic (ontological) gap Partially specified data Multiple points of view Access constraints...

12 Towards an infrastructure for collaborative discovery Building an effective infrastructure for scientific discovery requires coming to terms with How scientists communicate discipline-specific jargon versus common terms How scientists process information role of background knowledge, assumptions, points of view (ontological commitments) How scientists work capture and analyze data from multiple points of view, at multiple levels of abstraction Distributed, often massive data sources Autonomy of data sources (access restrictions, query capabilities) Semantic gaps between data sources and the data source and the user s point of view in a given context

13 Challenge: Distributed Data Sources Large Growing at an exponential rate Centralized access not feasible Can we learn without centralized access to data? How? How do the results compare with centralized setting?

14 Challenge: Semantic heterogeneity Sub-disciplines limited by their instruments of observation Stumbling block to scientific understanding: Blind men and the elephant syndrome

15 Semantic Gap Structural Genomics Functional Genomics Tissue Genome Sequence Disease Clinical Trials Clinical Data Countries separated by a common language! [Shaw, 1942, after Wilde, 1887]

16 Ontological differences? Temperature : Celsius Outlook : {Sunny, Rainy} Temp : Fahrenheit Precipitation : {NoPrec, Rain} Different terms, same meaning: Outlook vs. Precipitation Same term, different meaning: Wind (speed) vs. Wind (direction) Different domains of values for semantically equivalent attributes Different units: 32 deg. C vs. 75 deg. F

17 Challenge: Data source autonomy Access restrictions Privacy constraints Data source capabilities Queries Execution of user-supplied procedures Storage of partial results or indices Computing, memory and bandwidth limitations

18 Enabling technologies World wide web Knowledge representation; Description Logics, Ontology languages (OWL) Languages for making data sources, resources, and resources self describing (XML, RDF, WSDL) Service oriented computing (Web services)

19 Steps Towards the Semantic Web Web of Knowledge Machine-Machine, Human-Machine and Machine-Human communication Data and Programs 2010 Ontology, Knowledge, Inference, Services Self-Describing Documents 2000 Resource Description Framework extensible Markup Language Documents Early Web 1990 HyperText Markup Language HyperText Transfer Protocol Berners-Lee, Hendler; Nature, 2001

20 Towards an infrastructure for collaborative discovery Building an effective infrastructure for scientific discovery requires coming to terms with How scientists communicate disciplinespecific jargon versus common terms How scientists process information role of background knowledge, assumptions, points of view (ontological commitments) How scientists work capture and analyze data from multiple points of view, at multiple levels of abstraction

21 Solution: INDUS for Learning from Semantically Heterogeneous Distributed Autonomous Data Sources

22 Outline Background and motivation Learning from data revisited Learning classifiers from distributed data Learning classifiers from semantically heterogeneous data Current Status and Summary of Results

23 Learning Classifiers from Data Learning Data Labeled Examples Learner Classifier Classification Unlabeled Instance Classifier Class Standard learning algorithms assume centralized access to data

24 Example: Learning decision tree classifiers Day Outlook Sunny Sunny Overcast Overcast Temp. Hot Hot Hot Cold Humidity High High High Normal Wind Weak Strong Weak Weak Play Tennis No No Yes No Day 1 2 Day 3 4 Outlook Sunny Sunny Outlook Overcast Overcast Temp Hot Hot Temp Hot Cold Humid. High High Humid. High Normal Wind Weak Strong Wind Weak Strong Play No No Play Yes No {1, 2, 3, 4} {1, 2} Sunny No Outlook Overcast Temp. {3, 4} Hot Cold No Yes {4} {3}

25 Example: Learning decision tree classifiers Decision tree is constructed by recursively (and greedily) choosing the attribute that provides the greatest estimated information about the class label What do we need to choose a split at each step? Information gain Estimated probability distribution resulting from each candidate split Proportion of instances of each class along each branch of each candidate split If we have the relevant counts, we have no need for the data!

26 Learning from data reexamined Learner Data D Hypothesis Construction h i+1 C(h i, s (h i -> h i+1, D)) s(h i -> h i+1, D) Data D Statistical Query Generation Query s(h i -> h i+1, D) Learning = Sufficient statistics Extraction + Hypothesis construction

27 Learning from data reexamined Sufficient statistics f θ (D) is a sufficient statistic for θ if it contains all the information that is needed for estimating a parameter θ from the data D. Sample mean is a sufficient statistic for mean of a distribution. We have no use for the data once we have a sufficient statistic for the parameter of interest Note: The classical definition of a sufficient statistic is not constructive

28 Sufficient statistics for learning classifiers By drawing an analogy between a hypothesis h and a parameter estimate -- D D A L θ Θ h H

29 Sufficient statistics for learning classifiers A statistic s L (D,h) is called a sufficient statistic for learning a hypothesis h produced by the learning algorithm L when L is applied to a data set D if there exists an algorithm that takes s L (D,h) as input and outputs h. [Caragea, Silvescu, and Honavar, 2004]. We typically want minimal sufficient statistics and efficient algorithms for computing such statistics Trivially D is an s L (D,h) and so is h. Typically it helps to break down the computation of s L (D,h) into smaller steps queries to data D and computation on the results of the query

30 Sufficient statistic for learning a hypothesis A statistic s L (D,h i ->h i+1 ) is called a sufficient statistic for the refinement of h i into h i+1 if there exists an algorithm R that takes h i and s L (D,h i ->h i+1 ) as inputs and outputs h i+1. A statistic s L (D,h) is a sufficient statistic for learning a hypothesis h using the algorithm L applied to the data D if h can be obtained from h 0 =Ø through a sequence of refinement and composition operations. [Caragea, Silvescu, and Honavar, 2004]

31 Example: Learning decision tree classifiers Day Outlook Sunny Sunny Overcast Overcast Temp. Hot Hot Hot Cold Humidity High High High Normal Wind Weak Strong Weak Weak Play Tennis No No Yes No Day 1 2 Day 3 4 Outlook Sunny Sunny Outlook Overcast Overcast Temp Humid. Hot High Hot High Temp Humid. Hot High Cold Normal {1, 2, 3, 4} Wind Weak Strong Wind Weak Stron g Play No No Play Yes No {1, 2} Sunny No Outlook Overcast Temp. {3, 4} Hot Cold No Yes {4} {3}

32 Sufficient statistics for refining a decision tree Entropy H(D) = - i D D i log 2 D D i Sufficient statistics for refining a partially constructed decision tree count(attribute,class path) and count(class path)

33 Decision Tree Learning = Statistical Query Answering + Hypothesis refinement Outlook Counts(Attribute, Class), Counts(Class) Counts Sunny Overcast Rain Yes Wind Counts(Wind, Class Outlook), Counts(Class Outlook) Humidity Strong Weak No Yes Counts Counts(Humidity, Class Outlook), Counts(Class Outlook) Counts Data Data High Normal No Yes

34 Decision Tree Learning = Statistical Query Answering + Hypothesis refinement Data Attributes A i1,,a im Joint Count Sufficient Statistic count(a i1,,a i1 ) ; p m Joint count sufficient statistics provide all the information needed for learning Naïve Bayes, Bayesian Network (when the structure is known), Decision Tree and many other classifiers We can define refinement sufficient statistics for algorithms for SVM, logistic regression, etc. [Caragea, Silvescu, and Honavar, 2004; Caragea, Caragea, and Honavar, 2005]]

35 Learning from Data Reexamined Identification of minimal or near minimal sufficient statistics for different classes of learning algorithms Design of effective procedures for computing minimal or near minimal sufficient statistics or their efficient approximations Separation of concerns between hypothesis construction (through successive refinement and composition operations) and statistical query answering

36 Outline Background and motivation Learning from data revisited Learning classifiers from distributed data Learning classifiers from semantically heterogeneous data Current Status and Summary of Results

37 Learning Classifiers from Distributed Data Learning from distributed data requires learning from dataset fragments without gathering all of the data in a central location Assuming that the data set is represented in tabular form, data fragmentation can be horizontal vertical or more general (e.g. multi-relational)

38 Horizontal Data Fragmentation Example: Autonomously maintained data for different organisms in comparative genomics Data set fragments are distributed across multiple data repositories Complete data set is the union of data set fragments D 1 D 2 D 3 D 4

39 Vertical Data Fragmentation Example: data gathered by multiple laboratories about outcomes of different sets of clinical tests on a patient Each data set fragment contains sub-tuples of data tuples Sub-tuples of a tuple can associated with each other using a unique key (e.g., patient s social security number) Complete data set is the join of data set fragments D 1 D 2 D 3 D 4

40 Multi-Relational Data Fragmentation Data are stored in a set of relational database tables that can be conceptually tied together by a (global) schema

41 Learning from distributed data Learner S (D, h i ->h i+1 ) Query Decomposition q 1 q 2 D 1 D 2 Query S (D, h i ->h i+1 ) Answer Composition q 3 D 3

42 Learning from distributed data Learning classifiers from distributed data reduces to statistical query answering from distributed data A sound and complete procedure for answering the desired class of statistical queries from distributed data under Different types of data fragmentation Different constraints on access and query capabilities Different bandwidth and resource constraints [Caragea, Silvescu, and Honavar, 2004, also work in progress]

43 How can we evaluate algorithms for learning from distributed data? Compare with their batch counterparts Exactness guarantee that the learned hypothesis is the same as or equivalent to that obtained by the batch counterpart Approximation guarantee that the learned hypothesis is an approximation (in a quantifiable sense) of the hypothesis obtained in the batch setting Communication, memory, and processing requirements [Caragea, Silvescu, and Honavar., 2003, 2004]

44 Exact Learning of decision tree classifiers from distributed data under horizontal fragmentation Counts(Wind, Class Outlook), Counts(Class Outlook) Outlook Counts Query answering engine Sunny Overcast Rain D 1 Yes Wind Counts(Wind, Class Outlook), Counts(Class Outlook) Query Decomp Strong Weak Counts D 2 No Yes Answer Comp Humidity Add Up Counts D 3 High Normal No Yes

45 Time and communication complexity: centralized versus distributed case C is the number of classes (e.g., C =10), V is the maximum number of values of an attribute (e.g., V =10), D is the size of the data (number of examples) (e.g., D =1,000,000) T is the size of the tree (number of nodes) (e.g., T =100), K is the number of data sources (e.g., K=10). Theorem (Time): The algorithm for learning from horizontally fragmented distributed data is K times faster than the algorithm for learning from centralized data, if parallel access to the data sources is allowed. Theorem (Communication): If C V T K < D, then the algorithm for learning from horizontally fragmented distributed data is preferred to the algorithm for learning from centralized data, under the assumption that each data source allows both shipping raw data and computation of sufficient statistics Example: 10x10x100x10=100,000<1,000,000. [Caragea et al., 2003]

46 Some Results on Learning from Distributed Data Provably exact algorithms for learning decision trees, SVM, Naïve Bayes, Neural Network, and Bayesian network classifiers from distributed data Positive and negative results concerning efficiency (bandwith, memory, computation) of learning from distributed data without retrieving raw data relative to its centralized counterpart [Caragea, Silvescu, and Honavar, 2004] A theoretical framework based on sufficient statistics for analysis and design of efficient, exact algorithms for learning classifiers from distributed data

47 Related work learning from distributed data Parallel distributed learning: [Provost and Kolluri, 1999; Grossman and Guo, 2001] Ensemble approach: [Domingos, 1997; Prodromidis et al., 2000] Cooperation-based approach: [Provost and Henessy, 1996; Leckie and Kotagiri, 2002] Learning from vertically fragmented data: [Kargupta et al., 1999, 2001; Park and Kargupta, 2002] Relational learning: [Knobbe et al., 1999; Getoor et al., 2001; Atramentov et al., 2003] Privacy preserving data mining: [Lindell and Pinkas, 2002; Clifton et al., 2002] Attribute noise tolerant PAC Learning [Kearns, 1999]

48 Our approach Works for any learning algorithm Works for different types of data fragmentation Works for some scenarios where privacy preservation is required Yields algorithms that are provably exact with respect to their corresponding batch counterparts Lends itself to adaptation to learning from semantically heterogeneous data

49 Outline Background and motivation Learning from data revisited Learning classifiers from distributed data Learning classifiers from semantically heterogeneous data Current Status and Summary of Results

50 Learning from Semantically Heterogeneous Data Mappings from O to O 1.. O N Ontology M(O, O 1..O N ) O q 1 D 1, O 1 Learner S O (D, h i ->h i+1 ) Query Decomposition q 2 D 2, O 2 Query S O (D, h i ->h i+1 ) Answer Composition q 3 D 3, O 3

51 Learning from semantically heterogeneous data Requires solving the data integration problem: Given a set of autonomous, heterogeneous information sources, each with its own associated schema and ontology answer statistical queries from a user s perspective (user schema and ontology)

52 Semantically heterogeneous data Day Temperature (C) Wind Speed (km/h) Outlook Cloudy D Sunny Rainy Day Temp (F) Wind (mph) Precipitation D Rain Light Rain No Prec

53 Making Data Sources Self Describing Exposing the schema structure of data Specification of the attributes of the data and their types D 1 Day: day Temperature: deg C Wind Speed: kmh Outlook: outlook D 2 Day: day Temp: deg F Wind: mph Precipitation: prec Exposing the ontology conceptualization of semantics of data e.g., domains of attributes and relationships between values

54 Ontologies Partial order ontology (DAG structured) isa hierarchies part-of hierarchies Attribute-value-taxonomy (AVT)

55 Ontology Extended Data Sources Expose the data source schema structure of data specification of the attributes of the data and their types Expose the data source ontology conceptualization of semantics of data domains of attributes and relationships between values attribute value hierarchies Ontology extended data source = Data Source Schema + Data Source Ontology + Data

56 Mappings Querying data sources from a user s perspective is facilitated by specifying mappings at: Schema Level: from attributes from different data source schemas to attributes in the user schema Ontology Level: between values of the attributes from different data source ontologies to values of the corresponding attributes in the user ontology [Caragea, Pathak, and Honavar; 2004]

57 Mappings between schema D 1 Day: day Temperature: deg C Wind Speed: kmh Outlook: outlook D 2 Day: day Temp: deg F Wind: mph Precipitation: prec D U Day: day Temp: deg F Wind: kmh Outlook: outlook Day : D 1 Day: D U Day : D 2 Day: D U Temperature: D 1 Temp : D U Temp: D 2 Temp : D U

58 Mappings between Ontologies H 1 (is-a) H 2 (is-a) H U (is-a) The white nodes represent the values used to describe data

59 Data sources from a user s perspective H 1 (is-a) H U (is-a) Rainy : H 1 = Rain : H U Snow : H 1 = Snow : H U [Caragea, Pathak, and Honavar; 2004 NoPrec : H 1 < Outlook : H U {Sunny, Cloudy} : H 1 = NoPrec : H U Conversion functions are used to map units (e.g. degrees F to degrees C)

60 Conversion functions A total function τ 1 2τ 2 : dom(τ 1 )->dom(τ 2 ) that maps values of τ 1 to values of τ 2 is called conversion function from τ 1 to τ 2. For any two types τ 1, τ 2 Γ there exists at most one conversion function τ 1 2τ 2 For every type τ Γ, τ 2τ exists (identity) If τ i 2τ j and τ j 2τ k exist, then τ i 2τ k exists and τ i 2τ k = τ i 2τ j τ j 2τ k

61 Integration ontology An ontology (O U, ) is called an integration ontology of a set of data source ontologies O 1,,O K if there exist K partial injective mappings Φ 1,,Φ K from O 1,,O K, respectively, to O U that satisfy: Order preservation: x i y implies Φ i (x) Φ i (y), for all x,y O i Semantics preservation: (x:o i op y:o U ) IC, then (Φ i (x) op y), for all x O i and y O U

62 Semantic heterogeneity leads to Partially Specified Data Different data sources may describe data at different levels of abstraction Different users may want to view data at a certain level of abstraction H 1 (is-a) O U H U (is-a) Snow is under-specified in H 1 relative to user ontology H U Making D 1 partially specified from the user perspective [Zhang and Honavar, 2003; 2004]

63 Learning from Semantically Heterogeneous Data Mappings between O 1.. O N and O Ontology M(O, O 1..O N ) O q 1 D 1, O 1 Learner S O (h i ->h i+1,d) Query Decomposition q 2 D 2, O 2 Query S O (h i ->h i+1,d) Answer Composition q 3 D 3, O 3

64 Learning Classifiers from Attribute Value Taxonomies (AVT) and Partially Specified Data Given a taxonomy over values of each attribute, and data specified in terms of values at different levels of abstraction, learn a concise and accurate hypothesis Student Status Work Status h(γ 0 ) Undergraduate Graduate On-Campus Off-Campus h(γ 1 ) Freshman Senior Ph.D TA RA AA Government Private Sophomore Junior Master Federal Local Org State Com [Zhang and Honavar, 2003; 2004; 2005] h(γ k )

65 Learning Classifiers from (AVT) and Partially Specified Data Cuts through AVT induce a partial order over instance representations Classifiers AVT-DTL and AVT-NBL Show how to learn classifiers from partially specified data Estimate sufficient statistics from partially specified data under specific statistical assumptions Use CMDL score to trade off classifier complexity against accuracy [Zhang and Honavar, 2003; 2004; 2005]

66 AVT-NBL for Learning Classifiers from Partially Specified Data NBL Prop-NBL AVT-NBL Mushroom 10% 30% 50% 4.65(±1.33) 5.28 (±1.41) 6.63(±1.57) 4.69(±1.34) 4.84(±1.36) 5.82(±1.48) 0.30(±0.30) 0.64(±0.50) 1.24(±0.70) Nursery 10% 30% 50% 15.27(±1.81) 26.84(±2.23) 36.96(±2.43) 15.50(±1.82) 26.25(±2.21) 35.88(±2.41) 12.85(±1.67) 21.19(±2.05) 29.34(±2.29) Soybean 10% 30% 50% 8.76(±1.76) 12.45(±2.07) 19.39(±2.47) 9.08(±1.79) 11.54(±2.00) 16.91(±2.34) 6.75(±1.57) 10.32(±1.90) 16.93(±2.34) % Error rates on data with different percentages of partially or totally missing values based on 10-fold cross validation with 90% confidence interval [Zhang and Honavar, 2004]

67 Learning decision tree classifiers from semantically heterogeneous data data Schema and Ontology Level Mappings and Conversion Functions Oultlook O U O 1 Wind Sunny Overcast Yes Rain Strong No Wind Weak Yes Counts(Wind, Class Outlook), Counts(Class Outlook) Counts Query Decomp Query Engine Answer Comp Add Up Counts D 1 D 2 O 2

68 Related work Information integration [Levy, 1998; Davidson et al., 2001; Ekman, 2003] Ontology-extended relational algebra [Bonatti et al., 2003] Ontology and mappings editors [Noy et al., 2000; Eckman et al., 2002] Statistical databases [McClean et al., 2002] Learning from ontologies and fully specified data [Han and Fu, 1996; Koller and Sahami, 1997; Pazzani et al., 1997]

69 Our approach to learning classifiers from semantically heterogeneous data Is based on a separation of concerns between querying for sufficient statistics and hypothesis construction Supports learning from semantically heterogeneous data from a user perspective Offers a theoretically well founded solution to the problem of learning classifiers from semantically heterogeneous data

70 Outline of the talk Background and motivation Learning from data revisited Learning classifiers from distributed data Learning classifiers from semantically heterogeneous data Current Status and Summary of Results

71 Ontology-based information integration in INDUS

72 Results -- INDUS tools for for collaborative knowledge acquisition Algorithms for learning classifiers from distributed data with provable performance guarantees relative to their centralized or batch counterparts Algorithms for answering statistical queries from semantically heterogeneous data Algorithms for learning classifiers from partial order ontologies and partially specified data Modular ontologies, inter-ontology mappings, and inference to support collaborative ontology development and reuse Implementation of INDUS software Applications in bioinformatics classifiers for protein function annotation, classifiers for binding site identification

73 Capabilities of INDUS INDUS provides support for: Specification and update of schemas and ontologies Specification of mappings between ontologies Registration of new data sources Specification of user views Specification and execution of queries across distributed, semantically heterogeneous data source Learning classifiers from semantically heterogeneous data

74 INDUS Tools Ontology Editor for specifying or modifying ontologies Schema Editor for specifying or modifying data source schemas Mapping Editor for specifying mappings between ontologies and between schemas Data Editor for registering data sources with INDUS View Editor for defining user views Query Interface for formulating queries and displaying results

75 INDUS Users: Domain Ontologists A domain ontologist can specify or update: ontologies schemas mappings between ontologies mappings between schemas

76 INDUS Users: Data Providers A data provider can: Associate a predefined schema and ontology with a data source Specify data source location, type and access procedures Register a data source Act as a domain ontologist

77 INDUS Users: Domain Experts A domain expert can specify an application view select data sources of interest in an application domain an application specific schema an application specific ontology relevant mappings A domain expert can serve as Domain ontologist Data provider

78 INDUS Users: Analysis Tool Providers A analysis tool provider can: Register a tool (e.g., learning algorithm) Act as a data source provider Act as a domain ontologist Act as a domain expert

79 INDUS Users: Domain Scientists A domain scientist can Select an application view Formulate and execute queries Select and execute learning algorithms A domain scientist can act as Domain ontologist Data provider Domain expert Analysis tool provider

80 INDUS Some features of INDUS Clear distinction between structure and semantics of data Data integration from a user perspective - User-specifiable ontologies and mappings (no single global ontology) Semantic integrity of queries ensured by means of semantics preserving mappings

81 Current Directions Further development of the open source INDUS tools for collaborative discovery Algorithms for learning classifiers from semantically heterogeneous multi-relational data Modular collaborative ontology development Ontology-extended workflows and services Applications in bioinformatics, security informatics, medical informatics, social informatics

82 Current Ph.D. Students C. Andorf J. Bao C. Caragea J. Pathak T. Alcon O..Yakhnenko A. Silvescu F. Wu O. Kohutyuk M. Brathwaite F. Vasile D-K. Kang Y. El-Manzalawi P. Zaback Postdoctoral Fellows Recent Ph.D. grads Collaborating Ph.D. Students D. Caragea B. Olson C. Yan J. Zhang K. Vander T. Dunn O. Couture M. Terribilini Velden

83 Thank you! Vasant Honavar Bioinformatics and Computational Biology Program Center for Computational Intelligence, Learning, & Discovery Iowa State University

Learning from Semantically Heterogeneous Data

Learning from Semantically Heterogeneous Data Learning from Semantically Heterogeneous Data Doina Caragea* Department of Computing and Information Sciences Kansas State University 234 Nichols Hall Manhattan, KS 66506 USA voice: +1 785-532-7908 fax:

More information

Algorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous, Distributed Information Sources

Algorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous, Distributed Information Sources Algorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous, Distributed Information Sources Doina Caragea, Jun Zhang, Jie Bao, Jyotishman Pathak, and Vasant Honavar

More information

Learning Classifiers from Semantically Heterogeneous Data

Learning Classifiers from Semantically Heterogeneous Data Learning Classifiers from Semantically Heterogeneous Data Doina Caragea, Jyotishman Pathak, and Vasant G. Honavar Artificial Intelligence Research Laboratory Department of Computer Science Iowa State University

More information

AVT-NBL: An Algorithm for Learning Compact and Accurate Naïve Bayes Classifiers from Attribute Value Taxonomies and Data

AVT-NBL: An Algorithm for Learning Compact and Accurate Naïve Bayes Classifiers from Attribute Value Taxonomies and Data AVT-NBL: An Algorithm for Learning Compact and Accurate Naïve Bayes Classifiers from Attribute Value Taxonomies and Data Jun Zhang and Vasant Honavar Artificial Intelligence Research Laboratory Department

More information

A Framework for Learning from Distributed Data Using Sufficient Statistics and its Application to Learning Decision Trees

A Framework for Learning from Distributed Data Using Sufficient Statistics and its Application to Learning Decision Trees A Framework for Learning from Distributed Data Using Sufficient Statistics and its Application to Learning Decision Trees Doina Caragea, Adrian Silvescu and Vasant Honavar Artificial Intelligence Research

More information

Decision Tree Induction from Distributed Heterogeneous Autonomous Data Sources

Decision Tree Induction from Distributed Heterogeneous Autonomous Data Sources Decision Tree Induction from Distributed Heterogeneous Autonomous Data Sources Doina Caragea, Adrian Silvescu, and Vasant Honavar Artificial Intelligence Research Laboratory, Computer Science Department,

More information

Learning Accurate and Concise Naïve Bayes Classifiers from Attribute Value Taxonomies and Data 1

Learning Accurate and Concise Naïve Bayes Classifiers from Attribute Value Taxonomies and Data 1 Under consideration for publication in Knowledge and Information Systems Learning Accurate and Concise Naïve Bayes Classifiers from Attribute Value Taxonomies and Data 1 J. Zhang 1,2, D.-K. Kang 1,2, A.

More information

Learning Link-Based Naïve Bayes Classifiers from Ontology-Extended Distributed Data

Learning Link-Based Naïve Bayes Classifiers from Ontology-Extended Distributed Data Learning Link-Based Naïve Bayes Classifiers from Ontology-Extended Distributed Data Cornelia Caragea 1, Doina Caragea 2, and Vasant Honavar 1 1 Computer Science Department, Iowa State University 2 Computer

More information

Knowledge Discovery from Disparate Earth Data Sources

Knowledge Discovery from Disparate Earth Data Sources Knowledge Discovery from Disparate Earth Data Sources Doina Caragea and Vasant Honavar Iowa State University Abstract Advances in data collection and data storage technologies have made it possible to

More information

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10) CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification

More information

Learning classifiers from distributed, semantically heterogeneous, autonomous data sources

Learning classifiers from distributed, semantically heterogeneous, autonomous data sources Retrospective Theses and Dissertations 2004 Learning classifiers from distributed, semantically heterogeneous, autonomous data sources Doina Caragea Iowa State University Follow this and additional works

More information

MoSCoE: A Framework for Modeling Web Service Composition and Execution

MoSCoE: A Framework for Modeling Web Service Composition and Execution MoSCoE: A Framework for Modeling Web Service Composition and Execution Jyotishman Pathak 1,2 Samik Basu 1 Robyn Lutz 1,3 Vasant Honavar 1,2 1 Department of Computer Science, Iowa State University, Ames

More information

BITS F464: MACHINE LEARNING

BITS F464: MACHINE LEARNING BITS F464: MACHINE LEARNING Lecture-16: Decision Tree (contd.) + Random Forest Dr. Kamlesh Tiwari Assistant Professor Department of Computer Science and Information Systems Engineering, BITS Pilani, Rajasthan-333031

More information

1 DATA MINING IN DATA WAREHOUSE

1 DATA MINING IN DATA WAREHOUSE Sborník vědeckých prací Vysoké školy báňské - Technické univerzity Ostrava číslo 2, rok 2005, ročník LI, řada strojní článek č. 1484 Abstract Tibor SZAPPANOS *, Iveta ZOLOTOVÁ *, Lenka LANDRYOVÁ ** DISTIRIBUTED

More information

Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services. Patrick Wendel Imperial College, London

Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services. Patrick Wendel Imperial College, London Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services Patrick Wendel Imperial College, London Data Mining and Exploration Middleware for Distributed and Grid Computing,

More information

CSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem. Muhammad Asiful Islam, SBID:

CSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem. Muhammad Asiful Islam, SBID: CSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem Muhammad Asiful Islam, SBID: 106506983 Original Data Outlook Humidity Wind PlayTenis Sunny High Weak No Sunny

More information

Ph.D. in Computer Science (

Ph.D. in Computer Science ( Computer Science 1 COMPUTER SCIENCE http://www.cs.miami.edu Dept. Code: CSC Introduction The Department of Computer Science offers undergraduate and graduate education in Computer Science, and performs

More information

D B M G Data Base and Data Mining Group of Politecnico di Torino

D B M G Data Base and Data Mining Group of Politecnico di Torino DataBase and Data Mining Group of Data mining fundamentals Data Base and Data Mining Group of Data analysis Most companies own huge databases containing operational data textual documents experiment results

More information

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov WEKA: Practical Machine Learning Tools and Techniques in Java Seminar A.I. Tools WS 2006/07 Rossen Dimov Overview Basic introduction to Machine Learning Weka Tool Conclusion Document classification Demo

More information

Analysis of a Population of Diabetic Patients Databases in Weka Tool P.Yasodha, M. Kannan

Analysis of a Population of Diabetic Patients Databases in Weka Tool P.Yasodha, M. Kannan International Journal of Scientific & Engineering Research Volume 2, Issue 5, May-2011 1 Analysis of a Population of Diabetic Patients Databases in Weka Tool P.Yasodha, M. Kannan Abstract - Data mining

More information

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN NonMetric Data Nominal Data So far we consider patterns to be represented by feature vectors of real or integer values Easy to come up with a distance (similarity) measure by using a variety of mathematical

More information

Supervised Learning for Image Segmentation

Supervised Learning for Image Segmentation Supervised Learning for Image Segmentation Raphael Meier 06.10.2016 Raphael Meier MIA 2016 06.10.2016 1 / 52 References A. Ng, Machine Learning lecture, Stanford University. A. Criminisi, J. Shotton, E.

More information

Query Translation for Ontology-extended Data Sources

Query Translation for Ontology-extended Data Sources Query Translation for Ontology-extended Data Sources Jie Bao 1, Doina Caragea 2, Vasant Honavar 1 1 Artificial Intelligence Research Laboratory, Department of Computer Science, Iowa State University, Ames,

More information

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

What Is Data Mining? CMPT 354: Database I -- Data Mining 2 Data Mining What Is Data Mining? Mining data mining knowledge Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data CMPT

More information

Powering Knowledge Discovery. Insights from big data with Linguamatics I2E

Powering Knowledge Discovery. Insights from big data with Linguamatics I2E Powering Knowledge Discovery Insights from big data with Linguamatics I2E Gain actionable insights from unstructured data The world now generates an overwhelming amount of data, most of it written in natural

More information

Lecture 5. Functional Analysis with Blast2GO Enriched functions. Kegg Pathway Analysis Functional Similarities B2G-Far. FatiGO Babelomics.

Lecture 5. Functional Analysis with Blast2GO Enriched functions. Kegg Pathway Analysis Functional Similarities B2G-Far. FatiGO Babelomics. Lecture 5 Functional Analysis with Blast2GO Enriched functions FatiGO Babelomics FatiScan Kegg Pathway Analysis Functional Similarities B2G-Far 1 Fisher's Exact Test One Gene List (A) The other list (B)

More information

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM 1 CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM John R. Koza Computer Science Department Stanford University Stanford, California 94305 USA E-MAIL: Koza@Sunburn.Stanford.Edu

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

The NeuroLOG Platform Federating multi-centric neuroscience resources

The NeuroLOG Platform Federating multi-centric neuroscience resources Software technologies for integration of process and data in medical imaging The Platform Federating multi-centric neuroscience resources Johan MONTAGNAT Franck MICHEL Vilnius, Apr. 13 th 2011 ANR-06-TLOG-024

More information

Data mining fundamentals

Data mining fundamentals Data mining fundamentals Elena Baralis Politecnico di Torino Data analysis Most companies own huge bases containing operational textual documents experiment results These bases are a potential source of

More information

Fault Identification from Web Log Files by Pattern Discovery

Fault Identification from Web Log Files by Pattern Discovery ABSTRACT International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 2 ISSN : 2456-3307 Fault Identification from Web Log Files

More information

Ontology-driven information extraction and integration from heterogeneous distributed autonomous data sources: A federated query centric approach.

Ontology-driven information extraction and integration from heterogeneous distributed autonomous data sources: A federated query centric approach. Ontology-driven information extraction and integration from heterogeneous distributed autonomous data sources: A federated query centric approach. by Jaime A. Reinoso-Castillo A thesis submitted to the

More information

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML Decision Trees Nominal Data So far we consider patterns to be represented by feature vectors of real or integer values Easy to come up with a distance (similarity) measure by using a variety of mathematical

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 4, Jul-Aug 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 4, Jul-Aug 2015 RESEARCH ARTICLE OPEN ACCESS Multi-Lingual Ontology Server (MOS) For Discovering Web Services Abdelrahman Abbas Ibrahim [1], Dr. Nael Salman [2] Department of Software Engineering [1] Sudan University

More information

A Semantic Web Approach to Integrative Biosurveillance. Narendra Kunapareddy, UTHSC Zhe Wu, Ph.D., Oracle

A Semantic Web Approach to Integrative Biosurveillance. Narendra Kunapareddy, UTHSC Zhe Wu, Ph.D., Oracle A Semantic Web Approach to Integrative Biosurveillance Narendra Kunapareddy, UTHSC Zhe Wu, Ph.D., Oracle This talk: Translational BioInformatics and Information Integration Dilemma Case Study: Public Health

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Classification Advanced Reading: Chapter 8 & 9 Han, Chapters 4 & 5 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei. Data Mining.

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci Part 12: Advanced Topics in Collaborative Filtering Francesco Ricci Content Generating recommendations in CF using frequency of ratings Role of neighborhood size Comparison of CF with association rules

More information

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Introduction to GE Microarray data analysis Practical Course MolBio 2012 Introduction to GE Microarray data analysis Practical Course MolBio 2012 Claudia Pommerenke Nov-2012 Transkriptomanalyselabor TAL Microarray and Deep Sequencing Core Facility Göttingen University Medical

More information

arxiv: v1 [cs.ai] 12 Jul 2015

arxiv: v1 [cs.ai] 12 Jul 2015 A Probabilistic Approach to Knowledge Translation Shangpu Jiang and Daniel Lowd and Dejing Dou Computer and Information Science University of Oregon, USA {shangpu,lowd,dou}@cs.uoregon.edu arxiv:1507.03181v1

More information

Executive Summary for deliverable D6.1: Definition of the PFS services (requirements, initial design)

Executive Summary for deliverable D6.1: Definition of the PFS services (requirements, initial design) Electronic Health Records for Clinical Research Executive Summary for deliverable D6.1: Definition of the PFS services (requirements, initial design) Project acronym: EHR4CR Project full title: Electronic

More information

Machine Learning Chapter 2. Input

Machine Learning Chapter 2. Input Machine Learning Chapter 2. Input 2 Input: Concepts, instances, attributes Terminology What s a concept? Classification, association, clustering, numeric prediction What s in an example? Relations, flat

More information

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Decision Tree Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 24 Table of contents 1 Introduction 2 Decision tree

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

Prediction. What is Prediction. Simple methods for Prediction. Classification by decision tree induction. Classification and regression evaluation

Prediction. What is Prediction. Simple methods for Prediction. Classification by decision tree induction. Classification and regression evaluation Prediction Prediction What is Prediction Simple methods for Prediction Classification by decision tree induction Classification and regression evaluation 2 Prediction Goal: to predict the value of a given

More information

Question Bank. 4) It is the source of information later delivered to data marts.

Question Bank. 4) It is the source of information later delivered to data marts. Question Bank Year: 2016-2017 Subject Dept: CS Semester: First Subject Name: Data Mining. Q1) What is data warehouse? ANS. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile

More information

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Things We d Like to Do Spam Classification Given an email, predict

More information

k-nearest Neighbor (knn) Sept Youn-Hee Han

k-nearest Neighbor (knn) Sept Youn-Hee Han k-nearest Neighbor (knn) Sept. 2015 Youn-Hee Han http://link.koreatech.ac.kr ²Eager Learners Eager vs. Lazy Learning when given a set of training data, it will construct a generalization model before receiving

More information

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN MOTIVATION FOR RANDOM FOREST Random forest is a great statistical learning model. It works well with small to medium data. Unlike Neural Network which requires

More information

PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 15-1: Support Vector Machines Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University,

More information

USC Viterbi School of Engineering

USC Viterbi School of Engineering Introduction to Computational Thinking and Data Science USC Viterbi School of Engineering http://www.datascience4all.org Term: Fall 2016 Time: Tues- Thur 10am- 11:50am Location: Allan Hancock Foundation

More information

Data Mining Technologies for Bioinformatics Sequences

Data Mining Technologies for Bioinformatics Sequences Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment

More information

Epitopes Toolkit (EpiT) Yasser EL-Manzalawy August 30, 2016

Epitopes Toolkit (EpiT) Yasser EL-Manzalawy  August 30, 2016 Epitopes Toolkit (EpiT) Yasser EL-Manzalawy http://www.cs.iastate.edu/~yasser August 30, 2016 What is EpiT? Epitopes Toolkit (EpiT) is a platform for developing epitope prediction tools. An EpiT developer

More information

Information Management Fundamentals by Dave Wells

Information Management Fundamentals by Dave Wells Information Management Fundamentals by Dave Wells All rights reserved. Reproduction in whole or part prohibited except by written permission. Product and company names mentioned herein may be trademarks

More information

Data Mining Algorithms: Basic Methods

Data Mining Algorithms: Basic Methods Algorithms: The basic methods Inferring rudimentary rules Data Mining Algorithms: Basic Methods Chapter 4 of Data Mining Statistical modeling Constructing decision trees Constructing rules Association

More information

Army Data Services Layer (ADSL) Data Mediation Providing Data Interoperability and Understanding in a

Army Data Services Layer (ADSL) Data Mediation Providing Data Interoperability and Understanding in a Army Data Services Layer (ADSL) Data Mediation Providing Data Interoperability and Understanding in a SOA Environment Michelle Dirner Army Net-Centric t Data Strategy t (ANCDS) Center of Excellence (CoE)

More information

a paradigm for the Introduction to Semantic Web Semantic Web Angelica Lo Duca IIT-CNR Linked Open Data:

a paradigm for the Introduction to Semantic Web Semantic Web Angelica Lo Duca IIT-CNR Linked Open Data: Introduction to Semantic Web Angelica Lo Duca IIT-CNR angelica.loduca@iit.cnr.it Linked Open Data: a paradigm for the Semantic Web Course Outline Introduction to SW Give a structure to data (RDF Data Model)

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Decision Trees: Discussion

Decision Trees: Discussion Decision Trees: Discussion Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning

More information

8/19/13. Computational problems. Introduction to Algorithm

8/19/13. Computational problems. Introduction to Algorithm I519, Introduction to Introduction to Algorithm Yuzhen Ye (yye@indiana.edu) School of Informatics and Computing, IUB Computational problems A computational problem specifies an input-output relationship

More information

Data Mining and Analytics

Data Mining and Analytics Data Mining and Analytics Aik Choon Tan, Ph.D. Associate Professor of Bioinformatics Division of Medical Oncology Department of Medicine aikchoon.tan@ucdenver.edu 9/22/2017 http://tanlab.ucdenver.edu/labhomepage/teaching/bsbt6111/

More information

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas ian ian ian Might have reasons (domain information) to favor some hypotheses/predictions over others a priori ian methods work with probabilities, and have two main roles: Optimal Naïve Nets (Adapted from

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology What s a concept?

More information

STATISTICS (STAT) Statistics (STAT) 1

STATISTICS (STAT) Statistics (STAT) 1 Statistics (STAT) 1 STATISTICS (STAT) STAT 2013 Elementary Statistics (A) Prerequisites: MATH 1483 or MATH 1513, each with a grade of "C" or better; or an acceptable placement score (see placement.okstate.edu).

More information

Clustering Analysis Basics

Clustering Analysis Basics Clustering Analysis Basics Ke Chen Reading: [Ch. 7, EA], [5., KPM] Outline Introduction Data Types and Representations Distance Measures Major Clustering Methodologies Summary Introduction Cluster: A collection/group

More information

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Things We d Like to Do Spam Classification Given an email, predict

More information

Data Mining in Bioinformatics Day 1: Classification

Data Mining in Bioinformatics Day 1: Classification Data Mining in Bioinformatics Day 1: Classification Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tübingen and Eberhard Karls

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Classification with Decision Tree Induction

Classification with Decision Tree Induction Classification with Decision Tree Induction This algorithm makes Classification Decision for a test sample with the help of tree like structure (Similar to Binary Tree OR k-ary tree) Nodes in the tree

More information

Naïve Bayes for text classification

Naïve Bayes for text classification Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support

More information

Computer-based Tracking Protocols: Improving Communication between Databases

Computer-based Tracking Protocols: Improving Communication between Databases Computer-based Tracking Protocols: Improving Communication between Databases Amol Deshpande Database Group Department of Computer Science University of Maryland Overview Food tracking and traceability

More information

60-538: Information Retrieval

60-538: Information Retrieval 60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are

More information

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data. Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal

EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal Heinrich Widmann, DKRZ DI4R 2016, Krakow, 28 September 2016 www.eudat.eu EUDAT receives funding from the European Union's Horizon 2020

More information

SELF-SERVICE SEMANTIC DATA FEDERATION

SELF-SERVICE SEMANTIC DATA FEDERATION SELF-SERVICE SEMANTIC DATA FEDERATION WE LL MAKE YOU A DATA SCIENTIST Contact: IPSNP Computing Inc. Chris Baker, CEO Chris.Baker@ipsnp.com (506) 721 8241 BIG VISION: SELF-SERVICE DATA FEDERATION Biomedical

More information

Detecting Network Intrusions

Detecting Network Intrusions Detecting Network Intrusions Naveen Krishnamurthi, Kevin Miller Stanford University, Computer Science {naveenk1, kmiller4}@stanford.edu Abstract The purpose of this project is to create a predictive model

More information

A Multi-Analyzer Machine Learning Model for Marine Heterogeneous Data Schema Mapping

A Multi-Analyzer Machine Learning Model for Marine Heterogeneous Data Schema Mapping A Multi-Analyzer Machine Learning Model for Marine Heterogeneous Data Schema Mapping Wang Yan 1, 2 Le Jiajin 3, Zhang Yun 2 1 Glorious Sun School of Business and Management Donghua University 2 College

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

The Emerging Data Lake IT Strategy

The Emerging Data Lake IT Strategy The Emerging Data Lake IT Strategy An Evolving Approach for Dealing with Big Data & Changing Environments bit.ly/datalake SPEAKERS: Thomas Kelly, Practice Director Cognizant Technology Solutions Sean Martin,

More information

Semantic Web Mining and its application in Human Resource Management

Semantic Web Mining and its application in Human Resource Management International Journal of Computer Science & Management Studies, Vol. 11, Issue 02, August 2011 60 Semantic Web Mining and its application in Human Resource Management Ridhika Malik 1, Kunjana Vasudev 2

More information

DATA MINING TRANSACTION

DATA MINING TRANSACTION DATA MINING Data Mining is the process of extracting patterns from data. Data mining is seen as an increasingly important tool by modern business to transform data into an informational advantage. It is

More information

Semantics in the Financial Industry: the Financial Industry Business Ontology

Semantics in the Financial Industry: the Financial Industry Business Ontology Semantics in the Financial Industry: the Financial Industry Business Ontology Ontolog Forum 17 November 2016 Mike Bennett Hypercube Ltd.; EDM Council Inc. 1 Network of Financial Exposures Financial exposure

More information

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Database and Knowledge-Base Systems: Data Mining. Martin Ester Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro

More information

Integrating large, fast-moving, and heterogeneous data sets in biology.

Integrating large, fast-moving, and heterogeneous data sets in biology. Integrating large, fast-moving, and heterogeneous data sets in biology. C. Titus Brown Asst Prof, CSE and Microbiology; BEACON NSF STC Michigan State University ctb@msu.edu Introduction Background: Modeling

More information

Data Engineering. Data preprocessing and transformation

Data Engineering. Data preprocessing and transformation Data Engineering Data preprocessing and transformation Just apply a learner? NO! Algorithms are biased No free lunch theorem: considering all possible data distributions, no algorithm is better than another

More information

Distributed KIDS Labs 1

Distributed KIDS Labs 1 Distributed Databases @ KIDS Labs 1 Distributed Database System A distributed database system consists of loosely coupled sites that share no physical component Appears to user as a single system Database

More information

DESIGN AND IMPLEMENTATION OF BUILDING DECISION TREE USING C4.5 ALGORITHM

DESIGN AND IMPLEMENTATION OF BUILDING DECISION TREE USING C4.5 ALGORITHM 1 Proceedings of SEAMS-GMU Conference 2007 DESIGN AND IMPLEMENTATION OF BUILDING DECISION TREE USING C4.5 ALGORITHM KUSRINI Abstract. Decision tree is one of data mining techniques that is applied in classification

More information

Tillämpad Artificiell Intelligens Applied Artificial Intelligence Tentamen , , MA:8. 1 Search (JM): 11 points

Tillämpad Artificiell Intelligens Applied Artificial Intelligence Tentamen , , MA:8. 1 Search (JM): 11 points Lunds Tekniska Högskola EDA132 Institutionen för datavetenskap VT 2017 Tillämpad Artificiell Intelligens Applied Artificial Intelligence Tentamen 2016 03 15, 14.00 19.00, MA:8 You can give your answers

More information

Enterprise Miner Software: Changes and Enhancements, Release 4.1

Enterprise Miner Software: Changes and Enhancements, Release 4.1 Enterprise Miner Software: Changes and Enhancements, Release 4.1 The correct bibliographic citation for this manual is as follows: SAS Institute Inc., Enterprise Miner TM Software: Changes and Enhancements,

More information

Semantic Web. Dr. Philip Cannata 1

Semantic Web. Dr. Philip Cannata 1 Semantic Web Dr. Philip Cannata 1 Dr. Philip Cannata 2 Dr. Philip Cannata 3 Dr. Philip Cannata 4 See data 14 Scientific American.sql on the class website calendar SELECT strreplace(x, 'sa:', '') "C" FROM

More information

On the use of Abstract Workflows to Capture Scientific Process Provenance

On the use of Abstract Workflows to Capture Scientific Process Provenance On the use of Abstract Workflows to Capture Scientific Process Provenance Paulo Pinheiro da Silva, Leonardo Salayandia, Nicholas Del Rio, Ann Q. Gates The University of Texas at El Paso CENTER OF EXCELLENCE

More information

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Yanjun Qi, Ziv Bar-Joseph, Judith Klein-Seetharaman Protein 2006 Motivation Correctly

More information

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University CS423: Data Mining Introduction Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS423: Data Mining 1 / 29 Quote of the day Never memorize something that

More information

Opus: University of Bath Online Publication Store

Opus: University of Bath Online Publication Store Patel, M. (2004) Semantic Interoperability in Digital Library Systems. In: WP5 Forum Workshop: Semantic Interoperability in Digital Library Systems, DELOS Network of Excellence in Digital Libraries, 2004-09-16-2004-09-16,

More information

A Multi-Relational Decision Tree Learning Algorithm - Implementation and Experiments

A Multi-Relational Decision Tree Learning Algorithm - Implementation and Experiments A Multi-Relational Decision Tree Learning Algorithm - Implementation and Experiments Anna Atramentov, Hector Leiva and Vasant Honavar Artificial Intelligence Research Laboratory, Computer Science Department

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Knowledge Discovery in Data Bases

Knowledge Discovery in Data Bases Knowledge Discovery in Data Bases Chien-Chung Chan Department of CS University of Akron Akron, OH 44325-4003 2/24/99 1 Why KDD? We are drowning in information, but starving for knowledge John Naisbett

More information