Algorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous, Distributed, Information Sources
|
|
- Neil Leonard
- 5 years ago
- Views:
Transcription
1 Algorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous, Distributed, Information Sources Vasant Honavar Bioinformatics and Computational Biology Graduate Program Center for Computational Intelligence, Learning, & Discovery Iowa State University
2 Coauthors Doina Caragea Jie Bao Jyotishman Pathak Jun Zhang
3 Outline Background and motivation Learning from data revisited Learning classifiers from distributed data Learning classifiers from semantically heterogeneous data Current Status and Summary of Results
4 Background Data revolution Bioinformatics Over 200 data repositories of interest to molecular biologists alone (Discala, 2000) Environmental Informatics Enterprise Informatics Medical Informatics Social Informatics... Connectivity revolution (Internet and the web) Integration revolution Need to understand the elephant as opposed to examining the trunk, the tail, etc. Needed infrastructure to support collaborative, integrative analysis of data
5 Infrastructure for scientific computing As part of efforts to build the scientific computing infrastructure in the US, Europe, Japan Many hundreds of millions of dollars have been spent Many high end computers built Many large databases constructed High speed networks installed Many hundreds of computers bought Many thousands of software applications developed But.. Have we succeeded in changing the nature of the practice of science?
6 Motivating Application Bioinformatics Discovery of functionally important sequence and structural features of proteins Prediction of protein-protein, protein-dna and protein-rna interfaces Discovery of genetic regulatory networks
7 Representative application: Discovery of Functionally Important Sequence and Structural Features of Proteins Figure 3a: The 3-dimensional structure of human Caspase-1 (MEROPS family C14), corresponding to PDB entry 1BMQ. The four labeled residues Arg 179, His 237, Cys 285, and Arg 341 are known to form the substrate binding pocket of the Caspase-1 enzyme [Wilson, et al., 1994 Nature 370: ]. Three of these residues (arg 179, His 237, and Cys 285) are located within the MEMEgenerated motifs frequently used by the decision tree classifier for the MEROPS family C14. These motifs correspond to residues (red), (yellow), (green). Figure 3b: The 3-dimensional structure of Astacin (MEROPS family M12) from A. astacus, corresponding to PDB entry 1QJJ. Five MEMEgenerated motifs selected by the decision tree algorithm for the MEROPS family M12 correspond to residues (red), (yellow), and (green). The five labeled residues -- His 92, His 96, Glu 93, His 102, Tyr149 that appear within the motifs have been shown to form the zinc binding pocket of the enzyme [Bond and Beynon, 1995, Protein Science 4: ]. [Wang et al., 2003]
8 Motivating Application Bioinformatics applications of machine learning require integrated analysis of data from multiple sources Solution 1 Assemble a data set using special purpose scripts to extract data sets from different sources and then apply standard algorithms to the assembled data set time consuming, not scalable, and does not handle partially specified data Solution 2 Understand, state, and design algorithms to solve the problem of learning from semantically heterogeneous, distributed data sources
9 Representative application scenario Learning sequence and structural correlates of protein function
10 Acquiring knowledge from data Most current machine learning algorithms assume centralized access to a semantically homogeneous data set Assumptions Data L h Knowledge
11 Challenges Gleaning useful knowledge from data requires tools for analysis of data from autonomous sources Large, distributed, data sources Semantic (ontological) gap Partially specified data Multiple points of view Access constraints...
12 Towards an infrastructure for collaborative discovery Building an effective infrastructure for scientific discovery requires coming to terms with How scientists communicate discipline-specific jargon versus common terms How scientists process information role of background knowledge, assumptions, points of view (ontological commitments) How scientists work capture and analyze data from multiple points of view, at multiple levels of abstraction Distributed, often massive data sources Autonomy of data sources (access restrictions, query capabilities) Semantic gaps between data sources and the data source and the user s point of view in a given context
13 Challenge: Distributed Data Sources Large Growing at an exponential rate Centralized access not feasible Can we learn without centralized access to data? How? How do the results compare with centralized setting?
14 Challenge: Semantic heterogeneity Sub-disciplines limited by their instruments of observation Stumbling block to scientific understanding: Blind men and the elephant syndrome
15 Semantic Gap Structural Genomics Functional Genomics Tissue Genome Sequence Disease Clinical Trials Clinical Data Countries separated by a common language! [Shaw, 1942, after Wilde, 1887]
16 Ontological differences? Temperature : Celsius Outlook : {Sunny, Rainy} Temp : Fahrenheit Precipitation : {NoPrec, Rain} Different terms, same meaning: Outlook vs. Precipitation Same term, different meaning: Wind (speed) vs. Wind (direction) Different domains of values for semantically equivalent attributes Different units: 32 deg. C vs. 75 deg. F
17 Challenge: Data source autonomy Access restrictions Privacy constraints Data source capabilities Queries Execution of user-supplied procedures Storage of partial results or indices Computing, memory and bandwidth limitations
18 Enabling technologies World wide web Knowledge representation; Description Logics, Ontology languages (OWL) Languages for making data sources, resources, and resources self describing (XML, RDF, WSDL) Service oriented computing (Web services)
19 Steps Towards the Semantic Web Web of Knowledge Machine-Machine, Human-Machine and Machine-Human communication Data and Programs 2010 Ontology, Knowledge, Inference, Services Self-Describing Documents 2000 Resource Description Framework extensible Markup Language Documents Early Web 1990 HyperText Markup Language HyperText Transfer Protocol Berners-Lee, Hendler; Nature, 2001
20 Towards an infrastructure for collaborative discovery Building an effective infrastructure for scientific discovery requires coming to terms with How scientists communicate disciplinespecific jargon versus common terms How scientists process information role of background knowledge, assumptions, points of view (ontological commitments) How scientists work capture and analyze data from multiple points of view, at multiple levels of abstraction
21 Solution: INDUS for Learning from Semantically Heterogeneous Distributed Autonomous Data Sources
22 Outline Background and motivation Learning from data revisited Learning classifiers from distributed data Learning classifiers from semantically heterogeneous data Current Status and Summary of Results
23 Learning Classifiers from Data Learning Data Labeled Examples Learner Classifier Classification Unlabeled Instance Classifier Class Standard learning algorithms assume centralized access to data
24 Example: Learning decision tree classifiers Day Outlook Sunny Sunny Overcast Overcast Temp. Hot Hot Hot Cold Humidity High High High Normal Wind Weak Strong Weak Weak Play Tennis No No Yes No Day 1 2 Day 3 4 Outlook Sunny Sunny Outlook Overcast Overcast Temp Hot Hot Temp Hot Cold Humid. High High Humid. High Normal Wind Weak Strong Wind Weak Strong Play No No Play Yes No {1, 2, 3, 4} {1, 2} Sunny No Outlook Overcast Temp. {3, 4} Hot Cold No Yes {4} {3}
25 Example: Learning decision tree classifiers Decision tree is constructed by recursively (and greedily) choosing the attribute that provides the greatest estimated information about the class label What do we need to choose a split at each step? Information gain Estimated probability distribution resulting from each candidate split Proportion of instances of each class along each branch of each candidate split If we have the relevant counts, we have no need for the data!
26 Learning from data reexamined Learner Data D Hypothesis Construction h i+1 C(h i, s (h i -> h i+1, D)) s(h i -> h i+1, D) Data D Statistical Query Generation Query s(h i -> h i+1, D) Learning = Sufficient statistics Extraction + Hypothesis construction
27 Learning from data reexamined Sufficient statistics f θ (D) is a sufficient statistic for θ if it contains all the information that is needed for estimating a parameter θ from the data D. Sample mean is a sufficient statistic for mean of a distribution. We have no use for the data once we have a sufficient statistic for the parameter of interest Note: The classical definition of a sufficient statistic is not constructive
28 Sufficient statistics for learning classifiers By drawing an analogy between a hypothesis h and a parameter estimate -- D D A L θ Θ h H
29 Sufficient statistics for learning classifiers A statistic s L (D,h) is called a sufficient statistic for learning a hypothesis h produced by the learning algorithm L when L is applied to a data set D if there exists an algorithm that takes s L (D,h) as input and outputs h. [Caragea, Silvescu, and Honavar, 2004]. We typically want minimal sufficient statistics and efficient algorithms for computing such statistics Trivially D is an s L (D,h) and so is h. Typically it helps to break down the computation of s L (D,h) into smaller steps queries to data D and computation on the results of the query
30 Sufficient statistic for learning a hypothesis A statistic s L (D,h i ->h i+1 ) is called a sufficient statistic for the refinement of h i into h i+1 if there exists an algorithm R that takes h i and s L (D,h i ->h i+1 ) as inputs and outputs h i+1. A statistic s L (D,h) is a sufficient statistic for learning a hypothesis h using the algorithm L applied to the data D if h can be obtained from h 0 =Ø through a sequence of refinement and composition operations. [Caragea, Silvescu, and Honavar, 2004]
31 Example: Learning decision tree classifiers Day Outlook Sunny Sunny Overcast Overcast Temp. Hot Hot Hot Cold Humidity High High High Normal Wind Weak Strong Weak Weak Play Tennis No No Yes No Day 1 2 Day 3 4 Outlook Sunny Sunny Outlook Overcast Overcast Temp Humid. Hot High Hot High Temp Humid. Hot High Cold Normal {1, 2, 3, 4} Wind Weak Strong Wind Weak Stron g Play No No Play Yes No {1, 2} Sunny No Outlook Overcast Temp. {3, 4} Hot Cold No Yes {4} {3}
32 Sufficient statistics for refining a decision tree Entropy H(D) = - i D D i log 2 D D i Sufficient statistics for refining a partially constructed decision tree count(attribute,class path) and count(class path)
33 Decision Tree Learning = Statistical Query Answering + Hypothesis refinement Outlook Counts(Attribute, Class), Counts(Class) Counts Sunny Overcast Rain Yes Wind Counts(Wind, Class Outlook), Counts(Class Outlook) Humidity Strong Weak No Yes Counts Counts(Humidity, Class Outlook), Counts(Class Outlook) Counts Data Data High Normal No Yes
34 Decision Tree Learning = Statistical Query Answering + Hypothesis refinement Data Attributes A i1,,a im Joint Count Sufficient Statistic count(a i1,,a i1 ) ; p m Joint count sufficient statistics provide all the information needed for learning Naïve Bayes, Bayesian Network (when the structure is known), Decision Tree and many other classifiers We can define refinement sufficient statistics for algorithms for SVM, logistic regression, etc. [Caragea, Silvescu, and Honavar, 2004; Caragea, Caragea, and Honavar, 2005]]
35 Learning from Data Reexamined Identification of minimal or near minimal sufficient statistics for different classes of learning algorithms Design of effective procedures for computing minimal or near minimal sufficient statistics or their efficient approximations Separation of concerns between hypothesis construction (through successive refinement and composition operations) and statistical query answering
36 Outline Background and motivation Learning from data revisited Learning classifiers from distributed data Learning classifiers from semantically heterogeneous data Current Status and Summary of Results
37 Learning Classifiers from Distributed Data Learning from distributed data requires learning from dataset fragments without gathering all of the data in a central location Assuming that the data set is represented in tabular form, data fragmentation can be horizontal vertical or more general (e.g. multi-relational)
38 Horizontal Data Fragmentation Example: Autonomously maintained data for different organisms in comparative genomics Data set fragments are distributed across multiple data repositories Complete data set is the union of data set fragments D 1 D 2 D 3 D 4
39 Vertical Data Fragmentation Example: data gathered by multiple laboratories about outcomes of different sets of clinical tests on a patient Each data set fragment contains sub-tuples of data tuples Sub-tuples of a tuple can associated with each other using a unique key (e.g., patient s social security number) Complete data set is the join of data set fragments D 1 D 2 D 3 D 4
40 Multi-Relational Data Fragmentation Data are stored in a set of relational database tables that can be conceptually tied together by a (global) schema
41 Learning from distributed data Learner S (D, h i ->h i+1 ) Query Decomposition q 1 q 2 D 1 D 2 Query S (D, h i ->h i+1 ) Answer Composition q 3 D 3
42 Learning from distributed data Learning classifiers from distributed data reduces to statistical query answering from distributed data A sound and complete procedure for answering the desired class of statistical queries from distributed data under Different types of data fragmentation Different constraints on access and query capabilities Different bandwidth and resource constraints [Caragea, Silvescu, and Honavar, 2004, also work in progress]
43 How can we evaluate algorithms for learning from distributed data? Compare with their batch counterparts Exactness guarantee that the learned hypothesis is the same as or equivalent to that obtained by the batch counterpart Approximation guarantee that the learned hypothesis is an approximation (in a quantifiable sense) of the hypothesis obtained in the batch setting Communication, memory, and processing requirements [Caragea, Silvescu, and Honavar., 2003, 2004]
44 Exact Learning of decision tree classifiers from distributed data under horizontal fragmentation Counts(Wind, Class Outlook), Counts(Class Outlook) Outlook Counts Query answering engine Sunny Overcast Rain D 1 Yes Wind Counts(Wind, Class Outlook), Counts(Class Outlook) Query Decomp Strong Weak Counts D 2 No Yes Answer Comp Humidity Add Up Counts D 3 High Normal No Yes
45 Time and communication complexity: centralized versus distributed case C is the number of classes (e.g., C =10), V is the maximum number of values of an attribute (e.g., V =10), D is the size of the data (number of examples) (e.g., D =1,000,000) T is the size of the tree (number of nodes) (e.g., T =100), K is the number of data sources (e.g., K=10). Theorem (Time): The algorithm for learning from horizontally fragmented distributed data is K times faster than the algorithm for learning from centralized data, if parallel access to the data sources is allowed. Theorem (Communication): If C V T K < D, then the algorithm for learning from horizontally fragmented distributed data is preferred to the algorithm for learning from centralized data, under the assumption that each data source allows both shipping raw data and computation of sufficient statistics Example: 10x10x100x10=100,000<1,000,000. [Caragea et al., 2003]
46 Some Results on Learning from Distributed Data Provably exact algorithms for learning decision trees, SVM, Naïve Bayes, Neural Network, and Bayesian network classifiers from distributed data Positive and negative results concerning efficiency (bandwith, memory, computation) of learning from distributed data without retrieving raw data relative to its centralized counterpart [Caragea, Silvescu, and Honavar, 2004] A theoretical framework based on sufficient statistics for analysis and design of efficient, exact algorithms for learning classifiers from distributed data
47 Related work learning from distributed data Parallel distributed learning: [Provost and Kolluri, 1999; Grossman and Guo, 2001] Ensemble approach: [Domingos, 1997; Prodromidis et al., 2000] Cooperation-based approach: [Provost and Henessy, 1996; Leckie and Kotagiri, 2002] Learning from vertically fragmented data: [Kargupta et al., 1999, 2001; Park and Kargupta, 2002] Relational learning: [Knobbe et al., 1999; Getoor et al., 2001; Atramentov et al., 2003] Privacy preserving data mining: [Lindell and Pinkas, 2002; Clifton et al., 2002] Attribute noise tolerant PAC Learning [Kearns, 1999]
48 Our approach Works for any learning algorithm Works for different types of data fragmentation Works for some scenarios where privacy preservation is required Yields algorithms that are provably exact with respect to their corresponding batch counterparts Lends itself to adaptation to learning from semantically heterogeneous data
49 Outline Background and motivation Learning from data revisited Learning classifiers from distributed data Learning classifiers from semantically heterogeneous data Current Status and Summary of Results
50 Learning from Semantically Heterogeneous Data Mappings from O to O 1.. O N Ontology M(O, O 1..O N ) O q 1 D 1, O 1 Learner S O (D, h i ->h i+1 ) Query Decomposition q 2 D 2, O 2 Query S O (D, h i ->h i+1 ) Answer Composition q 3 D 3, O 3
51 Learning from semantically heterogeneous data Requires solving the data integration problem: Given a set of autonomous, heterogeneous information sources, each with its own associated schema and ontology answer statistical queries from a user s perspective (user schema and ontology)
52 Semantically heterogeneous data Day Temperature (C) Wind Speed (km/h) Outlook Cloudy D Sunny Rainy Day Temp (F) Wind (mph) Precipitation D Rain Light Rain No Prec
53 Making Data Sources Self Describing Exposing the schema structure of data Specification of the attributes of the data and their types D 1 Day: day Temperature: deg C Wind Speed: kmh Outlook: outlook D 2 Day: day Temp: deg F Wind: mph Precipitation: prec Exposing the ontology conceptualization of semantics of data e.g., domains of attributes and relationships between values
54 Ontologies Partial order ontology (DAG structured) isa hierarchies part-of hierarchies Attribute-value-taxonomy (AVT)
55 Ontology Extended Data Sources Expose the data source schema structure of data specification of the attributes of the data and their types Expose the data source ontology conceptualization of semantics of data domains of attributes and relationships between values attribute value hierarchies Ontology extended data source = Data Source Schema + Data Source Ontology + Data
56 Mappings Querying data sources from a user s perspective is facilitated by specifying mappings at: Schema Level: from attributes from different data source schemas to attributes in the user schema Ontology Level: between values of the attributes from different data source ontologies to values of the corresponding attributes in the user ontology [Caragea, Pathak, and Honavar; 2004]
57 Mappings between schema D 1 Day: day Temperature: deg C Wind Speed: kmh Outlook: outlook D 2 Day: day Temp: deg F Wind: mph Precipitation: prec D U Day: day Temp: deg F Wind: kmh Outlook: outlook Day : D 1 Day: D U Day : D 2 Day: D U Temperature: D 1 Temp : D U Temp: D 2 Temp : D U
58 Mappings between Ontologies H 1 (is-a) H 2 (is-a) H U (is-a) The white nodes represent the values used to describe data
59 Data sources from a user s perspective H 1 (is-a) H U (is-a) Rainy : H 1 = Rain : H U Snow : H 1 = Snow : H U [Caragea, Pathak, and Honavar; 2004 NoPrec : H 1 < Outlook : H U {Sunny, Cloudy} : H 1 = NoPrec : H U Conversion functions are used to map units (e.g. degrees F to degrees C)
60 Conversion functions A total function τ 1 2τ 2 : dom(τ 1 )->dom(τ 2 ) that maps values of τ 1 to values of τ 2 is called conversion function from τ 1 to τ 2. For any two types τ 1, τ 2 Γ there exists at most one conversion function τ 1 2τ 2 For every type τ Γ, τ 2τ exists (identity) If τ i 2τ j and τ j 2τ k exist, then τ i 2τ k exists and τ i 2τ k = τ i 2τ j τ j 2τ k
61 Integration ontology An ontology (O U, ) is called an integration ontology of a set of data source ontologies O 1,,O K if there exist K partial injective mappings Φ 1,,Φ K from O 1,,O K, respectively, to O U that satisfy: Order preservation: x i y implies Φ i (x) Φ i (y), for all x,y O i Semantics preservation: (x:o i op y:o U ) IC, then (Φ i (x) op y), for all x O i and y O U
62 Semantic heterogeneity leads to Partially Specified Data Different data sources may describe data at different levels of abstraction Different users may want to view data at a certain level of abstraction H 1 (is-a) O U H U (is-a) Snow is under-specified in H 1 relative to user ontology H U Making D 1 partially specified from the user perspective [Zhang and Honavar, 2003; 2004]
63 Learning from Semantically Heterogeneous Data Mappings between O 1.. O N and O Ontology M(O, O 1..O N ) O q 1 D 1, O 1 Learner S O (h i ->h i+1,d) Query Decomposition q 2 D 2, O 2 Query S O (h i ->h i+1,d) Answer Composition q 3 D 3, O 3
64 Learning Classifiers from Attribute Value Taxonomies (AVT) and Partially Specified Data Given a taxonomy over values of each attribute, and data specified in terms of values at different levels of abstraction, learn a concise and accurate hypothesis Student Status Work Status h(γ 0 ) Undergraduate Graduate On-Campus Off-Campus h(γ 1 ) Freshman Senior Ph.D TA RA AA Government Private Sophomore Junior Master Federal Local Org State Com [Zhang and Honavar, 2003; 2004; 2005] h(γ k )
65 Learning Classifiers from (AVT) and Partially Specified Data Cuts through AVT induce a partial order over instance representations Classifiers AVT-DTL and AVT-NBL Show how to learn classifiers from partially specified data Estimate sufficient statistics from partially specified data under specific statistical assumptions Use CMDL score to trade off classifier complexity against accuracy [Zhang and Honavar, 2003; 2004; 2005]
66 AVT-NBL for Learning Classifiers from Partially Specified Data NBL Prop-NBL AVT-NBL Mushroom 10% 30% 50% 4.65(±1.33) 5.28 (±1.41) 6.63(±1.57) 4.69(±1.34) 4.84(±1.36) 5.82(±1.48) 0.30(±0.30) 0.64(±0.50) 1.24(±0.70) Nursery 10% 30% 50% 15.27(±1.81) 26.84(±2.23) 36.96(±2.43) 15.50(±1.82) 26.25(±2.21) 35.88(±2.41) 12.85(±1.67) 21.19(±2.05) 29.34(±2.29) Soybean 10% 30% 50% 8.76(±1.76) 12.45(±2.07) 19.39(±2.47) 9.08(±1.79) 11.54(±2.00) 16.91(±2.34) 6.75(±1.57) 10.32(±1.90) 16.93(±2.34) % Error rates on data with different percentages of partially or totally missing values based on 10-fold cross validation with 90% confidence interval [Zhang and Honavar, 2004]
67 Learning decision tree classifiers from semantically heterogeneous data data Schema and Ontology Level Mappings and Conversion Functions Oultlook O U O 1 Wind Sunny Overcast Yes Rain Strong No Wind Weak Yes Counts(Wind, Class Outlook), Counts(Class Outlook) Counts Query Decomp Query Engine Answer Comp Add Up Counts D 1 D 2 O 2
68 Related work Information integration [Levy, 1998; Davidson et al., 2001; Ekman, 2003] Ontology-extended relational algebra [Bonatti et al., 2003] Ontology and mappings editors [Noy et al., 2000; Eckman et al., 2002] Statistical databases [McClean et al., 2002] Learning from ontologies and fully specified data [Han and Fu, 1996; Koller and Sahami, 1997; Pazzani et al., 1997]
69 Our approach to learning classifiers from semantically heterogeneous data Is based on a separation of concerns between querying for sufficient statistics and hypothesis construction Supports learning from semantically heterogeneous data from a user perspective Offers a theoretically well founded solution to the problem of learning classifiers from semantically heterogeneous data
70 Outline of the talk Background and motivation Learning from data revisited Learning classifiers from distributed data Learning classifiers from semantically heterogeneous data Current Status and Summary of Results
71 Ontology-based information integration in INDUS
72 Results -- INDUS tools for for collaborative knowledge acquisition Algorithms for learning classifiers from distributed data with provable performance guarantees relative to their centralized or batch counterparts Algorithms for answering statistical queries from semantically heterogeneous data Algorithms for learning classifiers from partial order ontologies and partially specified data Modular ontologies, inter-ontology mappings, and inference to support collaborative ontology development and reuse Implementation of INDUS software Applications in bioinformatics classifiers for protein function annotation, classifiers for binding site identification
73 Capabilities of INDUS INDUS provides support for: Specification and update of schemas and ontologies Specification of mappings between ontologies Registration of new data sources Specification of user views Specification and execution of queries across distributed, semantically heterogeneous data source Learning classifiers from semantically heterogeneous data
74 INDUS Tools Ontology Editor for specifying or modifying ontologies Schema Editor for specifying or modifying data source schemas Mapping Editor for specifying mappings between ontologies and between schemas Data Editor for registering data sources with INDUS View Editor for defining user views Query Interface for formulating queries and displaying results
75 INDUS Users: Domain Ontologists A domain ontologist can specify or update: ontologies schemas mappings between ontologies mappings between schemas
76 INDUS Users: Data Providers A data provider can: Associate a predefined schema and ontology with a data source Specify data source location, type and access procedures Register a data source Act as a domain ontologist
77 INDUS Users: Domain Experts A domain expert can specify an application view select data sources of interest in an application domain an application specific schema an application specific ontology relevant mappings A domain expert can serve as Domain ontologist Data provider
78 INDUS Users: Analysis Tool Providers A analysis tool provider can: Register a tool (e.g., learning algorithm) Act as a data source provider Act as a domain ontologist Act as a domain expert
79 INDUS Users: Domain Scientists A domain scientist can Select an application view Formulate and execute queries Select and execute learning algorithms A domain scientist can act as Domain ontologist Data provider Domain expert Analysis tool provider
80 INDUS Some features of INDUS Clear distinction between structure and semantics of data Data integration from a user perspective - User-specifiable ontologies and mappings (no single global ontology) Semantic integrity of queries ensured by means of semantics preserving mappings
81 Current Directions Further development of the open source INDUS tools for collaborative discovery Algorithms for learning classifiers from semantically heterogeneous multi-relational data Modular collaborative ontology development Ontology-extended workflows and services Applications in bioinformatics, security informatics, medical informatics, social informatics
82 Current Ph.D. Students C. Andorf J. Bao C. Caragea J. Pathak T. Alcon O..Yakhnenko A. Silvescu F. Wu O. Kohutyuk M. Brathwaite F. Vasile D-K. Kang Y. El-Manzalawi P. Zaback Postdoctoral Fellows Recent Ph.D. grads Collaborating Ph.D. Students D. Caragea B. Olson C. Yan J. Zhang K. Vander T. Dunn O. Couture M. Terribilini Velden
83 Thank you! Vasant Honavar Bioinformatics and Computational Biology Program Center for Computational Intelligence, Learning, & Discovery Iowa State University
Learning from Semantically Heterogeneous Data
Learning from Semantically Heterogeneous Data Doina Caragea* Department of Computing and Information Sciences Kansas State University 234 Nichols Hall Manhattan, KS 66506 USA voice: +1 785-532-7908 fax:
More informationAlgorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous, Distributed Information Sources
Algorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous, Distributed Information Sources Doina Caragea, Jun Zhang, Jie Bao, Jyotishman Pathak, and Vasant Honavar
More informationLearning Classifiers from Semantically Heterogeneous Data
Learning Classifiers from Semantically Heterogeneous Data Doina Caragea, Jyotishman Pathak, and Vasant G. Honavar Artificial Intelligence Research Laboratory Department of Computer Science Iowa State University
More informationAVT-NBL: An Algorithm for Learning Compact and Accurate Naïve Bayes Classifiers from Attribute Value Taxonomies and Data
AVT-NBL: An Algorithm for Learning Compact and Accurate Naïve Bayes Classifiers from Attribute Value Taxonomies and Data Jun Zhang and Vasant Honavar Artificial Intelligence Research Laboratory Department
More informationA Framework for Learning from Distributed Data Using Sufficient Statistics and its Application to Learning Decision Trees
A Framework for Learning from Distributed Data Using Sufficient Statistics and its Application to Learning Decision Trees Doina Caragea, Adrian Silvescu and Vasant Honavar Artificial Intelligence Research
More informationDecision Tree Induction from Distributed Heterogeneous Autonomous Data Sources
Decision Tree Induction from Distributed Heterogeneous Autonomous Data Sources Doina Caragea, Adrian Silvescu, and Vasant Honavar Artificial Intelligence Research Laboratory, Computer Science Department,
More informationLearning Accurate and Concise Naïve Bayes Classifiers from Attribute Value Taxonomies and Data 1
Under consideration for publication in Knowledge and Information Systems Learning Accurate and Concise Naïve Bayes Classifiers from Attribute Value Taxonomies and Data 1 J. Zhang 1,2, D.-K. Kang 1,2, A.
More informationLearning Link-Based Naïve Bayes Classifiers from Ontology-Extended Distributed Data
Learning Link-Based Naïve Bayes Classifiers from Ontology-Extended Distributed Data Cornelia Caragea 1, Doina Caragea 2, and Vasant Honavar 1 1 Computer Science Department, Iowa State University 2 Computer
More informationKnowledge Discovery from Disparate Earth Data Sources
Knowledge Discovery from Disparate Earth Data Sources Doina Caragea and Vasant Honavar Iowa State University Abstract Advances in data collection and data storage technologies have made it possible to
More informationCMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)
CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification
More informationLearning classifiers from distributed, semantically heterogeneous, autonomous data sources
Retrospective Theses and Dissertations 2004 Learning classifiers from distributed, semantically heterogeneous, autonomous data sources Doina Caragea Iowa State University Follow this and additional works
More informationMoSCoE: A Framework for Modeling Web Service Composition and Execution
MoSCoE: A Framework for Modeling Web Service Composition and Execution Jyotishman Pathak 1,2 Samik Basu 1 Robyn Lutz 1,3 Vasant Honavar 1,2 1 Department of Computer Science, Iowa State University, Ames
More informationBITS F464: MACHINE LEARNING
BITS F464: MACHINE LEARNING Lecture-16: Decision Tree (contd.) + Random Forest Dr. Kamlesh Tiwari Assistant Professor Department of Computer Science and Information Systems Engineering, BITS Pilani, Rajasthan-333031
More information1 DATA MINING IN DATA WAREHOUSE
Sborník vědeckých prací Vysoké školy báňské - Technické univerzity Ostrava číslo 2, rok 2005, ročník LI, řada strojní článek č. 1484 Abstract Tibor SZAPPANOS *, Iveta ZOLOTOVÁ *, Lenka LANDRYOVÁ ** DISTIRIBUTED
More informationDiscovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services. Patrick Wendel Imperial College, London
Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services Patrick Wendel Imperial College, London Data Mining and Exploration Middleware for Distributed and Grid Computing,
More informationCSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem. Muhammad Asiful Islam, SBID:
CSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem Muhammad Asiful Islam, SBID: 106506983 Original Data Outlook Humidity Wind PlayTenis Sunny High Weak No Sunny
More informationPh.D. in Computer Science (
Computer Science 1 COMPUTER SCIENCE http://www.cs.miami.edu Dept. Code: CSC Introduction The Department of Computer Science offers undergraduate and graduate education in Computer Science, and performs
More informationD B M G Data Base and Data Mining Group of Politecnico di Torino
DataBase and Data Mining Group of Data mining fundamentals Data Base and Data Mining Group of Data analysis Most companies own huge databases containing operational data textual documents experiment results
More informationWEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov
WEKA: Practical Machine Learning Tools and Techniques in Java Seminar A.I. Tools WS 2006/07 Rossen Dimov Overview Basic introduction to Machine Learning Weka Tool Conclusion Document classification Demo
More informationAnalysis of a Population of Diabetic Patients Databases in Weka Tool P.Yasodha, M. Kannan
International Journal of Scientific & Engineering Research Volume 2, Issue 5, May-2011 1 Analysis of a Population of Diabetic Patients Databases in Weka Tool P.Yasodha, M. Kannan Abstract - Data mining
More informationNominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN
NonMetric Data Nominal Data So far we consider patterns to be represented by feature vectors of real or integer values Easy to come up with a distance (similarity) measure by using a variety of mathematical
More informationSupervised Learning for Image Segmentation
Supervised Learning for Image Segmentation Raphael Meier 06.10.2016 Raphael Meier MIA 2016 06.10.2016 1 / 52 References A. Ng, Machine Learning lecture, Stanford University. A. Criminisi, J. Shotton, E.
More informationQuery Translation for Ontology-extended Data Sources
Query Translation for Ontology-extended Data Sources Jie Bao 1, Doina Caragea 2, Vasant Honavar 1 1 Artificial Intelligence Research Laboratory, Department of Computer Science, Iowa State University, Ames,
More informationWhat Is Data Mining? CMPT 354: Database I -- Data Mining 2
Data Mining What Is Data Mining? Mining data mining knowledge Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data CMPT
More informationPowering Knowledge Discovery. Insights from big data with Linguamatics I2E
Powering Knowledge Discovery Insights from big data with Linguamatics I2E Gain actionable insights from unstructured data The world now generates an overwhelming amount of data, most of it written in natural
More informationLecture 5. Functional Analysis with Blast2GO Enriched functions. Kegg Pathway Analysis Functional Similarities B2G-Far. FatiGO Babelomics.
Lecture 5 Functional Analysis with Blast2GO Enriched functions FatiGO Babelomics FatiScan Kegg Pathway Analysis Functional Similarities B2G-Far 1 Fisher's Exact Test One Gene List (A) The other list (B)
More informationCONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM
1 CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM John R. Koza Computer Science Department Stanford University Stanford, California 94305 USA E-MAIL: Koza@Sunburn.Stanford.Edu
More informationMulti-label classification using rule-based classifier systems
Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar
More informationEnhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques
24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE
More informationThe NeuroLOG Platform Federating multi-centric neuroscience resources
Software technologies for integration of process and data in medical imaging The Platform Federating multi-centric neuroscience resources Johan MONTAGNAT Franck MICHEL Vilnius, Apr. 13 th 2011 ANR-06-TLOG-024
More informationData mining fundamentals
Data mining fundamentals Elena Baralis Politecnico di Torino Data analysis Most companies own huge bases containing operational textual documents experiment results These bases are a potential source of
More informationFault Identification from Web Log Files by Pattern Discovery
ABSTRACT International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 2 ISSN : 2456-3307 Fault Identification from Web Log Files
More informationOntology-driven information extraction and integration from heterogeneous distributed autonomous data sources: A federated query centric approach.
Ontology-driven information extraction and integration from heterogeneous distributed autonomous data sources: A federated query centric approach. by Jaime A. Reinoso-Castillo A thesis submitted to the
More informationNominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML
Decision Trees Nominal Data So far we consider patterns to be represented by feature vectors of real or integer values Easy to come up with a distance (similarity) measure by using a variety of mathematical
More informationInternational Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 4, Jul-Aug 2015
RESEARCH ARTICLE OPEN ACCESS Multi-Lingual Ontology Server (MOS) For Discovering Web Services Abdelrahman Abbas Ibrahim [1], Dr. Nael Salman [2] Department of Software Engineering [1] Sudan University
More informationA Semantic Web Approach to Integrative Biosurveillance. Narendra Kunapareddy, UTHSC Zhe Wu, Ph.D., Oracle
A Semantic Web Approach to Integrative Biosurveillance Narendra Kunapareddy, UTHSC Zhe Wu, Ph.D., Oracle This talk: Translational BioInformatics and Information Integration Dilemma Case Study: Public Health
More informationCS570: Introduction to Data Mining
CS570: Introduction to Data Mining Classification Advanced Reading: Chapter 8 & 9 Han, Chapters 4 & 5 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei. Data Mining.
More informationCOSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor
COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality
More informationPart 12: Advanced Topics in Collaborative Filtering. Francesco Ricci
Part 12: Advanced Topics in Collaborative Filtering Francesco Ricci Content Generating recommendations in CF using frequency of ratings Role of neighborhood size Comparison of CF with association rules
More informationIntroduction to GE Microarray data analysis Practical Course MolBio 2012
Introduction to GE Microarray data analysis Practical Course MolBio 2012 Claudia Pommerenke Nov-2012 Transkriptomanalyselabor TAL Microarray and Deep Sequencing Core Facility Göttingen University Medical
More informationarxiv: v1 [cs.ai] 12 Jul 2015
A Probabilistic Approach to Knowledge Translation Shangpu Jiang and Daniel Lowd and Dejing Dou Computer and Information Science University of Oregon, USA {shangpu,lowd,dou}@cs.uoregon.edu arxiv:1507.03181v1
More informationExecutive Summary for deliverable D6.1: Definition of the PFS services (requirements, initial design)
Electronic Health Records for Clinical Research Executive Summary for deliverable D6.1: Definition of the PFS services (requirements, initial design) Project acronym: EHR4CR Project full title: Electronic
More informationMachine Learning Chapter 2. Input
Machine Learning Chapter 2. Input 2 Input: Concepts, instances, attributes Terminology What s a concept? Classification, association, clustering, numeric prediction What s in an example? Relations, flat
More informationData Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396
Data Mining Decision Tree Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 24 Table of contents 1 Introduction 2 Decision tree
More informationBusiness Club. Decision Trees
Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building
More informationSOCIAL MEDIA MINING. Data Mining Essentials
SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate
More informationPrediction. What is Prediction. Simple methods for Prediction. Classification by decision tree induction. Classification and regression evaluation
Prediction Prediction What is Prediction Simple methods for Prediction Classification by decision tree induction Classification and regression evaluation 2 Prediction Goal: to predict the value of a given
More informationQuestion Bank. 4) It is the source of information later delivered to data marts.
Question Bank Year: 2016-2017 Subject Dept: CS Semester: First Subject Name: Data Mining. Q1) What is data warehouse? ANS. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
More informationNaïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Things We d Like to Do Spam Classification Given an email, predict
More informationk-nearest Neighbor (knn) Sept Youn-Hee Han
k-nearest Neighbor (knn) Sept. 2015 Youn-Hee Han http://link.koreatech.ac.kr ²Eager Learners Eager vs. Lazy Learning when given a set of training data, it will construct a generalization model before receiving
More informationINTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN
INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN MOTIVATION FOR RANDOM FOREST Random forest is a great statistical learning model. It works well with small to medium data. Unlike Neural Network which requires
More informationPV211: Introduction to Information Retrieval
PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 15-1: Support Vector Machines Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University,
More informationUSC Viterbi School of Engineering
Introduction to Computational Thinking and Data Science USC Viterbi School of Engineering http://www.datascience4all.org Term: Fall 2016 Time: Tues- Thur 10am- 11:50am Location: Allan Hancock Foundation
More informationData Mining Technologies for Bioinformatics Sequences
Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment
More informationEpitopes Toolkit (EpiT) Yasser EL-Manzalawy August 30, 2016
Epitopes Toolkit (EpiT) Yasser EL-Manzalawy http://www.cs.iastate.edu/~yasser August 30, 2016 What is EpiT? Epitopes Toolkit (EpiT) is a platform for developing epitope prediction tools. An EpiT developer
More informationInformation Management Fundamentals by Dave Wells
Information Management Fundamentals by Dave Wells All rights reserved. Reproduction in whole or part prohibited except by written permission. Product and company names mentioned herein may be trademarks
More informationData Mining Algorithms: Basic Methods
Algorithms: The basic methods Inferring rudimentary rules Data Mining Algorithms: Basic Methods Chapter 4 of Data Mining Statistical modeling Constructing decision trees Constructing rules Association
More informationArmy Data Services Layer (ADSL) Data Mediation Providing Data Interoperability and Understanding in a
Army Data Services Layer (ADSL) Data Mediation Providing Data Interoperability and Understanding in a SOA Environment Michelle Dirner Army Net-Centric t Data Strategy t (ANCDS) Center of Excellence (CoE)
More informationa paradigm for the Introduction to Semantic Web Semantic Web Angelica Lo Duca IIT-CNR Linked Open Data:
Introduction to Semantic Web Angelica Lo Duca IIT-CNR angelica.loduca@iit.cnr.it Linked Open Data: a paradigm for the Semantic Web Course Outline Introduction to SW Give a structure to data (RDF Data Model)
More informationData Preprocessing. Slides by: Shree Jaswal
Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data
More informationDecision Trees: Discussion
Decision Trees: Discussion Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning
More information8/19/13. Computational problems. Introduction to Algorithm
I519, Introduction to Introduction to Algorithm Yuzhen Ye (yye@indiana.edu) School of Informatics and Computing, IUB Computational problems A computational problem specifies an input-output relationship
More informationData Mining and Analytics
Data Mining and Analytics Aik Choon Tan, Ph.D. Associate Professor of Bioinformatics Division of Medical Oncology Department of Medicine aikchoon.tan@ucdenver.edu 9/22/2017 http://tanlab.ucdenver.edu/labhomepage/teaching/bsbt6111/
More informationCSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas
ian ian ian Might have reasons (domain information) to favor some hypotheses/predictions over others a priori ian methods work with probabilities, and have two main roles: Optimal Naïve Nets (Adapted from
More informationData Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology What s a concept?
More informationSTATISTICS (STAT) Statistics (STAT) 1
Statistics (STAT) 1 STATISTICS (STAT) STAT 2013 Elementary Statistics (A) Prerequisites: MATH 1483 or MATH 1513, each with a grade of "C" or better; or an acceptable placement score (see placement.okstate.edu).
More informationClustering Analysis Basics
Clustering Analysis Basics Ke Chen Reading: [Ch. 7, EA], [5., KPM] Outline Introduction Data Types and Representations Distance Measures Major Clustering Methodologies Summary Introduction Cluster: A collection/group
More informationNaïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Things We d Like to Do Spam Classification Given an email, predict
More informationData Mining in Bioinformatics Day 1: Classification
Data Mining in Bioinformatics Day 1: Classification Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tübingen and Eberhard Karls
More informationThe Comparative Study of Machine Learning Algorithms in Text Data Classification*
The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification
More informationClassification with Decision Tree Induction
Classification with Decision Tree Induction This algorithm makes Classification Decision for a test sample with the help of tree like structure (Similar to Binary Tree OR k-ary tree) Nodes in the tree
More informationNaïve Bayes for text classification
Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support
More informationComputer-based Tracking Protocols: Improving Communication between Databases
Computer-based Tracking Protocols: Improving Communication between Databases Amol Deshpande Database Group Department of Computer Science University of Maryland Overview Food tracking and traceability
More information60-538: Information Retrieval
60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are
More information2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.
Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss
More informationPart I: Data Mining Foundations
Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?
More informationEUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal
EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal Heinrich Widmann, DKRZ DI4R 2016, Krakow, 28 September 2016 www.eudat.eu EUDAT receives funding from the European Union's Horizon 2020
More informationSELF-SERVICE SEMANTIC DATA FEDERATION
SELF-SERVICE SEMANTIC DATA FEDERATION WE LL MAKE YOU A DATA SCIENTIST Contact: IPSNP Computing Inc. Chris Baker, CEO Chris.Baker@ipsnp.com (506) 721 8241 BIG VISION: SELF-SERVICE DATA FEDERATION Biomedical
More informationDetecting Network Intrusions
Detecting Network Intrusions Naveen Krishnamurthi, Kevin Miller Stanford University, Computer Science {naveenk1, kmiller4}@stanford.edu Abstract The purpose of this project is to create a predictive model
More informationA Multi-Analyzer Machine Learning Model for Marine Heterogeneous Data Schema Mapping
A Multi-Analyzer Machine Learning Model for Marine Heterogeneous Data Schema Mapping Wang Yan 1, 2 Le Jiajin 3, Zhang Yun 2 1 Glorious Sun School of Business and Management Donghua University 2 College
More informationIntroduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.
Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How
More informationThe Emerging Data Lake IT Strategy
The Emerging Data Lake IT Strategy An Evolving Approach for Dealing with Big Data & Changing Environments bit.ly/datalake SPEAKERS: Thomas Kelly, Practice Director Cognizant Technology Solutions Sean Martin,
More informationSemantic Web Mining and its application in Human Resource Management
International Journal of Computer Science & Management Studies, Vol. 11, Issue 02, August 2011 60 Semantic Web Mining and its application in Human Resource Management Ridhika Malik 1, Kunjana Vasudev 2
More informationDATA MINING TRANSACTION
DATA MINING Data Mining is the process of extracting patterns from data. Data mining is seen as an increasingly important tool by modern business to transform data into an informational advantage. It is
More informationSemantics in the Financial Industry: the Financial Industry Business Ontology
Semantics in the Financial Industry: the Financial Industry Business Ontology Ontolog Forum 17 November 2016 Mike Bennett Hypercube Ltd.; EDM Council Inc. 1 Network of Financial Exposures Financial exposure
More informationDatabase and Knowledge-Base Systems: Data Mining. Martin Ester
Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro
More informationIntegrating large, fast-moving, and heterogeneous data sets in biology.
Integrating large, fast-moving, and heterogeneous data sets in biology. C. Titus Brown Asst Prof, CSE and Microbiology; BEACON NSF STC Michigan State University ctb@msu.edu Introduction Background: Modeling
More informationData Engineering. Data preprocessing and transformation
Data Engineering Data preprocessing and transformation Just apply a learner? NO! Algorithms are biased No free lunch theorem: considering all possible data distributions, no algorithm is better than another
More informationDistributed KIDS Labs 1
Distributed Databases @ KIDS Labs 1 Distributed Database System A distributed database system consists of loosely coupled sites that share no physical component Appears to user as a single system Database
More informationDESIGN AND IMPLEMENTATION OF BUILDING DECISION TREE USING C4.5 ALGORITHM
1 Proceedings of SEAMS-GMU Conference 2007 DESIGN AND IMPLEMENTATION OF BUILDING DECISION TREE USING C4.5 ALGORITHM KUSRINI Abstract. Decision tree is one of data mining techniques that is applied in classification
More informationTillämpad Artificiell Intelligens Applied Artificial Intelligence Tentamen , , MA:8. 1 Search (JM): 11 points
Lunds Tekniska Högskola EDA132 Institutionen för datavetenskap VT 2017 Tillämpad Artificiell Intelligens Applied Artificial Intelligence Tentamen 2016 03 15, 14.00 19.00, MA:8 You can give your answers
More informationEnterprise Miner Software: Changes and Enhancements, Release 4.1
Enterprise Miner Software: Changes and Enhancements, Release 4.1 The correct bibliographic citation for this manual is as follows: SAS Institute Inc., Enterprise Miner TM Software: Changes and Enhancements,
More informationSemantic Web. Dr. Philip Cannata 1
Semantic Web Dr. Philip Cannata 1 Dr. Philip Cannata 2 Dr. Philip Cannata 3 Dr. Philip Cannata 4 See data 14 Scientific American.sql on the class website calendar SELECT strreplace(x, 'sa:', '') "C" FROM
More informationOn the use of Abstract Workflows to Capture Scientific Process Provenance
On the use of Abstract Workflows to Capture Scientific Process Provenance Paulo Pinheiro da Silva, Leonardo Salayandia, Nicholas Del Rio, Ann Q. Gates The University of Texas at El Paso CENTER OF EXCELLENCE
More informationEvaluation of different biological data and computational classification methods for use in protein interaction prediction.
Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Yanjun Qi, Ziv Bar-Joseph, Judith Klein-Seetharaman Protein 2006 Motivation Correctly
More informationCS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University
CS423: Data Mining Introduction Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS423: Data Mining 1 / 29 Quote of the day Never memorize something that
More informationOpus: University of Bath Online Publication Store
Patel, M. (2004) Semantic Interoperability in Digital Library Systems. In: WP5 Forum Workshop: Semantic Interoperability in Digital Library Systems, DELOS Network of Excellence in Digital Libraries, 2004-09-16-2004-09-16,
More informationA Multi-Relational Decision Tree Learning Algorithm - Implementation and Experiments
A Multi-Relational Decision Tree Learning Algorithm - Implementation and Experiments Anna Atramentov, Hector Leiva and Vasant Honavar Artificial Intelligence Research Laboratory, Computer Science Department
More informationContents. Preface to the Second Edition
Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................
More informationKnowledge Discovery in Data Bases
Knowledge Discovery in Data Bases Chien-Chung Chan Department of CS University of Akron Akron, OH 44325-4003 2/24/99 1 Why KDD? We are drowning in information, but starving for knowledge John Naisbett
More information