Orange Documentation. Release 3.0. Biolab

Size: px
Start display at page:

Download "Orange Documentation. Release 3.0. Biolab"

Transcription

1 Orange Documentation Release 3.0 Biolab November 10, 2014

2

3 Contents 1 Data model (data) Data Storage (storage) Data Table (table) SQL table (data.sql) Domain description (domain) Variable Descriptors (variable) Values (value) Data Instance (instance) Data Filters (filter) Continuization and normalization (continuizer) Data discretization (discretization) Feature (feature) Scoring (scoring) Feature discretization (discretization) Classification (classification) Majority (majority) Naive Bayes (naive_bayes) Support Vector Machines (svm) Evaluation (evaluation) Sampling procedures for testing models (testing) Miscellaneous (misc) Symbolic constants (enum) Development Widget development Indices and tables 19 Bibliography 21 i

4 ii

5 CHAPTER 1 Data model (data) Orange stores data in Orange.data.Storage classes. The most commonly used storage is Orange.data.Table, which stores all data in two-dimensional numpy arrays. Each row of the data represents a data instance. Individual data instances are represented as instances of Orange.data.Instance. Different storage classes may derive subclasses of Instance to represent the retrieved rows in the data more efficiently and to allow modifying the data through modifying data instance. For example, if table is Orange.data.Table, table[0] returns the row as Orange.data.RowInstance. Every storage class and data instance has an associated domain description domain (an instance of Orange.data.Domain) that stores descriptions of data columns. Every column is described by an instance of a class derived from Orange.data.Variable. The subclasses correspond to continuous variables (Orange.data.ContinuousVariable), discrete variables (Orange.data.DiscreteVariable) and string variables (Orange.data.StringVariable). These descriptors contain the variable s name, symbolic values, number of decimals in printouts and similar. The data is divided into attributes (features, independent variables), class variables (classes, targets, outcomes, dependent variables) and meta attributes. This division applies to domain descriptions, data storages that contain separate arrays for each of the three parts of the data and data instances. Attributes and classes are represented with numeric values and are used in modelling. Meta attributes contain additional data which may be of any type. (Currently, only string values are supported in addition to continuous and numeric.) In indexing, columns can be referred to by their names, descriptors or an integer index. For example, if inst is a data instance and var is a descriptor of type Continuous, referring to the first column in the data, which is also names petal length, then inst[var], inst[0] and inst[ petal length ] refer to the first value of the instance. Negative indices are used for meta attributes, starting with -1. Continuous and discrete values can be represented by any numerical type; by default, Orange uses double precision (64-bit) floats. Discrete values are represented by whole numbers. 1.1 Data Storage (storage) Orange.data.storage.Storage is an abstract class representing a data object in which rows represent data instances (examples, in machine learning terminology) and columns represent variables (features, attributes, classes, targets, meta attributes). Data is divided into three parts that represent independent variables (X), dependent variables (Y) and meta data (metas). If practical, the class should expose those parts as properties. In the associated domain (Orange.data.Domain), the three parts correspond to lists of variable descriptors attributes, class_vars and metas. 1

6 Any of those parts may be missing, dense, sparse or sparse boolean. The difference between the later two is that the sparse data can be seen as a list of pairs (variable, value), while in the latter the variable (item) is present or absent, like in market basket analysis. The actual storage of sparse data depends upon the storage type. There is no uniform constructor signature: every derived class provides one or more specific constructors. There are currently two derived classes Orange.data.Table and Orange.data.sql.Table, the former storing the data in-memory, in numpy objects, and the latter in SQL (currently, only PostreSQL is supported). Derived classes must implement at least the methods for getting rows and the number of instances ( getitem and len ). To make storage fast enough to be practically useful, it must also reimplement a number of filters, preprocessors and aggregators. For instance, method _filter_values(self, filter) returns a new storage which only contains the rows that match the criteria given in the filter. Orange.data.Table implements an efficient method based on numpy indexing, and Orange.data.sql.Table, which stores a table as an SQL query, converts the filter into a WHERE clause. Orange.data.storage.domain(:obj: Orange.data.Domain ) The domain describing the columns of the data Data access Orange.data.storage. getitem (self, index) Return one or more rows of the data. If the index is an int, e.g. data[7]; the corresponding row is returned as an instance of Instance. Concrete implementations of Storage use specific derived classes for instances. If the index is a slice or a sequence of ints (e.g. data[7:10] or data[[7, 42, 15]], indexing returns a new storage with the selected rows. If there are two indices, where the first is an int (a row number) and the second can be interpreted as columns, e.g. data[3, 5] or data[3, gender ] or data[3, y] (where y is an instance of Variable), a single value is returned as an instance of Value. In all other cases, the first index should be a row index, a slice or a sequence, and the second index, which represent a set of columns, should be an int, a slice, a sequence or a numpy array. The result is a new storage with a new domain.. len (self ) Return the number of data instances (rows) Inspection Storage.X_density, Storage.Y_density, Storage.metas_density Indicates whether the attributes, classes and meta attributes are dense (Storage.DENSE) or sparse (Storage.SPARSE). If they are sparse and all values are 0 or 1, it is marked as (Storage.SPARSE_BOOL). The Storage class provides a default DENSE. If the data has no attibutes, classes or meta attributes, the corresponding method should re Filters Storage should define the following methods to optimize the filtering operations as allowed by the underlying data structure. Orange.data.Table executes them directly through numpy (or bottleneck or related) methods, while Orange.data.sql.Table appends them to the WHERE clause of the query that defines the data. 2 Chapter 1. Data model (data)

7 These methods should not be called directly but through the classes defined in Orange.data.filter. Methods in Orange.data.filter also provide the slower fallback functions for the functions not defined in the storage. Orange.data.storage._filter_is_defined(self, columns=none, negate=false) Extract rows without undefined values. Parameters columns (sequence of ints, variable names or descriptors) optional list of columns that are checked for unknowns negate (bool) invert the selection Returns a new storage of the same type or Table Return type Orange.data.storage.Storage Orange.data.storage._filter_has_class(self, negate=false) Return rows with known value of the target attribute. If there are multiple classes, all must be defined. Parameters negate (bool) invert the selection Returns a new storage of the same type or Table Return type Orange.data.storage.Storage Orange.data.storage._filter_same_value(self, column, value, negate=false) Select rows based on a value of the given variable. Parameters column (int, str or Orange.data.Variable) the column that is checked value (int, float or str) the value of the variable negate (bool) invert the selection Returns a new storage of the same type or Table Return type Orange.data.storage.Storage Orange.data.storage._filter_values(self, filter) Apply a the given filter to the data. Parameters filter (Orange.data.Filter) A filter for selecting the rows Returns a new storage of the same type or Table Return type Orange.data.storage.Storage Aggregators Similarly to filters, storage classes should provide several methods for fast computation of statistics. These methods are not called directly but by modules within Orange.statistics. _compute_basic_stats( self, columns=none, include_metas=false, compute_variance=false) Compute basic statistics for the specified variables: minimal and maximal value, the mean and a varianca (or a zero placeholder), the number of missing and defined values. Parameters columns (list of ints, variable names or descriptors of type Orange.data.Variable) a list of columns for which the statistics is computed; if None, the function computes the data for all variables 1.1. Data Storage (storage) 3

8 include_metas (bool) a flag which tells whether to include meta attributes (applicable only if columns is None) compute_variance (bool) a flag which tells whether to compute the variance Returns a list with tuple (min, max, mean, variance, #nans, #non-nans) for each variable Return type list Orange.data.storage._compute_distributions(self, columns=none) Compute the distribution for the specified variables. The result is a list of pairs containing the distribution and the number of rows for which the variable value was missing. For discrete variables, the distribution is represented as a vector with absolute frequency of each value. For continuous variables, the result is a 2-d array of shape (2, number-of-distinct-values); the first row contains (distinct) values of the variables and the second has their absolute frequencies. Parameters columns (list of ints, variable names or descriptors of type Orange.data.Variable) a list of columns for which the distributions are computed; if None, the function runs over all variables Returns a list of distributions Return type list of numpy arrays 1.2 Data Table (table) Constructors The preferred way to construct a table is to invoke a named constructor Inspection Row manipulation Weights 1.3 SQL table (data.sql) 1.4 Domain description (domain) 1.5 Variable Descriptors (variable) Every variable is associated with a descriptor that stores its name, type and other properties, and takes care of conversion of values from textual format to floats and back. Derived classes store lists of existing descriptors for variables of each particular type. The provide a supplementary constructor make, which returns an existing variable instead of creating a new one, when possible. Descriptors facilitate computation of variables from other variables. This is used in domain conversion: if the destination domain contains a variable that is not present in the original domain, the variables value is computed by calling its method compute_value, passing the data instance from the original domain as an argument. 4 Chapter 1. Data model (data)

9 Method compute_value calls get_value_from if defined and returns its result. If the variables does not have get_value_from, compute_value returns Unknown. 1.6 Values (value) 1.7 Data Instance (instance) Class Instance represents a data instance. The base class stores its own data, and derived classes are used to represent views into various data storages. Like data tables, every data instance is associated with a domain and its data is split into attributes, classes, meta attributes and the weight. The instance s data can be retrieved through attributes x, y and metas. For derived classes, changing this data does not necessarily modify the corresponding data tables or other structures. Use indexing to modify the data Rows of Data Tables 1.8 Data Filters (filter) Instances of classes derived from Filter are used for filtering the data. When called with an individual data instance (Orange.data.Instance), they accept or reject the instance by returning either True or False. When called with a data storage (e.g. an instance of Orange.data.Table) they check whether the corresponding class provides the method that implements the particular filter. If so, the method is called and the result should be of the same type as the storage; e.g., filter methods of Orange.data.Table return new instances of Orange.data.Table, and filter methods of SQL proxies return new SQL proxies. If the class corresponding to the storage does not implement a particular filter, the fallback computes the indices of the rows to be selected and returns data[indices]. 1.9 Continuization and normalization (continuizer) class Orange.data.continuizer.DomainContinuizer Construct a domain in which discrete attributes are replaced by continuous; existing continuous attributes can be normalized. The attributes are treated according to their types: binary variables are transformed into 0.0/1.0 or -1.0/1.0 indicator variables multinomial variables are treated according to the flag multinomial_treatment. discrete attribute with less than two possible values are removed; continuous variables can be normalized or left unchanged The typical use of the class is as follows: 1.6. Values (value) 5

10 continuizer = Orange.data.continuization.DomainContinuizer() continuizer.multinomial_treatment = continuizer.lowestisbase domain0 = continuizer(data) data0 = Orange.data.Table(domain0, data) Domain continuizers can be given either a data set or a domain, and return a new domain. When given only the domain, they cannot normalize continuous attributes or used the most frequent value as the base value, and will raise an exception with such settings. Constructor can also be passed data or domain, in which case the constructed continuizer is immediately applied to the data or domain and the transformed domain is returned instead of the continuizer instance. By default, the class does not change continuous and class attributes, discrete attributes are replaced with N attributes (NValues) with values 0 and 1. zero_based Determines the value used as the low value of the variable. When binary variables are transformed into continuous, when multivalued variable is transformed into multiple variables, the transformed variable can either have values 0.0 and 1.0 (default, zero_based is True) or -1.0 and 1.0 (zero_based is False). This attribute also determines the interval for normalized continuous variables (either [-1, 1] or [0, 1]). The following text assumes the default case. (Default: False) multinomial_treatment Decides the treatment of multinomial variables. Let N be the number of the variables s values. (Default: NValues) DomainContinuizer.NValues The variable is replaced by N indicator variables, each corresponding to one value of the original variable. For each value of the original attribute, only the corresponding new attribute will have a value of one and others will be zero. Note that these variables are not independent, so they cannot be used (directly) in, for instance, linear or logistic regression. For example, data set bridges has feature RIVER with values M, A, O and Y, in that order. Its value for the 15th row is M. Continuization replaces the variable with variables RIVER=M, RIVER=A, RIVER=O and RIVER=Y. For the 15th row, the first has value 1 and others are 0. DomainContinuizer.LowestIsBase Similar to the above except that it creates only N-1 variables. The missing indicator belongs to the lowest value: when the original variable has the lowest value all indicators are 0. If the variable descriptor has the base_value defined, the specified value is used as base instead of the lowest one. Continuizing the variable RIVER gives similar results as above except that it would omit RIVER=M ; all three variables would be zero for the 15th data instance. DomainContinuizer.FrequentIsBase Like above, except that the most frequent value is used as the base. If there are multiple most frequent values, the one with the lowest index is used. The frequency of values is extracted from data, so this option cannot be used if constructor is given only a domain. Variable RIVER would be continuized similarly to above except that it omits RIVER=A, which is the most frequent value. DomainContinuizer.Ignore Discrete variables are omitted. 6 Chapter 1. Data model (data)

11 DomainContinuizer.IgnoreMulti Discrete variables with more than two values are omitted; two-valued are treated the same as in LowestIsBase. DomainContinuizer.ReportError Raise an error if there are any multinominal variables in the data. DomainContinuizer.AsOrdinal Multivalued variables are treated as ordinal and replaced by a continuous variables with the values index, e.g. 0, 1, 2, 3... DomainContinuizer.AsNormalizedOrdinal As above, except that the resulting continuous value will be from range 0 to 1, e.g. 0, 0.25, 0.5, 0.75, 1 for a five-valued variable. normalize_continuous If None, continuous variables are left unchanged. If DomainContinuizer.NormalizeBySD, they are replaced with standardized values by subtracting the average value and dividing by the standard deviation. Attribute zero_based has no effect on this standardization. If DomainContinuizer.NormalizeBySpan, they are replaced with normalized values by subtracting min value of the data and dividing by span (max - min). Statistics are computed from the data, so constructor must be given data, not just domain. (Default: None) transform_class If True the class is replaced by continuous attributes or normalized as well. Multiclass problems are thus transformed to multitarget ones. (Default: False) 1.10 Data discretization (discretization) Continuous features in the data can be discretized using a uniform discretization method. Discretizatione replaces continuous features with the corresponding categorical features: import Orange iris = Orange.data.Table("iris.tab") disc_iris = Orange.data.discretization.DiscretizeTable(iris, method=orange.feature.discretization.equalfreq(n=3)) print("original data set:") for e in iris[:3]: print(e) print("discretized data set:") for e in disc_iris[:3]: print(e) Discretization introduces new categorical features with discretized values: Original data set: [5.1, 3.5, 1.4, 0.2 Iris-setosa] [4.9, 3.0, 1.4, 0.2 Iris-setosa] [4.7, 3.2, 1.3, 0.2 Iris-setosa] Discretized data set: [< , >= , < , < Iris-setosa] [< , [ , ), < , < Iris-setosa] [< , >= , < , < Iris-setosa] Data discretization uses feature discretization classes from Orange.feature.discretization and applies them on entire data set. Default discretization method (equal frequency with four intervals) can be replaced with other discretization approaches as demonstrated below: Data discretization (discretization) 7

12 disc = Orange.data.discretization.DiscretizeTable() disc.method = Orange.feature.discretization.EqualFreq(n=2) disc_iris = disc(iris) Data discretization classes 8 Chapter 1. Data model (data)

13 CHAPTER 2 Feature (feature) 2.1 Scoring (scoring) Feature score is an assessment of the usefulness of the feature for prediction of the dependant (class) variable. Orange provides classes that compute the common feature scores for classification and regression regression. The code below computes the information gain of feature tear_rate in the Lenses data set: >>> data = Orange.data.Table("lenses") >>> Orange.feature.scoring.InfoGain("tear_rate", data) An alternative way of invoking the scorers is to construct the scoring object, like in the following example: >>> gain = Orange.feature.scoring.InfoGain() >>> for att in data.domain.attributes:... print("%s %.3f" % (att.name, gain(att, data))) age prescription astigmatic tear_rate Feature scoring in classification problems class Orange.feature.scoring.InfoGain Information gain is the expected decrease of entropy. See Wikipedia entry on information gain. class Orange.feature.scoring.GainRatio Information gain ratio is the ratio between information gain and the entropy of the feature s value distribution. The score was introduced in [Quinlan1986] to alleviate overestimation for multi-valued features. See Wikipedia entry on gain ratio. class Orange.feature.scoring.Gini Gini index is the probability that two randomly chosen instances will have different classes. See Wikipedia entry on gini index Feature scoring in regression problems TBD. 9

14 Bibliography 2.2 Feature discretization (discretization) This module transforms continuous features into discrete. Usually, one would like to discretize the whole data set with the Orange.data.discretization module. This module is limited to single features. Discretization Algorithms return a discretized variable (with fixed parameters) that can transform either the learning or the testing data. Parameter learning is separate to the transformation, as in machine learning only the training set should be used to induce parameters. To obtain discretized features, call a discretization algorithm with with the data and the feature to discretize. The feature can be given either as an index name or Orange.data.Variable. The following example creates a discretized feature. import Orange data = Orange.data.Table("iris.tab") disc = Orange.feature.discretization.EqualFreq(n=4) disc_var = disc(data, 0) The values of the first attribute will be discretized the data is transformed to the Orange.data.Domain domain that includes disc_var. In the example below we add the discretized first attribute to the original domain. ndomain = Orange.data.Domain([disc_var] + list(data.domain.attributes), data.domain.class_vars) ndata = Orange.data.Table(ndomain, data) print(ndata) The printout: [[< , 5.1, 3.5, 1.4, 0.2 Iris-setosa], [< , 4.9, 3.0, 1.4, 0.2 Iris-setosa], [< , 4.7, 3.2, 1.3, 0.2 Iris-setosa], [< , 4.6, 3.1, 1.5, 0.2 Iris-setosa], [< , 5.0, 3.6, 1.4, 0.2 Iris-setosa],... ] Discretization Algorithms 10 Chapter 2. Feature (feature)

15 CHAPTER 3 Classification (classification) 3.1 Majority (majority) Accuracy of models is often compared with the default accuracy, that is, the accuracy of a model which classifies all instances to the majority class. Fitting such a model consists of computing the class distribution and its modus. The model is represented as an instance of Orange.classification.majority.ConstantClassifier Example >>> iris = Orange.data.Table( iris ) >>> maj = Orange.classification.majority.MajorityLearner() >>> model = maj(iris[30:110]) >>> print(model) ConstantClassifier [ ] >>> model(iris[0]) Value( iris, Iris-versicolor) 3.2 Naive Bayes (naive_bayes) The module for Naive Bayes (NB) classification contains two implementations: Orange.classification.naive_bayes.BayesClassifier is based on the popular scikit-learn package. Orange.classification.naive_bayes.BayesStorageClassifier demonstrates the use of the contingency table provided by the underlying storage. It returns only the predicted value Example >>> from Orange.data import Table >>> from Orange.classification.naive_bayes import * >>> iris = Table( iris ) >>> fitter = NaiveBayes() >>> model = fitter(iris[:-1]) >>> model(iris[-1]) Value( iris, Iris-virginica) 11

16 3.3 Support Vector Machines (svm) The module for Support Vector Machine (SVM) classification is based on the popular scikit-learn package, which uses LibSVM and LIBLINEAR libraries. It provides several learning algorithms: Orange.classification.svm.SVMLearner, a general SVM learner; Orange.classification.svm.LinearSVMLearner, a fast learner useful for data sets with a large number of features; Orange.classification.svm.NuSVMLearner, a general SVM learner that uses a parameter to control the number of support vectors; Orange.classification.svm.SVRLearner, a general learner for support vector regression; Orange.classification.svm.NuSVRLearner, which is similar to NuSVMLearner for regression; Orange.classification.svm.OneClassSVMLearner, unsupervised learner useful for novelty and outlier detection Example >>> from Orange.data import Table >>> from Orange.classification.svm import SVMLearner >>> from Orange.evaluation import CrossValidation, CA >>> iris = Table( iris ) >>> fitter = SVMLearner() >>> cross = CrossValidation(iris, fitter) >>> prediction = cross.kfold(10) >>> CA(iris, prediction[0]) Chapter 3. Classification (classification)

17 CHAPTER 4 Evaluation (evaluation) 4.1 Sampling procedures for testing models (testing) 13

18 14 Chapter 4. Evaluation (evaluation)

19 CHAPTER 5 Miscellaneous (misc) 5.1 Symbolic constants (enum) 15

20 16 Chapter 5. Miscellaneous (misc)

21 CHAPTER 6 Development 6.1 Widget development Library of Common GUI Controls gui is a library of functions which allow constructing a control (like check box, line edit or a combo), inserting it into the parent s layout, setting tooltips, callbacks and so forth, establishing synchronization with a Python object s attribute (including saving and retrieving when the widgets is closed and reopened)... in a single call. Almost all functions need three arguments: the widget into which the control is inserted, the master widget with one whose attributes the control s value is synchronized, the name of that attribute (value). All other arguments should be given as keyword arguments for clarity and also for allowing the potential future compatibility-breaking changes in the module. Several arguments that are common to all functions must always be given as keyword arguments. Common options All controls accept the following arguments that can only be given as keyword arguments. tooltip A string for a tooltip that appears when mouse is over the control. disabled Tells whether the control be disabled upon the initialization. addspace Gives the amount of space that is inserted after the control (or the box that encloses it). If True, a space of 8 pixels is inserted. Default is 0. addtolayout The control is added to the parent s layout unless this flag is set to False. stretch The stretch factor for this widget, used when adding to the layout. Default is 0. sizepolicy The size policy for the box or the control. Common Arguments Many functions share common arguments. widget Widget on which control will be drawn. 17

22 master Object which includes the attribute that is used to store the control s value; most often the self of the widget into which the control is inserted. value String with the name of the master s attribute that synchronizes with the the control s value.. box Indicates if there should be a box that around the control. If box is False (default), no box is drawn; if it is a string, it is also used as the label for the box s name; if box is any other true value (such as True :), an unlabeled box is drawn. callback A function that is called when the state of the control is changed. This can be a single function or a list of functions that will be called in the given order. The callback function should not change the value of the controlled attribute (the one given as the value argument described above) to avoid a cycle (a workaround is shown in the description of listbox function. label A string that is displayed as control s label. labelwidth The width of the label. This is useful for aligning the controls. orientation When the label is given used, this argument determines the relative placement of the label and the control. Label can be above the control, ( vertical or True - this is the default) or in the same line with control, ( horizontal or False). Orientation can also be an instance of QLayout. Common Attributes box If the constructed widget is enclosed into a box, the attribute box refers to the box. Widgets This section describes the wrappers for controls like check boxes, buttons and similar. Using them is preferred over calling Qt directly, for convenience, readability, ease of porting to newer versions of Qt and, in particular, because they set up a lot of things that happen in behind. Other widgets Utility functions Internal functions and classes This part of documentation describes some classes and functions that are used internally. The classes will likely maintain compatibility in the future, while the functions may be changed. Wrappers for Qt classes Wrappers for Python classes Other functions 18 Chapter 6. Development

23 CHAPTER 7 Indices and tables genindex modindex search 19

24 20 Chapter 7. Indices and tables

25 Bibliography [Kononenko2007] Igor Kononenko, Matjaz Kukar: Machine Learning and Data Mining, Woodhead Publishing, [Quinlan1986] J R Quinlan: Induction of Decision Trees, Machine Learning, [Breiman1984] L Breiman et al: Classification and Regression Trees, Chapman and Hall, [Kononenko1995] I Kononenko: On biases in estimating multi-valued attributes, International Joint Conference on Artificial Intelligence,

26 22 Bibliography

27 Index Symbols getitem () (in module Orange.data.storage), 2 len () (Orange.data.storage. method), 2 _compute_distributions() (in module Orange.data.storage), 4 _filter_has_class() (in module Orange.data.storage), 3 _filter_is_defined() (in module Orange.data.storage), 3 _filter_same_value() (in module Orange.data.storage), 3 _filter_values() (in module Orange.data.storage), 3 C classification majority classifier, 11 support vector machines (SVM), 12 D Data, 8 data discretization, 7 discretization, 7, 10 domain (in module Orange.data.storage), 2 F feature discretization, 10 feature scoring, 9 feature scoring, 9 gain ratio, 9 gini index, 9 information gain, 9 G GainRatio (class in Orange.feature.scoring), 9 Gini (class in Orange.feature.scoring), 9 I InfoGain (class in Orange.feature.scoring), 9 M majority classifier, 11 classification, 11 multinomial_treatment (Orange.data.continuizer.Orange.data.continuizer.DomainContinuizer attribute), 6 N Naive Bayes classifier, 11 normalize_continuous (Orange.data.continuizer.Orange.data.continuizer.DomainContinuizer attribute), 7 O Orange.data.continuizer.DomainContinuizer (class in Orange.data.continuizer), 5 S support vector machines (SVM), 12 classification, 12 T transform_class (Orange.data.continuizer.Orange.data.continuizer.DomainC attribute), 7 Z zero_based (Orange.data.continuizer.Orange.data.continuizer.DomainContin attribute), 6 23

Orange Data Mining Library Documentation

Orange Data Mining Library Documentation Orange Data Mining Library Documentation Release 3 Orange Data Mining Jan 30, 2018 Contents 1 Tutorial 1 1.1 The Data................................................. 1 1.1.1 Data Input............................................

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology What s a concept?

More information

Machine Learning Chapter 2. Input

Machine Learning Chapter 2. Input Machine Learning Chapter 2. Input 2 Input: Concepts, instances, attributes Terminology What s a concept? Classification, association, clustering, numeric prediction What s in an example? Relations, flat

More information

Basic Concepts Weka Workbench and its terminology

Basic Concepts Weka Workbench and its terminology Changelog: 14 Oct, 30 Oct Basic Concepts Weka Workbench and its terminology Lecture Part Outline Concepts, instances, attributes How to prepare the input: ARFF, attributes, missing values, getting to know

More information

Input: Concepts, Instances, Attributes

Input: Concepts, Instances, Attributes Input: Concepts, Instances, Attributes 1 Terminology Components of the input: Concepts: kinds of things that can be learned aim: intelligible and operational concept description Instances: the individual,

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Input: Concepts, instances, attributes Data ining Practical achine Learning Tools and Techniques Slides for Chapter 2 of Data ining by I. H. Witten and E. rank Terminology What s a concept z Classification,

More information

CSC Advanced Scientific Computing, Fall Numpy

CSC Advanced Scientific Computing, Fall Numpy CSC 223 - Advanced Scientific Computing, Fall 2017 Numpy Numpy Numpy (Numerical Python) provides an interface, called an array, to operate on dense data buffers. Numpy arrays are at the core of most Python

More information

Orange3 Educational Add-on Documentation

Orange3 Educational Add-on Documentation Orange3 Educational Add-on Documentation Release 0.1 Biolab Jun 01, 2018 Contents 1 Widgets 3 2 Indices and tables 27 i ii Widgets in Educational Add-on demonstrate several key data mining and machine

More information

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target

More information

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS 1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes and a class attribute

More information

Classification: Decision Trees

Classification: Decision Trees Classification: Decision Trees IST557 Data Mining: Techniques and Applications Jessie Li, Penn State University 1 Decision Tree Example Will a pa)ent have high-risk based on the ini)al 24-hour observa)on?

More information

Data Mining and Knowledge Discovery Practice notes 2

Data Mining and Knowledge Discovery Practice notes 2 Keywords Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si Data Attribute, example, attribute-value data, target variable, class, discretization Algorithms

More information

Introduction to Artificial Intelligence

Introduction to Artificial Intelligence Introduction to Artificial Intelligence COMP307 Machine Learning 2: 3-K Techniques Yi Mei yi.mei@ecs.vuw.ac.nz 1 Outline K-Nearest Neighbour method Classification (Supervised learning) Basic NN (1-NN)

More information

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course: DATA SCIENCE About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst/Analytics Manager/Actuarial Scientist/Business

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 8.11.2017 1 Keywords Data Attribute, example, attribute-value data, target variable, class, discretization

More information

\n is used in a string to indicate the newline character. An expression produces data. The simplest expression

\n is used in a string to indicate the newline character. An expression produces data. The simplest expression Chapter 1 Summary Comments are indicated by a hash sign # (also known as the pound or number sign). Text to the right of the hash sign is ignored. (But, hash loses its special meaning if it is part of

More information

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio

More information

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Syllabus Fri. 27.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 3.11. (2) A.1 Linear Regression Fri. 10.11. (3) A.2 Linear Classification Fri. 17.11. (4) A.3 Regularization

More information

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core) Introduction to Data Science What is Analytics and Data Science? Overview of Data Science and Analytics Why Analytics is is becoming popular now? Application of Analytics in business Analytics Vs Data

More information

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data DATA ANALYSIS I Types of Attributes Sparse, Incomplete, Inaccurate Data Sources Bramer, M. (2013). Principles of data mining. Springer. [12-21] Witten, I. H., Frank, E. (2011). Data Mining: Practical machine

More information

Python With Data Science

Python With Data Science Course Overview This course covers theoretical and technical aspects of using Python in Applied Data Science projects and Data Logistics use cases. Who Should Attend Data Scientists, Software Developers,

More information

Orange3-Prototypes Documentation. Biolab, University of Ljubljana

Orange3-Prototypes Documentation. Biolab, University of Ljubljana Biolab, University of Ljubljana Dec 17, 2018 Contents 1 Widgets 1 2 Indices and tables 11 i ii CHAPTER 1 Widgets 1.1 Contingency Table Construct a contingency table from given data. Inputs Data input

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules Keywords Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si Data Attribute, example, attribute-value data, target variable, class, discretization Algorithms

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Machine Learning: Algorithms and Applications Mockup Examination

Machine Learning: Algorithms and Applications Mockup Examination Machine Learning: Algorithms and Applications Mockup Examination 14 May 2012 FIRST NAME STUDENT NUMBER LAST NAME SIGNATURE Instructions for students Write First Name, Last Name, Student Number and Signature

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Data analysis case study using R for readily available data set using any one machine learning Algorithm

Data analysis case study using R for readily available data set using any one machine learning Algorithm Assignment-4 Data analysis case study using R for readily available data set using any one machine learning Algorithm Broadly, there are 3 types of Machine Learning Algorithms.. 1. Supervised Learning

More information

Weka VotedPerceptron & Attribute Transformation (1)

Weka VotedPerceptron & Attribute Transformation (1) Weka VotedPerceptron & Attribute Transformation (1) Lab6 (in- class): 5 DIC 2016-13:15-15:00 (CHOMSKY) ACKNOWLEDGEMENTS: INFORMATION, EXAMPLES AND TASKS IN THIS LAB COME FROM SEVERAL WEB SOURCES. Learning

More information

Performance Evaluation of Various Classification Algorithms

Performance Evaluation of Various Classification Algorithms Performance Evaluation of Various Classification Algorithms Shafali Deora Amritsar College of Engineering & Technology, Punjab Technical University -----------------------------------------------------------***----------------------------------------------------------

More information

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018 MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge

More information

Chapter 1 - The Spark Machine Learning Library

Chapter 1 - The Spark Machine Learning Library Chapter 1 - The Spark Machine Learning Library Objectives Key objectives of this chapter: The Spark Machine Learning Library (MLlib) MLlib dense and sparse vectors and matrices Types of distributed matrices

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Mining: Exploring Data. Lecture Notes for Chapter 3 Data Mining: Exploring Data Lecture Notes for Chapter 3 1 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Output: Knowledge representation Tables Linear models Trees Rules

More information

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors Dejan Sarka Anomaly Detection Sponsors About me SQL Server MVP (17 years) and MCT (20 years) 25 years working with SQL Server Authoring 16 th book Authoring many courses, articles Agenda Introduction Simple

More information

Data Mining Tools. Jean-Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, Paris, Cedex 05

Data Mining Tools. Jean-Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, Paris, Cedex 05 Data Mining Tools Jean-Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, 75252 Paris, Cedex 05 Jean-Gabriel.Ganascia@lip6.fr DATA BASES Data mining Extraction Data mining Interpretation/

More information

The Explorer. chapter Getting started

The Explorer. chapter Getting started chapter 10 The Explorer Weka s main graphical user interface, the Explorer, gives access to all its facilities using menu selection and form filling. It is illustrated in Figure 10.1. There are six different

More information

Credit card Fraud Detection using Predictive Modeling: a Review

Credit card Fraud Detection using Predictive Modeling: a Review February 207 IJIRT Volume 3 Issue 9 ISSN: 2396002 Credit card Fraud Detection using Predictive Modeling: a Review Varre.Perantalu, K. BhargavKiran 2 PG Scholar, CSE, Vishnu Institute of Technology, Bhimavaram,

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

Supervised Learning Classification Algorithms Comparison

Supervised Learning Classification Algorithms Comparison Supervised Learning Classification Algorithms Comparison Aditya Singh Rathore B.Tech, J.K. Lakshmipat University -------------------------------------------------------------***---------------------------------------------------------

More information

EPL451: Data Mining on the Web Lab 5

EPL451: Data Mining on the Web Lab 5 EPL451: Data Mining on the Web Lab 5 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01 University of Cyprus Department of Computer Science Predictive modeling techniques IBM reported in June 2012 that 90% of data available

More information

Ch.1 Introduction. Why Machine Learning (ML)?

Ch.1 Introduction. Why Machine Learning (ML)? Syllabus, prerequisites Ch.1 Introduction Notation: Means pencil-and-paper QUIZ Means coding QUIZ Why Machine Learning (ML)? Two problems with conventional if - else decision systems: brittleness: The

More information

Louis Fourrier Fabien Gaie Thomas Rolf

Louis Fourrier Fabien Gaie Thomas Rolf CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted

More information

> Data Mining Overview with Clementine

> Data Mining Overview with Clementine > Data Mining Overview with Clementine This two-day course introduces you to the major steps of the data mining process. The course goal is for you to be able to begin planning or evaluate your firm s

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.

More information

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Data Exploration Chapter Introduction to Data Mining by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 What is data exploration?

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.4. Spring 2010 Instructor: Dr. Masoud Yaghini Outline Using IF-THEN Rules for Classification Rule Extraction from a Decision Tree 1R Algorithm Sequential Covering Algorithms

More information

Chapter 1 Summary. Chapter 2 Summary. end of a string, in which case the string can span multiple lines.

Chapter 1 Summary. Chapter 2 Summary. end of a string, in which case the string can span multiple lines. Chapter 1 Summary Comments are indicated by a hash sign # (also known as the pound or number sign). Text to the right of the hash sign is ignored. (But, hash loses its special meaning if it is part of

More information

SCIENCE. An Introduction to Python Brief History Why Python Where to use

SCIENCE. An Introduction to Python Brief History Why Python Where to use DATA SCIENCE Python is a general-purpose interpreted, interactive, object-oriented and high-level programming language. Currently Python is the most popular Language in IT. Python adopted as a language

More information

Python for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT

Python for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT Python for Data Analysis Prof.Sushila Aghav-Palwe Assistant Professor MIT Four steps to apply data analytics: 1. Define your Objective What are you trying to achieve? What could the result look like? 2.

More information

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification Data Mining 3.3 Fall 2008 Instructor: Dr. Masoud Yaghini Outline Using IF-THEN Rules for Classification Rules With Exceptions Rule Extraction from a Decision Tree 1R Algorithm Sequential Covering Algorithms

More information

CSE4334/5334 DATA MINING

CSE4334/5334 DATA MINING CSE4334/5334 DATA MINING Lecture 4: Classification (1) CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides courtesy

More information

Python Scripting for Computational Science

Python Scripting for Computational Science Hans Petter Langtangen Python Scripting for Computational Science Third Edition With 62 Figures 43 Springer Table of Contents 1 Introduction... 1 1.1 Scripting versus Traditional Programming... 1 1.1.1

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Features: representation, normalization, selection. Chapter e-9

Features: representation, normalization, selection. Chapter e-9 Features: representation, normalization, selection Chapter e-9 1 Features Distinguish between instances (e.g. an image that you need to classify), and the features you create for an instance. Features

More information

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme Machine Learning A. Supervised Learning A.7. Decision Trees Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany 1 /

More information

Lecture 7: Decision Trees

Lecture 7: Decision Trees Lecture 7: Decision Trees Instructor: Outline 1 Geometric Perspective of Classification 2 Decision Trees Geometric Perspective of Classification Perspective of Classification Algorithmic Geometric Probabilistic...

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Eric Medvet 16/3/2017 1/77 Outline Machine Learning: what and why? Motivating example Tree-based methods Regression trees Trees aggregation 2/77 Teachers Eric Medvet Dipartimento

More information

egrapher Language Reference Manual

egrapher Language Reference Manual egrapher Language Reference Manual Long Long: ll3078@columbia.edu Xinli Jia: xj2191@columbia.edu Jiefu Ying: jy2799@columbia.edu Linnan Wang: lw2645@columbia.edu Darren Chen: dsc2155@columbia.edu 1. Introduction

More information

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time, Chapter 2 Although stochastic gradient descent can be considered as an approximation of gradient descent, it typically reaches convergence much faster because of the more frequent weight updates. Since

More information

Learning and Evaluating Classifiers under Sample Selection Bias

Learning and Evaluating Classifiers under Sample Selection Bias Learning and Evaluating Classifiers under Sample Selection Bias Bianca Zadrozny IBM T.J. Watson Research Center, Yorktown Heights, NY 598 zadrozny@us.ibm.com Abstract Classifier learning methods commonly

More information

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York A Systematic Overview of Data Mining Algorithms Sargur Srihari University at Buffalo The State University of New York 1 Topics Data Mining Algorithm Definition Example of CART Classification Iris, Wine

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Data Mining and Analytics

Data Mining and Analytics Data Mining and Analytics Aik Choon Tan, Ph.D. Associate Professor of Bioinformatics Division of Medical Oncology Department of Medicine aikchoon.tan@ucdenver.edu 9/22/2017 http://tanlab.ucdenver.edu/labhomepage/teaching/bsbt6111/

More information

COMP 465: Data Mining Classification Basics

COMP 465: Data Mining Classification Basics Supervised vs. Unsupervised Learning COMP 465: Data Mining Classification Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Supervised

More information

Python Scripting for Computational Science

Python Scripting for Computational Science Hans Petter Langtangen Python Scripting for Computational Science Third Edition With 62 Figures Sprin ger Table of Contents 1 Introduction 1 1.1 Scripting versus Traditional Programming 1 1.1.1 Why Scripting

More information

09/05/2018. Spark MLlib is the Spark component providing the machine learning/data mining algorithms

09/05/2018. Spark MLlib is the Spark component providing the machine learning/data mining algorithms Spark MLlib is the Spark component providing the machine learning/data mining algorithms Pre-processing techniques Classification (supervised learning) Clustering (unsupervised learning) Itemset mining

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 03 Data Processing, Data Mining Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology

More information

Python Working with files. May 4, 2017

Python Working with files. May 4, 2017 Python Working with files May 4, 2017 So far, everything we have done in Python was using in-memory operations. After closing the Python interpreter or after the script was done, all our input and output

More information

Nearest Neighbor Classification

Nearest Neighbor Classification Nearest Neighbor Classification Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 1 / 48 Outline 1 Administration 2 First learning algorithm: Nearest

More information

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof. CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing University of Florida, CISE Department Prof. Daisy Zhe Wang Data Visualization Value of Visualization Data And Image Models

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Classification. Instructor: Wei Ding

Classification. Instructor: Wei Ding Classification Decision Tree Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Preliminaries Each data record is characterized by a tuple (x, y), where x is the attribute

More information

Semi-supervised Learning

Semi-supervised Learning Semi-supervised Learning Piyush Rai CS5350/6350: Machine Learning November 8, 2011 Semi-supervised Learning Supervised Learning models require labeled data Learning a reliable model usually requires plenty

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 9, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 1 / 47

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Machine Learning for. Artem Lind & Aleskandr Tkachenko

Machine Learning for. Artem Lind & Aleskandr Tkachenko Machine Learning for Object Recognition Artem Lind & Aleskandr Tkachenko Outline Problem overview Classification demo Examples of learning algorithms Probabilistic modeling Bayes classifier Maximum margin

More information

Recursion and Induction: Haskell; Primitive Data Types; Writing Function Definitions

Recursion and Induction: Haskell; Primitive Data Types; Writing Function Definitions Recursion and Induction: Haskell; Primitive Data Types; Writing Function Definitions Greg Plaxton Theory in Programming Practice, Spring 2005 Department of Computer Science University of Texas at Austin

More information

Ch.1 Introduction. Why Machine Learning (ML)? manual designing of rules requires knowing how humans do it.

Ch.1 Introduction. Why Machine Learning (ML)? manual designing of rules requires knowing how humans do it. Ch.1 Introduction Syllabus, prerequisites Notation: Means pencil-and-paper QUIZ Means coding QUIZ Code respository for our text: https://github.com/amueller/introduction_to_ml_with_python Why Machine Learning

More information

Advanced Algorithms and Computational Models (module A)

Advanced Algorithms and Computational Models (module A) Advanced Algorithms and Computational Models (module A) Giacomo Fiumara giacomo.fiumara@unime.it 2014-2015 1 / 34 Python's built-in classes A class is immutable if each object of that class has a xed value

More information

2/26/2017. Spark MLlib is the Spark component providing the machine learning/data mining algorithms. MLlib APIs are divided into two packages:

2/26/2017. Spark MLlib is the Spark component providing the machine learning/data mining algorithms. MLlib APIs are divided into two packages: Spark MLlib is the Spark component providing the machine learning/data mining algorithms Pre-processing techniques Classification (supervised learning) Clustering (unsupervised learning) Itemset mining

More information

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3 Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3 January 25, 2007 CSE-4412: Data Mining 1 Chapter 6 Classification and Prediction 1. What is classification? What is prediction?

More information

arulescba: Classification for Factor and Transactional Data Sets Using Association Rules

arulescba: Classification for Factor and Transactional Data Sets Using Association Rules arulescba: Classification for Factor and Transactional Data Sets Using Association Rules Ian Johnson Southern Methodist University Abstract This paper presents an R package, arulescba, which uses association

More information

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM 1 CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM John R. Koza Computer Science Department Stanford University Stanford, California 94305 USA E-MAIL: Koza@Sunburn.Stanford.Edu

More information

The Warhol Language Reference Manual

The Warhol Language Reference Manual The Warhol Language Reference Manual Martina Atabong maa2247 Charvinia Neblett cdn2118 Samuel Nnodim son2105 Catherine Wes ciw2109 Sarina Xie sx2166 Introduction Warhol is a functional and imperative programming

More information

A Two-level Learning Method for Generalized Multi-instance Problems

A Two-level Learning Method for Generalized Multi-instance Problems A wo-level Learning Method for Generalized Multi-instance Problems Nils Weidmann 1,2, Eibe Frank 2, and Bernhard Pfahringer 2 1 Department of Computer Science University of Freiburg Freiburg, Germany weidmann@informatik.uni-freiburg.de

More information

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Study on Classifiers using Genetic Algorithm and Class based Rules Generation 2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules

More information

Development of Prediction Model for Linked Data based on the Decision Tree for Track A, Task A1

Development of Prediction Model for Linked Data based on the Decision Tree for Track A, Task A1 Development of Prediction Model for Linked Data based on the Decision Tree for Track A, Task A1 Dongkyu Jeon and Wooju Kim Dept. of Information and Industrial Engineering, Yonsei University, Seoul, Korea

More information

DATA STRUCTURES CHAPTER 1

DATA STRUCTURES CHAPTER 1 DATA STRUCTURES CHAPTER 1 FOUNDATIONAL OF DATA STRUCTURES This unit introduces some basic concepts that the student needs to be familiar with before attempting to develop any software. It describes data

More information

CISC 4631 Data Mining

CISC 4631 Data Mining CISC 4631 Data Mining Lecture 03: Introduction to classification Linear classifier Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside) 1 Classification:

More information

Babu Madhav Institute of Information Technology, UTU 2015

Babu Madhav Institute of Information Technology, UTU 2015 Five years Integrated M.Sc.(IT)(Semester 5) Question Bank 060010502:Programming in Python Unit-1:Introduction To Python Q-1 Answer the following Questions in short. 1. Which operator is used for slicing?

More information

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data. Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Output: Knowledge representation Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter of Data Mining by I. H. Witten and E. Frank Decision tables Decision trees Decision rules

More information

Variable and Data Type I

Variable and Data Type I Islamic University Of Gaza Faculty of Engineering Computer Engineering Department Lab 2 Variable and Data Type I Eng. Ibraheem Lubbad September 24, 2016 Variable is reserved a location in memory to store

More information

Data Mining Algorithms: Basic Methods

Data Mining Algorithms: Basic Methods Algorithms: The basic methods Inferring rudimentary rules Data Mining Algorithms: Basic Methods Chapter 4 of Data Mining Statistical modeling Constructing decision trees Constructing rules Association

More information