Prototyping DM Techniques with WEKA and YALE Open-Source Software

Size: px
Start display at page:

Download "Prototyping DM Techniques with WEKA and YALE Open-Source Software"

Transcription

1 TIES443 Contents Tutorial 1 Prototyping DM Techniques with WEKA and YALE Open-Source Software Department of Mathematical Information Technology University of Jyväskylä Mykola Pechenizkiy Course webpage: November 7, Brief Review of DM Software Commercial Open-source WEKA YALE The R Project for Statistical Computing Pentaho whole BI solutions. Matlab Sami will tell you more during the 2nd Tutorial WEKA vs. YALE Comparison Exploration Experimentation Visualization 1 st Assignment 2 Data Mining Software Many providers of commercial DM software SAS Enterprise Miner, SPSS Clementine, Statistica Data Miner, MS SQL Server, Polyanalyst, KnowledgeSTUDIO, IBM Intelligent Miner. Universities can now receive free copies of DB2 and Intelligent Miner for educational or research purposes. See for a list Open Source: WEKA (Waikato Environment for Knowledge Analysis) YALE (Yet Another Learning Environment) Many others MLC++, Minitab, AlphaMiner, Rattle, KNIME The Pentaho BI project a pioneering initiative by the Open Source development community to provide organizations with a comprehensive set of BI capabilities that enable them to radically improve business performance, efficiency, and effectiveness. Data Mining with WEKA The following slides are from by Eibe Frank Copyright: Martin Kramer (mkramer@wxs.nl) 3 4 WEKA: the software WEKA only deals with flat files Machine learning/data mining software written in Java (distributed under the GNU Public License) Used for research, education, and applications Complements Data Mining book by Witten & Frank Main features: Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods Graphical user interfaces (incl. data visualization) Environment for comparing learning age sex { female, chest_pain_type { typ_angina, asympt, non_anginal, cholesterol exercise_induced_angina { no, class { present, 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present

2 WEKA only deals with flat age sex { female, chest_pain_type { typ_angina, asympt, non_anginal, cholesterol exercise_induced_angina { no, class { present, 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present Command line tutorial Explorer: Pre-processing the Data Data can be imported from a file in various formats: ARFF, CSV, C4.5, binary Data can also be read from a URL or from an SQL database (using JDBC) Pre-processing tools in WEKA are called filters WEKA contains filters for: Discretization, normalization, resampling, attribute selection, transforming and combining attributes,

3

4

5

6 31 32 Explorer: building classifiers Classifiers in WEKA are models for predicting nominal or numeric quantities Implemented learning schemes include: Decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes nets, Meta -classifiers include: Bagging, boosting, stacking, error-correcting output codes, locally weighted learning,

7

8

9

10

11 QuickTime and a TIFF (LZW) decompressor are needed to see this picture

12 QuickTime and a TIFF (LZW) decompressor are needed to see this picture. QuickTime and a TIFF (LZW) decompressor are needed to see this picture

13

14 79 80 Explorer: clustering data WEKA contains clusterers for finding groups of similar instances in a dataset Implemented schemes are: k-means, EM, Cobweb, X-means, FarthestFirst Clusters can be visualized and compared to true clusters (if given) Evaluation based on loglikelihood if clustering scheme produces a probability distribution

15

16

17 Explorer: finding associations WEKA contains an implementation of the Apriori algorithm for learning association rules Works only with discrete data Can identify statistical dependencies between groups of attributes: milk, butter bread, eggs (with confidence 0.9 and support 2000) Apriori can compute all rules that have a given minimum support and exceed a given confidence

18 Explorer: attribute selection Panel that can be used to investigate which (subsets of) attributes are the most predictive ones Attribute selection methods contain two parts: A search method: best-first, forward selection, random, exhaustive, genetic algorithm, ranking An evaluation method: correlation-based, wrapper, information gain, chi-squared, Very flexible: WEKA allows (almost) arbitrary combinations of these two

19 Explorer: Data Visualization Visualization very useful in practice: e.g. helps to determine difficulty of the learning problem WEKA can visualize single attributes (1-d) and pairs of attributes (2-d) To do: rotating 3-d visualizations (Xgobi-style) Color-coded class values Jitter option to deal with nominal attributes (and to detect hidden data points) Zoom-in function

20

21

22 Performing Experiments Experimenter makes it easy to compare the performance of different learning schemes For classification and regression problems Results can be written into file or database Evaluation options: cross-validation, learning curve, holdout Can also iterate over different parameter settings Significance-testing built in!

23

24 The Knowledge Flow GUI New graphical user interface for WEKA Java-Beans-based interface for setting up and running machine learning experiments Data sources, classifiers, etc. are beans and can be connected graphically Data flows through components: e.g., data source -> filter -> classifier -> evaluator Layouts can be saved and loaded again later

25

26

27 Conclusion: try it yourself! WEKA is available at Also has a list of projects based on WEKA YALE has different interfaces and ideas behind but it also integrates all available DM techniques from WEKA

28 The following slides are compiled from screenshots and related descriptions available from YALE pages YALE Yet Another Learning Environment Artificial Intelligence Unit of the University of Dortmund. Features of YALE freely available open-source knowledge discovery environment 100% pure Java (runs on every major platform and operating system) KD processes are modeled as simple operator trees which is both intuitive and powerful operator trees or subtrees can be saved as building blocks for later re-use internal XML representation ensures standardized interchange format of data mining experiments simple scripting language allowing for automatic largescale experiments multi-layered data view concept ensures efficient and transparent data handling Features of YALE Flexibility in using YALE: graphical user interface (GUI) for interactive prototyping command line mode (batch mode) for automated large-scale applications Java API to ease usage of YALE from your own programs simple plugin and extension mechanisms, some plugins already exists and you can easily add your own powerful plotting facility offering a large set of sophisticated highdimensional visualization techniques for data and models more than 350 machine learning, evaluation, in- and output, pre- and post-processing, and visualization operators plus numerous meta optimization schemes machine learning library WEKA fully integrated YALE s potential application include text mining, multimedia mining, feature engineering, data stream mining and tracking drifting concepts, development of ensemble methods, and distributed data mining. 165 Experiment Setup the initial operator tree which only consist of a root node. The "Tree View" tab is the most often used editor for YALE experiments. Left: the current operator tree. Right: a table with the parameters of the currently selected operator. The lower part of the YALE main frame serves for displaying and viewing log and error messages. 166 After the learning operator "J48", a breakpoint indicates that the intermediate results can be inspected. Due to the modular concept of YALE, it is always possible to inspect and save intermediate results, e.g. the results for each individual run in a cross validation add new operators to the experiment: directly from the context menu of its parent. the new operator dialog shown in this screenshot. Several search constrains exist and a short description for each operator is shown

29 The operator trees are coded and represented by a simple XML format. The XML editor tab allows for fast and direct manipulations of the current experiment. All views can also be printed and exported to a wide range of graphic formats including jpg, png, ps and pdf. The "Box View" - is another viewer for YALE experiments. the box format is an intuitive way of representing the nesting of the operators. but editing is not possible "Monitor" tab provides an overview of the currently used memory and is an important tool for large-scale data mining tasks on huge data sets. The amount of used memory during an experiment run can even be logged in the same way like all other provided logging values. Data can be imported from several file formats with the attribute editor. Other file formats like Arff, C45, csv, and dbase can be loaded with specialized operators. Attribute Editor can be used to create meta data descriptions from almost arbitrary file formats. These meta data descriptions can then be used for an input operator which actually loads the data Additional attributes (features) can easily be constructed from your data. YALE provides several approaches to construct the best feature space automatically. These approaches range from feature space transformations like PCA, GHA, ICA or the kernel versions to standard feature selection techniques to several evolutionary approaches for feature construction and extraction. 173 Help features to ease the learning phase for new users: An online tutorial, tool tip texts, a beginner and expert mode, operator info screens, a GUI manual, and the YALE tutorial

30 Data Visualization Each time a data set is presented in the results tab (e.g. after loading it), several views appear: a meta data view describing all attributes, a data view showing the actual data and a plot view providing a large set of (high-dimensional) plotters for the data set at hand. The basic scatter plotter: Two of the attribute are used as axes, the class label attribute is used for colorization. The legend at the top maps the colors used to the classes or, in case of a real-valued color plot column, to the corresponding real values The standard scatter plotter even allows jittering, zooming, and displaying example ids. Doubleclicking a data point opens a visualizer. The standard example visualizer is presented here. 2D scatter plots can be put together to a scatter plot matrix where for all pairs of dimensions a usual scatter plot is drawn. This plotter is only available for less then 10 dimensions. For higher number of dimensions one of the other high-dimensional data plotter presented below should be used A 3D scatter plot exists similar to the colorized 2D scatter plot discussed above. The viewport can be rotated and zoomed to fit your needs. The built-in 2D and 3D plotters are a quick and easy way to view your numerical and nominal results, even as online plot at experiment runtime! SOM (Self-Organizing Map) plotter which uses a Kohonen net for dimensionality reduction. Plotting of the U-, the P-, and the U*-Matrix are supported with different color schemes. The data points can be colorized by one of the data columns, e.g. with the prediction label

31 SOM (Self-Organizing Map) plotter which uses a Kohonen net for dimensionality reduction. a gray scale color scheme was used to plot the U- Matrix. The parallel plotter prints the axes of all dimensions parallel to each other. This is the natural visualization technique for series data but can also be useful for other types of data. The main advantage of parallel plots is that a very high number of dimensions can be visualized with this technique. The dimensions are colorized with the feature weights. The more yellow a dimension is marked, the more important this column is quartile plots (also known as box plots) are often used for experiment results like performance values but it is possible to summarize the statistical properties of data columns in general with this type of plot. Histogram plots (also known as distribution plots) RadViz is another highdimensional data plotter where the data columns are placed as radial dimension anchors. Each data point is connected to each anchor with a spring corresponding to the feature values. This will lead to a fixed position in the two-dimensional plane. Again, weights are used to mark the more important columns. A survey plot is a sort of vertical histogram matrix also suitable for a large number of dimensions. Each line corresponds to one data point and can be colorized by one of the columns. The length of each section corresponds to the value of the data point for that dimension. For up to three dimensions the order of the histograms can be selected

32 Visualization of Models and other Results Andrews curves are another way of visualizing highdimensional data. Each data point is projected onto a set of orthogonal trigonometric functions and displayed as a curve. It is known that Andrews curves preserve distances, so they have many uses for data analysis and exploration. Often outliers and hidden patterns can be well detected in these plots. The result of a learning step is called model. Some models provide a graphical representation of the learned hypothesis. This screenshot presents a learned decision tree for the widely known "labor negotiations" data set from the UCI repository. Results like learned models, performance values, data sets or selected attributes are displayed when the experiment is completed or a breakpoint is reached In cases where no graphical representation of a learned model is available, at least a textual description of the learned model is presented. In this screenshot you see a Stacking model consisting of a rule model (the upper half) and a neural network (starts at the lower half). Both base models are described by simple and understandable texts. This is a density plot (similar to a contour plot) of the decision function of a Support Vector Machine (SVM). Almost all SVM implementations in YALE provide a table and a plot view of the learned model. In this screenshot, red points refer to support vectors, blue points to normal training examples. Bluish regions will be predicted negative, reddish regions will be predicted positive only the support vectors are shown colorized by the preditcted function value for the corresponding data point. Examples on the red side will be predicted positive; examples on the blue side will be predicted negative. There is a perfectly linear separation in two of the dimensions and it seems to be that the parameters were not chosen optimal since the number of support vectors is rather high. alpha values (Lagrange multipliers) of the SVM are plotted against the function values and colorized with the true label. We applied a slight jittering to make more points visible. This model seems to be "well-learned", since only few points have a alpha value not equal to zero and these are the points with function values approximately

33 This surface plot presents the result of a meta optimization experiment: the parameters of one of the operators are optimized. the plot can be rotated and zoomed WEKA & YALE Comparison You tell me in your report Now lets go through the first assignment 1 st Assignment nment1.pdf My advise for you is to come back to this assignment and WEKA and YALE tools after each forthcoming lecture to see how the things are implemented and can be used in practice

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer In part from: Yizhou Sun 2008 What is WEKA? Waikato Environment for Knowledge Analysis It s a data mining/machine learning tool developed by Department of Computer Science,,

More information

Classification using Weka (Brain, Computation, and Neural Learning)

Classification using Weka (Brain, Computation, and Neural Learning) LOGO Classification using Weka (Brain, Computation, and Neural Learning) Jung-Woo Ha Agenda Classification General Concept Terminology Introduction to Weka Classification practice with Weka Problems: Pima

More information

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI DATA ANALYSIS WITH WEKA Author: Nagamani Mutteni Asst.Professor MERI Topic: Data Analysis with Weka Course Duration: 2 Months Objective: Everybody talks about Data Mining and Big Data nowadays. Weka is

More information

Data Mining With Weka A Short Tutorial

Data Mining With Weka A Short Tutorial Data Mining With Weka A Short Tutorial Dr. Wenjia Wang School of Computing Sciences University of East Anglia (UEA), Norwich, UK Content 1. Introduction to Weka 2. Data Mining Functions and Tools 3. Data

More information

Tutorial on Machine Learning Tools

Tutorial on Machine Learning Tools Tutorial on Machine Learning Tools Yanbing Xue Milos Hauskrecht Why do we need these tools? Widely deployed classical models No need to code from scratch Easy-to-use GUI Outline Matlab Apps Weka 3 UI TensorFlow

More information

Tanagra: An Evaluation

Tanagra: An Evaluation Tanagra: An Evaluation Jessica Enright Jonathan Klippenstein November 5th, 2004 1 Introduction to Tanagra Tanagra was written as an aid to education and research on data mining by Ricco Rakotomalala [1].

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline Learn to Use Weka Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb-09-2010 Outline Introduction of Weka Explorer Filter Classify Cluster Experimenter KnowledgeFlow

More information

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE Luigi Grimaudo (luigi.grimaudo@polito.it) DataBase And Data Mining Research Group (DBDMG) Summary RapidMiner project Strengths

More information

Summary. RapidMiner Project 12/13/2011 RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE

Summary. RapidMiner Project 12/13/2011 RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE Luigi Grimaudo (luigi.grimaudo@polito.it) DataBase And Data Mining Research Group (DBDMG) Summary RapidMiner project Strengths

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

What is KNIME? workflows nodes standard data mining, data analysis data manipulation

What is KNIME? workflows nodes standard data mining, data analysis data manipulation KNIME TUTORIAL What is KNIME? KNIME = Konstanz Information Miner Developed at University of Konstanz in Germany Desktop version available free of charge (Open Source) Modular platform for building and

More information

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44 Data Mining Piotr Paszek piotr.paszek@us.edu.pl Introduction (Piotr Paszek) Data Mining DM KDD 1 / 44 Plan of the lecture 1 Data Mining (DM) 2 Knowledge Discovery in Databases (KDD) 3 CRISP-DM 4 DM software

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Engineering the input and output Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 7 of Data Mining by I. H. Witten and E. Frank Attribute selection z Scheme-independent, scheme-specific

More information

WEKA Explorer User Guide for Version 3-4

WEKA Explorer User Guide for Version 3-4 WEKA Explorer User Guide for Version 3-4 Richard Kirkby Eibe Frank July 28, 2010 c 2002-2010 University of Waikato This guide is licensed under the GNU General Public License version 2. More information

More information

Community edition(open-source) Enterprise edition

Community edition(open-source) Enterprise edition Suseela Bhaskaruni Rapid Miner is an environment for machine learning and data mining experiments. Widely used for both research and real-world data mining tasks. Software versions: Community edition(open-source)

More information

8. MINITAB COMMANDS WEEK-BY-WEEK

8. MINITAB COMMANDS WEEK-BY-WEEK 8. MINITAB COMMANDS WEEK-BY-WEEK In this section of the Study Guide, we give brief information about the Minitab commands that are needed to apply the statistical methods in each week s study. They are

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

9. Conclusions. 9.1 Definition KDD

9. Conclusions. 9.1 Definition KDD 9. Conclusions Contents of this Chapter 9.1 Course review 9.2 State-of-the-art in KDD 9.3 KDD challenges SFU, CMPT 740, 03-3, Martin Ester 419 9.1 Definition KDD [Fayyad, Piatetsky-Shapiro & Smyth 96]

More information

CHAPTER 6 EXPERIMENTS

CHAPTER 6 EXPERIMENTS CHAPTER 6 EXPERIMENTS 6.1 HYPOTHESIS On the basis of the trend as depicted by the data Mining Technique, it is possible to draw conclusions about the Business organization and commercial Software industry.

More information

WEKA homepage.

WEKA homepage. WEKA homepage http://www.cs.waikato.ac.nz/ml/weka/ Data mining software written in Java (distributed under the GNU Public License). Used for research, education, and applications. Comprehensive set of

More information

> Data Mining Overview with Clementine

> Data Mining Overview with Clementine > Data Mining Overview with Clementine This two-day course introduces you to the major steps of the data mining process. The course goal is for you to be able to begin planning or evaluate your firm s

More information

Using Decision Boundary to Analyze Classifiers

Using Decision Boundary to Analyze Classifiers Using Decision Boundary to Analyze Classifiers Zhiyong Yan Congfu Xu College of Computer Science, Zhejiang University, Hangzhou, China yanzhiyong@zju.edu.cn Abstract In this paper we propose to use decision

More information

Contents. ACE Presentation. Comparison with existing frameworks. Technical aspects. ACE 2.0 and future work. 24 October 2009 ACE 2

Contents. ACE Presentation. Comparison with existing frameworks. Technical aspects. ACE 2.0 and future work. 24 October 2009 ACE 2 ACE Contents ACE Presentation Comparison with existing frameworks Technical aspects ACE 2.0 and future work 24 October 2009 ACE 2 ACE Presentation 24 October 2009 ACE 3 ACE Presentation Framework for using

More information

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file 1 SPSS Guide 2009 Content 1. Basic Steps for Data Analysis. 3 2. Data Editor. 2.4.To create a new SPSS file 3 4 3. Data Analysis/ Frequencies. 5 4. Recoding the variable into classes.. 5 5. Data Analysis/

More information

Evaluation Report on PolyAnalyst 4.6

Evaluation Report on PolyAnalyst 4.6 1. INTRODUCTION CMPUT695: Assignment#2 Evaluation Report on PolyAnalyst 4.6 Hongqin Fan and Yunping Wang PolyAnalyst 4.6 professional edition (PA) is a commercial data mining tool developed by Megaputer

More information

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it Outline Introduction on KNIME KNIME components Exercise: Data Understanding Exercise: Market Basket Analysis Exercise:

More information

KNIME for the life sciences Cambridge Meetup

KNIME for the life sciences Cambridge Meetup KNIME for the life sciences Cambridge Meetup Greg Landrum, Ph.D. KNIME.com AG 12 July 2016 What is KNIME? A bit of motivation: tool blending, data blending, documentation, automation, reproducibility More

More information

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core) Introduction to Data Science What is Analytics and Data Science? Overview of Data Science and Analytics Why Analytics is is becoming popular now? Application of Analytics in business Analytics Vs Data

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Application of Data Mining in Manufacturing Industry

Application of Data Mining in Manufacturing Industry International Journal of Information Sciences and Application. ISSN 0974-2255 Volume 3, Number 2 (2011), pp. 59-64 International Research Publication House http://www.irphouse.com Application of Data Mining

More information

Weka: Practical machine learning tools and techniques with Java implementations

Weka: Practical machine learning tools and techniques with Java implementations Weka: Practical machine learning tools and techniques with Java implementations AI Tools Seminar University of Saarland, WS 06/07 Rossen Dimov 1 Supervisors: Michael Feld, Dr. Michael Kipp, Dr. Alassane

More information

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures

More information

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked Plotting Menu: QCExpert Plotting Module graphs offers various tools for visualization of uni- and multivariate data. Settings and options in different types of graphs allow for modifications and customizations

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

Artificial Neural Networks (Feedforward Nets)

Artificial Neural Networks (Feedforward Nets) Artificial Neural Networks (Feedforward Nets) y w 03-1 w 13 y 1 w 23 y 2 w 01 w 21 w 22 w 02-1 w 11 w 12-1 x 1 x 2 6.034 - Spring 1 Single Perceptron Unit y w 0 w 1 w n w 2 w 3 x 0 =1 x 1 x 2 x 3... x

More information

PROJECT 1 DATA ANALYSIS (KR-VS-KP)

PROJECT 1 DATA ANALYSIS (KR-VS-KP) PROJECT 1 DATA ANALYSIS (KR-VS-KP) Author: Tomáš Píhrt (xpiht00@vse.cz) Date: 12. 12. 2015 Contents 1 Introduction... 1 1.1 Data description... 1 1.2 Attributes... 2 1.3 Data pre-processing & preparation...

More information

Enterprise Miner Version 4.0. Changes and Enhancements

Enterprise Miner Version 4.0. Changes and Enhancements Enterprise Miner Version 4.0 Changes and Enhancements Table of Contents General Information.................................................................. 1 Upgrading Previous Version Enterprise Miner

More information

Gain Greater Productivity in Enterprise Data Mining

Gain Greater Productivity in Enterprise Data Mining Clementine 9.0 Specifications Gain Greater Productivity in Enterprise Data Mining Discover patterns and associations in your organization s data and make decisions that lead to significant, measurable

More information

Now, Data Mining Is Within Your Reach

Now, Data Mining Is Within Your Reach Clementine Desktop Specifications Now, Data Mining Is Within Your Reach Data mining delivers significant, measurable value. By uncovering previously unknown patterns and connections in data, data mining

More information

Data Science. Data Analyst. Data Scientist. Data Architect

Data Science. Data Analyst. Data Scientist. Data Architect Data Science Data Analyst Data Analysis in Excel Programming in R Introduction to Python/SQL/Tableau Data Visualization in R / Tableau Exploratory Data Analysis Data Scientist Inferential Statistics &

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats Data Mining Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka I. Data sets I.1. Data sets characteristics and formats The data to be processed can be structured (e.g. data matrix,

More information

Oracle Big Data Science

Oracle Big Data Science Oracle Big Data Science Tim Vlamis and Dan Vlamis Vlamis Software Solutions 816-781-2880 www.vlamis.com @VlamisSoftware Vlamis Software Solutions Vlamis Software founded in 1992 in Kansas City, Missouri

More information

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Implementation: Real machine learning schemes Decision trees Classification

More information

Basic Concepts Weka Workbench and its terminology

Basic Concepts Weka Workbench and its terminology Changelog: 14 Oct, 30 Oct Basic Concepts Weka Workbench and its terminology Lecture Part Outline Concepts, instances, attributes How to prepare the input: ARFF, attributes, missing values, getting to know

More information

ACHIEVEMENTS FROM TRAINING

ACHIEVEMENTS FROM TRAINING LEARN WELL TECHNOCRAFT DATA SCIENCE/ MACHINE LEARNING SYLLABUS 8TH YEAR OF ACCOMPLISHMENTS AUTHORIZED GLOBAL CERTIFICATION CENTER FOR MICROSOFT, ORACLE, IBM, AWS AND MANY MORE. 8411002339/7709292162 WWW.DW-LEARNWELL.COM

More information

Assignment 1: CS Machine Learning

Assignment 1: CS Machine Learning Assignment 1: CS7641 - Machine Learning Saad Khan September 18, 2015 1 Introduction I intend to apply supervised learning algorithms to classify the quality of wine samples as being of high or low quality

More information

Supervised Clustering of Yeast Gene Expression Data

Supervised Clustering of Yeast Gene Expression Data Supervised Clustering of Yeast Gene Expression Data In the DeRisi paper five expression profile clusters were cited, each containing a small number (7-8) of genes. In the following examples we apply supervised

More information

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV Subject Name: Elective I Data Warehousing & Data Mining (DWDM) Subject Code: 2640005 Learning Objectives: To understand

More information

Enterprise Miner Software: Changes and Enhancements, Release 4.1

Enterprise Miner Software: Changes and Enhancements, Release 4.1 Enterprise Miner Software: Changes and Enhancements, Release 4.1 The correct bibliographic citation for this manual is as follows: SAS Institute Inc., Enterprise Miner TM Software: Changes and Enhancements,

More information

WEKA KnowledgeFlow Tutorial for Version 3-5-6

WEKA KnowledgeFlow Tutorial for Version 3-5-6 WEKA KnowledgeFlow Tutorial for Version 3-5-6 Mark Hall Peter Reutemann June 1, 2007 c 2007 University of Waikato Contents 1 Introduction 2 2 Features 3 3 Components 4 3.1 DataSources..............................

More information

Gain Insight and Improve Performance with Data Mining

Gain Insight and Improve Performance with Data Mining Clementine 11.0 Specifications Gain Insight and Improve Performance with Data Mining Data mining provides organizations with a clearer view of current conditions and deeper insight into future events.

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou Sun yzsun@ccs.neu.edu November 19, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology What s a concept?

More information

Technical Support Minitab Version Student Free technical support for eligible products

Technical Support Minitab Version Student Free technical support for eligible products Technical Support Free technical support for eligible products All registered users (including students) All registered users (including students) Registered instructors Not eligible Worksheet Size Number

More information

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors Dejan Sarka Anomaly Detection Sponsors About me SQL Server MVP (17 years) and MCT (20 years) 25 years working with SQL Server Authoring 16 th book Authoring many courses, articles Agenda Introduction Simple

More information

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course: DATA SCIENCE About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst/Analytics Manager/Actuarial Scientist/Business

More information

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer Practical Data Mining COMP-321B Tutorial 1: Introduction to the WEKA Explorer Gabi Schmidberger Mark Hall Richard Kirkby July 12, 2006 c 2006 University of Waikato 1 Setting up your Environment Before

More information

Short instructions on using Weka

Short instructions on using Weka Short instructions on using Weka G. Marcou 1 Weka is a free open source data mining software, based on a Java data mining library. Free alternatives to Weka exist as for instance R and Orange. The current

More information

The Explorer. chapter Getting started

The Explorer. chapter Getting started chapter 10 The Explorer Weka s main graphical user interface, the Explorer, gives access to all its facilities using menu selection and form filling. It is illustrated in Figure 10.1. There are six different

More information

COMP 6838 Data MIning

COMP 6838 Data MIning COMP 6838 Data MIning LECTURE 1: Introduction Dr. Edgar Acuna Departmento de Matematicas Universidad de Puerto Rico- Mayaguez math.uprm.edu/~edgar 1 Course s Objectives Understand the basic concepts to

More information

Data Mining Overview. CHAPTER 1 Introduction to SAS Enterprise Miner Software

Data Mining Overview. CHAPTER 1 Introduction to SAS Enterprise Miner Software 1 CHAPTER 1 Introduction to SAS Enterprise Miner Software Data Mining Overview 1 Layout of the SAS Enterprise Miner Window 2 Using the Application Main Menus 3 Using the Toolbox 8 Using the Pop-Up Menus

More information

CALUMMA Management Tool User Manual

CALUMMA Management Tool User Manual CALUMMA Management Tool User Manual CALUMMA Management Tool Your Data Management SIMPLIFIED. by RISC Software GmbH The CALUMMA system is a highly adaptable data acquisition and management system, for complex

More information

A Survey of Statistical Modeling Tools

A Survey of Statistical Modeling Tools 1 of 6 A Survey of Statistical Modeling Tools Madhuri Kulkarni (A survey paper written under the guidance of Prof. Raj Jain) Abstract: A plethora of statistical modeling tools are available in the market

More information

Lecture Topic Projects

Lecture Topic Projects Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, basic tasks, data types 3 Introduction to D3, basic vis techniques for non-spatial data Project #1 out 4 Data

More information

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov WEKA: Practical Machine Learning Tools and Techniques in Java Seminar A.I. Tools WS 2006/07 Rossen Dimov Overview Basic introduction to Machine Learning Weka Tool Conclusion Document classification Demo

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Pre-Requisites: CS2510. NU Core Designations: AD

Pre-Requisites: CS2510. NU Core Designations: AD DS4100: Data Collection, Integration and Analysis Teaches how to collect data from multiple sources and integrate them into consistent data sets. Explains how to use semi-automated and automated classification

More information

Information Driven Healthcare:

Information Driven Healthcare: Information Driven Healthcare: Machine Learning course Lecture: Feature selection I --- Concepts Centre for Doctoral Training in Healthcare Innovation Dr. Athanasios Tsanas ( Thanasis ), Wellcome Trust

More information

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction International Journal of Computer Trends and Technology (IJCTT) volume 7 number 3 Jan 2014 Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction A. Shanthini 1,

More information

Right-click on whatever it is you are trying to change Get help about the screen you are on Help Help Get help interpreting a table

Right-click on whatever it is you are trying to change Get help about the screen you are on Help Help Get help interpreting a table Q Cheat Sheets What to do when you cannot figure out how to use Q What to do when the data looks wrong Right-click on whatever it is you are trying to change Get help about the screen you are on Help Help

More information

Oracle Big Data Science IOUG Collaborate 16

Oracle Big Data Science IOUG Collaborate 16 Oracle Big Data Science IOUG Collaborate 16 Session 4762 Tim and Dan Vlamis Tuesday, April 12, 2016 Vlamis Software Solutions Vlamis Software founded in 1992 in Kansas City, Missouri Developed 200+ Oracle

More information

SAS Visual Analytics 8.2: Working with Report Content

SAS Visual Analytics 8.2: Working with Report Content SAS Visual Analytics 8.2: Working with Report Content About Objects After selecting your data source and data items, add one or more objects to display the results. SAS Visual Analytics provides objects

More information

6.034 Design Assignment 2

6.034 Design Assignment 2 6.034 Design Assignment 2 April 5, 2005 Weka Script Due: Friday April 8, in recitation Paper Due: Wednesday April 13, in class Oral reports: Friday April 15, by appointment The goal of this assignment

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

Performance Evaluation of Various Classification Algorithms

Performance Evaluation of Various Classification Algorithms Performance Evaluation of Various Classification Algorithms Shafali Deora Amritsar College of Engineering & Technology, Punjab Technical University -----------------------------------------------------------***----------------------------------------------------------

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming

More information

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES USING DIFFERENT DATASETS V. Vaithiyanathan 1, K. Rajeswari 2, Kapil Tajane 3, Rahul Pitale 3 1 Associate Dean Research, CTS Chair Professor, SASTRA University,

More information

Tutorials Case studies

Tutorials Case studies 1. Subject Three curves for the evaluation of supervised learning methods. Evaluation of classifiers is an important step of the supervised learning process. We want to measure the performance of the classifier.

More information

1 Case study of SVM (Rob)

1 Case study of SVM (Rob) DRAFT a final version will be posted shortly COS 424: Interacting with Data Lecturer: Rob Schapire and David Blei Lecture # 8 Scribe: Indraneel Mukherjee March 1, 2007 In the previous lecture we saw how

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL SPSS QM II SHORT INSTRUCTIONS This presentation contains only relatively short instructions on how to perform some statistical analyses in SPSS. Details around a certain function/analysis method not covered

More information

Recitation Supplement: Creating a Neural Network for Classification SAS EM December 2, 2002

Recitation Supplement: Creating a Neural Network for Classification SAS EM December 2, 2002 Recitation Supplement: Creating a Neural Network for Classification SAS EM December 2, 2002 Introduction Neural networks are flexible nonlinear models that can be used for regression and classification

More information

PROGRAMMING AND ENGINEERING COMPUTING WITH MATLAB Huei-Huang Lee SDC. Better Textbooks. Lower Prices.

PROGRAMMING AND ENGINEERING COMPUTING WITH MATLAB Huei-Huang Lee SDC. Better Textbooks. Lower Prices. PROGRAMMING AND ENGINEERING COMPUTING WITH MATLAB 2018 Huei-Huang Lee SDC P U B L I C AT I O N S Better Textbooks. Lower Prices. www.sdcpublications.com Powered by TCPDF (www.tcpdf.org) Visit the following

More information

MACHINE LEARNING Example: Google search

MACHINE LEARNING Example: Google search MACHINE LEARNING Lauri Ilison, PhD Data Scientist 20.11.2014 Example: Google search 1 27.11.14 Facebook: 350 million photo uploads every day The dream is to build full knowledge of the world and know everything

More information

Subject. Dataset. Copy paste feature of the diagram. Importing the dataset. Copy paste feature into the diagram.

Subject. Dataset. Copy paste feature of the diagram. Importing the dataset. Copy paste feature into the diagram. Subject Copy paste feature into the diagram. When we define the data analysis process into Tanagra, it is possible to copy components (or entire branches of components) towards another location into the

More information

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines Linear Models Lecture Outline: Numeric Prediction: Linear Regression Linear Classification The Perceptron Support Vector Machines Reading: Chapter 4.6 Witten and Frank, 2nd ed. Chapter 4 of Mitchell Solving

More information

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá INTRODUCTION TO DATA MINING Daniel Rodríguez, University of Alcalá Outline Knowledge Discovery in Datasets Model Representation Types of models Supervised Unsupervised Evaluation (Acknowledgement: Jesús

More information

Minitab 18 Feature List

Minitab 18 Feature List Minitab 18 Feature List * New or Improved Assistant Measurement systems analysis * Capability analysis Graphical analysis Hypothesis tests Regression DOE Control charts * Graphics Scatterplots, matrix

More information

Data Set. What is Data Mining? Data Mining (Big Data Analytics) Illustrative Applications. What is Knowledge Discovery?

Data Set. What is Data Mining? Data Mining (Big Data Analytics) Illustrative Applications. What is Knowledge Discovery? Data Mining (Big Data Analytics) Andrew Kusiak Intelligent Systems Laboratory 2139 Seamans Center The University of Iowa Iowa City, IA 52242-1527 andrew-kusiak@uiowa.edu http://user.engineering.uiowa.edu/~ankusiak/

More information

Graphing Calculator Tutorial

Graphing Calculator Tutorial Graphing Calculator Tutorial This tutorial is designed as an interactive activity. The best way to learn the calculator functions will be to work the examples on your own calculator as you read the tutorial.

More information

Excel Manual X Axis Labels Below Chart 2010 Scatter

Excel Manual X Axis Labels Below Chart 2010 Scatter Excel Manual X Axis Labels Below Chart 2010 Scatter Of course, I want the chart itself to remain the same, so, the x values of dots are in row "b(o/c)", their y values are in "a(h/c)" row, and their respective

More information

Machine Learning in Action

Machine Learning in Action Machine Learning in Action PETER HARRINGTON Ill MANNING Shelter Island brief contents PART l (~tj\ssification...,... 1 1 Machine learning basics 3 2 Classifying with k-nearest Neighbors 18 3 Splitting

More information

Learn What s New. Statistical Software

Learn What s New. Statistical Software Statistical Software Learn What s New Upgrade now to access new and improved statistical features and other enhancements that make it even easier to analyze your data. The Assistant Data Customization

More information

SAS Visual Analytics 8.2: Getting Started with Reports

SAS Visual Analytics 8.2: Getting Started with Reports SAS Visual Analytics 8.2: Getting Started with Reports Introduction Reporting The SAS Visual Analytics tools give you everything you need to produce and distribute clear and compelling reports. SAS Visual

More information

New ensemble methods for evolving data streams

New ensemble methods for evolving data streams New ensemble methods for evolving data streams A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavaldà Laboratory for Relational Algorithmics, Complexity and Learning LARCA UPC-Barcelona Tech, Catalonia

More information

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm Proceedings of the National Conference on Recent Trends in Mathematical Computing NCRTMC 13 427 An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm A.Veeraswamy

More information

K236: Basis of Data Science

K236: Basis of Data Science Schedule of K236 K236: Basis of Data Science Lecture 6: Data Preprocessing Lecturer: Tu Bao Ho and Hieu Chi Dam TA: Moharasan Gandhimathi and Nuttapong Sanglerdsinlapachai 1. Introduction to data science

More information

MICROSOFT BUSINESS INTELLIGENCE

MICROSOFT BUSINESS INTELLIGENCE SSIS MICROSOFT BUSINESS INTELLIGENCE 1) Introduction to Integration Services Defining sql server integration services Exploring the need for migrating diverse Data the role of business intelligence (bi)

More information