Data Mining Tools. Jean-Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, Paris, Cedex 05

Similar documents
Input: Concepts, Instances, Attributes

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Formellement. Exemples

Machine Learning: Algorithms and Applications Mockup Examination

Summary. Machine Learning: Introduction. Marcin Sydow

CS 8520: Artificial Intelligence. Weka Lab. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Data Mining and Knowledge Discovery Practice notes 2

Data Mining and Knowledge Discovery: Practice Notes

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Function Approximation and Feature Selection Tool

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Machine Learning Chapter 2. Input

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Basic Concepts Weka Workbench and its terminology

Performance Analysis of Data Mining Classification Techniques

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008

ESERCITAZIONE PIATTAFORMA WEKA. Croce Danilo Web Mining & Retrieval 2015/2016

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Homework 1 Sample Solution

Decision Trees In Weka,Data Formats

arulescba: Classification for Factor and Transactional Data Sets Using Association Rules

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Data Mining With Weka A Short Tutorial

Data Mining Practical Machine Learning Tools and Techniques

ADaM version 4.0 (Eagle) Tutorial Information Technology and Systems Center University of Alabama in Huntsville

The Explorer. chapter Getting started

Subject. Dataset. Copy paste feature of the diagram. Importing the dataset. Copy paste feature into the diagram.

Data Mining Algorithms: Basic Methods

Data Mining. Covering algorithms. Covering approach At each stage you identify a rule that covers some of instances. Fig. 4.

Tutorial on Machine Learning Tools

Fitting Classification and Regression Trees Using Statgraphics and R. Presented by Dr. Neil W. Polhemus

Contents. Preface to the Second Edition

Hands on Datamining & Machine Learning with Weka

Data Mining Input: Concepts, Instances, and Attributes

Chapter 1, Introduction

Introduction to R and Statistical Data Analysis

STAT 1291: Data Science

Data analysis case study using R for readily available data set using any one machine learning Algorithm

SWETHA ENGINEERING COLLEGE (Approved by AICTE, New Delhi, Affiliated to JNTUA) DATA MINING USING WEKA

Rajasthan 2 Director, Dev Raj Group s Technical Campus, Ferozepur, Punjab 1 2

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Didacticiel - Études de cas

USE IBM IN-DATABASE ANALYTICS WITH R

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

Data Mining: Exploring Data. Lecture Notes for Chapter 3

WEKA homepage.

Classification using Weka (Brain, Computation, and Neural Learning)

Linear discriminant analysis and logistic

k-nearest Neighbors + Model Selection

Table Of Contents: xix Foreword to Second Edition

Orange Documentation. Release 3.0. Biolab

Supervised vs unsupervised clustering

International Training Workshop on FPGA Design for Scientific Instrumentation and Computing November 2013

COMP33111: Tutorial/lab exercise 2

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Experimental Design + k- Nearest Neighbors

COMP s1 - Getting started with the Weka Machine Learning Toolkit

Tanagra: An Evaluation

Supervised and Unsupervised Learning (II)

CHAPTER 3 ASSOCIATION RULE MINING ALGORITHMS

Association Rules and

Road Map. Objectives. Objectives. Frequent itemsets and rules. Items and transactions. Association Rules and Sequential Patterns

Visualizing class probability estimators

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Data Mining: STATISTICA

SOFTWARE TOOLS FOR TEACHING UNDERGRADUATE DATA MINING COURSE

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

Prototyping DM Techniques with WEKA and YALE Open-Source Software

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Machine Learning in Action

Function Algorithms: Linear Regression, Logistic Regression

Tutorial Case studies

CISC 4631 Data Mining

Outline. Prepare the data Classification and regression Clustering Association rules Graphic user interface

EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION OF MULTIVARIATE DATA SET

Part I: Data Mining Foundations

Python With Data Science

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Data Mining and Knowledge Discovery: Practice Notes

Clojure & Incanter. Introduction to Datasets & Charts. Data Sorcery with. David Edgar Liebke

Machine Learning in Python. Rohith Mohan GradQuant Spring 2018

Introduction to Artificial Intelligence

Data Mining and Knowledge Discovery: Practice Notes

Orange Data Mining Library Documentation

The Curse of Dimensionality

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Machine Learning: Symbolische Ansätze

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

WEKA Explorer User Guide for Version 3-4

Tanagra Tutorials. Let us consider an example to detail the approach. We have a collection of 3 documents (in French):

Data Set. What is Data Mining? Data Mining (Big Data Analytics) Illustrative Applications. What is Knowledge Discovery?

Computing a Gain Chart. Comparing the computation time of data mining tools on a large dataset under Linux.

Community edition(open-source) Enterprise edition

Sabbatical Leave Report

Random Forest A. Fornaser

Classification: Decision Trees

Transcription:

Data Mining Tools Jean-Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, 75252 Paris, Cedex 05 Jean-Gabriel.Ganascia@lip6.fr

DATA BASES Data mining Extraction Data mining Interpretation/ Visualization Evaluation Pre-treatment Selection DB DB DB DB Reformulation K. domain Reducing dimensions. supervised non-supervised Graphs Rules, 3D, RA, VR... SQL / OQL adhoc Google, Yahoo, AltaVista,... sequences symbolic symbolic sequences Wspot ID3, C4.5, Equipe CHARADE ACASA Cobweb, LIP6 UPMC FLEXPAT Sorbonne Universités FOIL, REMO,... COING

Free Tools R-project: statistical library TANAGRA Sipina (Lyon), http://eric.univ-lyon2.fr/~ricco/tanagra/fr/tanagra.html Weka New Zeeland (Java language) Orange Slovania (Python language) RapidMiner (Yale) AlphaMiner Mallet Machine Learning for Language Toolkit (Java language) http://mallet.cs.umass.edu University Massachusetts

What do those tools contain? Input file File format.tab arff etc.

Input type.tab Line 1 attribute name Line 2 attribute type Line 3 class Separation: tab Example file lenses.tab age prescription astigmatic tear_rate lenses discrete discrete discrete discrete discrete class young myope no reduced none young myope no normal soft presbyopic hypermetrope yes normal none

Entrée «ARFF» Attribute-Relation File Format Entête Commentaires précédés par % @RELATION <nom relation> (1 ligne) @ATTRIBUTE <nom attribut> <Type attribut> (liste de tous les attributs 1 par ligne) @DATA <val A1>, <val A2>, (liste de tous les exemples 1 par ligne) Type: Numeric <nominal-specification> - ensemble valeurs String entre apostrophes s il la chaîne contient des blancs Date[<format date>]

Example ARFF Header % 1. Title: Plants data base IRIS % % 2. Sources: % (A) Creator: RA Fisher % (B) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov) % (C) Date: July, 1988 % @ Iris RELATION @ Attribute sepallength NUMERIC @ Attribute sepalwidth NUMERIC @ Attribute petallength NUMERIC @ Attribute petalwidth NUMERIC @ Class Attribute {Iris-setosa, Iris versicolor, Iris-virginica}

Example ARFF Data @ Data 5.1,3.5,1.4,0.2, Iris-setosa 4.9,3.0,1.4,0.2, Iris-setosa 4.7,3.2,1.3,0.2, Iris-setosa 4.6,3.1,1.5,0.2, Iris-setosa 5.0,3.6,1.4,0.2, Iris-setosa 5.4,3.9,1.7,0.4, Iris-setosa 4.6,3.4,1.4,0.3, Iris-setosa 5.0,3.4,1.5,0.2, Iris-setosa 4.4,2.9,1.4,0.2, Iris-setosa 4.9,3.1,1.5,0.1, Iris-setosa

Sparse ARFF If there are many null values The same, except for data Non null attributes are identified by their rank Example ARFF @data 0, X, 0, Y, class A 0, 0, W, 0, class B Example Sparse ARFF @data {1 X, 3 Y, 4 class A } {2 W, 4 class B } Remark: the absent values correspond to 0 missing values are identified with?

Other steps Data preparation Feature selection Data selection Digitalization Sampling Outliers File fusion (joint) Concatenation Data visualization Classification Regression Evaluation Non supervised learning Association rules Text mining

Data visualization Exploratory Data Analysis Distributions Linear projection Attribute statistics Correspondence analysis Mosaic diagrams

Classification Bayesian classification Logistic regression K nearest neighbor Trees C4.5 CN2 SVM Visualization of the classification Trees CN2 rules

Non supervised learning Matrix distance from examples Matrix distance from attributes Dendrograms K-means

Evaluation supervised learning Separation Random Leave one out Cross validation Indices Precision-recall ROC Test training set/ test set Confusion matrix ROC analysis Prediction

Association rules Extraction of association rules Visualization of association rules Frequent sets

Specialized applications Bioinformatics Genomes data bases Gene selection Profiles Text mining Text file Preprocessing (TF.IDF, lemmatization, stemmatization, ) Bags of words N-grams of characters N-grams of words Feature extraction Distance

SPMF An Open-Source Data Mining Library http://www.philippe-fournier-viger.com/spmf/ Pattern Mining Sequential Rule Mining ItemSets Mining

Weka Written in Java

Weka http://www.cs.waikato.ac.nz/ml/weka/

Orange University of Ljubljana Slovenia Programmed with Python http://www.ailab.si/orange/ Machine ARI: orange-canvas