Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Data Mining Piotr Paszek piotr.paszek@us.edu.pl Introduction (Piotr Paszek) Data Mining DM KDD 1 / 44

Plan of the lecture 1 Data Mining (DM) 2 Knowledge Discovery in Databases (KDD) 3 CRISP-DM 4 DM software 5 DM tasks (algorithms) 6 DM fields of use (Piotr Paszek) Data Mining DM KDD 2 / 44

Recommended Reference Books 1 J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed. 2011 (2nd ed. 2006). 2 I. Witten, E. Frank, and M. Hall. Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, 3rd ed. 2011. 3 P. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Wiley, 2005 (2nd ed. 2016). (Piotr Paszek) Data Mining DM KDD 3 / 44

Data Mining Course (Module) Information Fall 2017 Class Schedule (see: http://plan.ii.us.edu.pl search: MinCS) Lecture: Tuesday (even weeks), room B219, 17.45 19.15. Exercises (laboratory): Tuesday, room B203, 16.15 17.45. Materials for the lecture: slides (pdf files) http://zsi.tech.us.edu.pl/~ppaszek/pliki/dm (Piotr Paszek) Data Mining DM KDD 4 / 44

Data Mining Etymology In the 1960s, statisticians used terms like data fishing or data dredging to refer to what they considered the bad practice of analysing data without an a-priori hypothesis. The term data mining appeared around 1990 in the database community. Other terms used include data archaeology, information harvesting, information discovery, knowledge extraction, etc. Gregory Piatetsky-Shapiro coined the term knowledge discovery in databases and this term became more popular in AI and machine learning community. However, the term data mining became more popular in the business and press communities. Currently, the terms data mining and knowledge discovery are often used interchangeably. (Piotr Paszek) Data Mining DM KDD 5 / 44

Data mining (definition?) Data mining is the computing process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. An essential process where intelligent methods are applied to extract data patterns. It is an interdisciplinary subfield of computer science. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. https://en.wikipedia.org/wiki/data_mining (Piotr Paszek) Data Mining DM KDD 6 / 44

Data Mining: Confluence of Multiple Disciplines (Piotr Paszek) Data Mining DM KDD 7 / 44

Data mining Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data.... Machine learning provides the technical basis for data mining. It is used to extract information from the raw data in databases... Data mining is defined as the process of discovering patterns in data. The process must be automatic or semiautomatic. The patterns discovered must be meaningful in that they lead to some advantage, usually an economic one. Ian Witten, Eibe Frank, Mark Hall. Data Mining: Practical Machine Learning Tools and Techniques. Third Edition. Morgan Kaufmann Publishers, 2011. (Piotr Paszek) Data Mining DM KDD 8 / 44

Data mining Data mining, also popularly referred to as Knowledge Discovery in Databases (KDD), is the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, other massive information repositories or data streams. Jiawei Han, Micheline Kamber. Data Mining: Concepts and Techniques. Second Edition. Morgan Kaufmann Publishers, 2006. (Piotr Paszek) Data Mining DM KDD 9 / 44

Knowledge Discovery in Databases (KDD) field is concerned with the development of methods and techniques for making sense of data.... At the core of the process is the application of specific data-mining methods for pattern discovery and extraction.... KDD refers to the overall process of discovering useful knowledge from data, and data mining refers to a particular step in this process. Data mining is the application of specific algorithms for extracting patterns from data. Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth. From Data Mining to Knowledge Discovery in Databases. AI Magazine, 17(3): 37 54, 1996. (Piotr Paszek) Data Mining DM KDD 10 / 44

KDD process 1. Understand the application domain and the goal of the process 2. Create target dataset as a subset of all the data that is available 3. Data cleaning and preprocessing to remove noise, handling missing data and outliers 4. Data reduction and projection in order to focus on the features that are relevant to the problem 5. Match goals of process to a data mining method. Decide the purpose of the model such as summarization or classification 6. Choose the data mining algorithms to match the purpose of the model (from step 5) 7. Data mining, it means run algorithms on data 8. Interpretation of mined patterns to make them understandable by the user, such as summarization and visualization 9. Acting on the discovered knowledge, such as reporting or making decisions U. Fayyad, G. Piatetsky-Shapiro, P. Smyth. The KDD Process for Extracting Useful Knowledge from Volumes of Data. Communications of the ACM 39, 11, 1996, 27-34. (Piotr Paszek) Data Mining DM KDD 11 / 44

KDD Fayyad U., Shapiro G. P., Smyth P., From data mining to knowledge discovery in databases. AI Magazine, 17(3): 37 54, 1996. (Piotr Paszek) Data Mining DM KDD 12 / 44

KDD Process 1. Data cleaning to remove noise and inconsistent data. 2. Data integration, where multiple data sources may be combined. 3. Data selection, where data relevant to the analysis task are retrieved from the database. 4. Data transformation, where data are transformed and consolidated into forms appropriate for mining by preforming summary or aggregation operations. 5. Data mining, which is an essential process where intelligent methods are applied to extract data patterns. 6. Pattern evaluation to identify the truly interesting patterns representing knowledge based on interesting measures. 7. Knowledge presentation, where visualization and knowledge representation techniques are used to present mined knowledge to users. Jiawei Han, Micheline Kamber. Data Mining: Concepts and Techniques. Second Edition. Morgan Kaufmann Publishers, 2006. (Piotr Paszek) Data Mining DM KDD 13 / 44

KDD Process Jiawei Han, Micheline Kamber. Data Mining: Concepts and Techniques. Second Edition. Morgan Kaufmann Publishers, 2006. (Piotr Paszek) Data Mining DM KDD 14 / 44

Data Mining in Business Intelligence (Piotr Paszek) Data Mining DM KDD 15 / 44

KDD Process This is a view from typical machine learning and statistics communities (Piotr Paszek) Data Mining DM KDD 16 / 44

CRISP-DM Cross-Industry Standard Process for Data Mining (CRISP-DM) is a data mining process model that describes commonly used approaches that data mining experts use to tackle problems. (Piotr Paszek) Data Mining DM KDD 17 / 44

Phases of the CRISP-DM reference model P. Chapman, J. Clinton et al. (2000); CRISP-DM 1.0 Step-by-step data mining guides (Piotr Paszek) Data Mining DM KDD 18 / 44

CRISP-DM major phase 1. Business Understanding 2. Data Understanding 3. Data Preparation 4. Modeling 5. Evaluation 6. Deployment P. Chapman, J. Clinton et al. (2000); CRISP-DM 1.0 Step-by-step data mining guides (Piotr Paszek) Data Mining DM KDD 19 / 44

CRISP-DM Business Understanding This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives. A decision model, especially one built using the Decision Model and Notation standard can be used. (Piotr Paszek) Data Mining DM KDD 20 / 44

CRISP-DM Data Understanding The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information. (Piotr Paszek) Data Mining DM KDD 21 / 44

CRISP-DM Data Preparation The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modelling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modelling tools. (Piotr Paszek) Data Mining DM KDD 22 / 44

CRISP-DM Modeling In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed. (Piotr Paszek) Data Mining DM KDD 23 / 44

CRISP-DM Evaluation At this stage in the project you have built a model (or models) that appears to have high quality, from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached. (Piotr Paszek) Data Mining DM KDD 24 / 44

CRISP-DM Deployment Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that is useful to the customer. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data scoring (e.g. segment allocation) or data mining process. In many cases it will be the customer, not the data analyst, who will carry out the deployment steps. Even if the analyst deploys the model it is important for the customer to understand up front the actions which will need to be carried out in order to actually make use of the created models. (Piotr Paszek) Data Mining DM KDD 25 / 44

DM Software Best free DM software (alphabetic order): KNIME Analytics Platform Orange Data mining R Software Environment, Rattle GUI RapidMiner Studio Rough Set Exploration System Weka Data Mining (Piotr Paszek) Data Mining DM KDD 26 / 44

KNIME Analytics Platform The Konstanz Information Miner (KNIME), is an open source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining concept and provides a graphical user interface allows assembly of nodes for data preprocessing, for modelling and data analysis and visualization. KNIME Analytics Platform provides over 1000 data analytic routines, either natively or through R and W eka. KNIME is written in Java and based on Eclipse and makes use of its extension mechanism to add plugins providing additional functionality. (Piotr Paszek) Data Mining DM KDD 27 / 44

KNIME (screenshot) (Piotr Paszek) Data Mining DM KDD 28 / 44

Orange Data mining Orange is an open source data visualization and analysis tool. Orange is developed at University of Ljubljana, Slovenia, along with open source community. Data mining is done through visual programming or P ython scripting. Orange is a Python library. Orange consists of a canvas interface onto which the user places widgets and creates a data analysis workflow. In Orange, data analysis process can be designed through visual programming. Orange runs on many platforms (Windows, Mac OS X, Linux). Orange can read files in native and other data formats. Orange is devoted to machine learning methods for classification, or supervised data mining. (Piotr Paszek) Data Mining DM KDD 29 / 44

ORANGE (screenshot) (Piotr Paszek) Data Mining DM KDD 30 / 44

R Software Environment R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. R is an integrated suite of software facilities for data manipulation, calculation and graphical display. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. R provides a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others. (Piotr Paszek) Data Mining DM KDD 31 / 44

R (screenshot) (Piotr Paszek) Data Mining DM KDD 32 / 44

Rattle GUI The R Analytical Tool To Learn Easily (Rattle) is a popular GUI for data mining using R. It is Free Open Source Software. Rattle runs on many platforms (Windows, Mac OS X, Linux). It presents statistical and visual summaries of data, transforms data that can be readily modeled, builds both unsupervised and supervised models from the data, presents the performance of models graphically, and scores new datasets. One of the most important features is that all of the user s interactions through the graphical user interface are captured as an R script that can be readily executed in R independently of the Rattle interface. Through a simple and logical graphical user interface based on Gnome, Rattle can be used by itself to deliver data mining projects. (Piotr Paszek) Data Mining DM KDD 33 / 44

Rattle (screenshot) (Piotr Paszek) Data Mining DM KDD 34 / 44

RapidMiner Studio RapidMiner Studio is a visual design environment for machine learning, data mining, text mining, predictive analytics and business analytics. It provides a deep library of machine learning algorithms, data preparation and exploration functions, and model validation tools to support all your data science projects and use cases. Data science teams can easily re-use existing R and P ython code, and add new functionality via a large marketplace of pre-built extensions. RapidMiner supports all steps of the data mining process including results visualization, validation and optimization. RapidMiner is written in the Java programming language. RapidMiner provides learning schemes and models and algorithms from W eka and R scripts that can be used through extensions. (Piotr Paszek) Data Mining DM KDD 35 / 44

RapidMiner (screenshot) (Piotr Paszek) Data Mining DM KDD 36 / 44

Rough Set Exploration System Rough Set Exploration System (RSES) is a toolset for analysing data with the use of methods coming from Rough Set Theory. It is a graphical, user-friendly front-end running under Windows and providing access to methods from RSESlib library. RSESlib is a core of RSES computational kernel. Both library and GUI are designed and implemented at the Warsaw University. RSESlib is a library of functions for performing various data exploration tasks such as: calculation of reducts, generation of decision rules, classification, discretization, decomposition, search for patterns in data, data manipulation. The library is implemented in Java. First version of library was included in the computational kernel of ROSET T A system. (Piotr Paszek) Data Mining DM KDD 37 / 44

RapidMiner (screenshot) (Piotr Paszek) Data Mining DM KDD 38 / 44

Weka Data Mining W eka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka features include machine learning, data mining, preprocessing, classification, regression, clustering, association rules, attribute selection, experiments, workflow and visualization. Weka is written in Java, developed at the University of Waikato, New Zealand. It runs on many platforms (Windows, Mac OS X, Linux). Weka is open source software issued under the GNU General Public License. (Piotr Paszek) Data Mining DM KDD 39 / 44

Weka (screenshot) (Piotr Paszek) Data Mining DM KDD 40 / 44

Data Mining Tasks Anomaly or Outlier Detection (descriptive or predictive) Association Rule (descriptive) Clustering (descriptive) Classification (predictive) Regression (predictive) Summarization (descriptive) (Piotr Paszek) Data Mining DM KDD 41 / 44

Data Mining - tasks I Anomaly detection (outlier/change/deviation detection) The identification of unusual data records, that might be interesting or data errors that require further investigation. Association rule learning (dependency modelling) Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Clustering Discovering groups and structures in the data that are in some way or another similar, without using known structures in the data. (Piotr Paszek) Data Mining DM KDD 42 / 44

Data Mining - tasks II Classification Building a model that describe how to classify (assign) the data items into one of a predefined classes. For example, an e-mail program might attempt to classify an e-mail as legitimate or as spam. Regression Predicting the value of a given (continuous) feature based on the values of other features in the data, assuming a linear or non-linear model of dependency. Summarization providing a more compact representation of the data set, including visualization and report generation. (Piotr Paszek) Data Mining DM KDD 43 / 44

Data mining - fields of application medicine banking finances insurance Stock Exchange e-commerce customer segmentation crime detection, fraud detection (Piotr Paszek) Data Mining DM KDD 44 / 44