Machine Learning Track
|
|
- Drusilla Harper
- 5 years ago
- Views:
Transcription
1 Intel HPC Developer Convention Salt Lake City 2016 Machine Learning Track Data Analytics, Machine Learning and HPC in today s changing application environment Franz J. Király
2 An overview of data analytics Statistical Programming (practical) R python DATA Scientific Questions Exploration Statistical Questions Quantitative Modelling Methods The Scientific Method Descriptive/Explanatory Predictive/Inferential Scientific and Statistical Validation Knowledge
3 Data analytics and data science in a broader context Lot of problems and subtleties at these stages already Raw data Clean data often, most of manpower in data project needs to go here first before one can attempt reliable Data analytics Statistics, Modelling, Data mining, Machine learning Knowledge Relevant findings and underlying arguments need to be explained well and properly
4 Big Data?
5 What Big Data may mean in practice Strategies that stop working in reasonable time Manual exploratory data analysis Kernel methods, OLS Random forests L1, LASSO (around the same order) Number of features Feature extraction Feature selection Large-scale strategies for super-linear algorithms Super-linear algorithms Linear algorithms, including Reading in all the data Number of data samples On-line models Distributed computing Sub-sampling Solution strategies
6 Large-scale motifs in data science = where high-performance computing is helpful/impactful Big models Not necessarily a lot of data, but computationally intensive models Classical example: finite elements and other numerical models New fancy example: large neural networks aka deep learning Big data = the classic, beloved by everyone Common HPC motif: divide/conquer in parts-of-model, e.g. neurons/nodes = what it says, a lot of data (ca 1 million samples or more) Computational challenge arises from processing all of the data Example: histogram or linear regression with huge amounts of data Common HPC motif: divide/conquer training/fitting of model, e.g. batchwise/epoch fitting Model validation and model selection = this talk s focus Answers the question: which model is best for your data? Demanding even for simple models and small amounts of data! Example: is deep learning better than logistic regression, or guessing?
7 Meta-modelling: stylized case studies Customer: Hospital specializing in treatment of patients with a certain disease. Patients with this disease are at-risk to experience an adverse event (e.g. death) Scientific question: depending on patient characteristics, predict the event risk. Data set: complete clinical records of patients, including event if occurred Customer: Retailer who wants to accurately model behaviour of customers. Customers can buy (or not buy) any of a number of products, or churn. Scientific question: predict future customer behaviour given past behaviour Data set: complete customer and purchase records of customers Customer: Manufacturer wishes to find best parameter setting for machines. Parameters influence amount/quality of product (or whether machine breaks) Scientific question: find parameter settings which optimizes the above Data set: outcomes for parameter settings on those machines Of interest: model interpretability; how accurate the predictions are expected to be whether the algorithm/model is (easily) deployable in the real world Not of interest: which algorithm/strategy, out of many, exactly solves the task
8 Model validation and model selection = data-centric and data-dependent modelling a scientific necessity implied by the scientific method and the following: 1. There is no model that is good for all data. (otherwise the concept of a model would be unnecessary) 2. For given data, there is no a-priori reason to believe that a certain type of model will be the best one. (any such belief is not empirically justified hence pseudoscientific) 3. No model can be trusted unless its validity has been verified by a model-independent argument. (otherwise the justification of validity is circular hence faulty) Machine learning provides algorithms & theory for meta-modelling and powerful algorithms motivated by meta-modelling optimality.
9 Machine Learning and Meta-Modelling in a Nutshell
10 Leitmotifs of Machine Learning from the intersection of engineering, statistics and computer science Engineering & statistics idea: Statistical models are objects in their own right learning machines modelling strategy Engineering & computer science idea: Any abstract algorithm can be a modelling strategy/learning machine computational learning modelling strategy Possibly non-explicit Computer science & statistics idea: Future performance of algorithm/learning machine can be estimated model validation model selection (and should) learning machine?
11 Problem types in Machine Learning Supervised Learning: some data is labelled by expert/oracle Task: predict label from covariates statistical models are usually discriminative Examples: regression, classification???
12 Problem types in Machine Learning Unsupervised Learning: the training data is not pre-labelled??! Task: find structure or pattern in data statistical models are usually generative Examples: clustering, dimension reduction
13 Advanced learning tasks Complications in the labelling Semi-supervised learning some training data are labelled, some are not Reinforcement learning data are not directly labelled, only indirect gain/loss Anomaly detection all or most data are positive examples, the task is to flag test negatives Complications through correlated data and/or time On-line learning the data is revealed with time, models need to update Forecasting each data point has a time stamp, predict the temporal future Transfer learning the data comes in dissimilar batches, train and test may be distinct
14 What is a Learning Machine? an algorithm that solves, e.g., the previous tasks: Illustration: supervised learning machine new data observations training data model fitting learning fitted model prediction?? predictions model tuning parameters e.g., to base decisions on Examples: generalized linear model, linear regression, support vector machine, neural networks (= deep learning ), random forests, gradient boosting,
15 Example: Linear Regression? new data observations training data model fitting learning fitted model prediction predictions Fit intercept or not?
16 Model validation: does the model make sense?? test labels the truth in-sample prediction strategy learning machine test data hold-out out-of-sample compare & quantify Model learning Prediction?? training data e.g. regression, GLM, advanced methods learnt model e.g. evaluating the regression model predictions Predictive models need to be validated on unseen data! The only (general) way to test goodness of prediction is actually observing prediction! Which means the part of data for testing has not been seen by the algorithm before! (note: this includes the case where machine = linear regression, deep learning, etc)
17 Re-sampling : all data training data 1 training test data 2 training test data 3 test data 3 Predictor 1 Predictor 2 Predictor Predictor 1 3 Predictor 2 Predictor Predictor 1 3 Predictor 2 Predictor 3 errors 1,2,3 errors 1,2,3 aggregate errors 1,2,3 errors 1,2,3 comparison Multiple algorithms are compared on multiple data splits/sub-datasets State-of-art principle in model validation, model comparison and meta-modelling type of re-sampling how to obtain training/test splits pros/cons k-fold cross-validation often: k=5 1. divide data in k (almost) equal parts 2. obtain k train/tests splits via: each part is test data exactly once the rest of data is the training set good compromise between runtime and accuracy when k is small compared to data size leave-one-out = [number of data points]-fold c.v. very accurate, high run-time repeated sub-sampling parameters: training/test size # of repetitions 1. obtain a random sub-sample of training/test data of specified sizes (train/test need not cover all data) 2. repeat 1. desired number of times can be arbitrarily quick can be arbitrarily inaccurate (depending on parameter choice) can be combined with k-fold
18 Quantitative model comparison a benchmarking experiment results in a table like this model RMSE MAE 15.3 ± ± ± ± ± ± 0.8? 20.1 ± ± 1.1 Confidence regions (or paired tests) to compare models to each other: A is better than B / B is better than A / A and B are equally good Uninformed model (stupid model/random guess) needs to be included otherwise a statement is better than an uninformed guess cannot be made. useful model = (significantly) better than uninformed baseline
19 ± 1. 4 Model ± 0. 7 ± 0. 9 ± 1. 2 Meta-model: automated parameter tuning Re-sampling is used to determine [best parameter setting] For validation, new unseen data needs to be used: all data training data test data Multi-fold-schemes are nested: splits within splits tuning train tuning test real test model goodness? predict & quantify w. Best Parameter fit to all training data whole training data training data test data Parameters 1 Parameters 2 Parameters 3 Re-sampled training data? mo del goodn ess ± 1 ±. 4 0 ± ± Best parameters Which measure of predictive goodness Important caveat: the inner training/test splits need to be part of any outer training set otherwise validation is not out-of-sample! Which inner re-sampling scheme Methods are usually less sensitive to these new tuning parameters
20 Meta-Strategies in ML Model tuning Model with tuning parameters Best tuning parameters are determined using data-driven tuning algorithm Ensemble learning A B C D A B D a number of (possibly weak ) models strong ensemble model
21 Object dependencies in the ML workflow One interesting dataset is re-sampled all data N = data points ( small data ) Typical number of into multiple train/test splits on each of which training data test data training data test data training data test data 5-10 outer splits the strategies are compared 1 2 M M = 5-20 most of which are parametertuned by the same principle 3-5 nested splits parameter combinations Runtime = 10 x 10 x 5 x (x 100) x one run on N samples base learners Ensembles: further nesting (usually O(N²) or O(N³) )
22 Machine Learning Toolboxes
23 An incomplete list of influential toolboxes scikit-learn is perhaps the most widely used ML toolbox Language Modular API (e.g., methods) GUI Common models Model tuning, meta-methods Model validation and comparison R python caret R python multiinterface Not entirely mostly kernels some Java 3rd party wrappers python Few, mostly classifiers few
24 The object-oriented ML Toolbox API as found in the R/mlr or scikit-learn packages Leading principles: encapsulation, modularization learning machine object modular structure object orientation Linear regression fit(traindata) predict(testdata) plus metadata & model info Abstraction models objects with unified API: Concept abstracted Public interface in R/mlr in sklearn Learning Machines fitting, predicting, set parameters Learner estimator Re-sampling schemes sample, apply & get results ResampleDesc splitter classes in model_selection Evaluation metrics compute from results, tabulate Measure metrics classes in metrics Meta-modelling Tuning Ensembling Pipelining wrapping machines by strategy various wrappers fused classes various wrappers Pipeline Learning task benchmark, list strategies/measures Task Implicit, not encapsulated
25 HPC for benchmarking/validation today Scikit-learn: joblib At the selected level: mlr: parallelmap 1 (one of 1-4) 2 Distribute to clusters/cores training data test data all data training data test data N = data points ( small data ) 1 2 M training data test data Typical number of M = outer splits nested splits parameter combinations base learners Plus algorithm-specific HPC interfaces, e.g. deep learning (mutually exclusive)
26 HPC support tomorrow? Layer 1: full graph of dependencies: re-samples algorithms parameters Layer 2: Scheduler for algorithms and meta-algorithms 1 2 M (image source: continuum analytics) Combining (?) MapReduce, DAAL, dask, joblib -> TBB? DATA (e.g. Hadoop) Data/task pipeline Layer 3: Optimized Primitives Linear systems convex optimization stoch. gradient descent (image source: Intel math kernel library) e.g. MKL, CUDA, BLAS Layer 4: Hardware API e.g. distributed, multi-core, multi-type/heterogeneous
27 Challenges in ML APIs and HPC Surprisingly few resources have been invested in ML toolboxes Most advanced toolboxes are currently open-source & academic Features that would be desirable to the practitioner but not available without mid-scale software development: Integration of (a) data management, (b) exploration and (c) modelling especially challenging: integration in large scale scenarios e.g. MapReduce for divide/conquer over data, model parts, and models Full HPC integration on granular level for distributed ML benchmarking making full use parallelism for nesting and computational redundancies complete HPC architecture for whole model benchmarking workflow Non-standard modelling tasks, structured data (incl time series) data heterogeneity, multiple datasets, time series, spatial features, images etc forecasting, on-line learning, anomaly detection, change point detection meta-modelling and re-sampling for these is an order of magnitude more costly
Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect
Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect BEOP.CTO.TP4 Owner: OCTO Revision: 0001 Approved by: JAT Effective: 08/30/2018 Buchanan & Edwards Proprietary: Printed copies of
More informationPython With Data Science
Course Overview This course covers theoretical and technical aspects of using Python in Applied Data Science projects and Data Logistics use cases. Who Should Attend Data Scientists, Software Developers,
More informationMachine Learning In A Snap. Thomas Parnell Research Staff Member IBM Research - Zurich
Machine Learning In A Snap Thomas Parnell Research Staff Member IBM Research - Zurich What are GLMs? Ridge Regression Support Vector Machines Regression Generalized Linear Models Classification Lasso Regression
More informationScalable Machine Learning in R. with H2O
Scalable Machine Learning in R with H2O Erin LeDell @ledell DSC July 2016 Introduction Statistician & Machine Learning Scientist at H2O.ai in Mountain View, California, USA Ph.D. in Biostatistics with
More informationIntroduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)
Introduction to Data Science What is Analytics and Data Science? Overview of Data Science and Analytics Why Analytics is is becoming popular now? Application of Analytics in business Analytics Vs Data
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationApplying Supervised Learning
Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains
More informationScaling Out Python* To HPC and Big Data
Scaling Out Python* To HPC and Big Data Sergey Maidanov Software Engineering Manager for Intel Distribution for Python* What Problems We Solve: Scalable Performance Make Python usable beyond prototyping
More informationBig Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1
Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that
More informationWhat's New in MATLAB for Engineering Data Analytics?
What's New in MATLAB for Engineering Data Analytics? Will Wilson Application Engineer MathWorks, Inc. 2017 The MathWorks, Inc. 1 Agenda Data Types Tall Arrays for Big Data Machine Learning (for Everyone)
More informationSCIENCE. An Introduction to Python Brief History Why Python Where to use
DATA SCIENCE Python is a general-purpose interpreted, interactive, object-oriented and high-level programming language. Currently Python is the most popular Language in IT. Python adopted as a language
More informationInformation Driven Healthcare:
Information Driven Healthcare: Machine Learning course Lecture: Feature selection I --- Concepts Centre for Doctoral Training in Healthcare Innovation Dr. Athanasios Tsanas ( Thanasis ), Wellcome Trust
More informationMachine Learning Practical NITP Summer Course Pamela K. Douglas UCLA Semel Institute
Machine Learning Practical NITP Summer Course 2013 Pamela K. Douglas UCLA Semel Institute Email: pamelita@g.ucla.edu Topics Covered Part I: WEKA Basics J Part II: MONK Data Set & Feature Selection (from
More informationPutting it all together: Creating a Big Data Analytic Workflow with Spotfire
Putting it all together: Creating a Big Data Analytic Workflow with Spotfire Authors: David Katz and Mike Alperin, TIBCO Data Science Team In a previous blog, we showed how ultra-fast visualization of
More informationML 프로그래밍 ( 보충 ) Scikit-Learn
ML 프로그래밍 ( 보충 ) Scikit-Learn 2017.5 Scikit-Learn? 특징 a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages (NumPy, SciPy, matplotlib).
More informationMachine Learning for Medical Image Analysis. A. Criminisi
Machine Learning for Medical Image Analysis A. Criminisi Overview Introduction to machine learning Decision forests Applications in medical image analysis Anatomy localization in CT Scans Spine Detection
More informationSupervised Learning for Image Segmentation
Supervised Learning for Image Segmentation Raphael Meier 06.10.2016 Raphael Meier MIA 2016 06.10.2016 1 / 52 References A. Ng, Machine Learning lecture, Stanford University. A. Criminisi, J. Shotton, E.
More informationSlides for Data Mining by I. H. Witten and E. Frank
Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-
More informationIntel Distribution for Python* и Intel Performance Libraries
Intel Distribution for Python* и Intel Performance Libraries 1 Motivation * L.Prechelt, An empirical comparison of seven programming languages, IEEE Computer, 2000, Vol. 33, Issue 10, pp. 23-29 ** RedMonk
More informationData Science Bootcamp Curriculum. NYC Data Science Academy
Data Science Bootcamp Curriculum NYC Data Science Academy 100+ hours free, self-paced online course. Access to part-time in-person courses hosted at NYC campus Machine Learning with R and Python Foundations
More informationSpecialist ICT Learning
Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.
More informationTutorial on Machine Learning Tools
Tutorial on Machine Learning Tools Yanbing Xue Milos Hauskrecht Why do we need these tools? Widely deployed classical models No need to code from scratch Easy-to-use GUI Outline Matlab Apps Weka 3 UI TensorFlow
More informationDATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:
DATA SCIENCE About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst/Analytics Manager/Actuarial Scientist/Business
More informationConvex and Distributed Optimization. Thomas Ropars
>>> Presentation of this master2 course Convex and Distributed Optimization Franck Iutzeler Jérôme Malick Thomas Ropars Dmitry Grishchenko from LJK, the applied maths and computer science laboratory and
More information1 Topic. Image classification using Knime.
1 Topic Image classification using Knime. The aim of image mining is to extract valuable knowledge from image data. In the context of supervised image classification, we want to assign automatically a
More informationPartitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning
Partitioning Data IRDS: Evaluation, Debugging, and Diagnostics Charles Sutton University of Edinburgh Training Validation Test Training : Running learning algorithms Validation : Tuning parameters of learning
More informationChapter 28. Outline. Definitions of Data Mining. Data Mining Concepts
Chapter 28 Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms
More informationAnalytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.
Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied
More informationCS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.
CS535 Big Data - Fall 2017 Week 8-A-1 CS535 BIG DATA FAQs Term project proposal New deadline: Tomorrow PA1 demo PART 1. BATCH COMPUTING MODELS FOR BIG DATA ANALYTICS 5. ADVANCED DATA ANALYTICS WITH APACHE
More informationBusiness Data Analytics
MTAT.03.319 Business Data Analytics Lecture 9 The slides are available under creative common license. The original owner of these slides is the University of Tartu Fraud Detection Wrongful act for financial
More informationUsing Existing Numerical Libraries on Spark
Using Existing Numerical Libraries on Spark Brian Spector Chicago Spark Users Meetup June 24 th, 2015 Experts in numerical algorithms and HPC services How to use existing libraries on Spark Call algorithm
More informationMachine Learning in Digital Security
Machine Learning in Digital Security White Paper www.seqrite.com Table of Contents 1. Introduction 2. Introduction to Machine Learning 3. Machine Learning usage in Security Industry 4. Clustering Samples
More informationUsing Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear
Using Machine Learning to Identify Security Issues in Open-Source Libraries Asankhaya Sharma Yaqin Zhou SourceClear Outline - Overview of problem space Unidentified security issues How Machine Learning
More informationTable Of Contents: xix Foreword to Second Edition
Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data
More informationPredictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA
Predictive Analytics: Demystifying Current and Emerging Methodologies Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA May 18, 2017 About the Presenters Tom Kolde, FCAS, MAAA Consulting Actuary Chicago,
More informationCS 179 Lecture 16. Logistic Regression & Parallel SGD
CS 179 Lecture 16 Logistic Regression & Parallel SGD 1 Outline logistic regression (stochastic) gradient descent parallelizing SGD for neural nets (with emphasis on Google s distributed neural net implementation)
More informationINTRODUCTION TO MACHINE LEARNING. Measuring model performance or error
INTRODUCTION TO MACHINE LEARNING Measuring model performance or error Is our model any good? Context of task Accuracy Computation time Interpretability 3 types of tasks Classification Regression Clustering
More informationA Practical Tour of Ensemble (Machine) Learning
A Practical Tour of Ensemble (Machine) Learning Nima Hejazi Evan Muzzall Division of Biostatistics, University of California, Berkeley D-Lab, University of California, Berkeley slides: https://googl/wwaqc
More informationscikit-learn (Machine Learning in Python)
scikit-learn (Machine Learning in Python) (PB13007115) 2016-07-12 (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 1 / 29 Outline 1 Introduction 2 scikit-learn examples 3 Captcha recognize
More informationMachine Learning (CSE 446): Practical Issues
Machine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1 / 39 scary words 2 / 39 Outline of CSE 446 We ve already covered stuff
More informationMachine Learning: Think Big and Parallel
Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least
More informationData Analytics and Machine Learning: From Node to Cluster
Data Analytics and Machine Learning: From Node to Cluster Presented by Viswanath Puttagunta Ganesh Raju Understanding use cases to optimize on ARM Ecosystem Date BKK16-404B March 10th, 2016 Event Linaro
More informationDeploying Machine Learning Models in Practice
Deploying Machine Learning Models in Practice Nick Pentreath Principal Engineer @MLnick About @MLnick on Twitter & Github Principal Engineer, IBM CODAIT - Center for Open-Source Data & AI Technologies
More informationSemi-supervised learning and active learning
Semi-supervised learning and active learning Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Combining classifiers Ensemble learning: a machine learning paradigm where multiple learners
More informationThis tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.
About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts
More informationCS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp
CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as
More informationOn the importance of deep learning regularization techniques in knowledge discovery
On the importance of deep learning regularization techniques in knowledge discovery Ljubinka Sandjakoska Atanas Hristov Ana Madevska Bogdanova Output Introduction Theory - Regularization techniques - Impact
More informationSUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018
SUPERVISED LEARNING METHODS Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018 2 CHOICE OF ML You cannot know which algorithm will work
More informationPre-Requisites: CS2510. NU Core Designations: AD
DS4100: Data Collection, Integration and Analysis Teaches how to collect data from multiple sources and integrate them into consistent data sets. Explains how to use semi-automated and automated classification
More informationBusiness Club. Decision Trees
Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building
More informationInterpretable Machine Learning with Applications to Banking
Interpretable Machine Learning with Applications to Banking Linwei Hu Advanced Technologies for Modeling, Corporate Model Risk Wells Fargo October 26, 2018 2018 Wells Fargo Bank, N.A. All rights reserved.
More informationData Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)
Data Mining: Concepts and Techniques Chapter 9 Classification: Support Vector Machines 1 Support Vector Machines (SVMs) SVMs are a set of related supervised learning methods used for classification Based
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:
More informationKnowledge Discovery. URL - Spring 2018 CS - MIA 1/22
Knowledge Discovery Javier Béjar cbea URL - Spring 2018 CS - MIA 1/22 Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical application of the methodologies from machine learning/statistics
More informationCommunity edition(open-source) Enterprise edition
Suseela Bhaskaruni Rapid Miner is an environment for machine learning and data mining experiments. Widely used for both research and real-world data mining tasks. Software versions: Community edition(open-source)
More informationTackling Big Data Using MATLAB
Tackling Big Data Using MATLAB Alka Nair Application Engineer 2015 The MathWorks, Inc. 1 Building Machine Learning Models with Big Data Access Preprocess, Exploration & Model Development Scale up & Integrate
More informationMachine Learning Techniques for Data Mining
Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already
More informationContents. Preface to the Second Edition
Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................
More informationOpportunities and challenges in personalization of online hotel search
Opportunities and challenges in personalization of online hotel search David Zibriczky Data Science & Analytics Lead, User Profiling Introduction 2 Introduction About Mission: Helping the travelers to
More informationINSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad
INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad - 500 043 INFORMATION TECHNOLOGY DEFINITIONS AND TERMINOLOGY Course Name : DATA WAREHOUSING AND DATA MINING Course Code : AIT006 Program
More information劉介宇 國立台北護理健康大學 護理助產研究所 / 通識教育中心副教授 兼教師發展中心教師評鑑組長 Nov 19, 2012
劉介宇 國立台北護理健康大學 護理助產研究所 / 通識教育中心副教授 兼教師發展中心教師評鑑組長 Nov 19, 2012 Overview of Data Mining ( 資料採礦 ) What is Data Mining? Steps in Data Mining Overview of Data Mining techniques Points to Remember Data mining
More informationData mining with sparse grids
Data mining with sparse grids Jochen Garcke and Michael Griebel Institut für Angewandte Mathematik Universität Bonn Data mining with sparse grids p.1/40 Overview What is Data mining? Regularization networks
More informationMIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA
Exploratory Machine Learning studies for disruption prediction on DIII-D by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA Presented at the 2 nd IAEA Technical Meeting on
More informationADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA
INSIGHTS@SAS: ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA AGENDA 09.00 09.15 Intro 09.15 10.30 Analytics using SAS Enterprise Guide Ellen Lokollo 10.45 12.00 Advanced Analytics using SAS
More informationGenerating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory
Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation
More informationCS6375: Machine Learning Gautam Kunapuli. Mid-Term Review
Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes
More informationChallenges motivating deep learning. Sargur N. Srihari
Challenges motivating deep learning Sargur N. srihari@cedar.buffalo.edu 1 Topics In Machine Learning Basics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation
More informationChapter 1, Introduction
CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from
More informationNeural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017
3/0/207 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/0/207 Perceptron as a neural
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences
More informationFMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu
FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)
More informationRandom Forest A. Fornaser
Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University
More informationMIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018
MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge
More informationBig Data Using Hadoop
IEEE 2016-17 PROJECT LIST(JAVA) Big Data Using Hadoop 17ANSP-BD-001 17ANSP-BD-002 Hadoop Performance Modeling for JobEstimation and Resource Provisioning MapReduce has become a major computing model for
More informationAccelerated Machine Learning Algorithms in Python
Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals
More informationMachine Learning: An Applied Econometric Approach Online Appendix
Machine Learning: An Applied Econometric Approach Online Appendix Sendhil Mullainathan mullain@fas.harvard.edu Jann Spiess jspiess@fas.harvard.edu April 2017 A How We Predict In this section, we detail
More informationOracle9i Data Mining. Data Sheet August 2002
Oracle9i Data Mining Data Sheet August 2002 Oracle9i Data Mining enables companies to build integrated business intelligence applications. Using data mining functionality embedded in the Oracle9i Database,
More informationMachine Learning in the Process Industry. Anders Hedlund Analytics Specialist
Machine Learning in the Process Industry Anders Hedlund Analytics Specialist anders@binordic.com Artificial Specific Intelligence Artificial General Intelligence Strong AI Consciousness MEDIA, NEWS, CELEBRITIES
More informationBig Data and FrameWorks; Perspectives to Applied Machine Learning
Big Data and FrameWorks; Perspectives to Applied Machine Learning Mehdi Habibzadeh PhD in Computer Science Outlines (Oct 2016) : Big Data and Challenges Review and Trends Math and Probability Concepts
More informationSparkling Water. August 2015: First Edition
Sparkling Water Michal Malohlava Alex Tellez Jessica Lanford http://h2o.gitbooks.io/sparkling-water-and-h2o/ August 2015: First Edition Sparkling Water by Michal Malohlava, Alex Tellez & Jessica Lanford
More informationKNIME for the life sciences Cambridge Meetup
KNIME for the life sciences Cambridge Meetup Greg Landrum, Ph.D. KNIME.com AG 12 July 2016 What is KNIME? A bit of motivation: tool blending, data blending, documentation, automation, reproducibility More
More informationParallel Methods for Convex Optimization. A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney
Parallel Methods for Convex Optimization A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney Problems minimize g(x)+f(x; A, b) Sparse regression g(x) =kxk 1 f(x) =kax bk 2 2 mx Sparse SVM g(x) =kxk
More informationImproving the Efficiency of Fast Using Semantic Similarity Algorithm
International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year
More informationS8873 GBM INFERENCING ON GPU. Shankara Rao Thejaswi Nanditale, Vinay Deshpande
S8873 GBM INFERENCING ON GPU Shankara Rao Thejaswi Nanditale, Vinay Deshpande Introduction AGENDA Objective Experimental Results Implementation Details Conclusion 2 INTRODUCTION 3 BOOSTING What is it?
More informationPreface to the Second Edition. Preface to the First Edition. 1 Introduction 1
Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches
More informationLecture on Modeling Tools for Clustering & Regression
Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into
More informationFraud Detection Using Random Forest Algorithm
Fraud Detection Using Random Forest Algorithm Eesha Goel Computer Science Engineering and Technology, GZSCCET, Bhatinda, India eesha1992@rediffmail.com Abhilasha Computer Science Engineering and Technology,
More informationUsing Machine Learning to Optimize Storage Systems
Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation
More informationHow Learning Differs from Optimization. Sargur N. Srihari
How Learning Differs from Optimization Sargur N. srihari@cedar.buffalo.edu 1 Topics in Optimization Optimization for Training Deep Models: Overview How learning differs from optimization Risk, empirical
More informationData Mining: STATISTICA
Outline Data Mining: STATISTICA Prepare the data Classification and regression (C & R, ANN) Clustering Association rules Graphic user interface Prepare the Data Statistica can read from Excel,.txt and
More informationContents. Foreword to Second Edition. Acknowledgments About the Authors
Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1
More informationMLeap: Release Spark ML Pipelines
MLeap: Release Spark ML Pipelines Hollin Wilkins & Mikhail Semeniuk SATURDAY Web Dev @ Cornell Studied some General Biology Rails Consulting for TrueCar and other companies Implement ML model for ClearBook
More informationCPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017
CPSC 340: Machine Learning and Data Mining Probabilistic Classification Fall 2017 Admin Assignment 0 is due tonight: you should be almost done. 1 late day to hand it in Monday, 2 late days for Wednesday.
More informationWhat is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry
Data Mining Andrew Kusiak Intelligent Systems Laboratory 2139 Seamans Center The University of Iowa Iowa City, IA 52242-1527 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak Tel. 319-335 5934
More informationTransforming Transport Infrastructure with GPU- Accelerated Machine Learning Yang Lu and Shaun Howell
Transforming Transport Infrastructure with GPU- Accelerated Machine Learning Yang Lu and Shaun Howell 11 th Oct 2018 2 Contents Our Vision Of Smarter Transport Company introduction and journey so far Advanced
More informationIntroduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture 14 Python Exercise on knn and PCA Hello everyone,
More informationHyperparameters and Validation Sets. Sargur N. Srihari
Hyperparameters and Validation Sets Sargur N. srihari@cedar.buffalo.edu 1 Topics in Machine Learning Basics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation
More informationEvaluation. Evaluate what? For really large amounts of data... A: Use a validation set.
Evaluate what? Evaluation Charles Sutton Data Mining and Exploration Spring 2012 Do you want to evaluate a classifier or a learning algorithm? Do you want to predict accuracy or predict which one is better?
More informationScalable Ensemble Learning and Computationally Efficient Variance Estimation. Erin E. LeDell. A dissertation submitted in partial satisfaction of the
Scalable Ensemble Learning and Computationally Efficient Variance Estimation by Erin E. LeDell A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy
More informationSOCIAL MEDIA MINING. Data Mining Essentials
SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate
More information