Python for. Data Science. by Luca Massaron. and John Paul Mueller

Size: px
Start display at page:

Download "Python for. Data Science. by Luca Massaron. and John Paul Mueller"

Transcription

1 Python for Data Science by Luca Massaron and John Paul Mueller

2 Table of Contents #»» *» «»>»»» Introduction 1 About This Book 1 Foolish Assumptions 2 Icons Used in This Book 3 Beyond the Book 4 Where to Go from Here 5 Part I: Getting Started Mith Python for Data Science 7 Chapter 1: Discovering the Match between Data Science and Python 9 Defining the Sexiest Job of the 21st Century 11 Considering the emergence of data science 11 Outlining the core competencies of a data scientist 12 Linking data science and big data 13 Understanding the role of programming 13 Creating the Data Science Pipeline 14 Preparing the data 14 Performing exploratory data analysis 15 Learning from data 15 Visualizing 15 Obtaining insights and data products 15 Understanding Python's Role in Data Science 16 Considering the shifting profile of data scientists 16 Working with a multipurpose, simple, and efficient language 17 Learning to Use Python Fast 18 Loading data 18 Training a model 18 Viewing a result 20 Chapter 2: Introducing Python's Capabilities and Wonders 21 Why Python? 22 Grasping Python's core philosophy 23 Discovering present and future development goals 23 Working with Python 24 Getting a taste of the language 24 Understanding the need for indentation. 25 Working at the command line or in the IDE 25

3 Python for Data Science For Dummies Performing Rapid Prototyping and Experimentation 29 Considering Speed of Execution 30 Visualizing Power 32 Using the Python Ecosystem for Data Science 33 Accessing scientific tools using SciPy 33 Performing fundamental scientific computing using NumPy 34 Performing data analysis using pandas 34 Implementing machine learning using Scikit-learn 35 Plotting the data using matplotlib 35 Parsing HTML documents using Beautiful Soup 35 Chapter 3: Setting Up Python for Data Science 37 Considering the Off-the-Shelf Cross-Platform Scientific Distributions 38 Getting Continuum Analytics Anaconda 39 Getting Enthought Canopy Express 40 Getting pythonxy 40 Getting WinPython 41 Installing Anaconda on Windows 41 Installing Anaconda on Linux 45 Installing Anaconda on Mac OS X 46 Downloading the Datasets and Example Code 47 Using IPython Notebook 47 Denning the code repository 48 Understanding the datasets used in this book 54 Chapter 4: Reviewing Basic Python 57 Working with Numbers and Logic 59 Performing variable assignments 60 Doing arithmetic 61 Comparing data using Boolean expressions 62 Creating and Using Strings 65 Interacting with Dates 66 Creating and Using Functions 68 Creating reusable functions 68 Calling functions in a variety of ways 70 Using Conditional and Loop Statements 73 Making decisions using the if statement 73 Choosing between multiple options using nested decisions 74 Performing repetitive tasks using for 75 Using the while statement 76 Storing Data Using Sets, Lists, and Tuples 77 Performing operations on sets 77 Working with lists 78 Creating and using Tuples 80 Defining Useful Iterators 81 Indexing Data Using Dictionaries 82

4 Table of Contents Part //; Gettinq \lour Hands Dirty With data 83 Chapter 5: Working with Real Data 85 Uploading, Streaming, and Sampling Data 86 Uploading small amounts of data into memory 87 Streaming large amounts of data into memory 88 Sampling data 89 Accessing Data in Structured Flat-File Form 90 Reading from a text file 91 Reading CSV delimited format 92 Reading Excel and other Microsoft Office files 94 Sending Data in Unstructured File Form 95 Managing Data from Relational Databases 98 Interacting with Data from NoSQL Databases 100 Accessing Data from the Web 101 Chapter 6: Conditioning Your Data 105 Juggling between NumPy and pandas 106 Knowing when to use NumPy 106 Knowing when to use pandas 106 Validating Your Data 107 Figuring out what's in your data 108 Removing duplicates 109 Creating a data map and data plan 110 Manipulating Categorical Variables 112 Creating categorical variables 113 Renaming levels 114 Combining levels 115 Dealing with Dates in Your Data 116 Formatting date and time values 117 Using the right time transformation 117 Dealing with Missing Data 118 Finding the missing data 119 Encoding missingness 119 Imputing missing data 120 Slicing and Dicing: Filtering and Selecting Data 122 Slicing rows 122 Slicing columns 123 Dicing 123 Concatenating and Transforming 124 Adding new cases and variables 125 Removing data 126 Sorting and shuffling 127 Aggregating Data at Any Level 128

5 Python for Data Science For Dummies Chapter 7: Shaping Data 131 Working with HTML Pages 132 Parsing XML and HTML 132 Using XPath for data extraction 133 Working with Raw Text 134 Dealing with Unicode 134 Stemming and removing stop words 136 Introducing regular expressions 137 Using the Bag of Words Model and Beyond 140 Understanding the bag of words model 141 Working with n-grams 142 Implementing TF-IDF transformations 144 Working with Graph Data 145 Understanding the adjacency matrix 146 Using NetworkX basics 146 Chapter 8: Putting What You Know in Action 149 Contextualizing Problems and Data 150 Evaluating a data science problem 151 Researching solutions 151 Formulating a hypothesis 152 Preparing your data 153 Considering the Art of Feature Creation 153 Defining feature creation 153 Combining variables...: 154 Understanding binning and discretization 155 Using indicator variables 155 Transforming distributions 156 Performing Operations on Arrays 156 Using vectorization 157 Performing simple arithmetic on vectors and matrices 157 Performing matrix vector multiplication 158 Performing matrix multiplication 159 Part 111: Visualizing the Invisible Chapter 9: Getting a Crash Course in MatPlotLib 163 Starting with a Graph 164 Defining the plot 164 Drawing multiple lines and plots 165 Saving your work 165 Setting the Axis, Ticks, Grids 166 Getting the axes 167

6 Table of Contents Formatting the axes 167 Adding grids 168 Defining the Line Appearance 169 Working with line styles 170 Using colors 170 Adding markers 172 Using Labels, Annotations, and Legends 173 Adding labels 174 Annotating the chart 174 Creating a legend 175 Chapter 10: Visualizing the Data., 179 Choosing the Right Graph 180 Showing parts of a whole with pie charts 180 Creating comparisons with bar charts 181 Showing distributions using histograms 183 Depicting groups using box plots 184 Seeing data patterns using scatterplots 185 Creating Advanced Scatterplots 187 Depicting groups 187 Showing correlations 188 Plotting Time Series 189 Representing time on axes 190 Plotting trends over time 191 Plotting Geographical Data 193 Visualizing Graphs 195 Developing undirected graphs 195 Developing directed graphs 197 Chapter 11: Understanding the Tools 199 Using the IPython Console 200 Interacting with screen text 200 Changing the window appearance 202 Getting Python help 203 Getting IPython help 205 Using magic functions 205 Discovering objects 207 Using IPython Notebook 208 Working with styles 208 Restarting the kernel 210 Restoring a checkpoint 210 Performing Multimedia and Graphic Integration 212 Embedding plots and other images 212 Loading examples from online sites 212 Obtaining online graphics and multimedia 212

7 Python for Data Science For Dummies Part W: Wrangling bata 215 Chapter 12: Stretching Python's Capabilities 217 Playing with Scikit-learn 218 Understanding classes in Scikit-learn 218 Denning applications for data science 219 Performing the Hashing Trick 222 Using hash functions 223 Demonstrating the hashing trick 223 Working with deterministic selection 225 Considering Timing and Performance 227 Benchmarking with timeit 228 Working with the memory profiler 230 Running in Parallel 232 Performing multicore parallelism 232 Demonstrating multiprocessing 233 Chapter 13: Exploring Data Analysis 235 The EDA Approach 236 Defining Descriptive Statistics for Numeric Data 237 Measuring central tendency 238 Measuring variance and range 239 Working with percentiles 239 Defining measures of normality 240 Counting for Categorical Data 241 Understanding frequencies 242 Creating contingency tables 243 Creating Applied Visualization for EDA 243 Inspecting boxplots 244 Performing t-tests after boxplots 245 Observing parallel coordinates 246 Graphing distributions 247 Plotting scatterplots 248 Understanding Correlation 250 Using covariance and correlation 250 Using nonparametric correlation 252 Considering chi-square for tables 253 Modifying Data Distributions 253 Using the normal distribution 254 Creating a Z-score standardization 254 Transforming other notable distributions 254

8 Table of Contents j)c Chapter 14: Reducing Dimensionality 257 Understanding SVD 258 Looking for dimensionality reduction 259 Using SVD to measure the invisible 260 Performing Factor and Principal Component Analysis 261 Considering the psychometric model 262 Looking for hidden factors 262 Using components, not factors 263 Achieving dimensionality reduction 264 Understanding Some Applications 264 Recognizing faces with PCA 265 Extracting Topics with NMF 267 Recommending movies 270 Chapter 15: Clustering 273 Clustering with K-means 275 Understanding centroid-based algorithms 275 Creating an example with image data 277 Looking for optimal solutions 278 Clustering big data 281 Performing Hierarchical Clustering 282 Moving Beyond the Round-Shaped Clusters: DBScan 286 Chapter 16: Detecting Outliers in Data 289 Considering Detection of Outliers 290 Finding more things that can go wrong 291 Understanding anomalies and novel data 292 Examining a Simple Univariate Method 292 Leveraging on the Gaussian distribution 294 Making assumptions and checking out 295 Developing a Multivariate Approach 296 Using principal component analysis 297 Using cluster analysis 298 Automating outliers detection with SVM 299 Part V: Learning from Data 301 Chapter 17: Exploring Four Simple and Effective Algorithms 303 Guessing the Number: Linear Regression 304 Defining the family of linear models 304 Using more variables 305 Understanding limitations and problems 307

9 K Python for Data Science For Dummies Moving to Logistic Regression 307 Applying logistic regression 308 Considering when classes are more 309 Making Things as Simple as Naive Bayes 310 Finding out that Naive Bayes isn't so naive 312 Predicting text classifications 313 Learning Lazily with Nearest Neighbors 315 Predicting after observing neighbors 316 Choosing your k parameter wisely 317 Chapter 18: Performing Cross-Validation, Selection, and Optimization 319 Pondering the Problem of Fitting a Model 320 Understanding bias and variance 321 Defining a strategy for picking models 322 Dividing between training and test sets 325 Cross-Validating 328 Using cross-validation on k folds 329 Sampling stratifications for complex data 329 Selecting Variables Like a Pro 331 Selecting by univariate measures 331 Using a greedy search 333 Pumping Up Your Hyperparameters 334 Implementing a grid search 335 Trying a randomized search 339 Chapter 19: Increasing Complexity with Linear and Nonlinear Tricks 341 Using Nonlinear Transformations 341 Doing variable transformations 342 Creating interactions between variables 344 Regularizing Linear Models 348 Relying on Ridge regression (L2) 349 Using the Lasso (LI) 349 Leveraging regularization 350 Combining LI & L2: Elasticnet 350 Fighting with Big Data Chunk by Chunk 351 Determining when there is too much data 351 Implementing Stochastic Gradient Descent 351 Understanding Support Vector Machines 354 Relying on a computational method 355 Fixing many new parameters 358 Classifying with SVC 360 Going nonlinear is easy 365 Performing regression with SVR 366 Creating a stochastic solution with SVM 368

10 Table of Contents XI Chapter 20: Understanding the Power of the Many 373 Starting with a Plain Decision Tree 374 Understanding a decision tree 374 Creating classification and regression trees 376 Making Machine Learning Accessible 379 Working with a Random Forest classifier 381 Working with a Random Forest regressor 382 Optimizing a Random Forest 383 Boosting Predictions 384 Knowing that many weak predictors win 384 Creating a gradient boosting classifier 385 Creating a gradient boosting regressor 386 Using GBM hyper-parameters 387 Part Vh The Part of Tens Chapter 21: Ten Essential Data Science Resource Collections 391 Gaining Insights with Data Science Weekly 392 Obtaining a Resource List at U Climb Higher 392 Getting a Good Start with KDnuggets 392 Accessing the Huge List of Resources on Data Science Central 393 Obtaining the Facts of Open Source Data Science from Masters 394 Locating Free Learning Resources with Quora 394 Receiving Help with Advanced Topics at Conductrics 394 Learning New Tricks from the Aspirational Data Scientist 395 Finding Data Intelligence and Analytics Resources at AnalyticBridge 396 Zeroing In on Developer Resources with Jonathan Bower 396 Chapter 22: Ten Data Challenges You Should Take 397 Meeting the Data Science London + Scikit-learn Challenge 398 Predicting Survival on the Titanic 399 Finding a Kaggle Competition that Suits Your Needs 399 Honing Your Overfit Strategies 400 Trudging Through the MovieLens Dataset 401 Getting Rid of Spam s 401 Working with Handwritten Information 402 Working with Pictures 403 Analyzing Amazon.com Reviews 404 Interacting with a Huge Graph 405 Index 407

Python for Data Science by Luca Massaron and John Paul Mueller

Python for Data Science by Luca Massaron and John Paul Mueller www.allitebooks.com www.allitebooks.com Python for Data Science by Luca Massaron and John Paul Mueller www.allitebooks.com Python for Data Science For Dummies Published by: John Wiley & Sons, Inc., 111

More information

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core) Introduction to Data Science What is Analytics and Data Science? Overview of Data Science and Analytics Why Analytics is is becoming popular now? Application of Analytics in business Analytics Vs Data

More information

Python With Data Science

Python With Data Science Course Overview This course covers theoretical and technical aspects of using Python in Applied Data Science projects and Data Logistics use cases. Who Should Attend Data Scientists, Software Developers,

More information

Certified Data Science with Python Professional VS-1442

Certified Data Science with Python Professional VS-1442 Certified Data Science with Python Professional VS-1442 Certified Data Science with Python Professional Certified Data Science with Python Professional Certification Code VS-1442 Data science has become

More information

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course: DATA SCIENCE About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst/Analytics Manager/Actuarial Scientist/Business

More information

Data Science with Python Course Catalog

Data Science with Python Course Catalog Enhance Your Contribution to the Business, Earn Industry-recognized Accreditations, and Develop Skills that Help You Advance in Your Career March 2018 www.iotintercon.com Table of Contents Syllabus Overview

More information

Data Analyst Nanodegree Syllabus

Data Analyst Nanodegree Syllabus Data Analyst Nanodegree Syllabus Discover Insights from Data with Python, R, SQL, and Tableau Before You Start Prerequisites : In order to succeed in this program, we recommend having experience working

More information

Python for Data Analysis

Python for Data Analysis Python for Data Analysis Wes McKinney O'REILLY 8 Beijing Cambridge Farnham Kb'ln Sebastopol Tokyo Table of Contents Preface xi 1. Preliminaries " 1 What Is This Book About? 1 Why Python for Data Analysis?

More information

Machine Learning: Think Big and Parallel

Machine Learning: Think Big and Parallel Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least

More information

Data Science Bootcamp Curriculum. NYC Data Science Academy

Data Science Bootcamp Curriculum. NYC Data Science Academy Data Science Bootcamp Curriculum NYC Data Science Academy 100+ hours free, self-paced online course. Access to part-time in-person courses hosted at NYC campus Machine Learning with R and Python Foundations

More information

Data Analyst Nanodegree Syllabus

Data Analyst Nanodegree Syllabus Data Analyst Nanodegree Syllabus Discover Insights from Data with Python, R, SQL, and Tableau Before You Start Prerequisites : In order to succeed in this program, we recommend having experience working

More information

Data Science. Data Analyst. Data Scientist. Data Architect

Data Science. Data Analyst. Data Scientist. Data Architect Data Science Data Analyst Data Analysis in Excel Programming in R Introduction to Python/SQL/Tableau Data Visualization in R / Tableau Exploratory Data Analysis Data Scientist Inferential Statistics &

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

DSC 201: Data Analysis & Visualization

DSC 201: Data Analysis & Visualization DSC 201: Data Analysis & Visualization Exploratory Data Analysis Dr. David Koop What is Exploratory Data Analysis? "Detective work" to summarize and explore datasets Includes: - Data acquisition and input

More information

SCIENCE. An Introduction to Python Brief History Why Python Where to use

SCIENCE. An Introduction to Python Brief History Why Python Where to use DATA SCIENCE Python is a general-purpose interpreted, interactive, object-oriented and high-level programming language. Currently Python is the most popular Language in IT. Python adopted as a language

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Introducing Categorical Data/Variables (pp )

Introducing Categorical Data/Variables (pp ) Notation: Means pencil-and-paper QUIZ Means coding QUIZ Definition: Feature Engineering (FE) = the process of transforming the data to an optimal representation for a given application. Scaling (see Chs.

More information

Specialist ICT Learning

Specialist ICT Learning Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

Using Existing Numerical Libraries on Spark

Using Existing Numerical Libraries on Spark Using Existing Numerical Libraries on Spark Brian Spector Chicago Spark Users Meetup June 24 th, 2015 Experts in numerical algorithms and HPC services How to use existing libraries on Spark Call algorithm

More information

Python Certification Training

Python Certification Training Introduction To Python Python Certification Training Goal : Give brief idea of what Python is and touch on basics. Define Python Know why Python is popular Setup Python environment Discuss flow control

More information

90 Hours for online Live Training

90 Hours for online Live Training Online Live Course Course Name Data Science Course Objective 1. To make the learner identify potential zones of uses of Data Science. 2. Providing experience of working with real time applications of Data

More information

Machine Learning in Action

Machine Learning in Action Machine Learning in Action PETER HARRINGTON Ill MANNING Shelter Island brief contents PART l (~tj\ssification...,... 1 1 Machine learning basics 3 2 Classifying with k-nearest Neighbors 18 3 Splitting

More information

Learning Objectives for Data Concept and Visualization

Learning Objectives for Data Concept and Visualization Learning Objectives for Data Concept and Visualization Assignment 1: Data Quality Concept and Impact of Data Quality Summarize concepts of data quality. Understand and describe the impact of data on actuarial

More information

Coding A L L - I N - O N E

Coding A L L - I N - O N E Coding ALL-IN-ONE Coding ALL-IN-ONE by Nikhil Abraham, Andy Harris, Eva Holland, Joris Meys, Luca Massaron, Chris Minnick, John Paul Mueller, and Andrie de Vries Coding All-in-One For Dummies Published

More information

Pre-Requisites: CS2510. NU Core Designations: AD

Pre-Requisites: CS2510. NU Core Designations: AD DS4100: Data Collection, Integration and Analysis Teaches how to collect data from multiple sources and integrate them into consistent data sets. Explains how to use semi-automated and automated classification

More information

About Intellipaat. About the Course. Why Take This Course?

About Intellipaat. About the Course. Why Take This Course? About Intellipaat Intellipaat is a fast growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 700,000 in over

More information

Automation.

Automation. Automation www.austech.edu.au WHAT IS AUTOMATION? Automation testing is a technique uses an application to implement entire life cycle of the software in less time and provides efficiency and effectiveness

More information

ML 프로그래밍 ( 보충 ) Scikit-Learn

ML 프로그래밍 ( 보충 ) Scikit-Learn ML 프로그래밍 ( 보충 ) Scikit-Learn 2017.5 Scikit-Learn? 특징 a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages (NumPy, SciPy, matplotlib).

More information

SAS (Statistical Analysis Software/System)

SAS (Statistical Analysis Software/System) SAS (Statistical Analysis Software/System) SAS Adv. Analytics or Predictive Modelling:- Class Room: Training Fee & Duration : 30K & 3 Months Online Training Fee & Duration : 33K & 3 Months Learning SAS:

More information

ARTIFICIAL INTELLIGENCE AND PYTHON

ARTIFICIAL INTELLIGENCE AND PYTHON ARTIFICIAL INTELLIGENCE AND PYTHON DAY 1 STANLEY LIANG, LASSONDE SCHOOL OF ENGINEERING, YORK UNIVERSITY WHAT IS PYTHON An interpreted high-level programming language for general-purpose programming. Python

More information

JMP Book Descriptions

JMP Book Descriptions JMP Book Descriptions The collection of JMP documentation is available in the JMP Help > Books menu. This document describes each title to help you decide which book to explore. Each book title is linked

More information

Kaggle See Click Fix Model Description

Kaggle See Click Fix Model Description Kaggle See Click Fix Model Description BY: Miroslaw Horbal & Bryan Gregory LOCATION: Waterloo, Ont, Canada & Dallas, TX CONTACT : miroslaw@gmail.com & bryan.gregory1@gmail.com CONTEST: See Click Predict

More information

Louis Fourrier Fabien Gaie Thomas Rolf

Louis Fourrier Fabien Gaie Thomas Rolf CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted

More information

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect BEOP.CTO.TP4 Owner: OCTO Revision: 0001 Approved by: JAT Effective: 08/30/2018 Buchanan & Edwards Proprietary: Printed copies of

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming

More information

BIG DATA SCIENTIST Certification. Big Data Scientist

BIG DATA SCIENTIST Certification. Big Data Scientist BIG DATA SCIENTIST Certification Big Data Scientist Big Data Science Professional (BDSCP) certifications are formal accreditations that prove proficiency in specific areas of Big Data. To obtain a certification,

More information

Python for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT

Python for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT Python for Data Analysis Prof.Sushila Aghav-Palwe Assistant Professor MIT Four steps to apply data analytics: 1. Define your Objective What are you trying to achieve? What could the result look like? 2.

More information

Machine Learning Part 1

Machine Learning Part 1 Data Science Weekend Machine Learning Part 1 KMK Online Analytic Team Fajri Koto Data Scientist fajri.koto@kmklabs.com Machine Learning Part 1 Outline 1. Machine Learning at glance 2. Vector Representation

More information

CIS192 Python Programming

CIS192 Python Programming CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania October 22, 2015 Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 1 / 18 Outline 1 Machine Learning

More information

10 things I wish I knew. about Machine Learning Competitions

10 things I wish I knew. about Machine Learning Competitions 10 things I wish I knew about Machine Learning Competitions Introduction Theoretical competition run-down The list of things I wish I knew Code samples for a running competition Kaggle the platform Reasons

More information

M. Sc. (Artificial Intelligence and Machine Learning)

M. Sc. (Artificial Intelligence and Machine Learning) Course Name: Advanced Python Course Code: MSCAI 122 This course will introduce students to advanced python implementations and the latest Machine Learning and Deep learning libraries, Scikit-Learn and

More information

Predict Outcomes and Reveal Relationships in Categorical Data

Predict Outcomes and Reveal Relationships in Categorical Data PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,

More information

Predicting Popular Xbox games based on Search Queries of Users

Predicting Popular Xbox games based on Search Queries of Users 1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA Predictive Analytics: Demystifying Current and Emerging Methodologies Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA May 18, 2017 About the Presenters Tom Kolde, FCAS, MAAA Consulting Actuary Chicago,

More information

Practical Guidance for Machine Learning Applications

Practical Guidance for Machine Learning Applications Practical Guidance for Machine Learning Applications Brett Wujek About the authors Material from SGF Paper SAS2360-2016 Brett Wujek Senior Data Scientist, Advanced Analytics R&D ~20 years developing engineering

More information

Data Analytics Training Program

Data Analytics Training Program Data Analytics Training Program In exclusive association with 1200+ Trainings 20,000+ Participants 10,000+ Brands 45+ Countries [Since 2009] Training partner for Who Is This Course For? Programers Willing

More information

ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA

ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA INSIGHTS@SAS: ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA AGENDA 09.00 09.15 Intro 09.15 10.30 Analytics using SAS Enterprise Guide Ellen Lokollo 10.45 12.00 Advanced Analytics using SAS

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Mining: Exploring Data. Lecture Notes for Chapter 3 Data Mining: Exploring Data Lecture Notes for Chapter 3 1 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include

More information

Otto Group Product Classification Challenge

Otto Group Product Classification Challenge Otto Group Product Classification Challenge Hoang Duong May 19, 2015 1 Introduction The Otto Group Product Classification Challenge is the biggest Kaggle competition to date with 3590 participating teams.

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Final Report: Kaggle Soil Property Prediction Challenge

Final Report: Kaggle Soil Property Prediction Challenge Final Report: Kaggle Soil Property Prediction Challenge Saurabh Verma (verma076@umn.edu, (612)598-1893) 1 Project Goal Low cost and rapid analysis of soil samples using infrared spectroscopy provide new

More information

DSC 201: Data Analysis & Visualization

DSC 201: Data Analysis & Visualization DSC 201: Data Analysis & Visualization Exploratory Data Analysis Dr. David Koop Python Support for Time The datetime package - Has date, time, and datetime classes -.now() method: the current datetime

More information

scikit-learn (Machine Learning in Python)

scikit-learn (Machine Learning in Python) scikit-learn (Machine Learning in Python) (PB13007115) 2016-07-12 (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 1 / 29 Outline 1 Introduction 2 scikit-learn examples 3 Captcha recognize

More information

Supervised Learning Classification Algorithms Comparison

Supervised Learning Classification Algorithms Comparison Supervised Learning Classification Algorithms Comparison Aditya Singh Rathore B.Tech, J.K. Lakshmipat University -------------------------------------------------------------***---------------------------------------------------------

More information

Machine Learning. Chao Lan

Machine Learning. Chao Lan Machine Learning Chao Lan Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian

More information

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018 SUPERVISED LEARNING METHODS Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018 2 CHOICE OF ML You cannot know which algorithm will work

More information

Python Certification Training

Python Certification Training About Intellipaat Intellipaat is a fast-growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over

More information

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017 3/0/207 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/0/207 Perceptron as a neural

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SPSS SPSS (originally Statistical Package for the Social Sciences ) is a commercial statistical software package with an easy-to-use

More information

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\ Data Preprocessing Javier Béjar BY: $\ URL - Spring 2018 C CS - MAI 1/78 Introduction Data representation Unstructured datasets: Examples described by a flat set of attributes: attribute-value matrix Structured

More information

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017 Lecture 27: Review Reading: All chapters in ISLR. STATS 202: Data mining and analysis December 6, 2017 1 / 16 Final exam: Announcements Tuesday, December 12, 8:30-11:30 am, in the following rooms: Last

More information

Clustering algorithms and autoencoders for anomaly detection

Clustering algorithms and autoencoders for anomaly detection Clustering algorithms and autoencoders for anomaly detection Alessia Saggio Lunch Seminars and Journal Clubs Université catholique de Louvain, Belgium 3rd March 2017 a Outline Introduction Clustering algorithms

More information

HANDS ON DATA MINING. By Amit Somech. Workshop in Data-science, March 2016

HANDS ON DATA MINING. By Amit Somech. Workshop in Data-science, March 2016 HANDS ON DATA MINING By Amit Somech Workshop in Data-science, March 2016 AGENDA Before you start TextEditors Some Excel Recap Setting up Python environment PIP ipython Scientific computation in Python

More information

Using Numerical Libraries on Spark

Using Numerical Libraries on Spark Using Numerical Libraries on Spark Brian Spector London Spark Users Meetup August 18 th, 2015 Experts in numerical algorithms and HPC services How to use existing libraries on Spark Call algorithm with

More information

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques

More information

Visual Analytics. Visualizing multivariate data:

Visual Analytics. Visualizing multivariate data: Visual Analytics 1 Visualizing multivariate data: High density time-series plots Scatterplot matrices Parallel coordinate plots Temporal and spectral correlation plots Box plots Wavelets Radar and /or

More information

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions CAMCOS Report Day December 9th, 2015 San Jose State University Project Theme: Classification The Kaggle Competition

More information

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

Ch.1 Introduction. Why Machine Learning (ML)? manual designing of rules requires knowing how humans do it.

Ch.1 Introduction. Why Machine Learning (ML)? manual designing of rules requires knowing how humans do it. Ch.1 Introduction Syllabus, prerequisites Notation: Means pencil-and-paper QUIZ Means coding QUIZ Code respository for our text: https://github.com/amueller/introduction_to_ml_with_python Why Machine Learning

More information

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures

More information

Brief Guide on Using SPSS 10.0

Brief Guide on Using SPSS 10.0 Brief Guide on Using SPSS 10.0 (Use student data, 22 cases, studentp.dat in Dr. Chang s Data Directory Page) (Page address: http://www.cis.ysu.edu/~chang/stat/) I. Processing File and Data To open a new

More information

CSE 158. Web Mining and Recommender Systems. Midterm recap

CSE 158. Web Mining and Recommender Systems. Midterm recap CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158

More information

Bar Charts and Frequency Distributions

Bar Charts and Frequency Distributions Bar Charts and Frequency Distributions Use to display the distribution of categorical (nominal or ordinal) variables. For the continuous (numeric) variables, see the page Histograms, Descriptive Stats

More information

An Introduction to Preparing Data for Analysis with JMP. Full book available for purchase here. About This Book... ix About The Author...

An Introduction to Preparing Data for Analysis with JMP. Full book available for purchase here. About This Book... ix About The Author... An Introduction to Preparing Data for Analysis with JMP. Full book available for purchase here. Contents About This Book... ix About The Author... xiii Chapter 1: Data Management in the Analytics Process...

More information

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Data Exploration Chapter Introduction to Data Mining by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 What is data exploration?

More information

DATA STRUCTURE AND ALGORITHM USING PYTHON

DATA STRUCTURE AND ALGORITHM USING PYTHON DATA STRUCTURE AND ALGORITHM USING PYTHON Common Use Python Module II Peter Lo Pandas Data Structures and Data Analysis tools 2 What is Pandas? Pandas is an open-source Python library providing highperformance,

More information

SAS High-Performance Analytics Products

SAS High-Performance Analytics Products Fact Sheet What do SAS High-Performance Analytics products do? With high-performance analytics products from SAS, you can develop and process models that use huge amounts of diverse data. These products

More information

Ch.1 Introduction. Why Machine Learning (ML)?

Ch.1 Introduction. Why Machine Learning (ML)? Syllabus, prerequisites Ch.1 Introduction Notation: Means pencil-and-paper QUIZ Means coding QUIZ Why Machine Learning (ML)? Two problems with conventional if - else decision systems: brittleness: The

More information

book 2014/5/6 15:21 page v #3 List of figures List of tables Preface to the second edition Preface to the first edition

book 2014/5/6 15:21 page v #3 List of figures List of tables Preface to the second edition Preface to the first edition book 2014/5/6 15:21 page v #3 Contents List of figures List of tables Preface to the second edition Preface to the first edition xvii xix xxi xxiii 1 Data input and output 1 1.1 Input........................................

More information

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning Partitioning Data IRDS: Evaluation, Debugging, and Diagnostics Charles Sutton University of Edinburgh Training Validation Test Training : Running learning algorithms Validation : Tuning parameters of learning

More information

Table of Contents. Introduction.*.. 7. Part /: Getting Started With MATLAB 5. Chapter 1: Introducing MATLAB and Its Many Uses 7

Table of Contents. Introduction.*.. 7. Part /: Getting Started With MATLAB 5. Chapter 1: Introducing MATLAB and Its Many Uses 7 MATLAB Table of Contents Introduction.*.. 7 About This Book 1 Foolish Assumptions 2 Icons Used in This Book 3 Beyond the Book 3 Where to Go from Here 4 Part /: Getting Started With MATLAB 5 Chapter 1:

More information

Intel Distribution for Python* и Intel Performance Libraries

Intel Distribution for Python* и Intel Performance Libraries Intel Distribution for Python* и Intel Performance Libraries 1 Motivation * L.Prechelt, An empirical comparison of seven programming languages, IEEE Computer, 2000, Vol. 33, Issue 10, pp. 23-29 ** RedMonk

More information

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Contents. Foreword to Second Edition. Acknowledgments About the Authors Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1

More information

CS 229 Project Report:

CS 229 Project Report: CS 229 Project Report: Machine learning to deliver blood more reliably: The Iron Man(drone) of Rwanda. Parikshit Deshpande (parikshd) [SU ID: 06122663] and Abhishek Akkur (abhakk01) [SU ID: 06325002] (CS

More information

Hal Varian, Google s Chief Economist The McKinsey Quarterly, Jan 2009

Hal Varian, Google s Chief Economist The McKinsey Quarterly, Jan 2009 The ability to take data to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it that s going to be a hugely important skill in the next decades, because

More information

Fathom Dynamic Data TM Version 2 Specifications

Fathom Dynamic Data TM Version 2 Specifications Data Sources Fathom Dynamic Data TM Version 2 Specifications Use data from one of the many sample documents that come with Fathom. Enter your own data by typing into a case table. Paste data from other

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Data Mining with SPSS Modeler

Data Mining with SPSS Modeler Tilo Wendler Soren Grottrup Data Mining with SPSS Modeler Theory, Exercises and Solutions Springer 1 Introduction 1 1.1 The Concept of the SPSS Modeler 2 1.2 Structure and Features of This Book 5 1.2.1

More information

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification CAMCOS Report Day December 9 th, 2015 San Jose State University Project Theme: Classification On Classification: An Empirical Study of Existing Algorithms based on two Kaggle Competitions Team 1 Team 2

More information

Python Certification Training

Python Certification Training About Intellipaat Intellipaat is a fast-growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over

More information

COPYRIGHT DATASHEET

COPYRIGHT DATASHEET Your Path to Enterprise AI To succeed in the world s rapidly evolving ecosystem, companies (no matter what their industry or size) must use data to continuously develop more innovative operations, processes,

More information

Facial Expression Classification with Random Filters Feature Extraction

Facial Expression Classification with Random Filters Feature Extraction Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle

More information

Intel Distribution For Python*

Intel Distribution For Python* Intel Distribution For Python* Intel Distribution for Python* 2017 Advancing Python performance closer to native speeds Easy, out-of-the-box access to high performance Python High performance with multiple

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right

More information