FAQs. Topics. This Material is Built Based on, Analytics Process Model. 8/22/2018 Week 1-B Sangmi Lee Pallickara

Size: px

Start display at page:

Download "FAQs. Topics. This Material is Built Based on, Analytics Process Model. 8/22/2018 Week 1-B Sangmi Lee Pallickara"

Julie Parker
5 years ago
Views:

CS435 Introduction to Big Data Week 1-B W1.B.0 CS435 Introduction to Big Data No Cell-phones in the class. W1.B.1 FAQs PA0 has been posted If you need to use a laptop, please sit in the back row.

Accommodation request, honor student Contact me by August 31 2018 Readings Reading research papers Keshav's "How to read a paper "How to Read and Understand a Scientific Paper: A Step-by-Step Guide

1 CS435 Introduction to Big Data Week 1-B W1.B.0 CS435 Introduction to Big Data No Cell-phones in the class. W1.B.1 FAQs PA0 has been posted If you need to use a laptop, please sit in the back row. August 31, 5:00PM via Canvas Individual submission (No team submission) I will ask you to turn off your laptop if it seems to be distracting to others. Accommodation request, honor student Contact me by August Readings Reading research papers Keshav's "How to read a paper "How to Read and Understand a Scientific Paper: A Step-by-Step Guide for NonScientists" PART 0. INTRODUCTION TO BIG DATA Computer Science, Colorado State University W1.B.2 W1.B.3 Topics Introduction to Big Data Analytics Data Collection, Sampling, and Preprocessing Introduction to MapReduce This Material is Built Based on, Part 0. Introduction Big Data Analytics -Data Collection, Sampling, and Preprocessing W1.B.4 W1.B.5 Analytics Process Model Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications, Bart Baesens, 2014, Wiley The most time-consuming step is the data selection and preprocessing step - This is usually around 80% of the total time needed to build an analytical model Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications, Bart Baesens, 2014, Wiley 1

Week 1-B W1.B.6 Types of Analytics Analytics is a term that is often used interchangeably with Data science Data mining Knowledge discovery Predictive analytics A target variable is typically available E.

7 Types of Data Sources Transactions Structured, low-level, detailed information Customer transactions Purchase, claim, cash transfer, credit card payment Stored in massive online transaction

(OLTP) relational database Can be summarized over longe

averages, relative trends, Max/Min values) Unstructured data embedded in text documents emails, web pages, claim forms, Requires extensive preQualitative, expert-based data Requires subject matter

B.9 Sampling Taking a subset of data for analytics Generating hypothesis Model selection Feature selection Speculative process Building analytics model Stratified sampling Taking samples according to

2 Week 1-B W1.B.6 Types of Analytics Analytics is a term that is often used interchangeably with Data science Data mining Knowledge discovery Predictive analytics A target variable is typically available E.g. linear/logistic regression, decision trees, neural networks, support vector machines Descriptive analytics No target variable e.g. Clustering, association rules W1.B.7 Types of Data Sources Transactions Structured, low-level, detailed information Customer transactions Purchase, claim, cash transfer, credit card payment Stored in massive online transaction processing (OLTP) relational database Can be summarized over longer time horizons (e.g. averages, relative trends, Max/Min values) Unstructured data embedded in text documents s, web pages, claim forms, Requires extensive preprocessing Qualitative, expert-based data Requires subject matter experts (SME) analysis Scientific data W1.B.8 Type of Data Consumers New types of data consumers Gather data in a particular setting (credit risk, marketing) Build models Sell outputs e.g. Dun & Bradstreet, Bureau Van Dijk, Thomson Reuters W1.B.9 Sampling Taking a subset of data for analytics Generating hypothesis Model selection Feature selection Speculative process Building analytics model Stratified sampling Taking samples according to predefined strata e.g. Fraud detection with very skewed (99 percent non-fraud customers, 1 percent fraud customers) records Sample should contain the same percentage of fraud customers as in the original data W1.B.10 Types of Data Elements Continuous Data elements that are defined on an interval that can be limited or unlimited e.g. income, sales, temperature Categorical Nominal Data elements that can only take on a limited set of values with no meaningful ordering between them e.g. marital status, profession, purpose of loan W1.B.11 Missing Values Missing values can occur because of various reasons The information can be non-applicable The information can be undisclosed The information can be unavailable Ordinal Data elements that can only take on a limited set of values with a meaningful ordering between them e.g. credit rating, age coded as young, middle age and old Binary Data elements that can only take on two values e.g. Having child, allowed to drive 2

3 Week 1-B W1.B.12 Missing Values --continued Replace (impute) Replaces the missing value with a computed/selected value Imputation algorithm examples Hot-deck: replaces with a randomly selected similar records Cold-deck: selects replacement from another dataset Mean substitution: replaces with the mean of that variable for all other cases Regression: predicts missing values of a variable based on other variables. Delete Deletes observations with lots of missing values This assumes that information is missing at random and has no meaningful interpretation and/or relationship to the target Keep Missing values can be meaningful e.g. a customer did not disclose the income for current condition W1.B.13 Outliers of Dataset Outliers are extreme observations that are very dissimilar to the rest of the population Valid observation Salary of boss Invalid observation Age is 300 Multivariate outliers Observations that are outlying in multiple dimensions e.g: Temperature in Fort Collins is 100 degrees but on a midnight in December W1.B.14 Identifying Outliers using Box Plots W1.B.15 Identifying Outliers using Z-Score A box plot represents three key quartiles of the data Q 1 : 25% of the observations have a lower value Q 2 : 50% of the observations have a lower value Q 3 : 75% of the observations have a lower value The minimum and maximum values are added Too far away is now quantified as more than 1.5 x Interquartile Range (IQR = (Q 3 Q 1 ) ) Min Q 1 M Q x IQR Outliers Measuring how many standard deviations an observation is away from the mean! " = $ % &' ( where μ represents the average of the variable and σ its standard deviation A practical rule of thumb then defines outliers when the absolute value of the z-score z is bigger than 3 ID Age Z-Score 1 30 (30-40)/10= (50-40)/10= (10-40)/10= (40-40)/10= (60-40)/10= (80-40)/10= μ = 40 σ = 10 μ = 0 σ = 1 W1.B.16 W1.B.17 Dealing with Outliers Treat outliers as missing values Popular schemes Truncation Taking only values that are within the limits Winsorizing Limiting extreme values to reduce the effect of possible spurious outliers e.g. 90% winsorization The data below the 5 th percentile and above 95 th percentile are replaced with the neighboring values {92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, -40, 101, 86, 85, 15, 89, 89, 28, -5, 41 (N = 20, mean = 101.5) à {92, 19, 101, 58, 101, 91, 26, 78, 10, 13, -5, 101, 86, 85, 15, 89, 89, 28, -5, 41 (N = 20, mean = 55.65) Using the Z-Scores for truncation Part 0. Introduction Big Data Analytics -Big Data Technology Stack 3

W1.B.18 W1.B.19 W1.B.20 W1.B.21 In a nutshell Security and Governance

Flume, Apache Kafka, Apache Sqoop Operations and Scheduling Layer Apache

MapReduce, Pig, Apache Spark, Cassandra, Storm, Mahout, MLLib, Data Layer

23 This material is developed based on, Anand Rajaraman, Jure Leskovec,

4 CS435 Introduction to Big Data Week 1-B W1.B.18 W1.B.19 W1.B.20 W1.B.21 In a nutshell Security and Governance Data Presentation Layer Apache Kibana Data Integration Layer Apache Flume, Apache Kafka, Apache Sqoop Operations and Scheduling Layer Apache Ambari Apache Oozie Apache Zookeeper Data Processing Layer Apache Hadoop MapReduce, Pig, Apache Spark, Cassandra, Storm, Mahout, MLLib, Data Layer Apache HDFS, Amazon AWS s S3, IBM GPFS, Microsoft Azure W1.B.22 W1.B.23 This material is developed based on, Anand Rajaraman, Jure Leskovec, and Jeffrey Ullman, Mining of Massive Datasets, Cambridge University Press, Chapter 2 Part 1. Large Scale Data Analytics Introduction to MapReduce Download this chapter from the CS435 schedule page Hadoop: The definitive Guide, Tom White, O Reilly, 3rd Edition, 2014 MapReduce Design Patterns, Donald Miner and Adam Shook, O Reilly,

5 Week 1-B W1.B.24 W1.B.25 MapReduce [1/2] MapReduce is inspired by the concepts of map and reduce in Lisp. What is MapReduce? Modern MapReduce Developed within Google as a mechanism for processing large amounts of raw data. Crawled documents or web request logs Distributes these data across thousands of machines Same computations are performed on each CPU with different dataset W1.B.26 MapReduce [2/2] MapReduce provides an abstraction that allows engineers to perform simple computations while hiding the details of parallelization, data distribution, load balancing and fault tolerance W1.B.27 Mapper Mapper maps input key/value pairs to a set of intermediate key/value pairs Maps are the individual tasks that transform input records into intermediate records The transformed intermediate records do not need to be of the same type as the input records A given input pair may map to zero or many output pairs The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job W1.B.28 W1.B.29 Reducer Reducer reduces a set of intermediate values which share a key to a smaller set of values Reducer has 3 primary phases Shuffle, sort and reduce Shuffle Input to the reducer is the sorted output of the mappers The framework fetches the relevant partition of the output of all the mappers via HTTP Sort The framework groups input to the reducer by keys MapReduceExample 1 5

6 Week 1-B W1.B.30 Example 1: WordCount [1/5] For text files stored under usr/joe/wordcount/input, count the number of occurrences of each word How do files and directory look? $ bin/hadoop dfs -ls /usr/joe/wordcount/input/ /usr/joe/wordcount/input/file01 /usr/joe/wordcount/input/file02 $ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01 Hello World, Bye World! $ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02 Hello Hadoop, Goodbye to hadoop. W1.B.31 Example 1: WordCount [2/5] Run the MapReduce application $ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.wordcount /usr/joe/wordcount/input /usr/joe/wordcount/output $ bin/hadoop dfs -cat /usr/joe/wordcount/output/part Bye 1 Goodbye 1 Hadoop, 1 Hello 2 World! 1 World, 1 hadoop. 1 to 1 W1.B.32 Example 1: WordCount [3/5] W1.B.33 Example 1: WordCount [4/5] Mappers 1. Read a line 2. Tokenize the string 3. Pass the <key,value> output to the reducer What do you have to pass from the Mappers? Reducers 1. Collect <key,value> pairs sharing same key 2. Aggregate total number of occurrences public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); W1.B.34 W1.B.35 Example 1: WordCount [5/5] public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); Questions? 6

W1.A.0 W2.A.0 1/22/2018 1/22/2018. CS435 Introduction to Big Data. FAQs. Readings

W1.A.0 W2.A.0 1/22/2018 1/22/2018. CS435 Introduction to Big Data. FAQs. Readings CS435 Introduction to Big Data 1/17/2018 W2.A.0 W1.A.0 CS435 Introduction to Big Data W2.A.1.A.1 FAQs PA0 has been posted Feb. 6, 5:00PM via Canvas Individual submission (No team submission) Accommodation