W1.A.0 W2.A.0 1/22/2018 1/22/2018. CS435 Introduction to Big Data. FAQs. Readings

Size: px

Start display at page:

Download "W1.A.0 W2.A.0 1/22/2018 1/22/2018. CS435 Introduction to Big Data. FAQs. Readings"

Stephany Chrystal Booth
6 years ago
Views:

1 CS435 Introduction to Big Data 1/17/2018 W2.A.0 W1.A.0 CS435 Introduction to Big Data W2.A.1.A.1 FAQs PA0 has been posted Feb. 6, 5:00PM via Canvas Individual submission (No team submission) Accommodation request, honor student Contact me by Jan Readings PART 0. INTRODUCTION TO BIG DATA Reading research papers Keshav's "How to read a paper "How to Read and Understand a Scientific Paper: A Step-by-Step Guide for Non-Scientists" Computer Science, Colorado State University W2.A.2.A.2 1/17/2018 W2.A.3 W1.A.3 Topics Introduction to Big Data Analytics Data Collection, Sampling, and Preprocessing Introduction to MapReduce This Material is Built Based on, Part 0. Introduction Big Data Analytics -Data Collection, Sampling, and Preprocessing W2.A.4.A.4 W2.A.5.A.5 Analytics Process Model Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications, Bart Baesens, 2014, Wiley The most time-consuming step is the data selection and preprocessing step - This is usually around 80% of the total time needed to build an analytical model Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications, Bart Baesens, 2014, Wiley 1

2 W2.A.6 Types of Analytics Analytics is a term that is often used interchangeably with Data science Data mining Knowledge discovery Predictive analytics A target variable is typically available E.g. linear/logistic regression, decision trees, neural networks, support vector machines Descriptive analytics No target variable e.g. Clustering, association rules W2.A.7 Types of Data Sources Transactions Structured, low-level, detailed information Customer transactions Purchase, claim, cash transfer, credit card payment Stored in massive online transaction processing (OLTP) relational database Can be summarized over longer time horizons (e.g. averages, relative trends, Max/Min values) Unstructured data embedded in text documents s, web pages, claim forms, Requires extensive preprocessing Qualitative, expert-based data Requires subject matter experts (SME) analysis Scientific data W2.A.8 Sampling Taking a subset of data for analytics Generating hypothesis Model selection Feature selection Speculative process Building analytics model Stratified sampling Taking samples according to predefined strata e.g. Fraud detection with very skewed (99 percent non-fraud customers, 1 percent fraud customers) Sample should contain the same percentage of fraud customers as in the original data W2.A.9 Types of Data Elements Continuous Data elements that are defined on an interval that can be limited or unlimited e.g. income, sales, temperature Categorical Nominal Data elements that can only take on a limited set of values with no meaningful ordering between them e.g. marital status, profession, purpose of loan Ordinal Data elements that can only take on a limited set of values with a meaningful ordering between them e.g. credit rating, age coded as young, middle age and old Binary Data elements that can only take on two values e.g. Having child, allowed to drive W2.A.10 Missing Values Missing values can occur because of various reasons The information can be non-applicable The information can be undisclosed The information can be unavailable W2.A.11 Missing Values --continued Replace (impute) Replaces the missing value with a computed/selected value Imputation algorithm examples Hot-deck: replaces with a randomly selected similar records Cold-deck: selects replacement from another dataset Mean substitution: replaces with the mean of that variable for all other cases Regression: predicts missing values of a variable based on other variables. Delete Deletes observations with lots of missing values This assumes that information is missing at random and has no meaningful interpretation and/or relationship to the target Keep Missing values can be meaningful e.g. a customer did not disclose the income for current condition 2

W2.A.12 Outliers of Dataset W2.A.13 Identifying Outliers using Box Plots Outliers are extreme observations that are very dissimilar to the rest of the population Valid observation Salary of boss

3 W2.A.12 Outliers of Dataset W2.A.13 Identifying Outliers using Box Plots Outliers are extreme observations that are very dissimilar to the rest of the population Valid observation Salary of boss Invalid observation Age is 300 Multivariate outliers Observations that are outlying in multiple dimensions e.g: Temperature in Fort Collins is 100 degrees but on a midnight in December A box plot represents three key quartiles of the data Q 1 : 25% of the observations have a lower value Q 2 : 50% of the observations have a lower value Q 3 : 75% of the observations have a lower value The minimum and maximum values are added Too far away is now quantified as more than 1.5 x Interquartile Range (IQR = (Q 3 Q 1 ) ) 1.5 x IQR Outliers Min Q1 M Q3 W2.A.14 W2.A.15 Identifying Outliers using Z-Score Measuring how many standard deviations an observation is away from the mean zi = $ %&' ( where μ represents the average of the variable and σ its standard deviation A practical rule of thumb then defines outliers when the absolute value of the z-score z is bigger than 3 ID Age Z-Score 1 30 (30-40)/10= (50-40)/10= (10-40)/10= (40-40)/10= (60-40)/10= (80-40)/10= μ = 40 σ = 10 μ = 0 σ = 1 Dealing with Outliers Treat outliers as missing values Popular schemes Truncation Taking only values that are within the limits Winsorizing Limiting extreme values to reduce the effect of possible spurious outliers {92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, -40, 101, 86, 85, 15, 89, 89, 28, -5, 41 (N = 20, mean = 101.5) à {92, 19, 101, 58, 101, 91, 26, 78, 10, 13, -5, 101, 86, 85, 15, 89, 89, 28, -5, 41 (N = 20, mean = 55.65) Using the Z-Scores for truncation W2.A.16 W2.A.17 Standardizing Data Standardizing Data. -- continued Scaling variables to a similar range e.g. two variables: education and income Elementary school (1), middle school (2), high school (3), college (4), graduate school (5) Income: 0 ~ $5M When building logistic regression models, the coefficient for education might become very small. Min/Max standardization Xnew = -./0&123 -./ /0 &123 -./0 newmax newmin + new Where newmax and newmin are the newly imposed maximum and minimum (e.g. 1 and 0) Z-Score based Calculate the z-scores Decimal scaling X new = -./0 <= > Dividing by a power of 10 Standardization is useful for regression-based approaches It is not needed for decision trees 3

Kafka, Apache Sqoop Data Processing Layer Apache Hadoop MapReduce, Pig, Apache Spark, Cassandra, Storm, Mahout, MLLib,

4 1/17/2018 W2.A.18 W1.A.18 W2.A.19 Part 0. Introduction Big Data Analytics -Big Data Technology Stack W2.A.20 In a nutshell Security and Governance Data Presentation Layer Apache Kibana Data Integration Layer Apache Flume, Apache Kafka, Apache Sqoop Data Processing Layer Apache Hadoop MapReduce, Pig, Apache Spark, Cassandra, Storm, Mahout, MLLib, Operations and Scheduling Layer Apache Ambari Apache Oozie Apache Zookeeper 1/17/2018 W2.A.21 W1.A.21 Part 1. Large Scale Data Analytics Introduction to MapReduce Data Layer Apache HDFS, Amazon AWS s S3, IBM GPFS, Microsoft Azure W2.A.22 1/17/2018 W2.A.23 W1.A.23 This material is developed based on, Anand Rajaraman, Jure Leskovec, and Jeffrey Ullman, Mining of Massive Datasets, Cambridge University Press, Chapter 2 Download this chapter from the CS435 schedule page Hadoop: The definitive Guide, Tom White, O Reilly, 3 rd Edition, 2014 What is MapReduce? MapReduce Design Patterns, Donald Miner and Adam Shook, O Reilly,

5 W2.A.24 MapReduce [1/2] MapReduce is inspired by the concepts of map and reduce in Lisp. Modern MapReduce Developed within Google as a mechanism for processing large amounts of raw data. Crawled documents or web request logs Distributes these data across thousands of machines Same computations are performed on each CPU with different dataset W2.A.25 MapReduce [2/2] MapReduce provides an abstraction that allows engineers to perform simple computations while hiding the details of parallelization, data distribution, load balancing and fault tolerance W2.A.26 Mapper Mapper maps input key/value pairs to a set of intermediate key/value pairs Maps are the individual tasks that transform input records into intermediate records The transformed intermediate records do not need to be of the same type as the input records A given input pair may map to zero or many output pairs The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job W2.A.27 Reducer Reducer reduces a set of intermediate values which share a key to a smaller set of values Reducer has 3 primary phases Shuffle, sort and reduce Shuffle Input to the reducer is the sorted output of the mappers The framework fetches the relevant partition of the output of all the mappers via HTTP Sort The framework groups input to the reducer by keys 1/17/2018 W2.A.28 W1.A.28 W2.A.29 Example 1: WordCount [1/5] MapReduce Example 1 For text files stored under usr/joe/wordcount/input, count the number of occurrences of each word How do files and directory look? $ bin/hadoop dfs -ls /usr/joe/wordcount/input/ /usr/joe/wordcount/input/file01 /usr/joe/wordcount/input/file02 $ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01 Hello World, Bye World! $ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02 Hello Hadoop, Goodbye to hadoop. 5

6 W2.A.30 Example 1: WordCount [2/5] Run the MapReduce application $ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.wordcount /usr/joe/wordcount/input /usr/joe/wordcount/output $ bin/hadoop dfs -cat /usr/joe/wordcount/output/part Bye 1 Goodbye 1 Hadoop, 1 Hello 2 World! 1 World, 1 hadoop. 1 to 1 W2.A.31 Example 1: WordCount [3/5] Mappers 1. Read a line 2. Tokenize the string 3. Pass the <key,value> output to the reducer What do you have to pass from the Mappers? Reducers 1. Collect <key,value> pairs sharing same key 2. Aggregate total number of occurrences W2.A.32 Example 1: WordCount [4/5] W2.A.33 Example 1: WordCount [5/5] public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); 1/17/2018 W2.A.34 W1.A.34 Questions? 6

FAQs. Topics. This Material is Built Based on, Analytics Process Model. 8/22/2018 Week 1-B Sangmi Lee Pallickara

FAQs. Topics. This Material is Built Based on, Analytics Process Model. 8/22/2018 Week 1-B Sangmi Lee Pallickara CS435 Introduction to Big Data Week 1-B W1.B.0 CS435 Introduction to Big Data No Cell-phones in the class. W1.B.1 FAQs PA0 has been posted If you need to use a laptop, please sit in the back row. August