Data Platforms and Pattern Mining

Size: px

Start display at page:

Download "Data Platforms and Pattern Mining"

Rosanna Brown
6 years ago
Views:

1 Morteza Zihayat Data Platforms and Pattern Mining IBM Corporation

2 About Myself IBM Software Group Big Data Scientist 4Platform Computing, IBM (2014 Now) PhD Candidate (2011 Now) 4Lassonde School of Engineering, York University Main research interests 4Big Data Mining and Engineering Finding Meaningful Patterns from Structured and Unstructured Big Data Parallel and Distributed Data Mining 4Mining Dynamic and Stream Data 4Social Network Analysis 2

3 Agenda IBM Software Group How Big is this Big Data? Big Data and Apache Hadoop Apache Spark: Lightning-fast cluster computing IBM Platform Conductor for Spark A Case Study 4Mining High Utility Sequential Patterns 4User behavioral analysis Globe and Mail Newspaper 3

4 IBM Software Group How Big is this Big Data? 2,9 MILLION 375 MEGABYTES 20 HOURS 24 PETABYTES NUMBER OF MAILS SENT EVERY SECOND DATA CONSUMED BY HOUSEHOLDS EACH DAY VIDEO UPLOADED TO YOUTUBE EVERY MINUTE DATA PER DAY PROCESSED BY GOOGLE 50 MILLION 700 BILLION 1,3 EXABYTES 398 ITEMS TWEETS PER DAY TOTAL MINUTES SPENT ON FACEBOOK EACH MONTH DATA SENT AND RECEIVED BY MOBILE INTERNET USERS PRODUCTS ORDERED ON AMAZON PER SECOND 4

5 What is Big Data? IBM Software Group Big Data. 5

machine/disk failure 4Keep multiple copies of data How do we

6 Challenges IBM Software Group Read/Write to disk is slow 4Use multiple disks for parallel read Hardware failure 4Single machine/disk failure 4Keep multiple copies of data How do we merge data from different reads 4Distributed processing or Hadoop MapReduce 6

clusters Two main components 4HDFS (Hadoop Distributed File System) Provides

7 Apache Hadoop IBM Software Group Hadoop is an open-source software framework written in Java 4Distributed storage 4Distributed processing of very large data sets 4Computer clusters Two main components 4HDFS (Hadoop Distributed File System) Provides Distributed Storage 4MapReduce (Distributed Data Processing Model) Provides Distributed Processing 7

HDFS IBM Software Group Based on Google s GFS (Google File System) Data is distributed across all nodes Build around the idea of write-once, read-many-times

8 HDFS IBM Software Group Based on Google s GFS (Google File System) Data is distributed across all nodes Build around the idea of write-once, read-many-times (WORM) File are stored as Blocks 464MB 4Different blocks of the same file are stored on different nodes 4Same block is replicated across several nodes for redundancy 8

9 IBM Software Group HDFS Architecture NameNode 4It stores metadata (No of blocks, On which rack which DataNode the data is stored and other details) DataNode 4It stores the actual data. There is only one NameNode in a cluster and many DataNodes 9

10 MapReduce IBM Software Group MapReduce is a method for distributing a task across multiple nodes Consists of two phases 4Map 4Reduce The data is process in the form of <key,value> Each map task processes a discrete portion of the overall data After all Maps are complete, the system distributes the intermediate data to nodes which perform Reduce phase (aggregation) 10

IBM Software Group Example: Word counting Consider the problem of counting the number of occurrences of each word in a large collection of documents Input: documents Output: <word,frequency> Divide

11 IBM Software Group Example: Word counting Consider the problem of counting the number of occurrences of each word in a large collection of documents Input: documents Output: <word,frequency> Divide collection of documents among the class. Sum up the counts from all the documents to give final answer Each students gives count of individual word in a document. Repeats for assigned quota of documents 11

12 IBM Software Group Word Count Execution Input Output the quick brown fox the fox ate the cow Student 1 Student 2 the, 1 brown, 1 fox, 1 quick, 1 the, 1 fox, 1 the, 1 ate,1 cow,1 brown, 1 Student 1 brown, 1 brown, 2 fox, 2 how, 1 now, 1 the, 3 how now brown cow Student 3 how, 1 now, 1 brown, 1 cow,1 Student 2 ate, 1 cow, 2 quick, 1 12

13 IBM Software Group Word Count Execution Input Map tasks Reduce tasks Output the quick brown fox the fox ate the cow Map Map the, 1 brown, 1 fox, 1 quick, 1 the, 1 fox, 1 the, 1 ate,1 cow,1 brown, 1 brown, 1 Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 how now brown cow Map how, 1 now, 1 brown, 1 cow,1 Reduce ate, 1 cow, 2 quick, 1 13

14 IBM Software Group MapReduce: Execution overview Master Server distributes M map tasks to machines and monitors their progress. Map task reads the allocated data, saves the map results in local buffer. Shuffle phase assigns reducers to these buffers, which are remotely read and processed by reducers. Reducers output the result on stable storage. 14

15 IBM Software Group MapReduce on a Cluster with Hadoop HDFS 15

16 Problem #1 IBM Software Group MapReduce I/O sandbags runtime for advanced analytics. 4Must persist results after each pass through data 4Advanced analytics often requires multiple passes through data Hadoop Storage Compute Store Hadoop Storage 16

17 IBM Software Group Example: Logistic Regression Goal: find best line separating two sets of points random initial line target Running Time (s) Hadoop Number of Iterations 17

18 Problem #2 IBM Software Group Many point solutions for advanced analytics in Hadoop 4Queries Apache drill,. 4Machine Learning Mahout, H2o, 4Graph Analytics Graphlab, 4Streaming Analytics S4, 18

19 Apache Spark IBM Software Group It was developed in response to limitations in the MapReduce Originally developed in 2009 in UC Berkeley s AMP Lab Fully open sourced in 2010 now a Top Level Project at the Apache Software Foundation 19

20 IBM Software Group Easy and Fast Big Data Easy to Develop 4Rich APIs in Java, Scala, Python 4Interactive shell Fast to Run 4In-memory storage 2-5 less code Up to 10 faster on disk, 100 in memory

21 IBM Software Group Resilient Distributed Datasets (RDD) Spark revolves around RDDs Fault-tolerant collection of elements that can be operated on in parallel 4Parallelized Collection: Scala collection which is run in parallel 4Hadoop Dataset: records of files supported by Hadoop 4The data flow from one iteration to another happens through memory Doesn't touch the disk. 4When the memory is not sufficient enough for the data to fit it It can be either spilled to the drive or is just left to be recreated upon request for the same

22 RDD Operations Transformations IBM Software Group 4Creation of a new dataset from an existing map, filter, distinct, union, sample, groupbykey, join, etc Actions 4Return a value after running a computation collect, count, first, takesample, foreach, etc

23 IBM Software Group RDD Persistence / Caching Variety of storage levels 4memory_only (default), memory_and_disk, etc API Calls 4persist(StorageLevel) 4cache() shorthand for persist(storagelevel.memory_only) Considerations 4Read from disk vs. recompute (memory_and_disk) 4Total memory storage size (memory_only_ser) 4Replicate to second node for faster fault recovery (memory_only_2) Think about this option if supporting a web application

24 IBM Software Group Cache Scaling Matters Execution time (s) Cache disabled 25% 50% 75% Fully cached % of working set in cache

25 IBM Software Group Logistic Regression Performance Running Time (s) Number of Iterations Hadoop Spark 25

26 IBM Software Group A single integrated platform for advanced analytics 26

27 IBM Software Group Spark Performance Machine Learning 4100x faster than MapReduce Queries (Shark) 4100x faster than Hive Streaming 42X throughput of Storm Graph (GraphX) 410X faster than MapReduce 27

28 IBM Software Group IBM Platform Conductor for Spark Addresses the following challenges 4Integrating Spark into existing environments 4Managing Spark lifecycles in the face of frequent updates to open-source Spark distributions 4Managing numerous Spark deployments within the organization Benefits 4Cuts costs by using a service orchestration framework to maximize resource utilization 4Improves performance and efficiency 4Simplifies application management 28

29 29 A Case Study High Utility Sequential Pattern Mining in Big Data IBM Corporation

30 IBM Software Group Frequent Pattern Mining Frequent Pattern Mining (FPM) 4FPM is a fundamental research topic in data mining 4Example application Discover sets of items (i.e., itemsets) that are frequently purchased together by customers Minimum support threshold: 60% TID T 1 T 2 T 3 T 4 T 5 T 6 Transaction {Bread, Milk} {Bread, Milk} {Bread, Milk, Diaper, Beer} {Bread, Milk, Diaper, Beer} {Diamond, Necklace} {Diamond, Necklace} Sup({Bread, Milk}) = 4/6 = 66.6% {Bread, Milk} is a frequent itemset 30

IBM Software Group Domain Driven Actionable Knowledge Discovery The identified patterns are handed over to business people They cannot interpret the

31 IBM Software Group Domain Driven Actionable Knowledge Discovery The identified patterns are handed over to business people They cannot interpret the patterns for business use 4There are many patterns/not informative 4Not interested to business needs. 4How to interpret the patterns to business actions 31

32 IBM Software Group Insufficiency of Frequent Pattern Mining In Market Analysis 4Business objective: Increase Revenue Which patterns can bring high profits and make high revenue? 4May lose infrequent but valuable patterns 4May present too many frequent but unprofitable patterns 4Cannot find patterns having high profits 32

33 IBM Software Group A Motivation Example TID T 1 T 2 T 3 T 4 T 5 T 6 Transaction {Bread(1), Milk(1)} {Bread(1), Milk(1)} {Bread(1), Milk(1), Diaper(3), Beer(6)} {Bread(1), Milk(1), Diaper(3), Beer(6)} {Diamond(1), Necklace(1)} {Diamond(1), Necklace(1)} Item Unit Profit Bread 20 Milk 30 Diamond 1,000 Necklace 300 Diaper 300 Beer 70 {Bread, Milk}: $200 {Diamond, Necklace}: $2600 {Diaper, Beer}:$

34 IBM Software Group High Utility Sequential Pattern Mining Given a set of sequences: find all sequences whose utility is > a userspecified minimum threshold 4Each item has quantity in a transaction 4Each item has a value (e.g., price) Items Profit Milk $3 Egg $2 Birthday Cake $20 Birthday Card $10 Bread $1 CID TID C1 T1 {(Bread,2), (Milk,6)} C1 T2 {(Birthday Card,2)} C1 T3 { (Birthday Cake,2), (egg,3)} C2 T3 {(Bread,2), (Milk,4), (Yoghurt,3), (Tuna,5)} C2 T4 {(egg,5), (Pizza,4), (Juice,2)} C3 T5 { (Bread,2),( Yoghurt,4), (Milk,3)} C3 T6 {(Milk,1), (cheese,2)} 34

35 What is utility? IBM Software Group Utility of item in a transaction = internal utility (quantity of items in the transaction) x external utility (profit of the item). 4U(Milk,T1 ) = 3 6 = 18 Utility of itemset in a sequence = sum of utilities of its items: 4U({Bread, Milk}, C1)= = 20 Utility of sub-sequence in a sequence = sum of its itemsets utilities 4U(<{Milk}{egg}>,C1) = = 24 4 If more than one occurrence, then maximum value among occurrences Items Profit Milk $3 Egg $2 Birthday Cake $20 Birthday Card $10 Bread $1 CID TID C1 T1 {(Bread,2), (Milk,6)} C1 T2 {(Birthday Card,2)} C1 T3 { (Birthday Cake,2), (egg,3)} C2 T3 {(Bread,2), (Milk,4), (Yoghurt,3), (Tuna,5)} C2 T4 {(egg,5), (Pizza,4), (Juice,2)} C3 T5 { (Bread,2),( Yoghurt,4), (Milk,3)} C3 T6 {(Milk,1), (cheese,2)} 35

36 36 A Rule-Based News Recommendation System IBM Corporation

37 IBM Software Group Globe and Mail Dataset Globe and Mail Inc. 4Corpus of news (142,163,909 articles) 4Web clickstream 2 billion records Each record A sequence of visited news 245 attributes 4 TB 37

38 IBM Software Group A Rule-Based Recommendation System User log Phase 1 Discover Frequent Patterns from user logs <News1,New2> Offline Phase 2 Extract rules from the patterns News1 News2 Phase 3 Online 38

39 Goal IBM Software Group A rule based news recommendation engine 4News1 News2 It is quite probable that not all the pages visited by the user are of interest to him/her The value or importance of a news changes over time 39

40 IBM Software Group A Utility Based News Recommendation System cursor movement Time Spent Editorial Board freshness Rule-Based Recommendation System Navigational rules 40

41 IBM Software Group The Proposed Framework User log Phase 1 Discover High Utility Patterns from user logs <News1,New2> Offline Phase 2 Extract Utility-Based rules from the patterns News1 News2 Online Phase 3 Recommend articles based on current user activities 41

IBM Software Group Preliminary Experiment Only 120000

mins run time With the exponential growth of data 4It

42 IBM Software Group Preliminary Experiment Only records 4Over 32 GB memory usage in average 4Around 55 mins run time With the exponential growth of data 4It is impossible to execute algorithms on a single machine 42

43 BigHUSP IBM Software Group 43

44 BigHUSP IBM Software Group 44

45 BigHUSP IBM Software Group Partition data In each mapper Scan input sequences Construct UMatrixs, Calculate required parameters: threshold, Remove all blocks for sequences from memory Store Umatrixs in Memory-Disk to re-use 45

46 BigHUSP IBM Software Group Call Uspan to mine local HUSPs using the local UMatrixs and threshold Output Local HUSPs and their utilities 46

47 BigHUSP IBM Software Group Aggregate local HUSPs If the pattern does not exist Estimate its utility Output Global candidates and their approximate utilities 47

48 BigHUSP IBM Software Group 48

49 IBM Software Group Our recommendations keep user longer 2,537 minutes spent by 107 users 380 minutes spent by 254 users 49

50 IBM Software Group Experimental Results 50

51 Summary IBM Software Group What Big Data is What Hadoop is and why it is important Hadoop main components What Spark is Why Spark is popular High Utility Pattern Mining Rule-Based News Recommendation System Big Data Analysis Using IBM Conductor for Spark 51

52 IBM Software Group THANKS! 52

53 Q1 IBM Software Group According to Spark advocates, how much faster can Apache Spark potentially run batch-processing programs when processed in memory than MapReduce can? 10 times faster 20 times faster 100 times faster 200 times faster 53

54 Q2 IBM Software Group What is the name of the programming framework originally developed by Google that supports the development of applications for processing large data sets in a distributed computing environment? MapReduce Hive Spark 54

55 Q3 IBM Software Group True or false? A big data analytics strategy is often defined by the three V's volume, variety and velocity -- which is helpful but ignores other commonly cited characteristics, such as complexity and variability. True False 55

56 Q4 IBM Software Group Just collecting and storing information isn't enough to produce real business value. Big data analytics technologies are necessary to: 4Formulate eye-catching charts and graphs 4Extract valuable insights from the data 4Integrate data from internal and external sources 56

57 Q5 IBM Software Group Which of the following application types can Spark run in addition to batchprocessing jobs? 4Stream processing 4Machine learning 4Graph processing 4All of the above 57

58 Q6 IBM Software Group Which of the following is NOT a characteristic shared by Hadoop and Spark? 4Both are data processing platforms 4Both are cluster computing environments 4Both have their own file system 4Both use open source APIs to link between different tools 58

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes