Data. Big: TiB - PiB. Small: MiB - GiB. Supervised Classification Regression Recommender. Learning. Model

Size: px

Start display at page:

Download "Data. Big: TiB - PiB. Small: MiB - GiB. Supervised Classification Regression Recommender. Learning. Model"

Domenic Price
5 years ago
Views:

2 2

3 Supervised Classification Regression Recommender Data Big: TiB - PiB Learning Model Small: MiB - GiB Unsupervised Clustering Dimensionality reduction Topic modeling 3

4 Example Formation Examples Modeling Model Evaluation

5 Feature Extraction ID Bag of Words (Large Scale) Join Data Parallel Functions ID Bag of Words Label Example Click Log ID Label Label Extraction

6 Step II: Modeling Step III: Evaluation Example Formation Modeling Evaluation

7 Step I: Example Formation Feature and Label Extraction Step III: Evaluation Example Formation Modeling Evaluation

8 Apply Model to Data Update Model Observe Errors

9 Sample Features Copy Model Example Formation Modeling Evaluation

10 + MapReduce model fits statistical query model learning - Hadoop MR does not support iterations (30x slowdown compared to others) - Hadoop MR does not match other forms of algorithms Hadoop Abuse

11 Statistics Model Updates

12 Statistics / Updates

14 Rise of the Resource Managers

15 Map Task Reduce Task Map Task Map Task Reduce Task

17 Resource Allocation = list of (node type, count, resource) App Master E.g. { (node1, 1, 1GB), (rack-1, 2, 1GB),(*, 1, 2GB) }

18 App Master Container Container Container Container

19 App Master Container Container Container Container

20 App Master Container Container Container Container

23 REEF: Retainable Evaluator Execution Framework

25 SQL / Hive Machine Learning YARN / HDFS

26 SQL / Hive Machine Learning REEF YARN / HDFS

27 SQL / Hive Machine Learning Logical Abstraction Physical Data Parallel Operators REEF YARN / HDFS

28 Job Driver Activity Control plane implementation. User code executed on YARN s Application Master User code executed within an Evaluator. Storage Network Evaluator Execution Environment for Activities. One Evaluator is bound to one YARN Container. State Management

31 Client public class DistributedShell {... public static void main(string[] args){... Injector i = new Injector(yarnConfiguration);... REEF reef = i.getinstance(reef.class);... reef.submit(driverconf); } }

32 Client public class DistributedShell {... public static void main(string[] args){... Injector i = new Injector(yarnConfiguration);... REEF reef = i.getinstance(reef.class);... reef.submit(driverconf); } }

33 Client public class DistributedShellJobDriver { private final EvaluatorRequestor requestor;... public void onnext(starttime time) { } requestor.submit(evaluatorrequest.builder().setsize(small).setnumber(2).build() ); }...

34 Client public class DistributedShellJobDriver { private final EvaluatorRequestor requestor;... public void onnext(allocatedevaluator eval) { Configuration contextconf =...; eval.submitcontext(contextconf) } }...

35 Client context config +

36 Client

37 Client public class DistributedShellJobDriver { private final String cmd = ls ; [...] public void onnext(activecontext ctx) { final String activityid = [...]; activity config } Configuration activityconf = Activity.CONF.set(IDENTIFIER, "ShellActivity").set(ACTIVITY, ShellActivity.class).set(COMMAND, this.cmd).build(); ctx.submitactivity(activityconf); [...] }

38 Client class ShellActivity implements Activity { private final String ShellActivity(@Parameter(Command.class) String c) { this.command = c; } private String exec(final String command){... public byte[] call(byte[] memento) { String s = exec(this.cmd); return s.getbytes(); } }

39 Client

40 Client Retains State!

41 Client activity config

42 Client

43 Client

46 Client Name Node Yarn RM Job Driver REEF HDFS NM node1 node3 Activity HDFS NM HDFS NM node2 node4

47 Client Name Node Yarn RM Job Driver REEF HDFS NM node1 node3 Activity HDFS NM HDFS NM node2 node4

48 Client Name Node Yarn RM Job Driver REEF HDFS NM node1 activity config + node3 Activity HDFS NM HDFS NM node2 node4

52 Logical Layer SQM ML algorithm Graph Analysis Physical Layer Select, Project, Join, Group MapReduce MPI 52

53 Logical Layer SQM ML algorithm Graph Analysis Physical Layer Select, Project, Join, Group MapReduce MPI 53

54 SQM ML algorithm Graph Analysis Logical query over training data Query optimizer Parallel Recursive Dataflow REEF 54

55 SQM ML algorithm Graph Analysis Recursion is built into the language Amenable to optimizations Lots of existing work that we can leverage J. Eisner and N. Filardo. Dyna: Extending datalog for modern AI. In Datalog 10 S. Funiak et al. Distributed inference with declarative overlay networks. EECS Tech Report 2008 D. Deutch, C. Koch, T. Milo. On Probabilistic Fixpoint and Markov Chain Query Languages. In PODS 10 Y. Bu et al. Scaling Datalog for Machine Learning on Big Data. Tech Report Datalog query over training data Query optimizer Parallel Recursive Dataflow REEF 55

56 Implementation over Hyracks Supports both Iterative-MRU and Pregel Standard optimizations + some new tricks Iterative-MRU Programming Models for ML algorithms Datalog queries Pregel Hardcoded optimizations Hyracks REEF 56

57 Provenance for triage My model misbehaves - why? Cost estimation for Fault-awareness recursive computation policies Incremental Cost models learning (time vs money) Interactive Query Dynamic Processing resources Elastic operators Storage/Networking services State Management Caching policies SQM ML algorithm Datalog query over training data Query optimizer Parallel Recursive Dataflow Graph Analysis REEF 57

Scale-out Beyond MapReduce. Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft

Scale-out Beyond MapReduce. Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft Scale-out Beyond MapReduce Raghu Ramakrishnan Cloud Information Services Lab (CISL) Microsoft Outline Big Data The New Applications The Digital Shoebox REEF Tiered Storage Compute Fabric Cloud Information