Big Data con MATLAB. Lucas García The MathWorks, Inc. 1

Similar documents
Navigating Big Data with MATLAB

Tackling Big Data Using MATLAB

Parallel and Distributed Computing with MATLAB The MathWorks, Inc. 1

What's New in MATLAB for Engineering Data Analytics?

Parallel and Distributed Computing with MATLAB Gerardo Hernández Manager, Application Engineer

Data Analytics with MATLAB. Tackling the Challenges of Big Data

Analyzing Fleet Data with MATLAB and Spark

Integrating Advanced Analytics with Big Data

MATLAB. Senior Application Engineer The MathWorks Korea The MathWorks, Inc. 2

BIG DATA: Data Analytics with MATLAB Christophe POUILLOT Senior Consultant MathWorks

Data Analytics with MATLAB

Scaling MATLAB. for Your Organisation and Beyond. Rory Adams The MathWorks, Inc. 1

2015 The MathWorks, Inc. 1

Integrate MATLAB Analytics into Enterprise Applications

Integrate MATLAB Analytics into Enterprise Applications

Getting Started with MATLAB Francesca Perino

MATLAB is a multi-paradigm numerical computing environment fourth-generation programming language. A proprietary programming language developed by

What s New in MATLAB May 16, 2017

What s New MATLAB and Simulink

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Parallel Computing with MATLAB

Working with Large Sets of Images in MATLAB Just Got Easier Avi Nehemiah

Multicore Computer, GPU 및 Cluster 환경에서의 MATLAB Parallel Computing 기능

Process Big Data in MATLAB Using MapReduce

Parallel Computing with MATLAB

Integrate MATLAB Analytics into Enterprise Applications

Dhavide Aruliah Director of Training, Anaconda

Speeding up MATLAB Applications Sean de Wolski Application Engineer

Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX

Best Practices and Performance Tuning on Amazon Elastic MapReduce

Parallel Computing with MATLAB

Specialist ICT Learning

Designing dashboards for performance. Reference deck

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Analytics in the cloud

Fault Detection using Advanced Analytics at CERN's Large Hadron Collider: Too Hot or Too Cold BIWA Summit 2016

DATA SCIENCE USING SPARK: AN INTRODUCTION

Data in the Cloud and Analytics in the Lake

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

Scalable Tools - Part I Introduction to Scalable Tools

Parallel learning of content recommendations using map- reduce

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

SFO15-TR6: Hadoop on ARM

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.

Caching and Buffering in HDF5

Using Parallel Computing Toolbox to accelerate the Video and Image Processing Speed. Develop parallel code interactively

MapReduce and Friends

Research in Middleware Systems For In-Situ Data Analytics and Instrument Data Analysis

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

ECS289: Scalable Machine Learning

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Oracle Big Data Science

Big Data Hadoop Stack

Speeding up MATLAB Applications The MathWorks, Inc.

Roadmap: Operating Pentaho at Scale. Jens Bleuel Senior Product Manager, Pentaho

MATLAB Distributed Computing Server Release Notes

ECS289: Scalable Machine Learning

Processing of big data with Apache Spark

Scalable Machine Learning in R. with H2O

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Package nyctaxi. October 26, 2017

Decision models for the Digital Economy

Big Data is Better on Bare Metal

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

Oracle Big Data Connectors

Chronix A fast and efficient time series storage based on Apache Solr. Caution: Contains technical content.

Apache HAWQ (incubating)

MATH36032 Problem Solving by Computer. Data Science

Big Data - Some Words BIG DATA 8/31/2017. Introduction

SQL Gone Wild: Taming Bad SQL the Easy Way (or the Hard Way) Sergey Koltakov Product Manager, Database Manageability

Technical Computing with MATLAB

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Behind Today s Trends The Technologies Driving Change. Paul Smith Director Consulting Services

Challenges for Data Driven Systems

SparkBench: A Comprehensive Spark Benchmarking Suite Characterizing In-memory Data Analytics

Evolving To The Big Data Warehouse

Oracle Big Data Science IOUG Collaborate 16

Distributed Machine Learning" on Spark

Analytics Platform for ATLAS Computing Services

Programming Systems for Big Data

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

An Introduction to Big Data Formats

Parallel MATLAB at VT

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Distributed Computing with Spark

Deep Learning for Computer Vision with MATLAB By Jon Cherrie

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Spark, Shark and Spark Streaming Introduction

CompSci 516: Database Systems

Divide & Recombine with Tessera: Analyzing Larger and More Complex Data. tessera.io

Distributed Computation Models

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Advanced Database Technologies NoSQL: Not only SQL

Transcription:

Big Data con MATLAB Lucas García 2015 The MathWorks, Inc. 1

Agenda Introduction Remote Arrays in MATLAB Tall Arrays for Big Data Scaling up Summary 2

Architecture of an analytics system Data from instruments and connected systems Data from business systems Analytics and Machine Learning 3

How big is big? What does Big Data even mean? Any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications. (Wikipedia) Any collection of data sets so large that it becomes difficult to process using traditional MATLAB functions, which assume all of the data is in memory. (MATLAB) 4

How big is big? In 1085 William 1 st commissioned a survey of England ~2 million words and figures collected over two years too big to handle in one piece collected and summarized in regional pieces used to generate revenue (tax), but most of the data then sat unused The Large Hadron Collider reached peak performance on 29 June 2016 2076 bunches of 120 billion protons currently circulating in each direction ~1.6x10 14 collisions per week, >30 petabytes of data per year too big to even store in one place used to explore interesting science, but taking researchers a long time to get through Image courtesy of CERN. Copyright 2011 CERN. 5

How big is big? Most of our data lies somewhere in between the extremes >10GB might be too much for one laptop / desktop ( inconveniently large ) 6

Big problems So what s the big problem? Standard tools won t work Getting the data is hard; processing it is even harder Need to learn new tools and new coding styles Have to rewrite algorithms, often at a lower level of abstraction We want to let you: Prototype algorithms quickly using small data Scale up to huge data-sets running on large clusters Use the same MATLAB code for both 7

New solution starting in R2016b: tall arrays tall array Quick overview (detail later!): Treat data in multiple files as one large table/array Write normal array / table code Behind the scenes operate on pieces 8

Agenda Introduction Remote Arrays in MATLAB Tall Arrays for Big Data Scaling up Summary 9

Remote arrays in MATLAB MATLAB provides array types for data that is not in normal memory distributed array (since R2006b) Data lives in the combined memory of a cluster of computers gpuarray (since R2010b) Data lives in the memory of the GPU card tall array (since R2016b) Data lives on disk, maybe spread across many disks (distributed file-system) 10

Remote arrays in MATLAB Rule: take the calculation to where the data is Normal array calculation happens in main memory: x = rand(...) x_norm = (x mean(x))./ std(x) 11

Remote arrays in MATLAB Rule: take the calculation to where the data is gpuarray all calculation happens on the GPU: x = gpuarray(...) x_norm = (x mean(x))./ std(x) distributed calculation is spread across the cluster: x = distributed(...) x_norm = (x mean(x))./ std(x) tall calculation is performed by stepping through files: x = tall(...) x_norm = (x mean(x))./ std(x) 12

Agenda Introduction Remote Arrays in MATLAB Tall Arrays for Big Data Scaling up Summary 13

tall arrays (new ) MATLAB data-type for data that doesn t fit into memory Ideal for lots of observations, few variables (hence tall ) Looks like a normal MATLAB array Supports numeric types, tables, datetimes, categoricals, strings, etc. Basic maths, stats, indexing, etc. Statistics and Machine Learning Toolbox support (clustering, classification, etc.), Database Toolbox. 14

tall arrays (new ) Single Machine Memory Data is in one or more files Typically tabular data Files stacked vertically Data doesn t fit into memory (even cluster memory) Cluster of Machines Memory 15

tall arrays (new ) Single Machine Memory Use datastore to define file-list ds = datastore('*.csv') Allows access to small pieces of data that fit in memory. while hasdata(ds) piece = read(ds); % Process piece end Cluster of Machines Memory Datastore 16

tall arrays (new ) Single Machine Memory tall array Process Single Machine Memory Create tall table from datastore ds = datastore('*.csv') tt = tall(ds) Datastore Operate on whole tall table just like ordinary table summary(tt) Cluster of Machines Memory max(tt.endtime tt.starttime) Chunk processing is handled automatically 17

tall arrays (new ) Single Machine Memory tall array Process Single Machine Memory With Parallel Computing Toolbox, process several chunks at once Datastore Process Single Machine Memory Can scale up to clusters with MATLAB Distributed Computing Server Cluster of Machines Memory Process Single Machine Memory Process Single Machine Memory 18

Example: Working with Big Data in MATLAB Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: Monthly taxi ride log files The local data set is small (~2 MB) The full data set is big (~25 GB) Approach: Preprocess and explore data Develop and validate predictive model (linear fit) Work with subset of data for prototyping Scale to full data set on HDFS 19

Example: Prototyping Preview Data Description Location: New York City Date(s): (Partial) January 2015 Data size: small data 13,693 rows / ~2 MB >> ds = datastore('taxidatanyc_1_2015.csv'); >> preview(ds) VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_l 2 2015-01-10 02:24:04 2015-01-10 02:36:10 2 2.19-73.999 40.729 1 2015-01-18 21:29:35 2015-01-18 21:34:15 1 1-74.017 40.705 2 2015-01-23 18:23:02 2015-01-23 18:39:32 3 2.22-73.973 40.787 1 2015-01-01 05:29:50 2015-01-01 05:48:55 1 3.6-73.943 40.817 1 2015-01-18 00:06:42 2015-01-18 00:11:43 1 0.8-73.983 40.762 2 2015-01-29 23:56:41 2015-01-30 00:02:49 5 0.87-73.982 40.772 2 2015-01-05 16:58:24 2015-01-05 17:03:33 5 0.78-73.992 40.743 1 2015-01-23 23:49:53 2015-01-23 23:55:42 2 1.4-73.956 40.78 20

Example: Prototyping Create a Tall Array Number of rows is unknown until all the data has been read >> tt = tall(ds) tt = M 19 tall table Input data is tabular result is a tall table VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_l 2 2015-01-10 02:24:04 2015-01-10 02:36:10 2 2.19-73.999 40.729 1 2015-01-18 21:29:35 2015-01-18 21:34:15 1 1-74.017 40.705 2 2015-01-23 18:23:02 2015-01-23 18:39:32 3 2.22-73.973 40.787 1 2015-01-01 05:29:50 2015-01-01 05:48:55 1 3.6-73.943 40.817 1 2015-01-18 00:06:42 2015-01-18 00:11:43 1 0.8-73.983 40.762 2 2015-01-29 23:56:41 2015-01-30 00:02:49 5 0.87-73.982 40.772 2 2015-01-05 16:58:24 Only the first 2015-01-05 few 17:03:33 5 0.78-73.992 40.743 rows are displayed 1 2015-01-23 23:49:53 2015-01-23 23:55:42 2 1.4-73.956 40.78 : : : : : : : : : : : : : : 21

Example: Prototyping Calling Functions with a Tall Array Once the tall table is created, can process much like an ordinary table % Calculate average trip duration mntrip = mean(tt.trip_minutes,'omitnan') mntrip = tall double? Preview deferred. Learn more. % Execute commands and gather results into workspace mn = gather(mntrip) Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: Completed in 4 sec Evaluation completed in 4 sec Most results are evaluated only when explicitly requested (e.g., gather) MATLAB automatically optimizes queued calculations to minimize the number of passes through the data mn = 13.2763 22

Example: Prototyping Calling Functions with a Tall Array % Remove some bad data tt.trip_minutes = minutes(tt.tpep_dropoff_datetime - tt.tpep_pickup_datetime); tt.speed_mph = tt.trip_distance./ (tt.trip_minutes./ 60); ignore = tt.trip_minutes <= 1... % really short time tt.trip_minutes >= 60 * 12... % unfeasibly long time tt.trip_distance <= 1... % really short distance tt.trip_distance >= 12 * 55... % unfeasibly far tt.speed_mph > 55... % unfeasibly fast tt.fare_amount < 0... % negative fares?! tt.fare_amount > 10000; % unfeasibly large fares tt(ignore, :) = []; % Credit card payments have the most accurate tip data keep = tt.payment_type == {'Credit card'}; tt = tt(keep,:); Data only read once, % Show tip distributiondespite 21 operations histogram( tt.tip_amount, 0:25 ) Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: 0% Completed completein 5 sec Evaluation 0% completed in 5 sec 23

Example: Prototyping Fit predictive model % Fit predictive model model = fitlm(tttrain,'fare_amount ~ 1 + hr_of_day + trip_distance*trip_minutes') Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: Completed in 5 sec Evaluation completed in 5 sec model = Compact linear regression model: fare_amount ~ 1 + hr_of_day + trip_distance*trip_minutes Estimated Coefficients: Estimate SE tstat pvalue (Intercept) 2.3432 0.040181 58.318 0 trip_distance 2.5841 0.0063898 404.41 0 hr_of_day -0.0012969 0.0018789-0.69024 0.49005 trip_minutes 0.22098 0.0020412 108.26 0 trip_distance:trip_minutes -0.007857 0.00017539-44.798 0 Number of observations: 42373, Error degrees of freedom: 42368 Root Mean Squared Error: 2.58 R-squared: 0.938, Adjusted R-Squared 0.938 F-statistic vs. constant model: 1.59e+05, p-value = 0 24

Example: Prototyping Predict and validate model % Predict and validate ypred = predict(model,ttvalidation); residuals = ypred - ttvalidation.fare_amount; figure histogram(residuals,'normalization','pdf','binlimits',[-50 50]) Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 2: Completed in 5 sec - Pass 2 of 2: Completed in 4 sec Evaluation completed in 10 sec 25

Agenda Introduction Remote Arrays in MATLAB Tall Arrays for Big Data Scaling up Summary 26

Scale to the Entire Data Set Description Location: New York City Date(s): All of 2015 Data size: Big Data 150,000,000 rows / ~25 GB 27

Example: small data processing vs. Big Data processing % Access the data ds = datastore('taxidatanyc_1_2015.csv'); tt = tall(ds); small data processing % Access the data ds = datastore('taxidata\*.csv'); tt = tall(ds); Big Data processing % Calculate average trip duration mntrip = mean(tt.trip_minutes,'omitnan') % Calculate average trip duration mntrip = mean(tt.trip_minutes,'omitnan') % Execute commands and gather results into workspace mn = gather(mntrip) % Remove some bad data tt.trip_minutes = minutes(tt.tpep_dropoff_datetime - tt.tpep_pickup_datetime); tt.speed_mph = tt.trip_distance./ (tt.trip_minutes./ 60); ignore = tt.trip_minutes <= 1... % really short time tt.trip_minutes >= 60 * 12... % unfeasibly long time tt.trip_distance <= 1... % really short distance tt.trip_distance >= 12 * 55... % unfeasibly far tt.speed_mph > 55... % unfeasibly fast tt.fare_amount < 0... % negative fares?! tt.fare_amount > 10000; % unfeasibly large fares tt(ignore, :) = []; % Execute commands and gather results into workspace mn = gather(mntrip) % Remove some bad data tt.trip_minutes = minutes(tt.tpep_dropoff_datetime - tt.tpep_pickup_datetime); tt.speed_mph = tt.trip_distance./ (tt.trip_minutes./ 60); ignore = tt.trip_minutes <= 1... % really short time tt.trip_minutes >= 60 * 12... % unfeasibly long time tt.trip_distance <= 1... % really short distance tt.trip_distance >= 12 * 55... % unfeasibly far tt.speed_mph > 55... % unfeasibly fast tt.fare_amount < 0... % negative fares?! tt.fare_amount > 10000; % unfeasibly large fares tt(ignore, :) = []; 28

Scaling up If you just have MATLAB: Run through each chunk of data one by one If you also have Parallel Computing Toolbox: Use all local cores to process several chunks at once If you also have a cluster with MATLAB Distributed Computing Server (MDCS): Use the whole cluster to process many chunks at once 29

Scaling up Working with clusters from MATLAB desktop: General purpose MATLAB cluster Can co-exist with other MATLAB workloads (parfor, parfeval, spmd, jobs and tasks, distributed arrays, ) Uses local memory and file caches on workers for efficiency Spark-enabled Hadoop clusters Data in HDFS Calculation is scheduled to be near data Uses Spark s built-in memory and disk caching 30

Example: Running on Spark + Hadoop % Hadoop/Spark Cluster numworkers = 16; setenv('hadoop_home', '/dev_env/cluster/hadoop'); setenv('spark_home', '/dev_env/cluster/spark'); cluster = parallel.cluster.hadoop; cluster.sparkproperties('spark.executor.instances') = num2str(numworkers); mr = mapreducer(cluster); % Access the data ds = datastore('hdfs://hadoop01:54310/datasets/taxidata/*.csv'); tt = tall(ds); 31

Example: Running on Spark + Hadoop 32

Agenda Introduction Remote Arrays in MATLAB Tall Arrays for Big Data Scaling up Summary 33

Summary for tall arrays Local disk, Shared folders, Databases Run on Compute Clusters or Spark + Hadoop (HDFS), for large scale analysis Process out-of-memory data on your Desktop to explore, analyze, gain insights and to develop analytics Use Parallel Computing Toolbox for increased performance MATLAB Distributed Computing Server, Spark+Hadoop 34

Big Data capabilities in MATLAB ACCESS Access data and collections of files that do not fit in memory Datastores Images Spreadsheets Tabular Text Custom Files SQL Hadoop (HDFS) PROCESS AND ANALYZE Purpose-built capabilities for domain experts to work with big data locally Tall Arrays Math Statistics GPU Arrays Matrix Math Deep Learning Image Classification Visualization Machine Learning Image Processing SCALE Scale to compute clusters and Hadoop/Spark for data stored in HDFS Tall Arrays Math, Stats, Machine Learning on Spark Distributed Arrays Matrix Math on Compute Clusters MDCS for EC2 Cloud-based Compute Cluster MapReduce MATLAB API for Spark 35

Summary MATLAB makes it easy, convenient, and scalable to work with big data Access any kind of big data from any file system Use tall arrays to process and analyze that data on your desktop, clusters, or on Hadoop/Spark There s no need to learn big data programming or out-of-memory techniques -- simply use the same code and syntax you're already used to. 36

Questions 37