Data Analytics with MATLAB. Tackling the Challenges of Big Data

Similar documents
Data Analytics with MATLAB

BIG DATA: Data Analytics with MATLAB Christophe POUILLOT Senior Consultant MathWorks

Getting Started with MATLAB Francesca Perino

Parallel and Distributed Computing with MATLAB The MathWorks, Inc. 1

2015 The MathWorks, Inc. 1

Working with Large Sets of Images in MATLAB Just Got Easier Avi Nehemiah

Parallel and Distributed Computing with MATLAB Gerardo Hernández Manager, Application Engineer

Integrate MATLAB Analytics into Enterprise Applications

Tackling Big Data Using MATLAB

Analyzing Fleet Data with MATLAB and Spark

Integrate MATLAB Analytics into Enterprise Applications

What's New in MATLAB for Engineering Data Analytics?

Big Data con MATLAB. Lucas García The MathWorks, Inc. 1

Integrate MATLAB Analytics into Enterprise Applications

Scaling MATLAB. for Your Organisation and Beyond. Rory Adams The MathWorks, Inc. 1

Application Development and Deployment With MATLAB

MATLAB. Senior Application Engineer The MathWorks Korea The MathWorks, Inc. 2

Integrating MATLAB Analytics into Business-Critical Applications Marta Wilczkowiak Senior Applications Engineer MathWorks

Navigating Big Data with MATLAB

Process Big Data in MATLAB Using MapReduce

Integrating Advanced Analytics with Big Data

What s New MATLAB and Simulink

Introduction to MATLAB application deployment

Scaling up MATLAB Analytics Marta Wilczkowiak, PhD Senior Applications Engineer MathWorks

Putting it all together: Creating a Big Data Analytic Workflow with Spotfire

System Requirements & Platform Availability by Product for R2016b

Behind Today s Trends The Technologies Driving Change. Paul Smith Director Consulting Services

MATLAB 에서작업한응용프로그램의공유 : App 에서부터웹서비스까지

MATLAB is a multi-paradigm numerical computing environment fourth-generation programming language. A proprietary programming language developed by

Parallel Computing with MATLAB

Multicore Computer, GPU 및 Cluster 환경에서의 MATLAB Parallel Computing 기능

Oracle Big Data Connectors

What s New in Computational Finance

MATLAB Distributed Computing Server Release Notes

Fit für die MATLAB EXPO

Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen

Optimizing and Accelerating Your MATLAB Code

What s New in MATLAB and Simulink Prashant Rao Technical Manager MathWorks India

Sharing and Deploying MATLAB Programs Sundar Umamaheshwaran Amit Doshi Application Engineer-Technical Computing

2015 The MathWorks, Inc. 1

Technical Computing with MATLAB

What s New in MATLAB and Simulink Young Joon Lee Principal Application Engineer

Automated Trading with MATLAB Stuart Kozola Computational Finance

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

What s New in MATLAB and Simulink

Developing Optimization Algorithms for Real-World Applications

What s New in MATLAB and Simulink The MathWorks, Inc. 1

Přehled novinek v SQL Server 2016

Speeding up MATLAB Applications Sean de Wolski Application Engineer

Sharing and Deploying MATLAB Applications

DATA SCIENCE USING SPARK: AN INTRODUCTION

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Simulink as Your Enterprise Simulation Platform

Challenges for Data Driven Systems

What s New in MATLAB and Simulink

MATLAB Based Optimization Techniques and Parallel Computing

Migrate from Netezza Workload Migration

MATLAB as a Financial Engineering Development Platform Delivering Financial / Quantitative Models to the Enterprise Eugene McGoldrick

What is Gluent? The Gluent Data Platform

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

What s New for MATLAB David Willingham

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

What s New in MATLAB May 16, 2017

What s New in MATLAB and Simulink

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Deep Learning for Computer Vision with MATLAB By Jon Cherrie

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA

Design Challenges for Sensor Data Analytics in Internet of Things (IoT)

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and

Real-Time Testing in a Modern, Agile Development Workflow

Cloud Computing & Visualization

An Introduction to Big Data Formats

Digital Enterprise Platform for Live Business. Kevin Liu SAP Greater China, Vice President General Manager of Big Data and Platform BU

BIG DATA COURSE CONTENT

Advanced Software Development with MATLAB

Evolving To The Big Data Warehouse

Fusion Registry 9 SDMX Data and Metadata Management System

Wearable Technology Orientation Using Big Data Analytics for Improving Quality of Human Life

What s New in MATLAB & Simulink. Prashant Rao Technical Manager MathWorks India

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

STREAMLINED CERTIFICATION PATHS

Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data

MATLAB Introduction. Ron Ilizarov Application Engineer

HOW TO ENABLE AFFORDABLE ENTERPRISE VIDEO FOR EVERYONE

Scalable Tools - Part I Introduction to Scalable Tools

Behind Today s Trends The Technologies Driving Change. Jason Ghidella Simulink Product Manager MathWorks

Accelerate Big Data Insights

Analytics and Visualization

Big Data with Hadoop Ecosystem

Simplifying your upgrade and consolidation to BW/4HANA. Pravin Gupta (Teklink International Inc.) Bhanu Gupta (Molex LLC)

Applications of Program analysis in Model-Based Design

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

MathWorks Products and Prices North America January 2018

Introduction to Control Systems Design

Amazon Linux: Operating System of the Cloud

Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen

UNLEASHING THE VALUE OF THE TERADATA UNIFIED DATA ARCHITECTURE WITH ALTERYX

2015 The MathWorks, Inc. 1

INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX

Transcription:

Data Analytics with MATLAB Tackling the Challenges of Big Data How big is big? What characterises big data? Any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications. Wikipedia Francesca Perino Application Engineering Team 2014 The MathWorks, Inc. 1

MATLAB Application Development Landscape Prototyping Programming Deployment 2

MATLAB Application Development Landscape Prototyping Programming Deployment 3

Data Analytics with MATLAB Tackling the Challenges of Big Data How big is big? What characterises big data? Any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications. Wikipedia Francesca Perino Application Engineering Team 2014 The MathWorks, Inc. 4

Data-Driven Decisions and Data-Driven Design Measurement Devices Big Data Compute Power 5

Need is Across Many Application Areas System Design Signal Processing Image Processing Model-Based Design Data Analysis Hybrid and electric vehicles Sound quality analysis Advanced driver assistance system Engine Calibration Portfolio risk optimization 6

Data Analytics in MATLAB Moving up the Information Hierarchy Action Knowledge Information Data Physical Sensors Source: Information Warfare Edward Waltz 1998 7

Data Analytics in MATLAB Moving up the Information Hierarchy Action Knowledge Databases Data warehouses HDFS (Hadoop) Flat files, Excel, Web Information Data acquisition Instruments Data OBSERVATION Physical Sensors Sensing Collecting Measurement Data Acquisition Imaging devices 8

Data Analytics in MATLAB Moving up the Information Hierarchy Action Exploratory Analysis Knowledge Information Filtering ORGANIZATION Data Preprocessing Calibration Filtering Data Reduction Physical Sensors Sensing Collecting Measurement Data Acquisition Data Processing 9

Data Analytics in MATLAB Moving up the Information Hierarchy Action 1 0.9 NN measured active power per-unit 0.8 0.7 0.6 1.2 x 10-4 1 0.8 Prediction 0.5 MSE 0.6 0 20 40 60 80 100 120 140 160 180 200 time secs 0.4 0.2 0 0 5 10 15 20 25 30 35 40 45 Knowledge turbine number UNDERSTANDING Information Analysis Visualization Modeling Prediction Machine Learning Regression Linear Non-linear Non-parametric Decision Tree Preprocessing Calibration Filtering Data Reduction Classification Ensemble Method Neural Network Support Vector Machine Data Sensing Collecting Measurement Data Acquisition Visualization Physical Sensors 10

Data Analytics in MATLAB Moving up the Information Hierarchy Action APPLICATION Knowledge Information Data Physical Sensors Reporting Apps Scalable Deployment Integration Analysis Visualization Modeling Prediction Preprocessing Calibration Filtering Data Reduction Sensing Collecting Measurement Data Acquisition Reports Integration into Existing Systems MATLAB Applications Feedback for Design and Operations Excel 11

Large Data Analytics Prototype Data Explore Share/Deploy Scale Work on the desktop Scale capacity as needed 12

Large Data Analytics on the Desktop Prototype Access Explore Share/Deploy Scale Access big data from your desktop Collections of Text Files Databases Binary Files datastore Database Toolbox memmapfile 13

Example: Airline Flight Distance Data BTS/RITA Airline On-Time Statistics 123.5M records, 29 fields Task Find the maximum distance travelled by commercial airlines based upon flight operations performance data CSV Data 22 files 12GB 14

Standard Workflow (up to R2014a) files = {'1987.csv', '1988.csv', '1989.csv', '1990.csv',... '1991.csv', '1992.csv', '1993.csv', '1994.csv', '1995.csv',... '1996.csv', '1997.csv', '1998.csv', '1999.csv', '2000.csv',... '2001.csv', '2002.csv', '2003.csv', '2004.csv', '2005.csv',... '2006.csv', '2007.csv', '2008.csv'}; Location fmtspec = ['%*q%*q%*q%*q%*q%*q%*q%*q%*q%*q'... '%*q%*q%*q%*q%*q%*q%*q%*q%f%*q'... '%*q%*q %*q%*q%*q%*q%*q%*q%*q']; Format maxdist = -Inf; for i = 1 : numfiles filei = fopen(files{i}); data = textscan(filei, fmtspec,...); fclose(filei); maxi = max(data{:}); maxdist = max(maxdist,maxi); end Read data Compute Combine 15

New Workflow with datastore (in R2014b) airdata = datastore('*.csv'); Location airdata.selectedvariablenames = {'Distance'}; airdata.selectedformats = {'%f'}; Format Read data maxdist = -Inf; while hasdata(airdata) data = read(airdata); maxi = max(data.distance); maxdist = max(maxdist, maxi); end Compute Combine 16

datastore Easily specify data set Single text file (or collection of text files) Database (using Database Toolbox) Preview data structure and format Select data to import using column names Incrementally read subsets of the data airdata = datastore('*.csv'); airdata.selectedvariables = {'Distance', 'ArrDelay }; data = read(airdata); 17

Large Data Analytics on the Desktop Expand workspace 64 bit processor support increased in-memory data set handling Access portions of data too big to fit into memory Memory mapped variables huge binary file Datastore huge text file or collections of text files Database query portion of a big database table Variety of programming constructs System Objects analyze streaming data MapReduce process text files that won t fit into memory Increase analysis speed Parallel for loops use with multicore/multi-process machines GPU Arrays 18

Scaled Large Data Analytics Prototype Access Explore Share/Deploy Scale Load, Analyze, Discard datastore, parfor Distributed Memory SPMD out-of-memory in-memory Embarrassingly Parallel Complexity Non- Partitionable 19

Example: Airline Delay Analysis Data BTS/RITA Airline On-Time Statistics 123.5M records, 29 fields Tasks Calculate delay patterns Visualize summaries Estimate & evaluate predictive models Resources Amazon S3 data store Amazon EC2 cluster 20

Instructions Reduced Data Airline Delay Analysis: Framework Cluster/Grid/Cloud environment 1987 1988 1989 1990 1991 1992 Client 21

Scaling Big Data Capacity MATLAB supports a number of programming constructs for use with clusters General compute clusters Parallel for loops embarrassingly parallel algorithms SPMD distributed processing Hadoop clusters MapReduce analyze data stored in the Hadoop Distributed File System 22

Scaled Large Data Analytics Prototype Access Explore Share/Deploy Scale Load, Analyze, Discard datastore, parfor MapReduce Distributed Memory SPMD out-of-memory in-memory Embarrassingly Parallel Complexity Non- Partitionable 23

mapreduce (in R2014b) Data Store Map Reduce 1503 UA LAX -5-10 2356 540 PS BUR 13 5 186 1920 DL BOS 10 32 1876 1840 DL SFO 0 13 568 272 US BWI 4-2 359 UA PS DL DL US 2356 186 1876 568 359 UA 2356 UA 1867 UA 1365 PS 186 784 PS SEA 7 3 176 PS 176 PS 237 796 PS LAX -2 2 237 PS 237 1525 UA SFO 3-5 1867 UA 1867 DL 1876 632 US PS SJC 2-4 245 1610 UA MIA 60 34 1365 2032 DL EWR 10 16 789 2134 DL DFW -2 6 914 US UA DL DL 245 1365 789 914 DL 914 US 359 US 245 24

Workflow with mapreduce Data Store % Specify and format the data indata = datastore('*.csv'); indata.selectedvariables = 'Distance'; indata.selectedformats = '%f'; Map Reduce function mapfun(data,~,intermed) % Compute and save intermediate result maxi = max(data.distance); add(intermed,'maxi',maxi); function reducefun(~,intermed,output) maxdist = -Inf; while hasnext(intermed) maxi = getnext(intermed); % Combine intermediate results maxdist = max(maxdist,maxi); end add(output,'maxdist',maxdist); outdata = mapreduce(indata,@mapfun,@reducefun) 25

mapreduce Use the powerful MapReduce programming technique to analyze big data Multiple items (keys) to organize and process Intermediate results do not fit in memory ******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 20% Reduce 0% Map 40% Reduce 0% Map 60% Reduce 0% Map 80% Reduce 0% Map 100% Reduce 25% Map 100% Reduce 50% Map 100% Reduce 75% Map 100% Reduce 100% On the desktop Analyze big database tables (Database Toolbox) Increase compute capacity (Parallel Computing Toolbox) Access data on HDFS to develop algorithms for use on Hadoop With Hadoop Run on Hadoop using MATLAB Distributed Computing Server Deploy applications and libraries for Hadoop using MATLAB Compiler 26

Data Analytics Landscape COMPLEX More programming effort required iterative all data needed in memory at once Algorithm complexity easily partitioned; independent tasks SIMPLE vectorisation Built-in numerical & statistical algorithms gpuarray parfor spmd distributed arrays mapreduce SMALL Increasing Data Size 27

Strengths of MATLAB for Large Data Analytics Challenge Getting started Rapid data exploration MATLAB Solution Easy access to data from your desktop Tools for accessing typical big data sets consisting of text or binary files, contained in database tables or stored on Hadoop All the tools to explore and visualize data Easy to try different methods Ideal environment for developing your own methods Development of scalable algorithms Use within business systems Work on the desktop and scale to clusters Tools for use in analyzing big data on your desktop, which scale for use on clusters, including Hadoop, if needed Ease of deployment and leveraging enterprise Push-button deployment into production including support for Hadoop 28

Strengths of MATLAB for Large Data Analytics Challenge Getting started Rapid data exploration MATLAB Solution Easy access to data from your desktop Tools for accessing typical data sets consisting of text or binary files, Excel files, contained in database tables. Data import from instruments All the tools to explore and visualize data Easy to try different methods Ideal environment for developing your own methods Development of scalable algorithms Use within business systems Work on the desktop and scale to clusters Tools for use in analyzing big data on your desktop, which scale for use on clusters, including cloud, if needed Ease of deployment and leveraging enterprise Push-button deployment into production framework 29

MATLAB Application Development Landscape Prototyping Programming Deployment 30

MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See www.mathworks.com/trademarks for a list of additional trademarks. Other product or brand names may be trademarks or registered trademarks of their respective holders. 2014 The MathWorks, Inc. 31