Data Analytics with MATLAB Tackling the Challenges of Big Data How big is big? What characterises big data? Any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications. Wikipedia Francesca Perino Application Engineering Team 2014 The MathWorks, Inc. 1
MATLAB Application Development Landscape Prototyping Programming Deployment 2
MATLAB Application Development Landscape Prototyping Programming Deployment 3
Data Analytics with MATLAB Tackling the Challenges of Big Data How big is big? What characterises big data? Any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications. Wikipedia Francesca Perino Application Engineering Team 2014 The MathWorks, Inc. 4
Data-Driven Decisions and Data-Driven Design Measurement Devices Big Data Compute Power 5
Need is Across Many Application Areas System Design Signal Processing Image Processing Model-Based Design Data Analysis Hybrid and electric vehicles Sound quality analysis Advanced driver assistance system Engine Calibration Portfolio risk optimization 6
Data Analytics in MATLAB Moving up the Information Hierarchy Action Knowledge Information Data Physical Sensors Source: Information Warfare Edward Waltz 1998 7
Data Analytics in MATLAB Moving up the Information Hierarchy Action Knowledge Databases Data warehouses HDFS (Hadoop) Flat files, Excel, Web Information Data acquisition Instruments Data OBSERVATION Physical Sensors Sensing Collecting Measurement Data Acquisition Imaging devices 8
Data Analytics in MATLAB Moving up the Information Hierarchy Action Exploratory Analysis Knowledge Information Filtering ORGANIZATION Data Preprocessing Calibration Filtering Data Reduction Physical Sensors Sensing Collecting Measurement Data Acquisition Data Processing 9
Data Analytics in MATLAB Moving up the Information Hierarchy Action 1 0.9 NN measured active power per-unit 0.8 0.7 0.6 1.2 x 10-4 1 0.8 Prediction 0.5 MSE 0.6 0 20 40 60 80 100 120 140 160 180 200 time secs 0.4 0.2 0 0 5 10 15 20 25 30 35 40 45 Knowledge turbine number UNDERSTANDING Information Analysis Visualization Modeling Prediction Machine Learning Regression Linear Non-linear Non-parametric Decision Tree Preprocessing Calibration Filtering Data Reduction Classification Ensemble Method Neural Network Support Vector Machine Data Sensing Collecting Measurement Data Acquisition Visualization Physical Sensors 10
Data Analytics in MATLAB Moving up the Information Hierarchy Action APPLICATION Knowledge Information Data Physical Sensors Reporting Apps Scalable Deployment Integration Analysis Visualization Modeling Prediction Preprocessing Calibration Filtering Data Reduction Sensing Collecting Measurement Data Acquisition Reports Integration into Existing Systems MATLAB Applications Feedback for Design and Operations Excel 11
Large Data Analytics Prototype Data Explore Share/Deploy Scale Work on the desktop Scale capacity as needed 12
Large Data Analytics on the Desktop Prototype Access Explore Share/Deploy Scale Access big data from your desktop Collections of Text Files Databases Binary Files datastore Database Toolbox memmapfile 13
Example: Airline Flight Distance Data BTS/RITA Airline On-Time Statistics 123.5M records, 29 fields Task Find the maximum distance travelled by commercial airlines based upon flight operations performance data CSV Data 22 files 12GB 14
Standard Workflow (up to R2014a) files = {'1987.csv', '1988.csv', '1989.csv', '1990.csv',... '1991.csv', '1992.csv', '1993.csv', '1994.csv', '1995.csv',... '1996.csv', '1997.csv', '1998.csv', '1999.csv', '2000.csv',... '2001.csv', '2002.csv', '2003.csv', '2004.csv', '2005.csv',... '2006.csv', '2007.csv', '2008.csv'}; Location fmtspec = ['%*q%*q%*q%*q%*q%*q%*q%*q%*q%*q'... '%*q%*q%*q%*q%*q%*q%*q%*q%f%*q'... '%*q%*q %*q%*q%*q%*q%*q%*q%*q']; Format maxdist = -Inf; for i = 1 : numfiles filei = fopen(files{i}); data = textscan(filei, fmtspec,...); fclose(filei); maxi = max(data{:}); maxdist = max(maxdist,maxi); end Read data Compute Combine 15
New Workflow with datastore (in R2014b) airdata = datastore('*.csv'); Location airdata.selectedvariablenames = {'Distance'}; airdata.selectedformats = {'%f'}; Format Read data maxdist = -Inf; while hasdata(airdata) data = read(airdata); maxi = max(data.distance); maxdist = max(maxdist, maxi); end Compute Combine 16
datastore Easily specify data set Single text file (or collection of text files) Database (using Database Toolbox) Preview data structure and format Select data to import using column names Incrementally read subsets of the data airdata = datastore('*.csv'); airdata.selectedvariables = {'Distance', 'ArrDelay }; data = read(airdata); 17
Large Data Analytics on the Desktop Expand workspace 64 bit processor support increased in-memory data set handling Access portions of data too big to fit into memory Memory mapped variables huge binary file Datastore huge text file or collections of text files Database query portion of a big database table Variety of programming constructs System Objects analyze streaming data MapReduce process text files that won t fit into memory Increase analysis speed Parallel for loops use with multicore/multi-process machines GPU Arrays 18
Scaled Large Data Analytics Prototype Access Explore Share/Deploy Scale Load, Analyze, Discard datastore, parfor Distributed Memory SPMD out-of-memory in-memory Embarrassingly Parallel Complexity Non- Partitionable 19
Example: Airline Delay Analysis Data BTS/RITA Airline On-Time Statistics 123.5M records, 29 fields Tasks Calculate delay patterns Visualize summaries Estimate & evaluate predictive models Resources Amazon S3 data store Amazon EC2 cluster 20
Instructions Reduced Data Airline Delay Analysis: Framework Cluster/Grid/Cloud environment 1987 1988 1989 1990 1991 1992 Client 21
Scaling Big Data Capacity MATLAB supports a number of programming constructs for use with clusters General compute clusters Parallel for loops embarrassingly parallel algorithms SPMD distributed processing Hadoop clusters MapReduce analyze data stored in the Hadoop Distributed File System 22
Scaled Large Data Analytics Prototype Access Explore Share/Deploy Scale Load, Analyze, Discard datastore, parfor MapReduce Distributed Memory SPMD out-of-memory in-memory Embarrassingly Parallel Complexity Non- Partitionable 23
mapreduce (in R2014b) Data Store Map Reduce 1503 UA LAX -5-10 2356 540 PS BUR 13 5 186 1920 DL BOS 10 32 1876 1840 DL SFO 0 13 568 272 US BWI 4-2 359 UA PS DL DL US 2356 186 1876 568 359 UA 2356 UA 1867 UA 1365 PS 186 784 PS SEA 7 3 176 PS 176 PS 237 796 PS LAX -2 2 237 PS 237 1525 UA SFO 3-5 1867 UA 1867 DL 1876 632 US PS SJC 2-4 245 1610 UA MIA 60 34 1365 2032 DL EWR 10 16 789 2134 DL DFW -2 6 914 US UA DL DL 245 1365 789 914 DL 914 US 359 US 245 24
Workflow with mapreduce Data Store % Specify and format the data indata = datastore('*.csv'); indata.selectedvariables = 'Distance'; indata.selectedformats = '%f'; Map Reduce function mapfun(data,~,intermed) % Compute and save intermediate result maxi = max(data.distance); add(intermed,'maxi',maxi); function reducefun(~,intermed,output) maxdist = -Inf; while hasnext(intermed) maxi = getnext(intermed); % Combine intermediate results maxdist = max(maxdist,maxi); end add(output,'maxdist',maxdist); outdata = mapreduce(indata,@mapfun,@reducefun) 25
mapreduce Use the powerful MapReduce programming technique to analyze big data Multiple items (keys) to organize and process Intermediate results do not fit in memory ******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 20% Reduce 0% Map 40% Reduce 0% Map 60% Reduce 0% Map 80% Reduce 0% Map 100% Reduce 25% Map 100% Reduce 50% Map 100% Reduce 75% Map 100% Reduce 100% On the desktop Analyze big database tables (Database Toolbox) Increase compute capacity (Parallel Computing Toolbox) Access data on HDFS to develop algorithms for use on Hadoop With Hadoop Run on Hadoop using MATLAB Distributed Computing Server Deploy applications and libraries for Hadoop using MATLAB Compiler 26
Data Analytics Landscape COMPLEX More programming effort required iterative all data needed in memory at once Algorithm complexity easily partitioned; independent tasks SIMPLE vectorisation Built-in numerical & statistical algorithms gpuarray parfor spmd distributed arrays mapreduce SMALL Increasing Data Size 27
Strengths of MATLAB for Large Data Analytics Challenge Getting started Rapid data exploration MATLAB Solution Easy access to data from your desktop Tools for accessing typical big data sets consisting of text or binary files, contained in database tables or stored on Hadoop All the tools to explore and visualize data Easy to try different methods Ideal environment for developing your own methods Development of scalable algorithms Use within business systems Work on the desktop and scale to clusters Tools for use in analyzing big data on your desktop, which scale for use on clusters, including Hadoop, if needed Ease of deployment and leveraging enterprise Push-button deployment into production including support for Hadoop 28
Strengths of MATLAB for Large Data Analytics Challenge Getting started Rapid data exploration MATLAB Solution Easy access to data from your desktop Tools for accessing typical data sets consisting of text or binary files, Excel files, contained in database tables. Data import from instruments All the tools to explore and visualize data Easy to try different methods Ideal environment for developing your own methods Development of scalable algorithms Use within business systems Work on the desktop and scale to clusters Tools for use in analyzing big data on your desktop, which scale for use on clusters, including cloud, if needed Ease of deployment and leveraging enterprise Push-button deployment into production framework 29
MATLAB Application Development Landscape Prototyping Programming Deployment 30
MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See www.mathworks.com/trademarks for a list of additional trademarks. Other product or brand names may be trademarks or registered trademarks of their respective holders. 2014 The MathWorks, Inc. 31