Big Data and FrameWorks; Perspectives to Applied Machine Learning

Size: px

Start display at page:

Download "Big Data and FrameWorks; Perspectives to Applied Machine Learning"

Roxanne Ashley Singleton
5 years ago
Views:

1 Big Data and FrameWorks; Perspectives to Applied Machine Learning Mehdi Habibzadeh PhD in Computer Science

Outlines (Oct 2016) : Big Data and Challenges Review and Trends Math and Probability Concepts Data Structure and Retrieval Algorithms Map-Reduce on Large Clusters Hadoop Framework Programming

2 Outlines (Oct 2016) : Big Data and Challenges Review and Trends Math and Probability Concepts Data Structure and Retrieval Algorithms Map-Reduce on Large Clusters Hadoop Framework Programming Apache Spark Framework Big Data and Cloud Computing Big Data and NoSQL Machine Learning (Conventional and Deep Learnings) Big Data in the real world 2016 Big Data and Applied Machine Learning 2

» Adoption of technologies, associated with unstructured data» Ref :

3 Big Data and Challenges Sources and Massive Information Characteristics and Trends The year 2015 was a big jump in the world of big data.» Adoption of technologies, associated with unstructured data» Ref : Big Data and Applied Machine Learning 3

4 Big Data and Challenges (Cont.) 2016 Big Data and Applied Machine Learning 4

5 Big Data and Challenges (Cont.) 2016 Big Data and Applied Machine Learning 5

6 Big Data and Challenges (Cont.) 2016 Big Data and Applied Machine Learning 6

7 Big Data and Challenges (Cont.) 2016 Big Data and Applied Machine Learning 7

8 Big Data: Math Terms Understanding and Visualization Missing values, Outliers values,. ML -Maximum Likelihood EM-Expectation Maximization The interquartile range (IQR) Data Mining and Statistical Approaches Data Dimensionality Reduction ( PCA, SFS, BFS,.) Relevance and Redundancy (Kruskal Wallis, Kolmogorov-Smirnov) Regression modeling (Logistic Regression, ) Data compression (Singular value decomposition) Variable Selection and Ranking (Eigen values/vectors, HDMR) 2016 Big Data and Applied Machine Learning 8

9 Big Data: Math Terms (Cont.) Feature selection : Reasons and motivation To trace effectiveness of aforementioned high dimensional invariant descriptors in white blood cell classification performance. To provide a smaller effective set compared to the starting data pool. To avoid redundant or irrelevant features. Two approaches (Wrapper - Filter) : Wrapper: An iterative method with considering its predictive efficiency to a given classifier (Pattern Recognition algorithm). Filter : The objective function evaluates subsets using statistical dependency, Regression, interclass distance (Machine Learning) Big Data and Applied Machine Learning 9

10 Big Data: Math Terms (Cont.) Machine Learning and Predicting Reliability, Uncertainty and Global Sensitivity Analysis Clustering and Classification Validation Method ( Cross Validation, Hold-out datasets,. ) Graph Laplacian for clustering Deterministic (NN, SVM, ) Probabilistic methods ) Bayes classifier, PAM, ) Deep Learning (Hierarchical Classification) 2016 Big Data and Applied Machine Learning 10

Big Data Search Algorithms Cache aware and Cache oblivious model Using CPU cache without having the size of the cache (Sort of Machines ) Memory performance & Improvement Adapt to arbitrary

11 Big Data Search Algorithms Cache aware and Cache oblivious model Using CPU cache without having the size of the cache (Sort of Machines ) Memory performance & Improvement Adapt to arbitrary memory hierarchies Data clustering Locality of memory references is increased. Application : Matrix multiplication, Sorting, Matrix transposition 2016 Big Data and Applied Machine Learning 11

12 Big Data Retrieval Algorithms Streaming Online Data Management Adapt to arbitrary and unstructured Input Data Real-Time Analytical Processing (RTAP) 2016 Big Data and Applied Machine Learning 12

13 Map-Reduce on Large Clusters Motivation and Demand: Tend to be very short, code-wise Represent a data flow 2016 Big Data and Applied Machine Learning 13

14 Map-Reduce (Cont.) 2016 Big Data and Applied Machine Learning 14

15 Map-Reduce (Cont.) 2016 Big Data and Applied Machine Learning 15

16 Map-Reduce (Cont.) Each step has one Map phase and one Reduce phase Convert any into MapReduce pattern Great solution for one-pass computations Not very efficient for Multi-pass computations and algorithms 2016 Big Data and Applied Machine Learning 16

17 Hadoop Framework Features : Open Source Framework for Processing Large Data Work on Cheap and Unreliable Clusters Known in Companies who deal with Big Data Applications Compatible with Java, Python and Scala 2016 Big Data and Applied Machine Learning 17

18 Hadoop Framework (Cont.) MapReduce Framework Assign work for different nodes Hadoop Distributed File System (HDFS) Primary storage system used by Hadoop applications. Copies each piece of data and distributes to individual nodes Name Node (Meta Data) and Data Nodes (File Blocks) Redundant information ( Three times by default) Machines in a given cluster are cheap and unreliable Decreases the risk of catastrophic failure» Even in the event that numerous nodes fail Links together the file systems on different nodes to make an integrated big file system (Parallel Processing( 2016 Big Data and Applied Machine Learning 18

19 Hadoop Framework (Cont.) Hadoop V.2 : Hadoop NextGen MapReduce (YARN) 2016 Big Data and Applied Machine Learning 19

Python, Scala, Ruby Data Retrieval / Query Language Hive Pig SQL- Like Language Data

20 Hadoop Framework (Cont.) Hadoop Programming Java Full control of MapReduce, Cascading (Open Java Library) Python, Scala, Ruby Data Retrieval / Query Language Hive Pig SQL- Like Language Data Flow Language (Simple and Out of Small Steps) Scalding Library built on top of Scala (Elegant Model) 2016 Big Data and Applied Machine Learning 20

Big Data Programming R Java- Python and Scala (

21 Big Data Programming R Java- Python and Scala ( Commonly Used) Three References : ( Recommended to Read) Big Data and Applied Machine Learning 21

22 Hadoop Framework (Cont.) 2016 Big Data and Applied Machine Learning 22

Apache Spark Framework Spark Features (More than Distributed Processing) Ease of use, and sophisticated analytics In-memory data storage and near real-time processing Holds intermediate results in

23 Apache Spark Framework Spark Features (More than Distributed Processing) Ease of use, and sophisticated analytics In-memory data storage and near real-time processing Holds intermediate results in memory Store as much as data in memory and then goes to disk Spark vs Hadoop On top of existing HDFS Data sets that are diverse in nature (Text, Videos, ) Variety in source of data (Batch v. real-time streaming data). 100 times faster in memory, 10 times faster when running on disk Big Data and Applied Machine Learning 23

24 Apache Spark Framework (Cont.) 2016 Big Data and Applied Machine Learning 24

Learning SQL Queries, Streaming Data Machine Learning and Graph Data Processing

25 Apache Spark Framework (Cont.) Compatible with Java, Scala and Python Perform Data Analytics and Machine Learning SQL Queries, Streaming Data Machine Learning and Graph Data Processing Spark MLlib, Spark s Machine Learning library Spark and data stored in a Cassandra database 2016 Big Data and Applied Machine Learning 25

26 Big Data and Cloud Cloud Computing Platform & Services (Cloudera, Hortonworks, MapR, Azure) 2016 Big Data and Applied Machine Learning 26

27 Big Data and NoSQL Key-values Stores Unique key and a pointer to a particular item of data. Tokyo Cabinet/Tyrant, Redis, Voldemort, Oracle BDB, Amazon SimpleDB, Riak Column Family Stores Very large amounts of data distributed over many machines. Cassandra, HBase 2016 Big Data and Applied Machine Learning 27

28 Big Data and NoSQL (Cont.) Document Databases Similar to key-value stores, Semi-structured documents are stored in formats like JSON Allowing nested values associated with each key. Document databases support querying more efficiently. CouchDB, MongoDb 2016 Big Data and Applied Machine Learning 28

and columns and the rigid structure of SQL Scale across multiple

29 Big Data and NoSQL (Cont.) Graph Database Flexible graph model Instead of tables of rows and columns and the rigid structure of SQL Scale across multiple machines (Scale Out) Neo4J, InfoGrid, Infinite Graph, Titan 2016 Big Data and Applied Machine Learning 29

30 Big Data and NoSQL (Cont.) JASON Format RDBMS Databae Table, View Row Column Index Join Foreign Key Partition NoSQL Database Collection Document (JSON, BSON) Field Index Embedded Document Reference Shard > db.user.findone({age:39}) { "_id" : ObjectId("5114e0bd42 "), "first" : "John", "last" : "Doe", "age" : 39, "interests" : [ "Reading", "Mountain Biking ] "favorites": { "color": "Blue", "sport": "Soccer"} } 2016 Big Data and Applied Machine Learning 30

31 Big Data and NoSQL (Cont.) 2016 Big Data and Applied Machine Learning 31

ML Methods : Support Vector Machines (SVM) Naive Bayes Classifier

32 Machine Learning Conventional Methods Feature Extraction and Selection as an Input and Proposed machine as a Classifier Sample ML Methods : Support Vector Machines (SVM) Naive Bayes Classifier Artificial Neural Network (ANN) 2016 Big Data and Applied Machine Learning 32

33 Machine Learning (Cont.) Support Vector Machine (SVM) Kernel Settings (Linear, polynomial and Gaussian ) Number of features is compared to the training sample. Less prone to over fitting than alternative choice. Soft-Margin and Hard Margin. Over fitting controlled by soft margin (Slack variables ε i ) One-versus-all. Well in practice ( highest response) K Fold - cross validation(validation data) 2016 Big Data and Applied Machine Learning 33

34 Machine Learning : Deep Learning Supervised & Unsupervised approaches Greedy layer-wise unsupervised pre-training. Hierarchy of features one level at a time, Learn a new transformation at each level to be composed with the previously learned transformations. Seeking for regularities to extract an unique representation Higher layer will find more useful than the original input Accurate hierarchical representation of complex data Subsequent feature extraction, Classification problems (types and classes) 2016 Big Data and Applied Machine Learning 34

35 Deep Learning (Cont.) Earliest concepts of deep learning : Perceptron Neural Networks structures. Neural Network technically can have more than one hidden layer. Increasing the number of hidden layers» Vanishing gradients, Over fitting Big Data and Applied Machine Learning 35

36 Deep Learning (Cont.) Auto-encoders, Stacked Auto-encoders, Restricted Boltzmann Machines, The spike and slab Restricted Boltzmann Machine (RBM), Deep Belief Networks, Convolutional Networks 2016 Big Data and Applied Machine Learning 36

Deep :Convolution Neural Network Extract topological invariant properties (spatially local connections (receptive fields) ) from the gray-scale image Especially in which input is spatially or

37 Deep :Convolution Neural Network Extract topological invariant properties (spatially local connections (receptive fields) ) from the gray-scale image Especially in which input is spatially or temporally distributed CNN is composed of two distinct parts : Several layers are convolution and then down-sampled (Max pooling) The second part categorizes the pattern into classes (such as RBF). CNN consists of three different layers: convolution layer (with different feature map), sub-sampling (maxpooling) layer and an ensemble of fully connected layers 2016 Big Data and Applied Machine Learning 37

38 Convolution Neural Network (Cont.) CNN : Recognition rate after 105 epoch, Few samples (28 per class), Similarity between Basophil and Lymphocyte 2016 Big Data and Applied Machine Learning 38

39 Deep learning In Codes! Reference : Programming Language : Python Matlab Java Lua Machine Learning in Python Scikit-learn, Keras, Caffe,. Pylearn2 Machine Learning in Matlab Torch7 Machine Learning in Java Deeplearning4j 2016 Big Data and Applied Machine Learning 39

40 Machine Learning in Python 2016 Big Data and Applied Machine Learning 40

) Advertising, Mobile Telecommunication Networks (i.

41 Big Data in the real world Climate data, Large scale health care Complex Image Processing Personalization ( Facebook, Telegram,.) Advertising, Mobile Telecommunication Networks (i.e, 5G), E-commerce and E- Banking Applications 2016 Big Data and Applied Machine Learning 41

42 Big Data in the real world (Cont.) Deep Learning Algorithm Transcribes House Numbers (Google) 2016 Big Data and Applied Machine Learning 42

43 Big Data in the real world (Cont.) Car Classification using Deep Learning Approach 2016 Big Data and Applied Machine Learning 43

Forgery Detection Financial Fraud Detection Bank Embezzlement & Money

44 Big Data in the real world (Cont.) Banking Systems; Big Data and Deep Learning Banknote Authentication and Forgery Detection Financial Fraud Detection Bank Embezzlement & Money Laundering Boost e-commerce Sales Losing From Disgruntled Customers Loan Approval Prediction 2016 Big Data and Applied Machine Learning 44

Contact Info Mehdi (Nima) Habibzadeh Motlagh PhD in Computer Science (Concordia university, Sept 2015) Email :

45 Contact Info Mehdi (Nima) Habibzadeh Motlagh PhD in Computer Science (Concordia university, Sept 2015) Nimahm@Gmail.com Cell phone : Telegram : Big Data and Applied Machine Learning 45

Introduction to NoSQL (MongoDB and Elastic )

Introduction to NoSQL (MongoDB and Elastic ) By : Mehdi Habibzadeh (@NimaHM1980) Hossein Shemshadi (@HosseinShemshadi) July 2017 Outlines (July 2017) : Big Data and Challenges Review and Trends Map-Reduce