Machine Learning Algorithms using Parallel Approach. Shri. Aditya Kumar Sinha FDP-Programme Head &Principal Technical officer ACTS, C-DAC Pune

Size: px
Start display at page:

Download "Machine Learning Algorithms using Parallel Approach. Shri. Aditya Kumar Sinha FDP-Programme Head &Principal Technical officer ACTS, C-DAC Pune"

Transcription

1 Machine Learning Algorithms using Parallel Approach Shri. Aditya Kumar Sinha FDP-Programme Head &Principal Technical officer ACTS, C-DAC Pune

2 Presentation Plan Thinking Parallel Parallel Computing The Learning Problem Machine Learning Parallelization of Machine Learning

3 Think Parallel Convert an array of string to upper case Summing 1 n Splitting a list a b # c # d e

4 Parallel Computing Consider the problem of stacking (reshelving) a set of library books. A single worker trying to stack all the books in their proper places cannot accomplish the task faster than a certain rate. We can speed up this process, however, by employing more than one worker.

5 Solution 1 Assume that books are organized into shelves and that the shelves are grouped into bays One simple way to assign the task to the workers is: To divide the books equally among them. Each worker stacks the books one a time This division of work may not be most efficient way to accomplish the task since The workers must walk all over the library to stack books.

6 Solution 2 Instance of task partitioning An alternative way to divide the work is to assign a fixed and disjoint set of bays to each worker. As before, each worker is assigned an equal number of books arbitrarily. If the worker finds a book that belongs to a bay assigned to him or her, he or she places that book in its assignment spot Otherwise, He or she passes it on to the worker responsible for the bay it belongs to. The second approach requires less effort from individual workers Instance of Communication task

7 Problems are parallelizable to different degrees For some problems, assigning partitions to other processors might be more time-consuming than performing the processing locally. Other problems may be completely serial. For example, consider the task of digging a post hole. Although one person can dig a hole in a certain amount of time, Employing more people does not reduce this time

8 Sorting in nature

9 Parallel Processing Several processing elements working to solve a single problem Primary consideration: elapsed time NOT: throughput, sharing resources, etc. Downside: complexity system, algorithm design Elapsed Time = computation time + communication time + synchronization time

10 Design of efficient algorithms A parallel computer is of little use unless efficient parallel algorithms are available. The issue in designing parallel algorithms are very different from those in designing their sequential counterparts. A significant amount of work is being done to develop efficient parallel algorithms for a variety of parallel architectures.

11 Some Complex Problems N-body simulation Atmospheric simulation Image generation Oil exploration Financial processing Computational biology

12 Some Complex Problems N-body simulation O(n log n) time galaxy stars approx. one year / iteration Atmospheric simulation 3D grid, each element interacts with neighbors 1x1x1 mile element elements 10 day simulation requires approx. 100 days

13 Some Complex Problems Image generation animation, special effects several minutes of video 50 days of rendering Oil exploration large amounts of seismic data to be processed months of sequential exploration

14 Some Complex Problems Financial processing market prediction, investing Cornell Theory Center, Renaissance Tech. Computational biology drug design gene sequencing (Celera) structure prediction (Proteomics)

15 Fundamental Issues Is the problem amenable to parallelization? How to decompose the problem to exploit parallelism? What machine architecture should be used? What parallel resources are available? What kind of speedup is desired?

16 Metrics A measure of relative performance between a multiprocessor system and a single processor system is the speed-up S( p), defined as follows: S( p) = Execution time using a single processor system Execution time using a multiprocessor with p processors S( p) = T 1 T p Efficiency = S p p Cost = p T p

17 Machine Learning

18 Quick Questionnaire How many people have heard about Machine Learning How many people know about Machine Learning How many people are using Machine Learning

19 Why Learn? Machine learning is programming computers to optimize a performance criterion using example data or past experience. There is no need to learn to calculate payroll Learning is used when: Human expertise does not exist (navigating on Mars), Humans are unable to explain their expertise (speech recognition) Solution changes in time (routing on a computer network) Solution needs to be adapted to particular cases (user biometrics)

20 About subfield of Artificial Intelligence (AI) name is derived from the concept that it deals with construction and study of systems that can learn from data can be seen as building blocks to make computers learn to behave more intelligently It is a theoretical concept. There are various techniques with various implementations.

21 In other words A computer program is said to learn from experience (E) with some class of tasks (T) and a performance measure (P) if its performance at tasks in T as measured by P improves with E

22 Terminology Features The number of features or distinct traits that can be used to describe each item in a quantitative manner. Samples A sample is an item to process (e.g. classify). It can be a document, a picture, a sound, a video, a row in database or CSV file, or whatever you can describe with a fixed set of quantitative traits. Feature vector is an n-dimensional vector of numerical features that represent some object. Feature extraction Preparation of feature vector transforms the data in the high-dimensional space to a space of fewer dimensions. Training/Evolution set Set of data to discover potentially predictive relationships.

23 Categories Supervised Learning Unsupervised Learning Semi-Supervised Learning Reinforcement Learning

24 Supervised Machine Learning The majority of practical machine learning uses supervised learning. Supervised learning is where you have input variables (X) and an output variable (Y ) and you use an algorithm to learn the mapping function from the input to the output. Y = f (X) The goal is to approximate the mapping function so well that when you have new input data (X) that you can predict the output variables (Y ) for that data. It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. We know the correct answers, the algorithm iteratively makes predictions on the training data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable level of performance. Supervised learning problems can be further grouped into regression and classification problems. Classification: A classification problem is when the output variable is a category, such as red or blue or disease and no disease. Regression: A regression problem is when the output variable is a real value, such dollars or weight.

25 Unsupervised Machine Learning Unsupervised learning is where you only have input data (X) and no corresponding output variables. The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data. These are called unsupervised learning because unlike supervised learning above there is no correct answers and there is no teacher. Algorithms are left to their own devises to discover and present the interesting structure in the data. Unsupervised learning problems can be further grouped into clustering and association problems. Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior. Association: An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy A also tend to buy B. Some popular examples of unsupervised learning algorithms are: k-means for clustering problems. Apriori algorithm for association rule learning problems.

26 Semi-Supervised Machine Learning Problems where you have a large amount of input data (X) and only some of the data is labeled (Y ) are called semi-supervised learning problems. These problems sit in between both supervised and unsupervised learning. A good example is a photo archive where only some of the images are labeled, (e.g. dog, cat, person) and the majority are unlabeled. Many real world machine learning problems fall into this area. This is because it can be expensive or time consuming to label data as it may require access to domain experts. Whereas unlabeled data is cheap and easy to collect and store. You can use unsupervised learning techniques to discover and learn the structure in the input variables. You can also use supervised learning techniques to make best guess predictions for the unlabeled data, feed that data back into the supervised learning algorithm as training data and use the model to make predictions on new unseen data.

27 Reinforcement Learning allows the machine or software agent to learn its behavior based on feedback from the environment. This behavior can be learnt once and for all, or keep on adapting as time goes by. Credit:

28 Machine Learning Techniques

29 Techniques classification: predict class from observations clustering: group observations into meaningful groups regression (prediction): predict value from observations

30 Classification classify a document into a predefined category. documents can be text, images Popular one is Naive Bayes Classifier. Steps: Step1 : Train the program (Building a Model) using a training set with a category for e.g. sports, cricket, news, Classifier will compute probability for each word, the probability that it makes a document belong to each of considered categories Step2 : Test with a test data set against this Model

31 Clustering clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other objects are not predefined For e.g. these keywords man s shoe women s shoe women s t-shirt man s t-shirt can be cluster into 2 categories shoe and t-shirt or man and women Popular ones are K-means clustering and Hierarchical clustering

32 K-means Clustering partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

33 Use-Cases Spam Detection Machine Translation (Language Translation) Image Search (Similarity) Clustering (KMeans) : Amazon Recommendations Classification : Google News continued

34 Use-Cases (contd.) Text Summarization - Google News Rating a Review/Comment: Yelp Fraud detection : Credit card Providers Decision Making : e.g. Bank/Insurance sector Sentiment Analysis Speech Understanding iphone with Siri Face Detection Facebook s Photo tagging

35 What is Clustering? Organizing data into classes such that there is high intra-class similarity low inter-class similarity Finding the class labels and the number of classes directly from the data (in contrast to classification). More informally, finding natural groupings among objects.

36 Example: k-means clustering An EM-like algorithm: Initialize k cluster centroids E-step: associate each data instance with the closest centroid Find expected values of cluster assignments given the data and centroids M-step: recalculate centroids as an average of the associated data instances Find new centroids that maximize that expectation 36

37 The data points

38 Initialization

39 #Runs = 1

40 #Runs = 2

41 #Runs = 3

42 Applications of K-means Method Optical Character Recognition Biometrics Diagnostic Systems Military Applications

43 Parallelizing k-means 43

44 Parallelizing k-means 44

45 Parallelizing k-means 45

46 Parallelization: platform choices Platform Communication Scheme Data size Peer-to-Peer TCP/IP Petabytes Virtual Clusters MapReduce / MPI Terabytes HPC Clusters MPI / MapReduce Terabytes Multicore Multithreading Gigabytes GPU CUDA Gigabytes FPGA HDL Gigabytes 46

47 Peer-to-peer (P2P) systems Millions of machines connected in a network Each machine can only contact its neighbors Each machine storing millions of data instances Practically unlimited scale Communication is the bottleneck Aggregation is costly, broadcast is cheaper Messages are sent over a spanning tree With an arbitrary node being the root 47

48 k-means in P2P Uniformly sample k centroids over P2P Using a random walk method Broadcast the centroids Run local k-means on each machine Sample n nodes Aggregate local centroids of those n nodes 48

49 Parallelization: platform choices Platform Communication Scheme Data size Peer-to-Peer TCP/IP Petabytes Virtual Clusters MapReduce / MPI Terabytes HPC Clusters MPI / MapReduce Terabytes Multicore Multithreading Gigabytes GPU CUDA Gigabytes FPGA HDL Gigabytes 49

50 Virtual clusters Datacenter-scale clusters Hundreds of thousands of machines Distributed file system Data redundancy Cloud computing paradigm Virtualization, full fault tolerance, pay-as-you-go MapReduce is #1 data processing scheme 50

51 MapReduce Mappers Reducers Process in parallel shuffle process in parallel Mappers output (key, value) records Records with the same key are sent to the same reducer 51

52 k-means on MapReduce Mappers read data portions and centroids Mappers assign data instances to clusters Mappers compute new local centroids and local cluster sizes Reducers aggregate local centroids (weighted by local cluster sizes) into new global centroids Reducers write the new centroids 52

53 Discussion on MapReduce MapReduce is not designed for iterative processing Mappers read the same data again and again MapReduce looks too low-level to some people Data analysts are traditionally SQL folks MapReduce looks too high-level to others A lot of MapReduce logic is hard to adapt Example: grouping documents by words 53

54 MapReduce wrappers Many of them are available At different levels of stability Apache Pig is an SQL-like environment Group, Join, Filter rows, Filter columns (Foreach) Developed at Yahoo! Research DryadLINQ is a C#-like environment Developed at Microsoft Research 54

55 Parallelization: platform choices Platform Communication Scheme Data size Peer-to-Peer TCP/IP Petabytes Virtual Clusters MapReduce / MPI Terabytes HPC Clusters MPI / MapReduce Terabytes Multicore Multithreading Gigabytes GPU CUDA Gigabytes FPGA HDL Gigabytes 55

56 HPC clusters High Performance Computing clusters / blades / supercomputers Thousands of cores Great variety of architectural choices Disk organization, cache, communication etc. Fault tolerance mechanisms are not crucial Hardware failures are rare Most typical communication protocol: MPI Message Passing Interface

57 Message Passing Interface (MPI) Runtime communication library Available for many programming languages MPI_Bsend(void* buffer, int size, int destid) Serialization is on you MPI_Recv(void* buffer, int size, int sourceid) Will wait until receives it MPI_Bcast broadcasts a message MPI_Barrier synchronizes all processes 57

58 MapReduce vs. MPI MPI is a generic framework Processes send messages to other processes Any computation graph can be built Most suitable for the master/slave model 58

59 k-means using MPI Slaves read data portions Master broadcasts centroids to slaves Slaves assign data instances to clusters Slaves compute new local centroids and local cluster sizes Then send them to the master Master aggregates local centroids weighted by local cluster sizes into new global centroids 59

60 Two features of MPI parallelization State-preserving processes Processes can live as long as the system runs No need to read the same data again and again All necessary parameters can be preserved locally Hierarchical master/slave paradigm A slave can be a master of other processes Could be very useful in dynamic resource allocation When a slave recognizes it has too much stuff to process 60

61 Takeaways on MPI Old, well established, well debugged Very flexible Perfectly suitable for iterative processing Fault intolerant Not that widely available anymore An open source implementation: OpenMPI MPI can be deployed on Hadoop 61

62 Parallelization: platform choices Platform Communication Scheme Data size Peer-to-Peer TCP/IP Petabytes Virtual Clusters MapReduce / MPI Terabytes HPC Clusters MPI / MapReduce Terabytes Multicore Multithreading Gigabytes GPU CUDA Gigabytes FPGA HDL Gigabytes 62

63 Multicore One machine, up to dozens of cores Shared memory, one disk Multithreading as a parallelization scheme Data might not fit the RAM Use streaming to process the data in portions Disk access may be the bottleneck If it does fit, RAM access is the bottleneck Use uniform, small size memory requests 63

64 Parallelization: platform choices Platform Communication Scheme Data size Peer-to-Peer TCP/IP Petabytes Virtual Clusters MapReduce / MPI Terabytes HPC Clusters MPI / MapReduce Terabytes Multicore Multithreading Gigabytes GPU CUDA Gigabytes FPGA HDL Gigabytes 64

65 Graphics Processing Unit (GPU) GPU has become General-Purpose (GP-GPU) CUDA is a GP-GPU programming framework Powered by NVIDIA Each GPU consists of hundreds of multiprocessors Each multiprocessor consists of a few ALUs ALUs execute the same line of code synchronously When code branches, some multiprocessors stall Avoid branching as much as possible 65

66 Machine learning with GPUs To fully utilize a GPU, the data needs to fit in RAM This limits the maximal size of the data GPUs are optimized for speed A good choice for real-time tasks A typical usecase: a model is trained offline and then applied in real-time (inference) Machine vision / speech recognition are example domains 66

67 k-means clustering on a GPU Cluster membership assignment done on GPU: Centroids are uploaded to every multiprocessor A multiprocessor works on one data vector at a time Each ALU works on one data dimension Centroid recalculation is then done on CPU Most appropriate for processing dense data Scattered memory access should be avoided A multiprocessor reads a data vector while its ALUs process a previous vector 67

68 Performance results 4 millions 8-dimensional vectors 400 clusters 50 k-means iterations 9 seconds!!! 68

69 Parallelization: platform choices Platform Communication Scheme Data size Peer-to-Peer TCP/IP Petabytes Virtual Clusters MapReduce / MPI Terabytes HPC Clusters MPI / MapReduce Terabytes Multicore Multithreading Gigabytes GPU CUDA Gigabytes FPGA HDL Gigabytes 69

70 Field-programmable gate array (FPGA) Highly specialized hardware units Programmable in Hardware Description Language (HDL) Applicable to training and inference 70

71 Moving next The cognitive era Cognitive Computing: Style of advanced analytics that attempts to mimic the way the human brain function but as a scale that no single person could achieve. Designed to adopt and make sense of the complexity and unpredictability of unstructured information. Read text, see images hear natural speech. Interpret information, organize it and offer explanation of what it means, along with the rationale of their conclusion.

72 Applying Advanced Computing for Human Advancement Thank you Aditya Kumar Sinha

COMP 308 Parallel Efficient Algorithms. Course Description and Objectives: Teaching method. Recommended Course Textbooks. What is Parallel Computing?

COMP 308 Parallel Efficient Algorithms. Course Description and Objectives: Teaching method. Recommended Course Textbooks. What is Parallel Computing? COMP 308 Parallel Efficient Algorithms Course Description and Objectives: Lecturer: Dr. Igor Potapov Chadwick Building, room 2.09 E-mail: igor@csc.liv.ac.uk COMP 308 web-page: http://www.csc.liv.ac.uk/~igor/comp308

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A Comparative study of Clustering Algorithms using MapReduce in Hadoop A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Introduction to Data Mining and Data Analytics

Introduction to Data Mining and Data Analytics 1/28/2016 MIST.7060 Data Analytics 1 Introduction to Data Mining and Data Analytics What Are Data Mining and Data Analytics? Data mining is the process of discovering hidden patterns in data, where Patterns

More information

Accelerated Machine Learning Algorithms in Python

Accelerated Machine Learning Algorithms in Python Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang Department of Computer Science San Marcos, TX 78666 Report Number TXSTATE-CS-TR-2010-24 Clustering in the Cloud Xuan Wang 2010-05-05 !"#$%&'()*+()+%,&+!"-#. + /+!"#$%&'()*+0"*-'(%,1$+0.23%(-)+%-+42.--3+52367&.#8&+9'21&:-';

More information

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve

More information

Parallel Approach for Implementing Data Mining Algorithms

Parallel Approach for Implementing Data Mining Algorithms TITLE OF THE THESIS Parallel Approach for Implementing Data Mining Algorithms A RESEARCH PROPOSAL SUBMITTED TO THE SHRI RAMDEOBABA COLLEGE OF ENGINEERING AND MANAGEMENT, FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

More information

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING CS 7265 BIG DATA ANALYTICS INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Mingon Kang, PhD Computer Science,

More information

CSL 860: Modern Parallel

CSL 860: Modern Parallel CSL 860: Modern Parallel Computation Course Information www.cse.iitd.ac.in/~subodh/courses/csl860 Grading: Quizes25 Lab Exercise 17 + 8 Project35 (25% design, 25% presentations, 50% Demo) Final Exam 25

More information

Introduction to Text Mining. Hongning Wang

Introduction to Text Mining. Hongning Wang Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve

More information

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect BEOP.CTO.TP4 Owner: OCTO Revision: 0001 Approved by: JAT Effective: 08/30/2018 Buchanan & Edwards Proprietary: Printed copies of

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

PARALLEL CLASSIFICATION ALGORITHMS

PARALLEL CLASSIFICATION ALGORITHMS PARALLEL CLASSIFICATION ALGORITHMS By: Faiz Quraishi Riti Sharma 9 th May, 2013 OVERVIEW Introduction Types of Classification Linear Classification Support Vector Machines Parallel SVM Approach Decision

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS 61C: Great Ideas in Computer Architecture. MapReduce CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

Comparative Analysis of K means Clustering Sequentially And Parallely

Comparative Analysis of K means Clustering Sequentially And Parallely Comparative Analysis of K means Clustering Sequentially And Parallely Kavya D S 1, Chaitra D Desai 2 1 M.tech, Computer Science and Engineering, REVA ITM, Bangalore, India 2 REVA ITM, Bangalore, India

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Chapter 1, Introduction

Chapter 1, Introduction CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Distributed Machine Learning Week #9 Today Distributed computing for machine learning Background MapReduce/Hadoop & Spark Theory

More information

Massive Scalability With InterSystems IRIS Data Platform

Massive Scalability With InterSystems IRIS Data Platform Massive Scalability With InterSystems IRIS Data Platform Introduction Faced with the enormous and ever-growing amounts of data being generated in the world today, software architects need to pay special

More information

Data Analysis Using MapReduce in Hadoop Environment

Data Analysis Using MapReduce in Hadoop Environment Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 2016 201 Road map What is Cluster Analysis? Characteristics of Clustering

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation

More information

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns STATS 700-002 Data Analysis using Python Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns Unit 3: parallel processing and big data The next few lectures will focus on big

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

Large-Scale GPU programming

Large-Scale GPU programming Large-Scale GPU programming Tim Kaldewey Research Staff Member Database Technologies IBM Almaden Research Center tkaldew@us.ibm.com Assistant Adjunct Professor Computer and Information Science Dept. University

More information

Data Mining: Models and Methods

Data Mining: Models and Methods Data Mining: Models and Methods Author, Kirill Goltsman A White Paper July 2017 --------------------------------------------------- www.datascience.foundation Copyright 2016-2017 What is Data Mining? Data

More information

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING Amol Jagtap ME Computer Engineering, AISSMS COE Pune, India Email: 1 amol.jagtap55@gmail.com Abstract Machine learning is a scientific discipline

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

732A54/TDDE31 Big Data Analytics

732A54/TDDE31 Big Data Analytics 732A54/TDDE31 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Peña IDA, Linköping University, Sweden 1/27 Contents MapReduce Framework Machine Learning with MapReduce Neural Networks

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

Finding a Needle in a Haystack. Facebook s Photo Storage Jack Hartner

Finding a Needle in a Haystack. Facebook s Photo Storage Jack Hartner Finding a Needle in a Haystack Facebook s Photo Storage Jack Hartner Paper Outline Introduction Background & Previous Design Design & Implementation Evaluation Related Work Conclusion Facebook Photo Storage

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Improved MapReduce k-means Clustering Algorithm with Combiner

Improved MapReduce k-means Clustering Algorithm with Combiner 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering

More information

Master-Worker pattern

Master-Worker pattern COSC 6397 Big Data Analytics Master Worker Programming Pattern Edgar Gabriel Fall 2018 Master-Worker pattern General idea: distribute the work among a number of processes Two logically different entities:

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Uday Kumar Sr 1, Naveen D Chandavarkar 2 1 PG Scholar, Assistant professor, Dept. of CSE, NMAMIT, Nitte, India. IJRASET 2015: All Rights are Reserved

Uday Kumar Sr 1, Naveen D Chandavarkar 2 1 PG Scholar, Assistant professor, Dept. of CSE, NMAMIT, Nitte, India. IJRASET 2015: All Rights are Reserved Implementation of K-Means Clustering Algorithm in Hadoop Framework Uday Kumar Sr 1, Naveen D Chandavarkar 2 1 PG Scholar, Assistant professor, Dept. of CSE, NMAMIT, Nitte, India Abstract Drastic growth

More information

The Future of High Performance Computing

The Future of High Performance Computing The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer

More information

Big Data - Some Words BIG DATA 8/31/2017. Introduction

Big Data - Some Words BIG DATA 8/31/2017. Introduction BIG DATA Introduction Big Data - Some Words Connectivity Social Medias Share information Interactivity People Business Data Data mining Text mining Business Intelligence 1 What is Big Data Big Data means

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

Machine Learning in Biology

Machine Learning in Biology Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan Coflow Recent Advances and What s Next? Mosharaf Chowdhury University of Michigan Rack-Scale Computing Datacenter-Scale Computing Geo-Distributed Computing Coflow Networking Open Source Apache Spark Open

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig CSE 6242 / CX 4242 Scaling Up 1 Hadoop, Pig Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Le

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

BİL 542 Parallel Computing

BİL 542 Parallel Computing BİL 542 Parallel Computing 1 Chapter 1 Parallel Programming 2 Why Use Parallel Computing? Main Reasons: Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion,

More information

Map Reduce Group Meeting

Map Reduce Group Meeting Map Reduce Group Meeting Yasmine Badr 10/07/2014 A lot of material in this presenta0on has been adopted from the original MapReduce paper in OSDI 2004 What is Map Reduce? Programming paradigm/model for

More information

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Distributed Computations MapReduce. adapted from Jeff Dean s slides Distributed Computations MapReduce adapted from Jeff Dean s slides What we ve learnt so far Basic distributed systems concepts Consistency (sequential, eventual) Fault tolerance (recoverability, availability)

More information

What to come. There will be a few more topics we will cover on supervised learning

What to come. There will be a few more topics we will cover on supervised learning Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression

More information

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010 INFORMATICS SEMINAR SEPT. 27 & OCT. 4, 2010 Introduction to Semi-Supervised Learning Review 2 Overview Citation X. Zhu and A.B. Goldberg, Introduction to Semi- Supervised Learning, Morgan & Claypool Publishers,

More information

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Observe novel applicability of DL techniques in Big Data Analytics. Applications of DL techniques for common Big Data Analytics problems. Semantic indexing

More information

Templates. for scalable data analysis. 2 Synchronous Templates. Amr Ahmed, Alexander J Smola, Markus Weimer. Yahoo! Research & UC Berkeley & ANU

Templates. for scalable data analysis. 2 Synchronous Templates. Amr Ahmed, Alexander J Smola, Markus Weimer. Yahoo! Research & UC Berkeley & ANU Templates for scalable data analysis 2 Synchronous Templates Amr Ahmed, Alexander J Smola, Markus Weimer Yahoo! Research & UC Berkeley & ANU Running Example Inbox Spam Running Example Inbox Spam Spam Filter

More information

Machine Learning with Python

Machine Learning with Python DEVNET-2163 Machine Learning with Python Dmitry Figol, SE WW Enterprise Sales @dmfigol Cisco Spark How Questions? Use Cisco Spark to communicate with the speaker after the session 1. Find this session

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Certified Big Data and Hadoop Course Curriculum

Certified Big Data and Hadoop Course Curriculum Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation

More information

Large Scale Data Analysis Using Deep Learning

Large Scale Data Analysis Using Deep Learning Large Scale Data Analysis Using Deep Learning Machine Learning Basics - 1 U Kang Seoul National University U Kang 1 In This Lecture Overview of Machine Learning Capacity, overfitting, and underfitting

More information

Specialist ICT Learning

Specialist ICT Learning Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.

More information

High Performance and Cloud Computing (HPCC) for Bioinformatics

High Performance and Cloud Computing (HPCC) for Bioinformatics High Performance and Cloud Computing (HPCC) for Bioinformatics King Jordan Georgia Tech January 13, 2016 Adopted From BIOS-ICGEB HPCC for Bioinformatics 1 Outline High performance computing (HPC) Cloud

More information

Similarities and Differences Between Parallel Systems and Distributed Systems

Similarities and Differences Between Parallel Systems and Distributed Systems Similarities and Differences Between Parallel Systems and Distributed Systems Pulasthi Wickramasinghe, Geoffrey Fox School of Informatics and Computing,Indiana University, Bloomington, IN 47408, USA In

More information

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters.

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters. Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 2013 ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 2013 ISSN: Semi Automatic Annotation Exploitation Similarity of Pics in i Personal Photo Albums P. Subashree Kasi Thangam 1 and R. Rosy Angel 2 1 Assistant Professor, Department of Computer Science Engineering College,

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

CSE 158. Web Mining and Recommender Systems. Midterm recap

CSE 158. Web Mining and Recommender Systems. Midterm recap CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158

More information

Parallelism. Parallel Hardware. Introduction to Computer Systems

Parallelism. Parallel Hardware. Introduction to Computer Systems Parallelism We have been discussing the abstractions and implementations that make up an individual computer system in considerable detail up to this point. Our model has been a largely sequential one,

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

Database Management Systems, 2nd edition, Raghu Ramakrishnan, Johannes Gehrke, McGraw-Hill

Database Management Systems, 2nd edition, Raghu Ramakrishnan, Johannes Gehrke, McGraw-Hill Lecture Handout Database Management System Lecture No. 34 Reading Material Database Management Systems, 2nd edition, Raghu Ramakrishnan, Johannes Gehrke, McGraw-Hill Modern Database Management, Fred McFadden,

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Mounica B, Aditya Srivastava, Md. Faisal Alam

Mounica B, Aditya Srivastava, Md. Faisal Alam International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 3 ISSN : 2456-3307 Clustering of large datasets using Hadoop Ecosystem

More information

V Conclusions. V.1 Related work

V Conclusions. V.1 Related work V Conclusions V.1 Related work Even though MapReduce appears to be constructed specifically for performing group-by aggregations, there are also many interesting research work being done on studying critical

More information

High Performance Computing. Introduction to Parallel Computing

High Performance Computing. Introduction to Parallel Computing High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Searching for Meaning in the Era of Big Data and IoT

Searching for Meaning in the Era of Big Data and IoT Searching for Meaning in the Era of Big Data and IoT Trung Tran MIT Lincoln Labs GraphEx Conference 11 May 2016 Distribution Statement A MTO Strategy EM Spectrum Tactical Information Extraction Globalization

More information

DCBench: a Data Center Benchmark Suite

DCBench: a Data Center Benchmark Suite DCBench: a Data Center Benchmark Suite Zhen Jia ( 贾禛 ) http://prof.ict.ac.cn/zhenjia/ Institute of Computing Technology, Chinese Academy of Sciences workshop in conjunction with CCF October 31,2013,Guilin

More information

Datacenter replication solution with quasardb

Datacenter replication solution with quasardb Datacenter replication solution with quasardb Technical positioning paper April 2017 Release v1.3 www.quasardb.net Contact: sales@quasardb.net Quasardb A datacenter survival guide quasardb INTRODUCTION

More information