Machine Learning Algorithms using Parallel Approach. Shri. Aditya Kumar Sinha FDP-Programme Head &Principal Technical officer ACTS, C-DAC Pune

Size: px

Start display at page:

Download "Machine Learning Algorithms using Parallel Approach. Shri. Aditya Kumar Sinha FDP-Programme Head &Principal Technical officer ACTS, C-DAC Pune"

Diana Fleming
5 years ago
Views:

1 Machine Learning Algorithms using Parallel Approach Shri. Aditya Kumar Sinha FDP-Programme Head &Principal Technical officer ACTS, C-DAC Pune

2 Presentation Plan Thinking Parallel Parallel Computing The Learning Problem Machine Learning Parallelization of Machine Learning

3 Think Parallel Convert an array of string to upper case Summing 1 n Splitting a list a b # c # d e

4 Parallel Computing Consider the problem of stacking (reshelving) a set of library books. A single worker trying to stack all the books in their proper places cannot accomplish the task faster than a certain rate. We can speed up this process, however, by employing more than one worker.

5 Solution 1 Assume that books are organized into shelves and that the shelves are grouped into bays One simple way to assign the task to the workers is: To divide the books equally among them. Each worker stacks the books one a time This division of work may not be most efficient way to accomplish the task since The workers must walk all over the library to stack books.

6 Solution 2 Instance of task partitioning An alternative way to divide the work is to assign a fixed and disjoint set of bays to each worker. As before, each worker is assigned an equal number of books arbitrarily. If the worker finds a book that belongs to a bay assigned to him or her, he or she places that book in its assignment spot Otherwise, He or she passes it on to the worker responsible for the bay it belongs to. The second approach requires less effort from individual workers Instance of Communication task

7 Problems are parallelizable to different degrees For some problems, assigning partitions to other processors might be more time-consuming than performing the processing locally. Other problems may be completely serial. For example, consider the task of digging a post hole. Although one person can dig a hole in a certain amount of time, Employing more people does not reduce this time

8 Sorting in nature

9 Parallel Processing Several processing elements working to solve a single problem Primary consideration: elapsed time NOT: throughput, sharing resources, etc. Downside: complexity system, algorithm design Elapsed Time = computation time + communication time + synchronization time

10 Design of efficient algorithms A parallel computer is of little use unless efficient parallel algorithms are available. The issue in designing parallel algorithms are very different from those in designing their sequential counterparts. A significant amount of work is being done to develop efficient parallel algorithms for a variety of parallel architectures.

11 Some Complex Problems N-body simulation Atmospheric simulation Image generation Oil exploration Financial processing Computational biology

12 Some Complex Problems N-body simulation O(n log n) time galaxy stars approx. one year / iteration Atmospheric simulation 3D grid, each element interacts with neighbors 1x1x1 mile element elements 10 day simulation requires approx. 100 days

13 Some Complex Problems Image generation animation, special effects several minutes of video 50 days of rendering Oil exploration large amounts of seismic data to be processed months of sequential exploration

14 Some Complex Problems Financial processing market prediction, investing Cornell Theory Center, Renaissance Tech. Computational biology drug design gene sequencing (Celera) structure prediction (Proteomics)

15 Fundamental Issues Is the problem amenable to parallelization? How to decompose the problem to exploit parallelism? What machine architecture should be used? What parallel resources are available? What kind of speedup is desired?

16 Metrics A measure of relative performance between a multiprocessor system and a single processor system is the speed-up S( p), defined as follows: S( p) = Execution time using a single processor system Execution time using a multiprocessor with p processors S( p) = T 1 T p Efficiency = S p p Cost = p T p

17 Machine Learning

18 Quick Questionnaire How many people have heard about Machine Learning How many people know about Machine Learning How many people are using Machine Learning

19 Why Learn? Machine learning is programming computers to optimize a performance criterion using example data or past experience. There is no need to learn to calculate payroll Learning is used when: Human expertise does not exist (navigating on Mars), Humans are unable to explain their expertise (speech recognition) Solution changes in time (routing on a computer network) Solution needs to be adapted to particular cases (user biometrics)

20 About subfield of Artificial Intelligence (AI) name is derived from the concept that it deals with construction and study of systems that can learn from data can be seen as building blocks to make computers learn to behave more intelligently It is a theoretical concept. There are various techniques with various implementations.

21 In other words A computer program is said to learn from experience (E) with some class of tasks (T) and a performance measure (P) if its performance at tasks in T as measured by P improves with E

22 Terminology Features The number of features or distinct traits that can be used to describe each item in a quantitative manner. Samples A sample is an item to process (e.g. classify). It can be a document, a picture, a sound, a video, a row in database or CSV file, or whatever you can describe with a fixed set of quantitative traits. Feature vector is an n-dimensional vector of numerical features that represent some object. Feature extraction Preparation of feature vector transforms the data in the high-dimensional space to a space of fewer dimensions. Training/Evolution set Set of data to discover potentially predictive relationships.

23 Categories Supervised Learning Unsupervised Learning Semi-Supervised Learning Reinforcement Learning

24 Supervised Machine Learning The majority of practical machine learning uses supervised learning. Supervised learning is where you have input variables (X) and an output variable (Y ) and you use an algorithm to learn the mapping function from the input to the output. Y = f (X) The goal is to approximate the mapping function so well that when you have new input data (X) that you can predict the output variables (Y ) for that data. It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. We know the correct answers, the algorithm iteratively makes predictions on the training data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable level of performance. Supervised learning problems can be further grouped into regression and classification problems. Classification: A classification problem is when the output variable is a category, such as red or blue or disease and no disease. Regression: A regression problem is when the output variable is a real value, such dollars or weight.

25 Unsupervised Machine Learning Unsupervised learning is where you only have input data (X) and no corresponding output variables. The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data. These are called unsupervised learning because unlike supervised learning above there is no correct answers and there is no teacher. Algorithms are left to their own devises to discover and present the interesting structure in the data. Unsupervised learning problems can be further grouped into clustering and association problems. Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior. Association: An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy A also tend to buy B. Some popular examples of unsupervised learning algorithms are: k-means for clustering problems. Apriori algorithm for association rule learning problems.

26 Semi-Supervised Machine Learning Problems where you have a large amount of input data (X) and only some of the data is labeled (Y ) are called semi-supervised learning problems. These problems sit in between both supervised and unsupervised learning. A good example is a photo archive where only some of the images are labeled, (e.g. dog, cat, person) and the majority are unlabeled. Many real world machine learning problems fall into this area. This is because it can be expensive or time consuming to label data as it may require access to domain experts. Whereas unlabeled data is cheap and easy to collect and store. You can use unsupervised learning techniques to discover and learn the structure in the input variables. You can also use supervised learning techniques to make best guess predictions for the unlabeled data, feed that data back into the supervised learning algorithm as training data and use the model to make predictions on new unseen data.

27 Reinforcement Learning allows the machine or software agent to learn its behavior based on feedback from the environment. This behavior can be learnt once and for all, or keep on adapting as time goes by. Credit:

28 Machine Learning Techniques

29 Techniques classification: predict class from observations clustering: group observations into meaningful groups regression (prediction): predict value from observations

30 Classification classify a document into a predefined category. documents can be text, images Popular one is Naive Bayes Classifier. Steps: Step1 : Train the program (Building a Model) using a training set with a category for e.g. sports, cricket, news, Classifier will compute probability for each word, the probability that it makes a document belong to each of considered categories Step2 : Test with a test data set against this Model

31 Clustering clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other objects are not predefined For e.g. these keywords man s shoe women s shoe women s t-shirt man s t-shirt can be cluster into 2 categories shoe and t-shirt or man and women Popular ones are K-means clustering and Hierarchical clustering

32 K-means Clustering partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

33 Use-Cases Spam Detection Machine Translation (Language Translation) Image Search (Similarity) Clustering (KMeans) : Amazon Recommendations Classification : Google News continued

34 Use-Cases (contd.) Text Summarization - Google News Rating a Review/Comment: Yelp Fraud detection : Credit card Providers Decision Making : e.g. Bank/Insurance sector Sentiment Analysis Speech Understanding iphone with Siri Face Detection Facebook s Photo tagging

35 What is Clustering? Organizing data into classes such that there is high intra-class similarity low inter-class similarity Finding the class labels and the number of classes directly from the data (in contrast to classification). More informally, finding natural groupings among objects.

36 Example: k-means clustering An EM-like algorithm: Initialize k cluster centroids E-step: associate each data instance with the closest centroid Find expected values of cluster assignments given the data and centroids M-step: recalculate centroids as an average of the associated data instances Find new centroids that maximize that expectation 36

37 The data points

38 Initialization

39 #Runs = 1

40 #Runs = 2

41 #Runs = 3

42 Applications of K-means Method Optical Character Recognition Biometrics Diagnostic Systems Military Applications

43 Parallelizing k-means 43

44 Parallelizing k-means 44

45 Parallelizing k-means 45

46 Parallelization: platform choices Platform Communication Scheme Data size Peer-to-Peer TCP/IP Petabytes Virtual Clusters MapReduce / MPI Terabytes HPC Clusters MPI / MapReduce Terabytes Multicore Multithreading Gigabytes GPU CUDA Gigabytes FPGA HDL Gigabytes 46

47 Peer-to-peer (P2P) systems Millions of machines connected in a network Each machine can only contact its neighbors Each machine storing millions of data instances Practically unlimited scale Communication is the bottleneck Aggregation is costly, broadcast is cheaper Messages are sent over a spanning tree With an arbitrary node being the root 47

48 k-means in P2P Uniformly sample k centroids over P2P Using a random walk method Broadcast the centroids Run local k-means on each machine Sample n nodes Aggregate local centroids of those n nodes 48

49 Parallelization: platform choices Platform Communication Scheme Data size Peer-to-Peer TCP/IP Petabytes Virtual Clusters MapReduce / MPI Terabytes HPC Clusters MPI / MapReduce Terabytes Multicore Multithreading Gigabytes GPU CUDA Gigabytes FPGA HDL Gigabytes 49

50 Virtual clusters Datacenter-scale clusters Hundreds of thousands of machines Distributed file system Data redundancy Cloud computing paradigm Virtualization, full fault tolerance, pay-as-you-go MapReduce is #1 data processing scheme 50

51 MapReduce Mappers Reducers Process in parallel shuffle process in parallel Mappers output (key, value) records Records with the same key are sent to the same reducer 51

52 k-means on MapReduce Mappers read data portions and centroids Mappers assign data instances to clusters Mappers compute new local centroids and local cluster sizes Reducers aggregate local centroids (weighted by local cluster sizes) into new global centroids Reducers write the new centroids 52

53 Discussion on MapReduce MapReduce is not designed for iterative processing Mappers read the same data again and again MapReduce looks too low-level to some people Data analysts are traditionally SQL folks MapReduce looks too high-level to others A lot of MapReduce logic is hard to adapt Example: grouping documents by words 53

54 MapReduce wrappers Many of them are available At different levels of stability Apache Pig is an SQL-like environment Group, Join, Filter rows, Filter columns (Foreach) Developed at Yahoo! Research DryadLINQ is a C#-like environment Developed at Microsoft Research 54

55 Parallelization: platform choices Platform Communication Scheme Data size Peer-to-Peer TCP/IP Petabytes Virtual Clusters MapReduce / MPI Terabytes HPC Clusters MPI / MapReduce Terabytes Multicore Multithreading Gigabytes GPU CUDA Gigabytes FPGA HDL Gigabytes 55

56 HPC clusters High Performance Computing clusters / blades / supercomputers Thousands of cores Great variety of architectural choices Disk organization, cache, communication etc. Fault tolerance mechanisms are not crucial Hardware failures are rare Most typical communication protocol: MPI Message Passing Interface

57 Message Passing Interface (MPI) Runtime communication library Available for many programming languages MPI_Bsend(void* buffer, int size, int destid) Serialization is on you MPI_Recv(void* buffer, int size, int sourceid) Will wait until receives it MPI_Bcast broadcasts a message MPI_Barrier synchronizes all processes 57

58 MapReduce vs. MPI MPI is a generic framework Processes send messages to other processes Any computation graph can be built Most suitable for the master/slave model 58

59 k-means using MPI Slaves read data portions Master broadcasts centroids to slaves Slaves assign data instances to clusters Slaves compute new local centroids and local cluster sizes Then send them to the master Master aggregates local centroids weighted by local cluster sizes into new global centroids 59

60 Two features of MPI parallelization State-preserving processes Processes can live as long as the system runs No need to read the same data again and again All necessary parameters can be preserved locally Hierarchical master/slave paradigm A slave can be a master of other processes Could be very useful in dynamic resource allocation When a slave recognizes it has too much stuff to process 60

61 Takeaways on MPI Old, well established, well debugged Very flexible Perfectly suitable for iterative processing Fault intolerant Not that widely available anymore An open source implementation: OpenMPI MPI can be deployed on Hadoop 61

62 Parallelization: platform choices Platform Communication Scheme Data size Peer-to-Peer TCP/IP Petabytes Virtual Clusters MapReduce / MPI Terabytes HPC Clusters MPI / MapReduce Terabytes Multicore Multithreading Gigabytes GPU CUDA Gigabytes FPGA HDL Gigabytes 62

63 Multicore One machine, up to dozens of cores Shared memory, one disk Multithreading as a parallelization scheme Data might not fit the RAM Use streaming to process the data in portions Disk access may be the bottleneck If it does fit, RAM access is the bottleneck Use uniform, small size memory requests 63

64 Parallelization: platform choices Platform Communication Scheme Data size Peer-to-Peer TCP/IP Petabytes Virtual Clusters MapReduce / MPI Terabytes HPC Clusters MPI / MapReduce Terabytes Multicore Multithreading Gigabytes GPU CUDA Gigabytes FPGA HDL Gigabytes 64

65 Graphics Processing Unit (GPU) GPU has become General-Purpose (GP-GPU) CUDA is a GP-GPU programming framework Powered by NVIDIA Each GPU consists of hundreds of multiprocessors Each multiprocessor consists of a few ALUs ALUs execute the same line of code synchronously When code branches, some multiprocessors stall Avoid branching as much as possible 65

66 Machine learning with GPUs To fully utilize a GPU, the data needs to fit in RAM This limits the maximal size of the data GPUs are optimized for speed A good choice for real-time tasks A typical usecase: a model is trained offline and then applied in real-time (inference) Machine vision / speech recognition are example domains 66

67 k-means clustering on a GPU Cluster membership assignment done on GPU: Centroids are uploaded to every multiprocessor A multiprocessor works on one data vector at a time Each ALU works on one data dimension Centroid recalculation is then done on CPU Most appropriate for processing dense data Scattered memory access should be avoided A multiprocessor reads a data vector while its ALUs process a previous vector 67

68 Performance results 4 millions 8-dimensional vectors 400 clusters 50 k-means iterations 9 seconds!!! 68

69 Parallelization: platform choices Platform Communication Scheme Data size Peer-to-Peer TCP/IP Petabytes Virtual Clusters MapReduce / MPI Terabytes HPC Clusters MPI / MapReduce Terabytes Multicore Multithreading Gigabytes GPU CUDA Gigabytes FPGA HDL Gigabytes 69

70 Field-programmable gate array (FPGA) Highly specialized hardware units Programmable in Hardware Description Language (HDL) Applicable to training and inference 70

71 Moving next The cognitive era Cognitive Computing: Style of advanced analytics that attempts to mimic the way the human brain function but as a scale that no single person could achieve. Designed to adopt and make sense of the complexity and unpredictability of unstructured information. Read text, see images hear natural speech. Interpret information, organize it and offer explanation of what it means, along with the rationale of their conclusion.

72 Applying Advanced Computing for Human Advancement Thank you Aditya Kumar Sinha

COMP 308 Parallel Efficient Algorithms. Course Description and Objectives: Teaching method. Recommended Course Textbooks. What is Parallel Computing?

COMP 308 Parallel Efficient Algorithms Course Description and Objectives: Lecturer: Dr. Igor Potapov Chadwick Building, room 2.09 E-mail: igor@csc.liv.ac.uk COMP 308 web-page: http://www.csc.liv.ac.uk/~igor/comp308