Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing
|
|
- Damon Henry
- 5 years ago
- Views:
Transcription
1 Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing Richard Johansson December 1, 2015
2 today's lecture as you've seen, processing large corpora can take time! for instance, building the frequency tables in the word sketch assignment in this lecture, we'll think of how we can process large volumes of data by parallelizing our programs some basic ideas, some techniques, and pointers to software we'll just dip our toes, but there will be pointers for further reading
3 overview basics of parallel programming parallel programming in Python architectures for large-scale parallel processing: MapReduce, Hadoop, Spark
4 speeding up by parallelizing can we buy a machine that runs our code 10 times faster? I have a 2 GHz CPU: can I get a 20 GHz CPU instead? it's probably easier to buy 10 machines, or a machine with 10 CPUs, and then try to make the program parallel
5 Moore's law Moore's law was formulated by Gordon Moore at Intel in the early 1970s overall processing power for computers doubles every 2 years until about 2000, this used to mean that processors got faster Moore's law still holds, but its eect nowadays is increased parallelization increased number of CPUs in computers and each CPUs can run more than one process at a time
6 Moore's law (Wikipedia)
7 computer clusters computations may be distributed over large collections of machines: clusters for instance, Sweden has the SNIC infrastructure that connects clusters at dierent universitites picture borrowed from Peter Exner's slides
8 parallelizing an algorithm making an algorithm work in a parallel fashion may involve signicant changes a parallel algorithm is ecient if T parallel T sequential number of processors for instance, if I can compute a frequency table in a corpus 10 times as fast by using 10 machines often, this isn't exactly the case because there is some administrative overhead when we parallelize
9 processing embarassingly parallel tasks an embarrassingly parallel (or trivially parallel) job can be split into separate pieces with little or no eort, pieces be processed independently: when processing one piece, we don't need to care about what other pieces contain and where it's easy to collect the result in the end how can I process such a task if I have 10 machines (or CPUs)?
10 processing embarassingly parallel tasks an embarrassingly parallel (or trivially parallel) job can be split into separate pieces with little or no eort, pieces be processed independently: when processing one piece, we don't need to care about what other pieces contain and where it's easy to collect the result in the end how can I process such a task if I have 10 machines (or CPUs)? split the data into 10 pieces (of roughly equal size) assign a piece to each machine run the 10 machines in parallel concatenate the 10 results
11 what kinds of tasks are embarrassingly parallel? lowercasing the 1000 text les in a directory? building a frequency table of the words in a corpus? PoS tagging? parsing? machine learning tasks: Naive Bayes training? perceptron training?
12 embarrassing parallelization on Unix-like systems on Unix-like systems (e.g. Mac or Linux), commands such as split can be handy for instance, we split a le bigfile into 10 parts split bigfile -n 10 smallfile_ then we will get smallfile_aa, smallfile_ab, etc if we have more than one CPU on the machine, we can start multiple processes at once: python3 do_something.py smallfile_aa & python3 do_something.py smallfile_ab &... if we have many machines, we may need to copy les; on a computer cluster with many machines, the le system is usually shared between the machines
13 when parallelization is not trivial typically, algorithms that work in an incremental fashion are hard to parallelize when the result in the current step depends on what has happened before a good example is the perceptron learning algorithm what we do in this step depends on all the errors we made before parallelized versions of the perceptron (and related algorithms such as SVM, LR) use mini-batches rather than single instances
14 overview basics of parallel programming parallel programming in Python architectures for large-scale parallel processing: MapReduce, Hadoop, Spark
15 simple parallelization in Python in programming, we distinguish between two types of parallel activities: threads are parallel activities that share memory (variables, data structures, etc) processes run with separate memory, so they need to communicate over the network or through les in Python, for various technical reasons, using threads is less ecient in general than using separate processes but threading can be useful for many other purposes, for instance to process events in a server application if you're interested, take a look at the threading library
16 the multiprocessing library the multiprocessing library (included in Python's standard library) contains some functions for managing processes: creating a process waiting for a process to end, or stop it violently communicating between processes synchronization: making sure that processes don't mess up for each other managing a group of slave processes: the Pool
17 simple multiprocessing example import time import multiprocessing as mp import random def do_something(job_nbr): while True: print('process {0} says hello!'.format(job_nbr)) time.sleep(random.random()) if name == ' main ': nbr_workers = 5 for i in range(nbr_workers): worker = mp.process(target=do_something, args=[i]) worker.start()
18 masterslave architecture and the Pool in many cases, we have a master process (the main program) that creates tasks for a number of slave processes that work in parallel to do the hard work with the Pool class from the multiprocessing library, we can simplify the management of slaves: the master process submits tasks to the Pool, which distributes the tasks to the slaves the slaves process the tasks in parallel the master collects the results
19 Pool example import multiprocessing as mp ### THIS PART IS EXECUTED IN THE SLAVE PROCESS ### def compute_square(number): return number*number ### THIS PART IS EXECUTED IN THE MASTER PROCESS ### square_list = [] def add_square(square): square_list.append(square) if name == ' main ': pool = mp.pool(processes=4) # or mp.cpu_count() for i in range(10): # submit a job pool.apply_async(compute_square, args=[i], callback=add_square) pool.close() # tell the pool that we're done pool.join() # wait for all jobs to finish print(square_list)
20 word counting example: not parallelized now we'll do something more useful: computing frequencies we'll start from this non-parallelized example: from collections import Counter filename =... something... freqs = Counter() with open(filename) as f: for l in f: freqs.update(l.split()) print(freqs.most_common(5)) now, let's divide this into master and slave
21 parallelized word counting example: slave part def compute_frequencies(lines): # make a frequency table for these lines freqs = Counter() for l in lines: freqs.update(l.split()) # send the frequency table back to the master return freqs
22 parallelized word count example: master part (1) if name == ' main ': filename =... something... pool = mp.pool(processes=mp.cpu_count()) with open(filename) as f: chunk = read_chunk(f, ) while chunk: # submit a job pool.apply_async(compute_frequencies, args=[chunk], callback=merge) chunk = read_chunk(f, ) pool.close() # tell the pool we're done pool.join() # wait for all jobs to finish print(total_result.most_common(5))
23 parallelized word count example: master part (2) # this is the callback function, called every time we # get a partial frequency table from a slave total_result = Counter() def merge(partial_result): total_result.update(partial_result) # helper function to read a number of lines that should # be sent to a slave def read_chunk(f, chunk_size): chunk = [] for line in f: chunk.append(line) if len(chunk) == chunk_size: break return chunk
24 word counting example: how much improvement? seconds number of processes
25 why not half the time with twice the number of processes? splitting: reading the le, dividing into chunks communication overhead: processes don't share memory, so they need to send and receive data inputs (the chunks) and outputs (the partial tables) are pickled and unpickled this becomes even more critical if the processes run on separate machines, because then the data is sent over a network assembling the end result: for instance, merging the partial tables process administration: starting processes, communicating inputs and outputs
26 overview basics of parallel programming parallel programming in Python architectures for large-scale parallel processing: MapReduce, Hadoop, Spark
27 architectures for large-scale processing on a single machine, multiprocessing solutions such as Python's Pool can be useful, although a bit low-level we'll have a look at frameworks that can help us program for larger systems that may be distributed on many machines picture borrowed from Peter Exner's slides
28 connections to functional programming some architectures for large-scale processing borrow a few concepts from functional programming FP has the following characteristics: data structures are immutable (not modiable): instead of modifying, they are transformed into new structures many standard operations on data structures (transforming, collecting, ltering, etc) are implemented as higher-order functions: functions that take other functions as input (in Python, list comprehension plays much of the same role) uses small on-the-y functions a lot: lambda in Python FP is attractive for this purpose because it separates the what from the how we want to transform a list, but we don't want to worry about how its parts are distributed to dierent machines or in which order the parts are processed
29 a higher-order function in Python: map the function map applies a function to all elements in a collection def add1(x): return x + 1 print(list(map(add1, [1, 2, 3, 4, 5]))) # prints [2, 3, 4, 5, 6] print(list(map(lambda x: x + 1, [1, 2, 3, 4, 5]))) # prints [2, 3, 4, 5, 6] print(list(map(len, ['a', 'few', 'strings']))) # prints [1, 3, 7]
30 another higher-order function: reduce the function reduce applies a function to accumulate the elements in a collection typical example: summing or multiplying all elements reduce lives in the functools library in Python def add(x, y): return x + y print(reduce(add, [1, 2, 3, 4, 5])) # prints 15 print(reduce(lambda x, y: x + y, [1, 2, 3, 4, 5])) # prints 15
31 contrived example using map and reduce sum the lengths of some words: words = ['a', 'few', 'strings'] print(reduce(lambda x, y: x + y, map(len, words))) # prints 11
32 less contrived example the parallelized word counting program we wrote before can be thought of as mapping and reducing: map: for each chunk, compute a partial frequency table reduce: combine all partial tables into a complete table
33 MapReduce MapReduce [Dean and Ghemawat, 2004] is an architecture developed by Google that models large-scale computation tasks in terms of mapping and reducing see also this paper for a popular-scientic introduction the user denes the map and reduce tasks to be carried out MapReduce was designed to take care of many of the complexities in distributed processing: large les can be distributed across several machines to minimize network trac, tasks are carried out locally as much as possible: a machine handles the piece it stores sometimes computers break down, so the system may need to reprocess tasks that have disappeared
34 Hadoop Hadoop is an open-source implementation of an architecture similar to Google's ideas its central parts are processing part: Hadoop MapReduce le system: HDFS (Hadoop Distributed File System)... but it also has many other components
35 Spark Spark [Zaharia et al., 2012] is a more recent framework that addresses some of the drawbacks of Hadoop most importantly, it tries to keep data in memory, rather than in les, which can lead to signicant speedups for some tasks Spark can be installed not only on a cluster but also on a single machine (standalone mode) see
36 word counting example in Spark the Spark engine is implemented in the Scala language a fairly new functional programming language that runs on the Java virtual machine however, we can write Spark programs not only in Scala or Java but also other languages including Python here's a Python example from the Spark web page:
37 intuition of the word counting program
38 Spark's fundamental data structure: the RDD Spark works by processing RDDs: Resilient Distributed Datasets Resilient: it recomputes data in case of loss Distributed: may be spread out over dierent machines conceptually, an RDD is similar to a Python list (or more precisely, a generator) word counting example: 1. RDD with lines 2. RDD with tokens 3. RDD with (token, 1) pairs 4. RDD with (token, count) pairs
39 transformations of RDDs Spark includes many transformations of RDDs many of the transformations are well-known higher-order functions in FP not just map and reduce! check the overview here: programming-guide.html see a complete list of transformations here: pyspark.html#pyspark.rdd let's walk through the steps in the word counting program
40 step 1: reading a text le as lines spark.textfile reads a text le and returns an RDD containing the lines text_file = spark.textfile(name_of_file)
41 step 2: flatmap; splitting the lines into tokens flatmap is a transformation that applies some function to all elements in an RDD... and then attens the result: removes lists inside the RDD we use this to convert the lines into a new RDD with tokens step1 = text_file.flatmap(lambda line: line.split())
42 step 3: map map is a transformation that applies some function to all elements in an RDD this is simpler than flatmap: no attening involved in our case, we make a new RDD consisting of wordcount pairs (but all counts are 1 so far) step2 = step1.map(lambda word: (word, 1))
43 step 4: reducebykey reducebykey is similar to reduce that we explained before, but operates on keyvalue pairs and the aggregation operation is applied to the values, separely for each key in our case, we sum all the 1s for each word separately step3 = step2.reducebykey(lambda a, b: a+b)
44 the nal VG assignment: using Spark do a few small word counting exercises using Spark we have installed Spark on the lab machines... or you may install it on your own we don't have a real cluster: you'll have to make believe!
45 references I Dean, J. and Ghemawat, S. (2004). MapReduce: Simplied data processing on large clusters. In OSDI'04. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI.
Resilient Distributed Datasets
Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More information2/4/2019 Week 3- A Sangmi Lee Pallickara
Week 3-A-0 2/4/2019 Colorado State University, Spring 2019 Week 3-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 1: MAPREDUCE PA1
More informationIntroduction to Spark
Introduction to Spark Outlines A brief history of Spark Programming with RDDs Transformations Actions A brief history Limitations of MapReduce MapReduce use cases showed two major limitations: Difficulty
More informationMapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia
MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline - MapReduce: - - Resilient Distributed Datasets (RDD) - - Motivation Examples The Design and How it Works Performance Motivation
More informationIntroduction to MapReduce Algorithms and Analysis
Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)
More informationSpark: A Brief History. https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
Spark: A Brief History https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf A Brief History: 2004 MapReduce paper 2010 Spark paper 2002 2004 2006 2008 2010 2012 2014 2002 MapReduce @ Google
More informationMapReduce, Hadoop and Spark. Bompotas Agorakis
MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)
More informationFast, Interactive, Language-Integrated Cluster Computing
Spark Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica www.spark-project.org
More informationSurvey on Incremental MapReduce for Data Mining
Survey on Incremental MapReduce for Data Mining Trupti M. Shinde 1, Prof.S.V.Chobe 2 1 Research Scholar, Computer Engineering Dept., Dr. D. Y. Patil Institute of Engineering &Technology, 2 Associate Professor,
More informationClustering Documents. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve
More informationBig Data Analytics with Apache Spark. Nastaran Fatemi
Big Data Analytics with Apache Spark Nastaran Fatemi Apache Spark Throughout this part of the course we will use the Apache Spark framework for distributed data-parallel programming. Spark implements a
More informationIntroduction to MapReduce
732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server
More informationCloud, Big Data & Linear Algebra
Cloud, Big Data & Linear Algebra Shelly Garion IBM Research -- Haifa 2014 IBM Corporation What is Big Data? 2 Global Data Volume in Exabytes What is Big Data? 2005 2012 2017 3 Global Data Volume in Exabytes
More informationStream Processing on IoT Devices using Calvin Framework
Stream Processing on IoT Devices using Calvin Framework by Ameya Nayak A Project Report Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science Supervised
More informationThe MapReduce Framework
The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce
More informationClustering Documents. Document Retrieval. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve
More informationL22: SC Report, Map Reduce
L22: SC Report, Map Reduce November 23, 2010 Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance Google version = Map Reduce; Hadoop = Open source
More informationMachine learning library for Apache Flink
Machine learning library for Apache Flink MTP Mid Term Report submitted to Indian Institute of Technology Mandi for partial fulfillment of the degree of B. Tech. by Devang Bacharwar (B2059) under the guidance
More informationL3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences
Indian Institute of Science Bangalore, India भ रत य व ज ञ न स स थ न ब गल र, भ रत Department of Computational and Data Sciences L3: Spark & RDD Department of Computational and Data Science, IISc, 2016 This
More informationCOSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?
COSC 6339 Big Data Analytics Introduction to Spark Edgar Gabriel Fall 2018 What is SPARK? In-Memory Cluster Computing for Big Data Applications Fixes the weaknesses of MapReduce Iterative applications
More informationDistributed Computation Models
Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case
More informationNatural Language Processing In A Distributed Environment
Natural Language Processing In A Distributed Environment A comparative performance analysis of Apache Spark and Hadoop MapReduce Ludwig Andersson Ludwig Andersson Spring 2016 Bachelor s Thesis, 15 hp Supervisor:
More informationDatabases and Big Data Today. CS634 Class 22
Databases and Big Data Today CS634 Class 22 Current types of Databases SQL using relational tables: still very important! NoSQL, i.e., not using relational tables: term NoSQL popular since about 2007.
More informationTwitter data Analytics using Distributed Computing
Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationAnalytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig
Analytics in Spark Yanlei Diao Tim Hunter Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Outline 1. A brief history of Big Data and Spark 2. Technical summary of Spark 3. Unified analytics
More informationSpark. In- Memory Cluster Computing for Iterative and Interactive Applications
Spark In- Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker,
More informationClustering Lecture 8: MapReduce
Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data
More information15.1 Data flow vs. traditional network programming
CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 15, 5/22/2017. Scribed by D. Penner, A. Shoemaker, and
More informationSurvey Paper on Traditional Hadoop and Pipelined Map Reduce
International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,
More informationSTATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns
STATS 700-002 Data Analysis using Python Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns Unit 3: parallel processing and big data The next few lectures will focus on big
More informationCompSci 516: Database Systems
CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationAnnouncements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems
Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic
More informationSpark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley Background MapReduce and Dryad raised level of abstraction in cluster
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationSTATS Data Analysis using Python. Lecture 8: Hadoop and the mrjob package Some slides adapted from C. Budak
STATS 700-002 Data Analysis using Python Lecture 8: Hadoop and the mrjob package Some slides adapted from C. Budak Recap Previous lecture: Hadoop/MapReduce framework in general Today s lecture: actually
More informationAn Introduction to Big Data Analysis using Spark
An Introduction to Big Data Analysis using Spark Mohamad Jaber American University of Beirut - Faculty of Arts & Sciences - Department of Computer Science May 17, 2017 Mohamad Jaber (AUB) Spark May 17,
More informationCS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [MAPREDUCE] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Bit Torrent What is the right chunk/piece
More informationLecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci, Charles Zheng.
CME 323: Distributed Algorithms and Optimization, Spring 2015 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Databricks and Stanford. Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci,
More informationLecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks
COMP 322: Fundamentals of Parallel Programming Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks Mack Joyner and Zoran Budimlić {mjoyner, zoran}@rice.edu http://comp322.rice.edu COMP
More informationChapter 5. The MapReduce Programming Model and Implementation
Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing
More informationMapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia
MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,
More informationRESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING
RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin,
More informationBatch Processing Basic architecture
Batch Processing Basic architecture in big data systems COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman 2 1 2 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 3
More informationParallel Computing: MapReduce Jin, Hai
Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! MapReduce is a distributed/parallel computing framework introduced by Google
More informationIntroduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)
Introduction to MapReduce Adapted from Jimmy Lin (U. Maryland, USA) Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction
More informationComparison of Distributed Computing Approaches to Complexity of n-gram Extraction.
Comparison of Distributed Computing Approaches to Complexity of n-gram Extraction Sanzhar Aubakirov 1, Paulo Trigo 2 and Darhan Ahmed-Zaki 1 1 Department of Computer Science, al-farabi Kazakh National
More informationIntroduction to Apache Spark
Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no lu İstanbul
More informationChapter 4: Apache Spark
Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,
More informationScalable Tools - Part I Introduction to Scalable Tools
Scalable Tools - Part I Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/mbds2018/ Scalable Tools session
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationSpark Overview. Professor Sasu Tarkoma.
Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based
More informationParallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce
Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations
More informationCS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University
CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [SPARK] Frequently asked questions from the previous class survey 48-bit bookending in Bzip2: does the number have to be special? Spark seems to have too many
More informationAnalytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation
Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationBeyond MapReduce: Apache Spark Antonino Virgillito
Beyond MapReduce: Apache Spark Antonino Virgillito 1 Why Spark? Most of Machine Learning Algorithms are iterative because each iteration can improve the results With Disk based approach each iteration
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 7: Parallel Computing Cho-Jui Hsieh UC Davis May 3, 2018 Outline Multi-core computing, distributed computing Multi-core computing tools
More informationOutline. CS-562 Introduction to data analysis using Apache Spark
Outline Data flow vs. traditional network programming What is Apache Spark? Core things of Apache Spark RDD CS-562 Introduction to data analysis using Apache Spark Instructor: Vassilis Christophides T.A.:
More informationLECTURE 7: STUDENT REQUESTED TOPICS
1 LECTURE 7: STUDENT REQUESTED TOPICS Introduction to Scientific Python, CME 193 Feb. 20, 2014 Please download today s exercises from: web.stanford.edu/~ermartin/teaching/cme193-winter15 Eileen Martin
More informationSpark, Shark and Spark Streaming Introduction
Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References
More informationa Spark in the cloud iterative and interactive cluster computing
a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley Background MapReduce and Dryad raised level of
More informationMapReduce: A Programming Model for Large-Scale Distributed Computation
CSC 258/458 MapReduce: A Programming Model for Large-Scale Distributed Computation University of Rochester Department of Computer Science Shantonu Hossain April 18, 2011 Outline Motivation MapReduce Overview
More informationIntroduction to Map Reduce
Introduction to Map Reduce 1 Map Reduce: Motivation We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate
More informationIndex Construction 1
Index Construction 1 October, 2009 1 Vorlage: Folien von M. Schütze 1 von 43 Index Construction Hardware basics Many design decisions in information retrieval are based on hardware constraints. We begin
More informationBig Data Infrastructures & Technologies
Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory
More information08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters
are the primary abstraction in Spark are distributed collections of objects spread across the nodes of a clusters They are split in partitions Each node of the cluster that is running an application contains
More informationApache Spark Internals
Apache Spark Internals Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Apache Spark Internals 1 / 80 Acknowledgments & Sources Sources Research papers: https://spark.apache.org/research.html Presentations:
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (2/2) January 12, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationMapReduce: Simplified Data Processing on Large Clusters 유연일민철기
MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,
More informationCSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101
CSC 261/461 Database Systems Lecture 24 Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101 Announcements Term Paper due on April 20 April 23 Project 1 Milestone 4 is out Due on 05/03 But I would
More informationCS294 Big Data System Course Project Report Gemini: Boosting Spark Performance with GPU Accelerators
Gemini: Boosting Spark Performance with GPU Accelerators Guanhua Wang Zhiyuan Lin Ion Stoica AMPLab EECS AMPLab UC Berkeley UC Berkeley UC Berkeley Abstract Compared with MapReduce, Apache Spark is more
More informationSpark. In- Memory Cluster Computing for Iterative and Interactive Applications
Spark In- Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker,
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationMap-Reduce. Marco Mura 2010 March, 31th
Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of
More informationCS-2510 COMPUTER OPERATING SYSTEMS
CS-2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application Functional
More informationCDS. André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 de Données astronomiques de Strasbourg, 2SSC-XMM-Newton
Docker @ CDS André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 1Centre de Données astronomiques de Strasbourg, 2SSC-XMM-Newton Paul Trehiou Université de technologie de Belfort-Montbéliard
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model
More informationData-intensive computing systems
Data-intensive computing systems University of Verona Computer Science Department Damiano Carra Acknowledgements q Credits Part of the course material is based on slides provided by the following authors
More informationProcessing 11 billions events a day with Spark. Alexander Krasheninnikov
Processing 11 billions events a day with Spark Alexander Krasheninnikov Badoo facts 46 languages 10M Photos added daily 320M registered users 190 countries 21M daily active users 3000+ servers 2 data-centers
More informationThe MapReduce Abstraction
The MapReduce Abstraction Parallel Computing at Google Leverages multiple technologies to simplify large-scale parallel computations Proprietary computing clusters Map/Reduce software library Lots of other
More informationProgramming Systems for Big Data
Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There
More informationNew Developments in Spark
New Developments in Spark And Rethinking APIs for Big Data Matei Zaharia and many others What is Spark? Unified computing engine for big data apps > Batch, streaming and interactive Collection of high-level
More informationCSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA
CSE 3302 Lecture 11: Map/Reduce 7 October 2010 Nate Nystrom UTA 378,000 results in 0.17 seconds including images and video communicates with 1000s of machines web server index servers document servers
More informationL435/L555. Dept. of Linguistics, Indiana University Fall 2016
for : for : L435/L555 Dept. of, Indiana University Fall 2016 1 / 12 What is? for : Decent definition from wikipedia: Computer programming... is a process that leads from an original formulation of a computing
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Distributed Machine Learning Week #9 Today Distributed computing for machine learning Background MapReduce/Hadoop & Spark Theory
More informationDatabase Applications (15-415)
Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April
More information2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters
are the primary abstraction in Spark are distributed collections of objects spread across the nodes of a clusters They are split in partitions Each node of the cluster that is used to run an application
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationMapReduce Simplified Data Processing on Large Clusters
MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /
More informationDiscretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica UC BERKELEY Motivation Many important
More informationHow to Implement MapReduce Using. Presented By Jamie Pitts
How to Implement MapReduce Using Presented By Jamie Pitts A Problem Seeking A Solution Given a corpus of html-stripped financial filings: Identify and count unique subjects. Possible Solutions: 1. Use
More informationA very short introduction
A very short introduction General purpose compute engine for clusters batch / interactive / streaming used by and many others History developed in 2009 @ UC Berkeley joined the Apache foundation in 2013
More informationSparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY
SparkStreaming Large scale near- realtime stream processing Tathagata Das (TD) UC Berkeley UC BERKELEY Motivation Many important applications must process large data streams at second- scale latencies
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationPrincipal Software Engineer Red Hat Emerging Technology June 24, 2015
USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015 ABOUT ME Distributed systems and data science in Red Hat's Emerging
More informationIntroduction to Apache Spark
1 Introduction to Apache Spark Thomas Ropars thomas.ropars@univ-grenoble-alpes.fr 2017 2 References The content of this lectures is inspired by: The lecture notes of Yann Vernaz. The lecture notes of Vincent
More information