Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing

Size: px

Start display at page:

Download "Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing"

Damon Henry
5 years ago
Views:

1 Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing Richard Johansson December 1, 2015

2 today's lecture as you've seen, processing large corpora can take time! for instance, building the frequency tables in the word sketch assignment in this lecture, we'll think of how we can process large volumes of data by parallelizing our programs some basic ideas, some techniques, and pointers to software we'll just dip our toes, but there will be pointers for further reading

3 overview basics of parallel programming parallel programming in Python architectures for large-scale parallel processing: MapReduce, Hadoop, Spark

4 speeding up by parallelizing can we buy a machine that runs our code 10 times faster? I have a 2 GHz CPU: can I get a 20 GHz CPU instead? it's probably easier to buy 10 machines, or a machine with 10 CPUs, and then try to make the program parallel

5 Moore's law Moore's law was formulated by Gordon Moore at Intel in the early 1970s overall processing power for computers doubles every 2 years until about 2000, this used to mean that processors got faster Moore's law still holds, but its eect nowadays is increased parallelization increased number of CPUs in computers and each CPUs can run more than one process at a time

6 Moore's law (Wikipedia)

7 computer clusters computations may be distributed over large collections of machines: clusters for instance, Sweden has the SNIC infrastructure that connects clusters at dierent universitites picture borrowed from Peter Exner's slides

8 parallelizing an algorithm making an algorithm work in a parallel fashion may involve signicant changes a parallel algorithm is ecient if T parallel T sequential number of processors for instance, if I can compute a frequency table in a corpus 10 times as fast by using 10 machines often, this isn't exactly the case because there is some administrative overhead when we parallelize

9 processing embarassingly parallel tasks an embarrassingly parallel (or trivially parallel) job can be split into separate pieces with little or no eort, pieces be processed independently: when processing one piece, we don't need to care about what other pieces contain and where it's easy to collect the result in the end how can I process such a task if I have 10 machines (or CPUs)?

10 processing embarassingly parallel tasks an embarrassingly parallel (or trivially parallel) job can be split into separate pieces with little or no eort, pieces be processed independently: when processing one piece, we don't need to care about what other pieces contain and where it's easy to collect the result in the end how can I process such a task if I have 10 machines (or CPUs)? split the data into 10 pieces (of roughly equal size) assign a piece to each machine run the 10 machines in parallel concatenate the 10 results

11 what kinds of tasks are embarrassingly parallel? lowercasing the 1000 text les in a directory? building a frequency table of the words in a corpus? PoS tagging? parsing? machine learning tasks: Naive Bayes training? perceptron training?

12 embarrassing parallelization on Unix-like systems on Unix-like systems (e.g. Mac or Linux), commands such as split can be handy for instance, we split a le bigfile into 10 parts split bigfile -n 10 smallfile_ then we will get smallfile_aa, smallfile_ab, etc if we have more than one CPU on the machine, we can start multiple processes at once: python3 do_something.py smallfile_aa & python3 do_something.py smallfile_ab &... if we have many machines, we may need to copy les; on a computer cluster with many machines, the le system is usually shared between the machines

13 when parallelization is not trivial typically, algorithms that work in an incremental fashion are hard to parallelize when the result in the current step depends on what has happened before a good example is the perceptron learning algorithm what we do in this step depends on all the errors we made before parallelized versions of the perceptron (and related algorithms such as SVM, LR) use mini-batches rather than single instances

14 overview basics of parallel programming parallel programming in Python architectures for large-scale parallel processing: MapReduce, Hadoop, Spark

15 simple parallelization in Python in programming, we distinguish between two types of parallel activities: threads are parallel activities that share memory (variables, data structures, etc) processes run with separate memory, so they need to communicate over the network or through les in Python, for various technical reasons, using threads is less ecient in general than using separate processes but threading can be useful for many other purposes, for instance to process events in a server application if you're interested, take a look at the threading library

16 the multiprocessing library the multiprocessing library (included in Python's standard library) contains some functions for managing processes: creating a process waiting for a process to end, or stop it violently communicating between processes synchronization: making sure that processes don't mess up for each other managing a group of slave processes: the Pool

17 simple multiprocessing example import time import multiprocessing as mp import random def do_something(job_nbr): while True: print('process {0} says hello!'.format(job_nbr)) time.sleep(random.random()) if name == ' main ': nbr_workers = 5 for i in range(nbr_workers): worker = mp.process(target=do_something, args=[i]) worker.start()

18 masterslave architecture and the Pool in many cases, we have a master process (the main program) that creates tasks for a number of slave processes that work in parallel to do the hard work with the Pool class from the multiprocessing library, we can simplify the management of slaves: the master process submits tasks to the Pool, which distributes the tasks to the slaves the slaves process the tasks in parallel the master collects the results

19 Pool example import multiprocessing as mp ### THIS PART IS EXECUTED IN THE SLAVE PROCESS ### def compute_square(number): return number*number ### THIS PART IS EXECUTED IN THE MASTER PROCESS ### square_list = [] def add_square(square): square_list.append(square) if name == ' main ': pool = mp.pool(processes=4) # or mp.cpu_count() for i in range(10): # submit a job pool.apply_async(compute_square, args=[i], callback=add_square) pool.close() # tell the pool that we're done pool.join() # wait for all jobs to finish print(square_list)

20 word counting example: not parallelized now we'll do something more useful: computing frequencies we'll start from this non-parallelized example: from collections import Counter filename =... something... freqs = Counter() with open(filename) as f: for l in f: freqs.update(l.split()) print(freqs.most_common(5)) now, let's divide this into master and slave

21 parallelized word counting example: slave part def compute_frequencies(lines): # make a frequency table for these lines freqs = Counter() for l in lines: freqs.update(l.split()) # send the frequency table back to the master return freqs

22 parallelized word count example: master part (1) if name == ' main ': filename =... something... pool = mp.pool(processes=mp.cpu_count()) with open(filename) as f: chunk = read_chunk(f, ) while chunk: # submit a job pool.apply_async(compute_frequencies, args=[chunk], callback=merge) chunk = read_chunk(f, ) pool.close() # tell the pool we're done pool.join() # wait for all jobs to finish print(total_result.most_common(5))

23 parallelized word count example: master part (2) # this is the callback function, called every time we # get a partial frequency table from a slave total_result = Counter() def merge(partial_result): total_result.update(partial_result) # helper function to read a number of lines that should # be sent to a slave def read_chunk(f, chunk_size): chunk = [] for line in f: chunk.append(line) if len(chunk) == chunk_size: break return chunk

24 word counting example: how much improvement? seconds number of processes

25 why not half the time with twice the number of processes? splitting: reading the le, dividing into chunks communication overhead: processes don't share memory, so they need to send and receive data inputs (the chunks) and outputs (the partial tables) are pickled and unpickled this becomes even more critical if the processes run on separate machines, because then the data is sent over a network assembling the end result: for instance, merging the partial tables process administration: starting processes, communicating inputs and outputs

26 overview basics of parallel programming parallel programming in Python architectures for large-scale parallel processing: MapReduce, Hadoop, Spark

architectures for large-scale processing on a single machine, multiprocessing solutions such as Python's Pool can be useful, although a bit low-level we'll have a look at frameworks

27 architectures for large-scale processing on a single machine, multiprocessing solutions such as Python's Pool can be useful, although a bit low-level we'll have a look at frameworks that can help us program for larger systems that may be distributed on many machines picture borrowed from Peter Exner's slides

28 connections to functional programming some architectures for large-scale processing borrow a few concepts from functional programming FP has the following characteristics: data structures are immutable (not modiable): instead of modifying, they are transformed into new structures many standard operations on data structures (transforming, collecting, ltering, etc) are implemented as higher-order functions: functions that take other functions as input (in Python, list comprehension plays much of the same role) uses small on-the-y functions a lot: lambda in Python FP is attractive for this purpose because it separates the what from the how we want to transform a list, but we don't want to worry about how its parts are distributed to dierent machines or in which order the parts are processed

29 a higher-order function in Python: map the function map applies a function to all elements in a collection def add1(x): return x + 1 print(list(map(add1, [1, 2, 3, 4, 5]))) # prints [2, 3, 4, 5, 6] print(list(map(lambda x: x + 1, [1, 2, 3, 4, 5]))) # prints [2, 3, 4, 5, 6] print(list(map(len, ['a', 'few', 'strings']))) # prints [1, 3, 7]

30 another higher-order function: reduce the function reduce applies a function to accumulate the elements in a collection typical example: summing or multiplying all elements reduce lives in the functools library in Python def add(x, y): return x + y print(reduce(add, [1, 2, 3, 4, 5])) # prints 15 print(reduce(lambda x, y: x + y, [1, 2, 3, 4, 5])) # prints 15

31 contrived example using map and reduce sum the lengths of some words: words = ['a', 'few', 'strings'] print(reduce(lambda x, y: x + y, map(len, words))) # prints 11

32 less contrived example the parallelized word counting program we wrote before can be thought of as mapping and reducing: map: for each chunk, compute a partial frequency table reduce: combine all partial tables into a complete table

33 MapReduce MapReduce [Dean and Ghemawat, 2004] is an architecture developed by Google that models large-scale computation tasks in terms of mapping and reducing see also this paper for a popular-scientic introduction the user denes the map and reduce tasks to be carried out MapReduce was designed to take care of many of the complexities in distributed processing: large les can be distributed across several machines to minimize network trac, tasks are carried out locally as much as possible: a machine handles the piece it stores sometimes computers break down, so the system may need to reprocess tasks that have disappeared

34 Hadoop Hadoop is an open-source implementation of an architecture similar to Google's ideas its central parts are processing part: Hadoop MapReduce le system: HDFS (Hadoop Distributed File System)... but it also has many other components

importantly, it tries to keep data in memory, rather than in les, which can lead to

35 Spark Spark [Zaharia et al., 2012] is a more recent framework that addresses some of the drawbacks of Hadoop most importantly, it tries to keep data in memory, rather than in les, which can lead to signicant speedups for some tasks Spark can be installed not only on a cluster but also on a single machine (standalone mode) see

36 word counting example in Spark the Spark engine is implemented in the Scala language a fairly new functional programming language that runs on the Java virtual machine however, we can write Spark programs not only in Scala or Java but also other languages including Python here's a Python example from the Spark web page:

37 intuition of the word counting program

38 Spark's fundamental data structure: the RDD Spark works by processing RDDs: Resilient Distributed Datasets Resilient: it recomputes data in case of loss Distributed: may be spread out over dierent machines conceptually, an RDD is similar to a Python list (or more precisely, a generator) word counting example: 1. RDD with lines 2. RDD with tokens 3. RDD with (token, 1) pairs 4. RDD with (token, count) pairs

39 transformations of RDDs Spark includes many transformations of RDDs many of the transformations are well-known higher-order functions in FP not just map and reduce! check the overview here: programming-guide.html see a complete list of transformations here: pyspark.html#pyspark.rdd let's walk through the steps in the word counting program

40 step 1: reading a text le as lines spark.textfile reads a text le and returns an RDD containing the lines text_file = spark.textfile(name_of_file)

41 step 2: flatmap; splitting the lines into tokens flatmap is a transformation that applies some function to all elements in an RDD... and then attens the result: removes lists inside the RDD we use this to convert the lines into a new RDD with tokens step1 = text_file.flatmap(lambda line: line.split())

42 step 3: map map is a transformation that applies some function to all elements in an RDD this is simpler than flatmap: no attening involved in our case, we make a new RDD consisting of wordcount pairs (but all counts are 1 so far) step2 = step1.map(lambda word: (word, 1))

43 step 4: reducebykey reducebykey is similar to reduce that we explained before, but operates on keyvalue pairs and the aggregation operation is applied to the values, separely for each key in our case, we sum all the 1s for each word separately step3 = step2.reducebykey(lambda a, b: a+b)

44 the nal VG assignment: using Spark do a few small word counting exercises using Spark we have installed Spark on the lab machines... or you may install it on your own we don't have a real cluster: you'll have to make believe!

45 references I Dean, J. and Ghemawat, S. (2004). MapReduce: Simplied data processing on large clusters. In OSDI'04. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI.

Resilient Distributed Datasets

Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,