Database System Architectures Parallel DBs, MapReduce, ColumnStores CMPSCI 445 Fall 2010 Some slides courtesy of Yanlei Diao, Christophe Bisciglia, Aaron Kimball, & Sierra Michels- Slettvet
Motivation: Large Scale Data Processing Want to process lots of data ( > 1 TB) Want to parallelize across hundreds/thousands of commodity computers New definition of cluster computing: large numbers of low-end processors working in parallel to solve a computing problem. Parallel DB: shared-nothing architecture: many interconnected machines with independent CPUs and disks Want to make this easy
Parallel Database Systems We ve studied centralized and client-server systems so far. For massive datasets or extremely large numbers of transactions, parallel systems are often required. Parallel database system: Multiple CPUs and disks used in parallel
Speed-up Running a given task in less time by increasing the degree of parallelism is called speedup. Linear speedup Speed Sublinear speedup Resources
Scale-up Handling larger tasks by increasing the degree of parallelism is called scaleup. Time Improvement Linear scaleup Sublinear scaleup Increasing Problem Size & Resources -->
Obstacles in parallel systems Start-up costs Starting each process has a cost. If the start-up time dominates the actual processing time, speedup is adversely affected. Interference Processing executing in parallel may need to access shared resources (system bus, shared disks, locks). Scaleup and speedup may be hurt. Skew By dividing a task into many subtasks to execute in parallel, we complete when the last subtask completes. If subtasks are skewed, then we are limited by longest subtask.
Parallel databases Research over the last 20 years Major issues: Data partitioning Managing skew Intraquery and interquery parallelism Query optimization can become very difficult A number of mature commercial systems Teradata, Netezza, Greenplum,... Not many open-source systems
MapReduce Automatic parallelization & distribution Fault-tolerance Status and monitoring tools Clean abstraction for programmers
Programming Model Borrows from functional programming Users implement an interface of two functions: map (in_key, in_value) -> (out_key, intermediate_value) list reduce (out_key, intermediate_value list) -> out_value list
map Input key-value pairs: records from the data source, e.g., lines out of files (filename, line), rows of a database, etc. map( ) is applied independently to each line or record. map( ) produces one or more intermediate values along with an output key from the input. (line, long, lat, month, day, year, temp) (0057, +51317, +028783, 01, 01, 1950, 39) (0058, +51321, +018253, 02, 13, 1951, 32) map (year, temp) (1950, 39) (1951, 32)
reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list reduce( ) combines those intermediate values into one or more final values for that same output key (in practice, usually only one final value per key) (1950, (39, 20, 18, 32) ) (1951, (32, 8, 23, 12, 10, 12) )... (2010, (31, 29, 27, 27, 18, 12) ) reduce (max) (1950, 39) (1951, 32)... (2010, 31)
Example: Count Word Occurrences map(string input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(string output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); How do we implement this using a relational DBMS? Customized data loading (data may be used only once), then Group By.
Extract (key, value) using map(); Group By key Apply reduce()
Parallelism map() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently Bottleneck: reduce phase can t start until map phase is completely finished.
Fault Tolerance Master detects worker failures Re-executes completed & in-progress map() tasks Re-executes in-progress reduce() tasks Master notices particular input key/values cause crashes in map(), and skips those values on re-execution. Effect: Can work around bugs in third-party libraries!
Optimizations No reduce can start until map is complete: A single slow disk controller can rate-limit the whole process Master redundantly executes slow-moving map tasks; uses results of first copy to finish
Optimizations Combiner functions can run on same machine as a mapper Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth Under what conditions is it sound to use a combiner?
Some Comments on MapReduce Strengths Simplicity and low cost Infrastructure support: massive parallelism, fault tolerance, with proven success Complex map functions for parse, transform; complex analytics for data mining, clustering, etc. Storage system independent; heterogeneous storage systems Limitations: Performance for structured data analysis. No indexes; often large, repeated scans. Language not declarative Data model is limited
Column Oriented Databases Can be used in centralized or parallel deployments Pages are collections of the attribute values of a single column of a table --> think extreme decomposition Clear performance advantages for read-mostly workloads which access small number of columns. Standard databases can simulate column stores: vertical partitioning and materialization indexing of individual columns Current column oriented databases improve on this: aggressive use of compression late joining
Articles to Read MapReduce: Simplified Data Processing on Large Clusters. Jeffrey Dean and Sanjay Ghemawat. OSDI 2004. A Comparison of Approaches to Large-Scale Data Analysis. Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker. SIGMOD 2009. MapReduce and Parallel DBMSs: Friends or Foes?. Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin. CACM Jan 2010. MapReduce: A Flexible Data Processing Tool. Jeffrey Dean, Sanjay Ghemawat. CACM Jan 2010.