Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis

Size: px

Start display at page:

Download "Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis"

Oswald Mills
6 years ago
Views:

1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis Elif Dede, Madhusudhan Govindaraju Lavanya Ramakrishnan, Dan Gunter, Shane Canon Department of Computer Science, Binghamton University (SUNY) Lawrence Berkeley National Laboratory 1

2 Computa(on and Data are cri(cal parts of the scien(fic process Three Pillars of Science Theory Experiment Computation Advance Light Source Data Data Rates (Fourth 2009 Paradigm) 65 TB/yr TB/yr TB/yr

3 Materials Project Brain Schemaless database manager.x manager.x manager.x Source: Michael Kocher, Daniel Gunter 3

4 Data is Big 4

set of commodity machines Characteristics of the model: Relaxed

5 Processing Big Data : MapReduce Introduced in OSDI 2004 by Dean and Ghemawat from Google Programming model for processing large data sets Exploits large a set of commodity machines Characteristics of the model: Relaxed synchronization constraints Locality optimization Fault-tolerance Load balancing OSDI

6 Map and Reduce Map/Reduce: The map() function is called on every item in the input set and emits a series of intermediate key/value pairs All values associated with a given intermediate key are grouped together The reduce() function is called on every unique intermediate key, and its value list, and emits a final output value 6

Apache Hadoop Open- source MapReduce implementa;on in Java Easy scalability Built- in I/O management Hadoop Distributed File System(HDFS) Data distribu(on, management and replica(on Load balancing

7 Apache Hadoop Open- source MapReduce implementa;on in Java Easy scalability Built- in I/O management Hadoop Distributed File System(HDFS) Data distribu(on, management and replica(on Load balancing Handles stragglers Fault tolerance Commodity hardware Heartbeats Specula(ve execu(on and data replica(on Hadoop Streaming Create and run MapReduce jobs with any executable or script as the mapper and/or the reducer 7

Scien;fic Compu;ng and Hadoop Hadoop provides: Data Flow Parallelism Data goes through different steps of processing Similar Job Phases Data prepara(on, transforma(on and reduc(on MapReduce: maps

8 Scien;fic Compu;ng and Hadoop Hadoop provides: Data Flow Parallelism Data goes through different steps of processing Similar Job Phases Data prepara(on, transforma(on and reduc(on MapReduce: maps (transforma(on) and reduces (reduc(on) Number of maps >>> Number of reduce Data transforma(on is typically more parallel than data reduc(on Fault Tolerance and Data Locality Data intensive loads Long running scien(fic jobs 8

9 Scien;fic Compu;ng and Hadoop (Cont.) Hadoop does not provide: Java implementa(on Legacy scien(fic code mostly is not in java and hard to rewrite as map and reduce func(ons Hadoop Streaming allows other modes HDFS is a non- POSIX file system HDFS java library calls needed to create, read and write files HDFS data locality good but does not handle applica(ons that might have mul(ple data sets Scien(fic data formats do not fit in the line/block oriented inputs of typical Hadoop jobs Scien(fic applica(ons o]en work with files where the logical division of work is per file New file formats require addi(onal java programming to define the format, appropriate split for a single map task 9

10 Scien;fic Compu;ng and Hadoop (Cont.) Hadoop does not provide: Maps and reduces are considered iden(cal (executables/ arguments) Implemen(ng different tasks requires logic in the tasks that differen(ate the func(onality This can cause worker processing (mes to vary widely an lead to (meouts and restarted tasks due to the specula(ve execu(on in Hadoop No built- in dynamic and itera(ve applica(on support 10

11 New Genera;on Data Dynamic Data Size and Content Structured? Semi structured, unstructured Relational? Not always 11

NoSQL A broad class of data management systems where the data

a privileged role NoSQL has emerged as an alternative model

Address the ``Big Data'' challenge by providing horizontal

There are various data models that are represented under

12 NoSQL A broad class of data management systems where the data is partitioned across a set of servers, where no server plays a privileged role NoSQL has emerged as an alternative model for this new non-relational data model. Address the ``Big Data'' challenge by providing horizontal scalability. Lower maintenance costs and flexibility. There are various data models that are represented under NoSQL including key-value, column-oriented and document-oriented stores. Each of these models has its own interpretation of data storage and makes different tradeoffs within the Consistency, Availability and Performance 12

13 What is MongoDB? Open source document-oriented database Data is not in tables with rows and columns Data is stored as documents, each of which is a associative array of scalar values, or nested associative arrays Javascript Object Notation (JSON) format Stored as BSON MongoDB uses sharding to split the data evenly across the cluster to parallelize access. This is done through front-end routing servers and back-end data servers Provides a built-in MapReduce Drawbacks The MapReduce scripts should be written in JavaScript Slow and poor analytics libraries The JavaScript implementation used by the MongoDB is not thread safe 13

14 Why MongoDB? Brain Materials Project: A community accessible data store of calculated materials. Data store is complex with hundreds of attributes and constantly evolving. MongoDB provides an appropriate data model and query language. The project also needs to perform complex statistical data mining to discover patterns in materials and validate/verify correctness. Schemaless database manager.x manager.x manager.x These task are difficult with MongoDB but natural for MapReduce ALS: Advanced Light Source s Tomogropy beamline uses MongoDB to store metadata from experiments (Summer 12, LBNL) Source: Michael Kocher, Daniel Gunter (LBNL) 14

15 Hadoop- MongoDB Connector Input splits are retrieved from a MongoDB server(s) Each mapper can read its splits in parallel Results are written back to MongoDB by the Hadoop reducer(s) It works with single MongoDB server or with a sharding setup User determines the split size 15

16 MongoDB: Overhead of mul;ple connec;ons Test ability to handle large number of simultaneous connections 768 tasks with different checkpoint intervals compared to when there is no checkpoint overhead Connections increased from 154 to 768 per second, write volume increased to 768 MBs/. 16

17 MongoDB: Overhead, when using more nodes, tasks 10 min per task, All tasks run in parallel 10 sec checkpoint interval Overhead observed after 1000 parallel tasks Large number of connections is the bottleneck More than the data volume 17

18 MongoDB MapReduce vs. Hadoop- MongoDB Read/Write Performance Comparison Data is stored on a single MongoDB server Hadoop cluster consists of 2 worker nodes The mongo-hadoop plug-in provides roughly five times better performance. 18

19 Hadoop- MongoDB: Choosing the Split Size Processing 9.3 million input records with Hadoop Each mapper reads an input split from the MongoDB server, does processing and sends its intermediate output to the reducer Split size varies:16, 32, 64, 128, 254 MBs sweet spot: 128 MB With the default split size of 8MB, Hadoop schedules over 500 mappers; by increasing the split size, this number drops around to 40 19

Hadoop- MongoDB: Increasing Data For 4.6 million input records, HDFS Hadoop is two times better than MongoDB, and at 37.2 million records it is five times At 37.

20 Hadoop- MongoDB: Increasing Data For 4.6 million input records, HDFS Hadoop is two times better than MongoDB, and at 37.2 million records it is five times At 37.2 million input records mongo-hadoop is more than 3 times slower in reading and more than nine times in writing than Hadoop-HDFS. In a sharded setup, mongohadoop reading times improve considerably. 2-node Hadoop Cluster and 2 Mongo-DB servers. 20

Hadoop- MongoDB: Sharding and processing on local nodes vs different nodes The performance slightly worsened compared to running the servers on different machines.

21 Hadoop- MongoDB: Sharding and processing on local nodes vs different nodes The performance slightly worsened compared to running the servers on different machines. MongoDB uses mmap to aggressively cache data from disk into memory With increasing input size growing memory and CPU usage is observed on the worker/server nodes This effects the performance of the MapReduce job Performance bottleneck is due to memory Contention. Locality has minimal effect. 21

Hadoop- MongoDB: Increasing #Workers The performance over increasing cluster sizes from 16 to 64 cores Single to two MongoDB sharded servers The write time is bound by the reduce phase for this

22 Hadoop- MongoDB: Increasing #Workers The performance over increasing cluster sizes from 16 to 64 cores Single to two MongoDB sharded servers The write time is bound by the reduce phase for this MapReduce job Number of mappers >> number of reducers The write performance of MongoDB still remains to be a bottleneck along with the overhead of routing data to be written between sharding servers. Write performance of MongoDB is a bottleneck. 22

writing the output to HDFS Downloading the data to HDFS before

23 Hadoop- MongoDB: Different Setups (given that the data is in MongoDB) Best performance achieved reading from MongoDB and writing the output to HDFS Downloading the data to HDFS before running the analysis is the slowest. Hadoop-HDFS provides the best peformance. 23

24 Hadoop- MongoDB: Different Setups Increasing cluster size (from 8 cores to 64) for 37.2 million input records With an increasing number of worker nodes the concurrency of the map phase increases The map times get considerably faster 24

nodes and fails to complete the MapReduce job Mongo-hadoop gets the input splits

25 Hadoop- MongoDB: Fault Tolerance 32 node Hadoop cluster processing ~37 million input records After eight faulted worker nodes Hadoop- HDFS loses too many data nodes and fails to complete the MapReduce job Mongo-hadoop gets the input splits from the MongoDB server therefore losing worker nodes does not lead to loss of input data 25

26 Conclusions Sharding helps to improve MongoDB s performance especially for reads. In a sharded setup, mongo- hadoop reading (mes improve considerably, as there are mul(ple servers to respond to parallel worker requests In cases where data is stored in MongoDB and needs to be analyzed, the mongo- hadoop connector is a convenient way to use Hadoop. Performance improves when output is wriben to HDFS MongoDB performance degrada(on observed with the increasing number of connec(ons, increasing write requests per second, as well as the increase in total write volume The mongo- hadoop plug- in provides roughly five (mes beber performance compared to using MongoDB s na(ve MapReduce implementa(on. The performance gain from using mongo- hadoop increases linearly with input size. 26

27 Contact Madhu Govindaraju Binghamton University State University of New York (SUNY) Dan Gunter Lawrence Berkeley National Laboratory 27

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System Ilkay Al(ntas and Daniel Crawl San Diego Supercomputer Center UC San Diego Jianwu Wang UMBC WorDS.sdsc.edu Computa3onal