Big Data and Scripting mongodb map reduce
|
|
- Andrea Harris
- 6 years ago
- Views:
Transcription
1 Big Data and Scripting mongodb map reduce 1,
2 2, last lecture replication distribution full distributed storage network today use storage network for distributed computations introduce map/reduce framework
3 3, map reduce - introduction aim massive parallelization (arbitrary scaling) heterogeneous environment (different types of nodes) failure tolerance (nodes can fail) why not PRAM model? in PRAM, many cores access the same memory in parallel arbitrary scaling one large machine with many CPUs and much RAM very expensive if realizable at all using many cheap nodes connected via network scales better
4 4, map reduce - a new model of computation simulating a PRAM in a computing network simulation is possible in principle handling of locking, parallel writes and synchronization is complicated PRAM algorithms are not aware of data locations transport of data between nodes reduces simulation speed many parallel processes working on one big memory creating a new paradigm for distributed computations split data into many small parts handle each part independently from all the other combine results when parts are finished
5 5, the simplest example the task: word frequencies document with words, many repeated find bag-of-words -vector: b N, such that b w = repetitions of w in document splitting and recombination split documents in tuples: w, 1 (key w, value 1) collect all values for each key: w, (1, 1,..., 1) combine by adding values: w, k w splitting can be parallelized using chunks of the document combination is independent for each word
6 6, generalizing key,value computations input: (input) key/value pairs k 1, v 1,..., k n, v n output: (output) key/value pairs k i, r i map k i, v i { k i, v i : i 0,..., n } map each input pair to zero or more output pairs modify or even split/filter input shuffle for each k i collect all v i and create new tuple distribute to reduce instances, e.g. different nodes reduce k i, (v 1,..., v k ) k i, r i combine all values for one key to (partial) result do not change key
7 7, generalizing key/value computations mapping is independent for each input key can be applied completely in parallel can be executed on nodes that store the corresponding data must be implemented without side-effects shuffling is independent of algorithm implemented and optimized once map computations to nodes where data is stored (when possible) can be executed partially parallel to mapping only finished when mapping is finished reduce step again parallel for each output key can be started only when is finished mapping more complex algorithms can be realized by combining several map/reduce steps
8 8, a general interface for map/reduce (many) different implementations for map/reduce exist all share the common interface: input: set of key, value pairs input: map-function input: reduce-function input flexibility: can be a single element can consist of elements with same key (e.g. NULL) analogous for output all input elements can be mapped to one key single output value
9 9, example approach for a complex parallel algorithm a first map/reduce step partitions the input into many sub-problems intermediate steps solve sub-problems in parallel and possible combine them final step maps all elements to single key and reduces to final result
10 10, example: recursive k-means clustering the STREAM k-means clustering can be adopted to the map/reduce framework 1. distribution/partitioning map: partition the input set randomly using a hash-function as key reduce: values for one key are clustered into 4k clusters 2. repeated recursive reduction map: redistribute weighted cluster representatives, use new hash-function with smaller range reduce: cluster weighted representatives for each key 3. recombination to single clustering map: map all weighted representatives to one key reduce: combine representatives to a final clustering of input set in contrast to the stream implementation, the data points can be assigned to the resulting cluster centers
11 11, map/reduce in mongodb import an example table with cities and population size: 1 mongoimport --db scratch --collection zips --file zips.json the imported collection: > use scratch switched to db scratch > db.zips.findone() { "city" : "ACMAR", "loc" : [ , ], "pop" : 6055, "state" : "AL", "_id" : "35004" } 1 media.mongodb.org/zips.json
12 12, map/reduce in mongodb task: create a histogram of city sizes for appropriate k number of cities with population in [1000k, 1000(k + 1)) (include lower, exclude upper bound) expression in map/reduce terms map values to their bucket: > map=function(){value=math.round(this.pop/1000); emit(value,1);} reduce by summing up values in bucket: > reduce=function(value, counts){ return(array.sum(counts)); } note: map and reduce are variables holding functions these will in the following as parameters
13 13, defining map and reduce functions > map=function(){value=math.round(this.pop/1000); emit(value,1);} map does not need a parameter access input document as this arbitrary many calls to emit(key,value) create input for shuffle-stage key and value can be simple values or documents > reduce=function(value, counts){ return(array.sum(counts)); } reduce has two parameters, named arbitrarily first: the key (here: value), corresponding to key used in emit second: an array of values, each array element has type corresponding to the second argument of emit implementation in mongodb expects additional properties (later)
14 14, execution and refinement execute and put result in new collection hist 1000 > db.zips.mapreduce(map, reduce, {out:"hist_1000"}) { "result" : "hist_1000", "timemillis" : 531, "counts" : { "input" : 29467, "emit" : 29467, "reduce" : 2135, "output" : 97}, "ok" : 1,} execution delivers result document with execution parameters result collection execution time input pairs, pairs emitted in from map() input pairs for reduce() (after shuffling) output pairs
15 15, output of map/reduce in mongodb > show collections hist_1000 system.indexes zips > db.hist_1000.findone() { "_id" : 0, "value" : 4415 } result is stored in collection one document for each key/value pair in result key is used as id value is stored as field value
16 16, additional requirements for reduce() mongodbs implementation differs from original definition original definition: reduce() executed once for each key in mongodb: multiple times for each key with part of the values additional requirements for reduce() functions: type of return equals type of input (value only, key is not changed) idempotent reduce(reduce(key,values))=reduce(key,values) associative reduce(key,[a,reduce(key,[a,b])]==reduce(key,[a,b,c]) commutative reduce(key,[a,b])==reduce(key,[b,a]) note: this allows application of the recursive parallelization scheme for associative functions on many arguments
17 17, additional options to mapreduce() mandatory arguments: map, reduce, out other parameters are optional query: limit input to documents matching the query limit: limit number of input documents sort: sort input documents may be useful for optimization others are: finalize, scope
18 18, finalization limitation of reduce() may result in unwanted output format output can be finally modified using finalize: db.collection.mapreduce(map, reduce, {out:"..", finalize: finfunction}) function arguments equal to reduce() return value ends up as value-field in result set
19 using global variables with scope allows to blend in global variables into map, reduce and finalize specify set of variables from JavaScript context: scope: ["var1","var2"] these can be read and modified in map/reduce functions example: > db.zips.mapreduce(map,... function(v,c){counter++;... return({sum:array.sum(c), count:counter});},... {out:"hist", limit:100, scope:["counter"]}) { "result" : "hist",... > db.hist.find() { "_id" : 0, "value" : 1 } { "_id" : 1, "value" : { "sum" : 12, "count" : 56 } } { "_id" : 2, "value" : { "sum" : 9, "count" : 57 } } only one value with key 0 did not enter reduce 19,
20 20, output options out : "output_collection" default behaviour: store output key/value pairs in collection if collection exists, overwrite alternative: output action out: {<action> : <collectionname>, <optionalparameters>} replace (the default behavior) merge overwrite only documents with key existing in output reduce merge documents with equal keys using reduce
21 21, more output options optionalparameters: db: store result collection in other database default is db of input collection sharded: distribute (shard) output using id as key nonatomic: when merging or reducing the output into an existing collection prevent locking of the database with nonatomic: true default is false, i.e. database is locked for output writing output mode without storage in collection: out: {inline : 1} intermediate documents not stored in database pure in-memory computation result document has additional array-field results with output
22 22, using functions via api calls functions of the form db.collection.find() are executed in the JavaScript context on the mongo-console access via APIs provides execution of mongo queries and commands, but not JavaScript a minimal interface minimal context provided by (almost) all APIs a connection object (or variable) a database object methods to send commands to a mongodb instance
23 23, accessing mongo via API access to subset of mongodb functions in language context execution of JavaScript commands is often not directly supported sending JavaScript code is error-prone and should be avoided generic command execution: db.runcommand() executes database commands 2 provided by mongodb parameter availability can differ from JavaScript counterpart provides access to functions not directly supported by api example: db.runcommand({aggregate: "<collection>", pipeline: [<pipeline>]}) 2 list: docs.mongodb.org/manual/reference/command/
24 execution of JavaScript eval allows execution of JavaScript functions: { eval: <function>, args: [ <arg1>, <arg2>,...], nolock: <boolean>} example: > db.runcommand({eval: function(x){db.tmp.insert({value:x});},... args: ["2"]}) { "retval" : null, "ok" : 1 } > db.tmp.find() { "_id" : ObjectId("51c1ee258ab237572ba6ea88"), "value" : "2" } > db.runcommand({eval: "db.tmp.insert({value:3})"}) > db.tmp.find() { "_id" : ObjectId("51c1ee258ab237572ba6ea88"), "value" : "2" } { "_id" : ObjectId("51c29fc08ab237572ba6ea8c"), "value" : 3 } by default execution blocks the complete database prevent this with nolock:true 24,
25 25, storing pre-defined functions mongodb provides a special collection for storing functions: db.system.js these functions are database specific can access database as db id is function name, value holds function > db.system.js.insert({_id:"tmpinsert", value: function(x)... {db.tmp.insert({value:x});}}) load function functions: > db.loadserverscripts() > tmpinsert(3) > db.runcommand({eval : tmpinsert, args: [24]}) { "retval" : null, "ok" : 1 } > db.tmp.find() { "_id" : ObjectId("51c2a27be806ddd460c2ae90"), "value" : 3 } { "_id" : ObjectId("51c2a3698ab237572ba6ea8d"), "value" : 24 }
26 26, formalizing map/reduce for automated execution aim: create a formalized storage for map/reduce algorithms allow automated execution of stored algorithms a single map reduce operation has 3 functions, one optional these could be stored in a mr-step document: {map: function()..., reduce: function()..., finalize: function()...} a complete algorithm consists of (possibly) multiple steps can be formalized as mr-algorithm document: {_id : "algoname", functions : [ <mr1>, <mr2>,...]} where <mr1> is a document describing a single map/reduce step
27 27, automatic execution of map/reduce algorithms assume a collection of mr-algorithm documents execution of these can be scripted in a function mr-execution input: algorithm name, input collection, output collection get algorithm document from collection for first element in functions: execute mapreduce input collection output collection for each element in functions: execute mapreduce output collection output collection
Course Content MongoDB
Course Content MongoDB 1. Course introduction and mongodb Essentials (basics) 2. Introduction to NoSQL databases What is NoSQL? Why NoSQL? Difference Between RDBMS and NoSQL Databases Benefits of NoSQL
More informationFinal Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23
Final Exam Review 2 Kathleen Durant CS 3200 Northeastern University Lecture 23 QUERY EVALUATION PLAN Representation of a SQL Command SELECT {DISTINCT} FROM {WHERE
More informationQuery Optimization. Query Optimization. Optimization considerations. Example. Interaction of algorithm choice and tree arrangement.
COS 597: Principles of Database and Information Systems Query Optimization Query Optimization Query as expression over relational algebraic operations Get evaluation (parse) tree Leaves: base relations
More informationScaling with mongodb
Scaling with mongodb Ross Lawley Python Engineer @ 10gen Web developer since 1999 Passionate about open source Agile methodology email: ross@10gen.com twitter: RossC0 Today's Talk Scaling Understanding
More informationCSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA
CSE 3302 Lecture 11: Map/Reduce 7 October 2010 Nate Nystrom UTA 378,000 results in 0.17 seconds including images and video communicates with 1000s of machines web server index servers document servers
More informationGroup13: Siddhant Deshmukh, Sudeep Rege, Sharmila Prakash, Dhanusha Varik
Group13: Siddhant Deshmukh, Sudeep Rege, Sharmila Prakash, Dhanusha Varik mongodb (humongous) Introduction What is MongoDB? Why MongoDB? MongoDB Terminology Why Not MongoDB? What is MongoDB? DOCUMENT STORE
More informationDatabase Applications (15-415)
Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April
More informationHDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung
HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per
More informationAdvanced Databases: Parallel Databases A.Poulovassilis
1 Advanced Databases: Parallel Databases A.Poulovassilis 1 Parallel Database Architectures Parallel database systems use parallel processing techniques to achieve faster DBMS performance and handle larger
More informationRemote Procedure Call. Tom Anderson
Remote Procedure Call Tom Anderson Why Are Distributed Systems Hard? Asynchrony Different nodes run at different speeds Messages can be unpredictably, arbitrarily delayed Failures (partial and ambiguous)
More informationData Partitioning and MapReduce
Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,
More informationMapReduce Design Patterns
MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together
More informationThe course modules of MongoDB developer and administrator online certification training:
The course modules of MongoDB developer and administrator online certification training: 1 An Overview of the Course Introduction to the course Table of Contents Course Objectives Course Overview Value
More informationJargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems
Jargons, Concepts, Scope and Systems Key Value Stores, Document Stores, Extensible Record Stores Overview of different scalable relational systems Examples of different Data stores Predictions, Comparisons
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large
More informationMongoDB Chunks Distribution, Splitting, and Merging. Jason Terpko
Percona Live 2016 MongoDB Chunks Distribution, Splitting, and Merging Jason Terpko NoSQL DBA, Rackspace/ObjectRocket www.linkedin.com/in/jterpko, jason.terpko@rackspace.com My Story Started out in relational
More informationFinal Exam Review. Kathleen Durant PhD CS 3200 Northeastern University
Final Exam Review Kathleen Durant PhD CS 3200 Northeastern University 1 Outline for today Identify topics for the final exam Discuss format of the final exam What will be provided for you and what you
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Fall 2016 1 HW8 is out Last assignment! Get Amazon credits now (see instructions) Spark with Hadoop Due next wed CSE 344 - Fall 2016
More informationHadoop Map Reduce 10/17/2018 1
Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018
More informationHuge market -- essentially all high performance databases work this way
11/5/2017 Lecture 16 -- Parallel & Distributed Databases Parallel/distributed databases: goal provide exactly the same API (SQL) and abstractions (relational tables), but partition data across a bunch
More informationApril Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.
1. MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. MapReduce is a framework for processing big data which processes data in two phases, a Map
More informationA BigData Tour HDFS, Ceph and MapReduce
A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!
More informationBig Data and Scripting map reduce in Hadoop
Big Data and Scripting map reduce in Hadoop 1, 2, connecting to last session set up a local map reduce distribution enable execution of map reduce implementations using local file system only all tasks
More informationPerformance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis
Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis Elif Dede, Madhusudhan Govindaraju Lavanya Ramakrishnan, Dan Gunter, Shane Canon Department of Computer Science, Binghamton
More informationPutting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21
Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files
More information5/2/16. Announcements. NoSQL Motivation. The New Hipster: NoSQL. Serverless. What is the Problem? Database Systems CSE 414
Announcements Database Systems CSE 414 Lecture 16: NoSQL and JSon Current assignments: Homework 4 due tonight Web Quiz 6 due next Wednesday [There is no Web Quiz 5 Today s lecture: JSon The book covers
More informationMongoDB DI Dr. Angelika Kusel
MongoDB DI Dr. Angelika Kusel 1 Motivation Problem Data is partitioned over large scale clusters Clusters change the rules for processing Good news Lots of machines to spread the computation over Bad news
More informationDatabase Systems CSE 414
Database Systems CSE 414 Lecture 16: NoSQL and JSon CSE 414 - Spring 2016 1 Announcements Current assignments: Homework 4 due tonight Web Quiz 6 due next Wednesday [There is no Web Quiz 5] Today s lecture:
More informationComparative Analysis of Range Aggregate Queries In Big Data Environment
Comparative Analysis of Range Aggregate Queries In Big Data Environment Ranjanee S PG Scholar, Dept. of Computer Science and Engineering, Institute of Road and Transport Technology, Erode, TamilNadu, India.
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data
More informationParallel Computing: MapReduce Jin, Hai
Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! MapReduce is a distributed/parallel computing framework introduced by Google
More informationMitigating Data Skew Using Map Reduce Application
Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationVoldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation
Voldemort Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/29 Outline 1 2 3 Smruti R. Sarangi Leader Election 2/29 Data
More informationClustering Lecture 8: MapReduce
Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data
More informationMapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University
MapReduce & HyperDex Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University 1 Distributing Processing Mantra Scale out, not up. Assume failures are common. Move processing to the data. Process
More informationGhislain Fourny. Big Data 5. Wide column stores
Ghislain Fourny Big Data 5. Wide column stores Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage 2 Where we are User interfaces
More informationFunctional Programming. Pure Functional Programming
Functional Programming Pure Functional Programming Computation is largely performed by applying functions to values. The value of an expression depends only on the values of its sub-expressions (if any).
More informationHow to Implement MapReduce Using. Presented By Jamie Pitts
How to Implement MapReduce Using Presented By Jamie Pitts A Problem Seeking A Solution Given a corpus of html-stripped financial filings: Identify and count unique subjects. Possible Solutions: 1. Use
More informationHow to Scale MongoDB. Apr
How to Scale MongoDB Apr-24-2018 About me Location: Skopje, Republic of Macedonia Education: MSc, Software Engineering Experience: Lead Database Consultant (since 2016) Database Consultant (2012-2016)
More informationChapter 12. UML and Patterns. Copyright 2008 Pearson Addison-Wesley. All rights reserved
Chapter 12 UML and Patterns Copyright 2008 Pearson Addison-Wesley. All rights reserved Introduction to UML and Patterns UML and patterns are two software design tools that can be used within the context
More informationIntroduction to MapReduce
732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server
More informationBatch Processing Basic architecture
Batch Processing Basic architecture in big data systems COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman 2 1 2 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 3
More information8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara
Week 1-B-0 Week 1-B-1 CS535 BIG DATA FAQs Slides are available on the course web Wait list Term project topics PART 0. INTRODUCTION 2. DATA PROCESSING PARADIGMS FOR BIG DATA Sangmi Lee Pallickara Computer
More informationMapReduce. Stony Brook University CSE545, Fall 2016
MapReduce Stony Brook University CSE545, Fall 2016 Classical Data Mining CPU Memory Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining
More informationDatabase Applications (15-415)
Database Applications (15-415) DBMS Internals- Part VI Lecture 17, March 24, 2015 Mohammad Hammoud Today Last Two Sessions: DBMS Internals- Part V External Sorting How to Start a Company in Five (maybe
More informationAnnouncements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414
Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)
More informationParallel DBs. April 23, 2018
Parallel DBs April 23, 2018 1 Why Scale? Scan of 1 PB at 300MB/s (SATA r2 Limit) Why Scale Up? Scan of 1 PB at 300MB/s (SATA r2 Limit) ~1 Hour Why Scale Up? Scan of 1 PB at 300MB/s (SATA r2 Limit) (x1000)
More informationCSE 414: Section 7 Parallel Databases. November 8th, 2018
CSE 414: Section 7 Parallel Databases November 8th, 2018 Agenda for Today This section: Quick touch up on parallel databases Distributed Query Processing In this class, only shared-nothing architecture
More informationIntroduction to Map Reduce
Introduction to Map Reduce 1 Map Reduce: Motivation We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate
More informationMongoDB Architecture
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui MongoDB Architecture Lecturer : Dr. Pavle Mogin SWEN 432 Advanced Database Design and Implementation Advanced Database Design
More informationPercona Live Updated Sharding Guidelines in MongoDB 3.x with Storage Engine Considerations. Kimberly Wilkins
Percona Live 2016 Updated Sharding Guidelines in MongoDB 3.x with Storage Engine Considerations Kimberly Wilkins Principal Engineer - Databases, Rackspace/ ObjectRocket www.linkedin.com/in/wilkinskimberly,
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationCA485 Ray Walshe NoSQL
NoSQL BASE vs ACID Summary Traditional relational database management systems (RDBMS) do not scale because they adhere to ACID. A strong movement within cloud computing is to utilize non-traditional data
More informationCamdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa
Camdoop Exploiting In-network Aggregation for Big Data Applications costa@imperial.ac.uk joint work with Austin Donnelly, Antony Rowstron, and Greg O Shea (MSR Cambridge) MapReduce Overview Input file
More informationParallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce
Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The
More informationTI2736-B Big Data Processing. Claudia Hauff
TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Patterns Hadoop Ctd. Graphs Giraph Spark Zoo Keeper Spark Learning objectives Implement
More informationTrack Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross
Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling
More informationDIVING IN: INSIDE THE DATA CENTER
1 DIVING IN: INSIDE THE DATA CENTER Anwar Alhenshiri Data centers 2 Once traffic reaches a data center it tunnels in First passes through a filter that blocks attacks Next, a router that directs it to
More informationMongoDB. David Murphy MongoDB Practice Manager, Percona
MongoDB Click Replication to edit Master and Sharding title style David Murphy MongoDB Practice Manager, Percona Who is this Person and What Does He Know? Former MongoDB Master Former Lead DBA for ObjectRocket,
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 Web Search Engine Crawling Indexing Computing
More informationCSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL
CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2015 Lecture 14 NoSQL References Scalable SQL and NoSQL Data Stores, Rick Cattell, SIGMOD Record, December 2010 (Vol. 39, No.
More informationSorting & Aggregations
Sorting & Aggregations Lecture #11 Database Systems /15-645 Fall 2018 AP Andy Pavlo Computer Science Carnegie Mellon Univ. 2 Sorting Algorithms Aggregations TODAY'S AGENDA 3 WHY DO WE NEED TO SORT? Tuples
More informationFrequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management
Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management Kranti Patil 1, Jayashree Fegade 2, Diksha Chiramade 3, Srujan Patil 4, Pradnya A. Vikhar 5 1,2,3,4,5 KCES
More informationModule 4. Implementation of XQuery. Part 0: Background on relational query processing
Module 4 Implementation of XQuery Part 0: Background on relational query processing The Data Management Universe Lecture Part I Lecture Part 2 2 What does a Database System do? Input: SQL statement Output:
More informationPrinciples of Data Management. Lecture #16 (MapReduce & DFS for Big Data)
Principles of Data Management Lecture #16 (MapReduce & DFS for Big Data) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s News Bulletin
More informationAlgorithms for MapReduce. Combiners Partition and Sort Pairs vs Stripes
Algorithms for MapReduce 1 Assignment 1 released Due 16:00 on 20 October Correctness is not enough! Most marks are for efficiency. 2 Combining, Sorting, and Partitioning... and algorithms exploiting these
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 4: Index Construction Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards This time: Spell correction Soundex Index construction Index
More informationMongoDB Distributed Write and Read
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui MongoDB Distributed Write and Read Lecturer : Dr. Pavle Mogin SWEN 432 Advanced Database Design and Implementation Advanced
More informationClustering Documents. Document Retrieval. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo
More information1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions
Big Data Hadoop Architect Online Training (Big Data Hadoop + Apache Spark & Scala+ MongoDB Developer And Administrator + Apache Cassandra + Impala Training + Apache Kafka + Apache Storm) 1 Big Data Hadoop
More informationHadoop Development Introduction
Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand
More informationHadoop. copyright 2011 Trainologic LTD
Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides
More informationEvaluation of Apache Hadoop for parallel data analysis with ROOT
Evaluation of Apache Hadoop for parallel data analysis with ROOT S Lehrack, G Duckeck, J Ebke Ludwigs-Maximilians-University Munich, Chair of elementary particle physics, Am Coulombwall 1, D-85748 Garching,
More informationThe MapReduce Abstraction
The MapReduce Abstraction Parallel Computing at Google Leverages multiple technologies to simplify large-scale parallel computations Proprietary computing clusters Map/Reduce software library Lots of other
More informationCS /29/18. Paul Krzyzanowski 1. Question 1 (Bigtable) Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams
Question 1 (Bigtable) What is an SSTable in Bigtable? Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams It is the internal file format used to store Bigtable data. It maps keys
More informationDistributed Systems Pre-exam 3 review Selected questions from past exams. David Domingo Paul Krzyzanowski Rutgers University Fall 2018
Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams David Domingo Paul Krzyzanowski Rutgers University Fall 2018 November 28, 2018 1 Question 1 (Bigtable) What is an SSTable in
More informationTITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
More informationHadoop Online Training
Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 2: MapReduce Algorithm Design (2/2) January 14, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationMongoDB 2.2 and Big Data
MongoDB 2.2 and Big Data Christian Kvalheim Team Lead Engineering, EMEA christkv@10gen.com @christkv christiankvalheim.com From here... http://bit.ly/ot71m4 ...to here... http://bit.ly/oxcsis ...without
More informationDelving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture
Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases
More informationHadoopDB: An open source hybrid of MapReduce
HadoopDB: An open source hybrid of MapReduce and DBMS technologies Azza Abouzeid, Kamil Bajda-Pawlikowski Daniel J. Abadi, Avi Silberschatz Yale University http://hadoopdb.sourceforge.net October 2, 2009
More informationChapter 17: Parallel Databases
Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems Database Systems
More informationDatabase Applications (15-415)
Database Applications (15-415) DBMS Internals- Part VI Lecture 14, March 12, 2014 Mohammad Hammoud Today Last Session: DBMS Internals- Part V Hash-based indexes (Cont d) and External Sorting Today s Session:
More informationAnnouncements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm
Announcements HW5 is due tomorrow 11pm Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create your AWS account before
More informationCS November 2018
Bigtable Highly available distributed storage Distributed Systems 19. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account
More informationHadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)
Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:
More informationParallelism with Haskell
Parallelism with Haskell T-106.6200 Special course High-Performance Embedded Computing (HPEC) Autumn 2013 (I-II), Lecture 5 Vesa Hirvisalo & Kevin Hammond (Univ. of St. Andrews) 2013-10-08 V. Hirvisalo
More informationIndex construction CE-324: Modern Information Retrieval Sharif University of Technology
Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.
More informationLecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci, Charles Zheng.
CME 323: Distributed Algorithms and Optimization, Spring 2015 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Databricks and Stanford. Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci,
More informationCS 61C: Great Ideas in Computer Architecture. MapReduce
CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing
More informationCSC 261/461 Database Systems Lecture 19
CSC 261/461 Database Systems Lecture 19 Fall 2017 Announcements CIRC: CIRC is down!!! MongoDB and Spark (mini) projects are at stake. L Project 1 Milestone 4 is out Due date: Last date of class We will
More informationPercona Live Santa Clara, California April 24th 27th, 2017
Percona Live 2017 Santa Clara, California April 24th 27th, 2017 MongoDB Shell: A Primer Rick Golba The Mongo Shell It is a JavaScript interface to MongoDB Part of the standard installation of MongoDB Used
More informationCSE 344 Final Review. August 16 th
CSE 344 Final Review August 16 th Final In class on Friday One sheet of notes, front and back cost formulas also provided Practice exam on web site Good luck! Primary Topics Parallel DBs parallel join
More informationThe amount of data increases every day Some numbers ( 2012):
1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect
More information