Introduc)on to Map Reduce with Couchbase Tugdual Grall / @tgrall NoSQL Ma)ers 13 - Cologne - April 25th 2013
About Me Tugdual Tug Grall Couchbase exo Technical Evangelist CTO Oracle Developer/Product Manager Mainly Java/SOA Developer in consul@ng firms Web @tgrall hep://blog.grallandco.com tgrall NantesJUG co- founder Pet Project : hep://www.resultri.com
What s the Problem? Lots of Data Big Data Big Users SaaS/Cloud CompuDng
Solu)on Distribute: the data the processing of the data
Map Reduce MapReduce is a programming model for processing large data sets, and the name of an implementa@on of the model by Google. MapReduce is typically used to do distributed compu@ng on clusters of computers. hep://research.google.com/archive/mapreduce.html
In details Developer specifies 2 methods: map (in_key, in_value) -> list(out_key, intermediate_value) Processes input data Produces key, values pairs reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values for a par@cular key Produce a set of merged output values
Execu)on
Most common use case Yahoo inc.
What about Couchbase?
Couchbase Open Source Project Leading NoSQL database project focused on distributed database technology and surrounding ecosystem Supports both key- value and document- oriented use cases All components are available under the Apache 2.0 Public License Obtained as packaged soxware in both enterprise and community edi@ons. Couchbase Open Source Project
Couchbase Server Core Principles Easy Scalability PERFORMANCE Consistent High Performance Grow cluster without applica@on changes, without down@me with a single click Consistent sub- millisecond read and write response @mes with consistent high throughput Always On 24x365 JSON JSON JSON JSON Flexible Data Model No down@me for soxware upgrades, hardware maintenance, etc. JSON document model with no fixed schema.
Addi)onal Couchbase Server Features Built- in clustering All nodes equal Data replica@on with auto- failover Zero- down@me maintenance Built- in managed cached Append- only storage layer Online compac@on Monitoring and admin API & UI SDK for a variety of languages
Couchbase Server 2.0 Architecture 8092 Query API 11211 Memcapable 1.0 11210 Memcapable 2.0 Moxi Query Engine Memcached Couchbase EP Engine Data Manager New Persistence Layer storage interface REST management API/Web UI Heartbeat Process monitor Configura@on manager Global singleton supervisor Rebalance orchestrator Node health monitor vbucket state and replica@on manager Cluster Manager hvp on each node one per cluster Erlang/OTP HTTP 8091 Erlang port mapper 4369 Distributed Erlang 21100-21199
Couchbase Server 2.0 Architecture 8092 Query API 11211 Memcapable 1.0 11210 Memcapable 2.0 Moxi Query Engine Object- level Cache RAM Cache, Indexing & Persistence Management (C & V8) Couchbase EP Engine New Disk Persistence Persistence Layer storage interface REST management API/Web UI Heartbeat Process monitor Configura@on manager Global singleton supervisor Rebalance orchestrator Node health monitor vbucket state and replica@on manager Server/Cluster Management & CommunicaDon (Erlang) hvp on each node one per cluster Erlang/OTP The Unreasonable Effectiveness of C by Damien Katz HTTP 8091 Erlang port mapper 4369 Distributed Erlang 21100-21199
Basic Opera)on APP SERVER 1 APP SERVER 2 COUCHBASE Client Library CLUSTER MAP COUCHBASE Client Library CLUSTER MAP READ/WRITE/UPDATE SERVER 1 ACTIVE SERVER 2 ACTIVE SERVER 3 ACTIVE s distributed evenly across servers 5 2 4 7 1 2 Each server stores both ac)ve and replica docs Only one server ac@ve at a @me 9 8 6 Client library provides app with simple interface to database REPLICA 4 REPLICA 6 REPLICA 7 Cluster map provides map to which server doc is on App never needs to know 1 3 9 App reads, writes, updates docs 8 2 5 Mul)ple app servers can access same document at same )me COUCHBASE SERVER CLUSTER User Configured Replica Count = 1
How to access the data?
Couchbase.get( my-key );
Look at a document Key { } string : string, string : value, string JSON : { string : string, OBJECT string : value }, string : [ array ] ( DOCUMENT ) How to find document based on its avributes? get employee by email get products by type... You need to look into the document/value
How to? Create an index!
Create the index { { {"id": "110f37fa30", {"id": "110f37fa30", "rev": {"id":"1-000000000", "110f37fa30", "rev": "expiration": {"id":"1-000000000", "110f37fa30", "rev": 0, "expiration": {"id":"1-000000000", "110f37fa30", "rev": 0, "flags": "expiration": {"id":"1-000000000", 0, "110f37fa30", "rev": 0, "flags": "expiration": {"id":"1-000000000", 0, "110f37fa30", "type": "rev": "flags": "id": "json" "1-000000000", 0, "110f37fa30", 0, "expiration": "type": "rev": "flags": "id": "json" "1-000000000", 0, "110f37fa30", 0, } "expiration": "type": "rev": "json" "1-000000000", 0, "flags": } "expiration": "type": "rev": 0, "json" "1-000000000", 0, "flags": } "expiration": 0, 0, "type": "flags": } "expiration": "json" 0, 0, "type": "flags": "json" 0, { } "type": "flags": "json" 0, { } "type": "json" {"name": } "Aventinus", "type": "json" {"name": } "Aventinus", "abv": {"name": } 8.2, "Aventinus", "abv": "name": 8.2, "Aventinus", "ibu": { "abv": 0, 8.2, "ibu": {"name": "Aventinus", "abv": 0, 8.2, "srm": "ibu": {"name": 0, "Aventinus", "abv": 0, "srm": "ibu": {"name": 8.2, 0, "Aventinus", 0, "upc": "abv": "srm": "ibu": 0, "name": 8.2, 0, "Aventinus", 0, "upc": "abv": "srm": 0, "name": 8.2, 0, "Aventinus", "type": "ibu": "upc": "abv": "srm": "beer", 0, 0, 8.2, 0, "type": "ibu": "upc": "abv": "beer", 0, 0, 8.2, "brewery_id": "srm": "type": "ibu": 0, "upc":"beer", 0, "110f1f2012", 0, "brewery_id": "srm": "type": "ibu": 0, "beer", "110f1f2012", 0, "updated": "upc": "brewery_id": "srm": "type":"2010-07-22 0, 0, "beer", "110f1f2012", 20:00:20", "updated": "upc": "brewery_id": "srm": "2010-07-22 0, 0, "110f1f2012", 20:00:20", "description": "type": "beer", "updated": "upc": "brewery_id": "2010-07-22 0, "Dark-ruby, "110f1f2012", 20:00:20", "description": "type": "beer", "updated": "upc": "2010-07-22 0, "Dark-ruby, 20:00:20",... "brewery_id": "description": "updated": Weizenbock", "type": "beer", "110f1f2012", "2010-07-22 "Dark-ruby, 20:00:20",... "brewery_id": "description": Weizenbock", "type": "beer", "110f1f2012", "Dark-ruby, "category": "updated":... "brewery_id": "description": Weizenbock", "German "2010-07-22 "Dark-ruby, Ale" "110f1f2012", 20:00:20", "category": "updated":... "brewery_id": Weizenbock", "German "2010-07-22 Ale" "110f1f2012", 20:00:20", } "description": "Dark-ruby, "category": "updated":... Weizenbock", "German "2010-07-22 Ale" 20:00:20", } "description": "Dark-ruby, "category": "updated":... Weizenbock", "German "2010-07-22 Ale" 20:00:20", } "description": "Dark-ruby, "category":... Weizenbock", "German Ale" } "description": "Dark-ruby, "category":... Weizenbock", "German Ale" } "category":... Weizenbock", "German Ale" } "category": "German Ale" } "category": "German Ale" } } Key Value Aven@nus 8.2 Avenue Ale 4.1......
Concrete Example This map func)on: receives the document and metadata as developer you just have to emit the K,V
Map Func)on Text
?startkey= b1?startkey= bz & endkey= zn endkey= zz Pulls the Index- Keys between UTF- 8 Range specified by the startkey and endkey. doc.email abba@couchbase.com beta@couchbase.com jasdeep@couchbase.com math@couchbase.com mae@couchbase.com ye@@couchbase.com zorro@couchbase.com meta.id u::1 u::7 u::2 u::5 u::6 u::4 u::3
?key= math@couchbase.com Match a Single Index- Key doc.email abba@couchbase.com beta@couchbase.com jasdeep@couchbase.com math@couchbase.com mae@couchbase.com ye@@couchbase.com zorro@couchbase.com meta.id u::1 u::7 u::2 u::5 u::6 u::4 u::3
?keys=[ math@couchbase.com, yed@couchbase.com ] Query Mul@ple in the Set (Array Nota@on) doc.email abba@couchbase.com beta@couchbase.com jasdeep@couchbase.com math@couchbase.com mae@couchbase.com ye@@couchbase.com zorro@couchbase.com meta.id u::1 u::7 u::2 u::5 u::6 u::4 u::3
How it works?
Indexing and Querying APP SERVER 1 APP SERVER 2 COUCHBASE Client Library CLUSTER MAP COUCHBASE Client Library CLUSTER MAP Query SERVER 1 ACTIVE 5 SERVER 2 ACTIVE 5 SERVER 3 ACTIVE 5 Indexing work is distributed amongst nodes Large data set possible 2 2 2 Parallelize the effort 9 9 9 Each node has index for data stored on it 4 REPLICA 4 REPLICA 4 REPLICA Queries combine the results from required nodes 1 1 1 8 8 8 COUCHBASE SERVER CLUSTER User Configured Replica Count = 1
Couchbase Server 2.0: Views Views can cover a few different use cases Primary Index Simple secondary indexes (the most common) Complex secondary, ter@ary and composite indexes Aggrega@on func@ons (reduc@on) Example: count the number of North American Ales Organizing related data Built using Map/Reduce Map func@on creates a matrix from document fields Reduce func@on summarizes (reduces) informa@on
Distributed Index Build Phase Op)mized for lookups, in- order access and aggrega)ons All view reads from disk (different performance profile) View builds against every document on every node This is why you should group them in a design document Automa)cally kept up to date Incremental Map Reduce
Dynamic Range Queries with Op5onal Aggrega5on Efficiently fetch an row or group of related rows. Queries use cached values from B- tree inner nodes when possible Take advantage of in- order tree traversal with group_level queries?startkey= J &endkey= K { rows :[{ key : Juneau, value :null}]} SERVER 1 SERVER 2 SERVER 3 Ac@ve s Ac@ve s Ac@ve s 5 DOC 4 DOC 1 DOC 2 DOC 7 DOC 3 DOC 9 DOC 8 DOC 6 DOC Replica s Replica s Replica s 4 DOC 6 DOC 7 DOC 1 DOC 3 DOC 9 DOC 8 DOC 2 DOC 5 DOC
Append Only Index Disk acdvity is slow UpdaDng disk blocks is very slow Appending new data to the end of the current file is fast Overhead of reverse reading is small Because exisdng blocks are not re- used, can lead to fragmentadon Couchbase will compact the index automa@cally View Processor Disk Changed uments View Processor Original Appended
Adding a new ument new root A-R 14 A-R 15 new reductions A-H 7 I-R 7 I-R 8 A-C 3 D-F 2 G-H 2 I-L 3 N-R 4 M-R 5 A B C D F G H I K L M N O Q R new key
What about Reduce? Out of the box func)ons : _count() _sum() _stats() Create your own if needed function(key, values, rereduce) { if (rereduce) { var result = 0; for (var i = 0; i < values.length; i++) { result += values[i]; } return result; } else { return values.length; } }
Reduce Func)on Key and Arrays of values as parameters WriVen Javascript Called aner the map func)on Used to reduce the result of a map of single values Used with grouping Could be ignored when querying reuse the index
Reduce in Ac)on Map() Result Key Value Belgian- Style Dubbel 1 Belgian- Style Dubbel 1 Belgian- Style Dubbel 1 Belgian- Style Pale Ale 1 Belgian- Style White 1 Belgian- Style White 1...... Reduce() _count() Result Key Value Belgian- Style Dubbel 3 Belgian- Style Pale Ale 1 Belgian- Style White 2
How to use it? Use client SDK to call the view: View view = client.getview("beer", "by_name"); Query query = new Query(); query.setincludes(true).setlimit(20).setrangestart(complexkey.of(startkey)).setrangeend(complexkey.of(startkey + "\uefff")); ViewResponse result = client.query(view, query); for(viewrow row : result) {... }
Demonstra)on
Hadoop & Couchbase Deal with Big Data More is be)er than Faster Batch Oriented Usually used to extract/transform data Fully distributed Map, Shuffle, Reduce Distributed Executed where the document is Deal with indexing data As fast as possible Use to query the data in the Database
Map Reduce in Couchbase Like many other NoSQL Database : Used for queries! Index are distributed on each node of the cluster Index are updated Incrementally Write you Map Reduce in Javascript
Thank you! tug@couchbase.com @tgrall Get Couchbase Server at hep://www.couchbase.com/download