Analytics in MongoDB. Massimo Brignoli Principal Solutions

Size: px

Start display at page:

Download "Analytics in MongoDB. Massimo Brignoli Principal Solutions"

Kathryn McBride
6 years ago
Views:

1 Analytics in MongoDB Massimo Brignoli Principal Solutions

2 Agenda Analytics in MongoDB? Aggregation Framework Aggregation Pipeline Stages Aggregation Framework in Action Joins in MongoDB 3.2 Integrations Analytical Architectures

3 Relational Expressive Query Language & Secondary Indexes Strong Consistency Enterprise Management & Integrations

4 The World Has Changed Data Time Volume Velocity Variety Iterative Agile Short Cycles Risk Always On Secure Global Cost Open-Source Cloud Commodity

5 NoSQL Expressive Query Language & Secondary Indexes Flexibility Strong Consistency Scalability & Performance Enterprise Management & Integrations Always On, Global Deployments

6 Nexus Architecture Expressive Query Language & Secondary Indexes Flexibility Strong Consistency Scalability & Performance Enterprise Management & Integrations Always On, Global Deployments

7 Some Common MongoDB Use Cases Single View Internet of Things Mobile Real-Time Analytics Catalog Personalization Content Management

8 MongoDB in Research

9 Analytics in MongoDB?

10 Analytics in MongoDB? Analytics? Create Read Update Delete Group Count Derive Values Filter Average Sort

Analytics on MongoDB Data MapReduce & HDFS

from MongoDB and perform complex analytics

nodes to manage Direct access to MongoDB from

from BI Tools MongoDB aggregation pipeline

11 Analytics on MongoDB Data MapReduce & HDFS Hadoop Connector SQL Connector Extract data from MongoDB and perform complex analytics with Hadoop Batch rather than real-time Extra nodes to manage Direct access to MongoDB from SPARK MongoDB BI Connector Direct SQL Access from BI Tools MongoDB aggregation pipeline Real-time Live, operational data set Narrower feature set

12 For Example: US Census Data Census data from 1990, 2000, 2010 Question: Which US Division has the fastest growing population density? We only want to include data states with more than 1M people We only want to include divisions larger than 100K square miles Division = a group of US States Population density = Area of division/# of people Data is provided at the state level

13 US Regions and Divisions

14 How would we solve this in SQL? SELECT GROUP BY HAVING

15 Aggregation Framework

16 Aggregation Framework

17 What is an Aggregation Pipeline? A Series of Document Transformations Executed in stages Original input is a collection Output as a cursor or a collection Rich Library of Functions Filter, compute, group, and summarize data Output of one stage sent to input of next Operations executed in sequential order

18 Aggregation Pipeline

19 Aggregation Pipeline $match {u}

20 Aggregation Pipeline $match {u}

21 Aggregation Pipeline $match {u} $project {n= + }

22 Aggregation Pipeline $match {u} $project {u n} {u n} {u n} {n= + }

23 Aggregation Pipeline $match $project $lookup {u} {u n} {u n} {u n} {n= + } { v} { v} { v} { v}

24 Aggregation Pipeline $match $project $lookup {u} {u n} {u n} {u n} {n= + } {u n[vv]} {u n[vv]} {u n} { v} { v} { v} { v}

25 Aggregation Pipeline $match $project $lookup $group {u} {u n} {u n} {u n} {n= + } {u n[vv]} {u n[vv]} {u n} { v} { v} { v} { v} {n Σ λ σ} {n Σ λ σ} {n Σ λ σ}

26 Aggregation Pipeline Stages $match Filter documents $geonear Geospherical query $project Reshape documents $lookup New Left-outer equi joins $unwind Expand documents $group Summarize documents $sample New Randomly selects a subset of documents $sort Order documents $skip Jump over a number of documents $limit Limit number of documents $redact Restrict documents $out Sends results to a new collection

27 Aggregation Framework in Action (let s play with the census data)

28 MongoDB State Collection Document For Each State Name Region Division Census Data For 1990, 2000, 2010 Population Housing Units Occupied Housing Units Census Data is an array with three subdocuments

29 Document Model { "_id" : ObjectId("54e23c7b f "), "name" : "California", "region" : "West", } "data" : [ { "totalpop" : , ], "totalhouse" : , "occhouse" : , "year" : 2000}, { "totalpop" : , "totalhouse" : , "occhouse" : , "year" : 2010}, { "totalpop" : , "totalhouse" : , "occhouse" : , "year" : 1990}

30 Total US Area db.cdata.aggregate([ {"$group" : { "_id" : null, "totalarea" : {$sum : "$aream"}, avgarea" : {$avg : "$aream"} }} ])

31 $group Group documents by value Field reference, object, constant Other output fields are computed $max, $min, $avg, $sum $addtoset, $push $first, $last Processes all data in memory by default

32 Area By Region db.cdata.aggregate([ {"$group" : { "_id" : "$region", "totalarea" : {$sum : "$aream"}, "avgarea" : {$avg : "$aream"}, "numstates" : {$sum : 1}, "states" : {$push : "$name"} } } ])

33 Calculating Average State Area By Region {state: New York", aream: 218, region: North East" } {state: New Jersey", aream: 90, region: North East } {state: California", area: 300, region: West" } { $group: { _id: "$region", avgaream: {$avg: $aream" } }} { _id: North East", avgaream: 154} {_id: West", avgaream: 300}

34 Calculating Total Area and State Count {state: New York", aream: 218, region: North East" } {state: New Jersey", aream: 90, region: North East } {state: California", area: 300, region: West" } { $group: { _id: "$region", totarea: {$sum: $aream" }, scount : {$sum : 1} }} { _id: North East", totarea: 308 scount: 2} { _id: West", totarea: 300, scount: 1}

35 Total US Population By Year db.cdata.aggregate([ ]) {$unwind : "$data"}, {$group : { "_id" : "$data.year", "totalpop" : {$sum :"$data.totalpop"}}}, {$sort : {"totalpop" : 1}}

36 $unwind Operate on an array field Create documents from array elements Array replaced by element value Missing/empty fields no output Non-array fields error Pipe to $group to aggregate

37 { state: New York", census: [1990, 2000, 2010]} { state: New Jersey", census: [1990, 2000]} { state: California", census: [1980, 1990, 2000, 2010]} { state: Delaware", census: [1990, 2000]} $unwind { $unwind: $census } { state: New York, census: 1990} { state: New York, census: 2000} { state: New York, census: 2010} { state: New Jersey, census: 1990} { state: New Jersey, census: 2000}

38 Southern State Population By Year db.cdata.aggregate([ ]) {$match : {"region" : "South"}}, {$unwind : "$data"}, {$group : { "_id" : "$data.year", "totalpop : {"$sum : "$data.totalpop"}}}

39 $match Filter documents Uses existing query syntax, same as.find()

40 $match {state: New York", aream: 218, region: North East" } {state: Oregon", aream: 245, region: West } {state: California", area: 300, region: West" } { $match: { region : West } } {state: Oregon", aream: 245, region: West } {state: California", area: 300, region: West"}

41 Population Delta By State from 1990 to 2010 db.cdata.aggregate([ }]) {$unwind : "$data"}, {$sort : {"data.year" : 1}}, {$group : { "_id" : "$name", "pop1990" : {"$first" : "$data.totalpop"}, "pop2010" : {"$last" : "$data.totalpop"}}}, {$project : { "_id" : 0, "name" : "$_id", "delta" : {"$subtract" : ["$pop2010", "$pop1990"]}, "pop1990" : 1, "pop2010 : 1}

42 $sort, $limit, $skip Sort documents by one or more fields Same order syntax as cursors Waits for earlier pipeline operator to return In-memory unless early and indexed Limit and skip follow cursor behavior

43 $first, $last Collection operations like $push and $addtoset Must be used in $group $first and $last determined by document order Typically used with $sort to ensure ordering is known

44 $project Reshape Documents Include, exclude or rename fields Inject computed fields Create sub-document fields

45 { } { Including and Excluding Fields "_id" : "Virginia, "pop1990" : , "pop2010" : "_id" : "South Dakota", "pop1990" : , "pop2010" : } { $project: { _id : 0, pop1990 : 1, pop2010 : 1} } {"pop1990" : , "pop2010" : } {"pop1990" : , "pop2010" : }

46 { } { Renaming and Computing Fields "_id" : "Virginia, "pop1990" : , "pop2010" : "_id" : "South Dakota", "pop1990" : , "pop2010" : } { $project: { _id : 0, pop1990 : 0, pop2010 : 0, name : $_id, "delta" : {"$subtract" : ["$pop2010", "$pop1990"]}} } { name" : Virginia, delta" : } { name" : South Dakota, delta" : }

47 Compare number of people living within 500KM of Memphis, TN in 1990, 2000, 2010

48 Compare number of people living within 500KM of Memphis, TN in 1990, 2000, 2010 db.cdata.aggregate([ ]) {$geonear : { "near" : {"type" : "Point", "coordinates" : [90, 35]}, {$unwind : "$data"}, distancefield : "dist.calculated", maxdistance : , includelocs : "dist.location", spherical : true }}, {$group : { "_id" : "$data.year", {$sort : {"_id" : 1}} "totalpop" : {"$sum" : "$data.totalpop"}, "states" : {"$addtoset" : "$name"}}},

49 $geonear Order/Filter Documents by Location Requires a geospatial index Output includes physical distance Must be first aggregation stage

50 $geonear {"_id" : "Virginia, "pop1990" : , "pop2010" : , center : { type : Point, coordinates : [78.6, 37.5]}} { "_id" : Tennessee", "pop1990" : , "pop2010" : , center : { type : Point, coordinates : [86.6, 37.8]}} {$geonear : { "near : {"type : "Point", "coordinates : [90, 35]}, maxdistance : , spherical : true }} {"_id" : Tennessee", "pop1990" : , "pop2010" : , center : { type : Point, coordinates : [86.6, 37.8]}}

51 What if I want to save the results to a collection? db.cdata.aggregate([ {$geonear : { "near" : {"type" : "Point", "coordinates" : [90, 35]}, distancefield : "dist.calculated", maxdistance : , includelocs : "dist.location", spherical : true }}, {$unwind : "$data"}, {$group : { "_id" : "$data.year", "totalpop" : {"$sum" : "$data.totalpop"}, "states" : {"$addtoset" : "$name"}}}, {$sort : {"_id" : 1}}, {$out : peoplenearmemphis } ])

52 $out db.cdata.aggregate([<pipeline stages>, { $out : resultscollection }]) Save aggregation results to a new collection New aggregation uses: Transform documents - ETL

53 Back To The Original Question Which US Division has the fastest growing population density? We only want to include data states with more than 1M people We only want to include divisions larger than 100K square miles

54 Division with Fastest Growing Pop Density db.cdata.aggregate( [{$match : {"data.totalpop" : {"$gt" : }}}, {$unwind : "$data"}, {$sort : {"data.year" : 1}}, {$group : {"_id" : "$name", "pop1990" : {"$first" : "$data.totalpop"}, {$group : { "pop2010" : {"$last" : "$data.totalpop"}, "aream" : {"$first" : "$aream"}, "division" : {"$first" : "$division"}}}, "_id" : "$division", "totalpop1990" : {"$sum" : "$pop1990"}, "totalpop2010" : {"$sum" : "$pop2010"}, "totalaream" : {"$sum" : "$aream"}}}, {$match : {"totalaream" : {"$gt" : }}}, {$project : {"_id" : 0, "division" : "$_id", "density1990" : {"$divide" : ["$totalpop1990", "$totalaream"]}, "density2010" : {"$divide" : ["$totalpop2010", "$totalaream"]}, "dendelta" : {"$subtract" : [{"$divide" : ["$totalpop2010", $totalaream"]}, {"$divide" : ["$totalpop1990,"$totalaream"]}]}, "totalaream" : 1, "totalpop1990" : 1, "totalpop2010" : 1}}, {$sort : {"dendelta" : -1}}])

55 Aggregate Options

56 Aggregate options db.cdata.aggregate([<pipeline stages>], { explain : false 'allowdiskuse' : true, 'cursor' : {'batchsize' : 5}}) explain similar to find().explain() allowdiskuse enable use of disk to store intermediate results cursor specify the size of the initial result

57 Aggregation and Sharding

Workload split between shards Shards execute pipeline up to a point Primary shard merges cursors and continues processing* Use explain to analyze pipeline split

58 Workload split between shards Shards execute pipeline up to a point Primary shard merges cursors and continues processing* Use explain to analyze pipeline split Early $match can exclude shards Potential CPU and memory implications for primary shard host *Prior to v2.6 second stage pipeline processing was done by mongos Sharding

59 MongoDB 3.2: Joins and other improvements

60 Existing Alternatives to Joins { "_id": 10000, "items": [ { "productname": "laptop", "unitprice": 1000, "weight": 1.2, "remainingstock": 23}, { "productname": "mouse", "unitprice": 20, "weight": 0.2, "remainingstock": 276}], } orders Option 1: Include all data for an order in the same document Fast reads One find delivers all the required data Captures full description at the time of the event Consumes extra space Details of each product stored in many order documents Complex to maintain A change to any product attribute must be propagated to all affected orders

$Existing Alternatives to Joins { "_id": 10000, "items": [ 12345, 54321 ],... } { "_id": 12345, "productname": "laptop", "unitprice": 1000, "weight": 1.$ 2, "remainingstock": 276 } Option 2: Order document references product documents Slower reads Multiple trips to the database Space efficient Product

2, "remainingstock": 276 } Option 2: Order document references product documents Slower reads Multiple trips to the database Space efficient Product

61 Existing Alternatives to Joins { "_id": 10000, "items": [ 12345, ],... } { "_id": 12345, "productname": "laptop", "unitprice": 1000, "weight": 1.2, "remainingstock": 23 } { "_id": 54321, "productname": "mouse", "unitprice": 20, "weight": 0.2, "remainingstock": 276 } Option 2: Order document references product documents Slower reads Multiple trips to the database Space efficient Product details stored once Lose point-in-time snapshot of full record Extra application logic Must iterate over product IDs in the order document and find the product documents RDBMS would automate through a JOIN orders products

62 The Winner? In general, Option 1 wins Performance and containment of everything in same place beats space efficiency of normalization There are exceptions e.g. Comments in a blog post -> unbounded size However, analytics benefit from combining data from multiple collections Keep listening...

63 $lookup Left Collection Right Collection Left-outer join Includes all documents from the left collection For each document in the left collection, find the matching documents from the right collection and embed them

64 $lookup Left Collection Right Collection db.leftcollection.aggregate([{ $lookup: }]) { } from: rightcollection, localfield: leftval, foreignfield: rightval, as: embeddeddata

65 Worked Example Data Set db.postcodes.findone() { "_id":objectid(" e50fa77da54dfc 0d2"), ]}} "postcode": "SL6 0AA", "location": { "type": "Point", "coordinates": [ , db.homesales.findone() { "_id":objectid("56005dd980c3678b19792b7f"), "amount": 9000, "date": ISODate(" T00:00:00Z"), "address": { "nameornumber": 25, "street": "NORFOLK PARK COTTAGES", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 7DR" } }

66 Reduce Data Set First db.homesales.aggregate([ {$match: { amount: {$gte: }} } ]) { "_id": ObjectId("56005dda80c3678b19799e52"), "amount": , "date": ISODate(" T00:00:00Z"), "address": { "nameornumber": "TEMPLE FERRY PLACE", "street": "MILL LANE", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 5ND" } },

67 Join (left-outer-equi) Results With Second Collection db.homesales.aggregate([ {$match: { amount: {$gte: }} }, {$lookup: { from: "postcodes", localfield: "address.postcode", foreignfield: "postcode", as: "postcode_docs"} } ])... "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 5ND" }, "postcode_docs": [ { "_id": ObjectId("560053e280c3678b1978b293"), "postcode": "SL6 5ND", "location": { "type": "Point", "coordinates": [ , ] }}]},...

68 Refactor Each Resulting Document...}, {$project: { _id: 0, saledate: $date", price: "$amount", address: 1, location: {$arrayelemat: ["$postcode_docs.location", 0]}} ]) { "address": { "nameornumber": "TEMPLE FERRY PLACE", "street": "MILL LANE", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 5ND" }, "saledate": ISODate(" T00:00:00Z"), "price": , "location": { "type": "Point", "coordinates": [ , ]}},...

69 Sort on Sale Price & Write to Collection...}, {$sort: {price: -1}}, {$out: "hotspots"} ]) {"address": { "nameornumber": "2-3", "street": "THE SWITCHBACK", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 7RJ" }, "saledate": ISODate(" T00:00:00Z"), "price": , "location": { "type": "Point", "coordinates": [ , ]}},...

70 Aggregated Statistics db.homesales.aggregate([ ]) {$group: { _id: {$year: "$date"}, higestprice: {$max: "$amount"}, }} lowestprice: {$min: "$amount"}, averageprice: {$avg: "$amount"}, amountstddev: {$stddevpop: "$amount"}... { "_id": 1995, "higestprice": , "lowestprice": 12000, "averageprice": , "amountstddev": }, { "_id": 1996, "higestprice": , "lowestprice": 9000, "averageprice": , "amountstddev": },...

71 Clean Up Output..., {$project: { _id: 0, year: "$_id", higestprice: 1, lowestprice: 1, } ]) } averageprice: {$trunc: "$averageprice"}, pricestddev: {$trunc: "$amountstddev"}... { "higestprice": , "lowestprice": 12000, "averageprice": , "year": 1995, "pricestddev": }, { "higestprice": , "lowestprice": 10500, "averageprice": , "year": 2004, "pricestddev": },...

72 Integrations

73 Hadoop Connector

74 Mongo-Hadoop Connector Turn MongoDB into a Hadoop-enabled filesystem: use as the input or output for Hadoop Works with MongoDB backup files (.bson) Input data Hadoop Cluster -or-.bson

75 Benefits and Features Takes advantage of full multi-core parallelism to process data in Mongo Full integration with Hadoop and JVM ecosystems Can be used with Amazon Elastic MapReduce Can read and write backup files from local filesystem, HDFS, or S3

76 Benefits and Features Vanilla Java MapReduce If you don t want to use Java, support for Hadoop Streaming. Write MapReduce code in

77 Benefits and Features Support for Pig high-level scripting language for data analysis and building map/reduce workflows Support for Hive SQL-like language for ad-hoc queries + analysis of data sets on Hadoopcompatible file systems

78 How It Works Adapter examines the MongoDB input collection and calculates a set of splits from the data Each split gets assigned to a node in Hadoop cluster In parallel, Hadoop nodes pull data for splits from MongoDB (or BSON) and process them locally Hadoop merges results and streams output back to MongoDB or BSON

79 BI Connector

80 MongoDB Connector for BI Visualize and explore multi-dimensional documents using SQL-based BI tools. The connector does the following: Provides the BI tool with the schema of the MongoDB collection to be visualized Translates SQL statements issued by the BI tool into equivalent MongoDB queries that are sent to MongoDB for processing Converts the results into the tabular format expected by the BI tool, which can then visualize the data based on user requirements

81 Location & Flow of Data MongoDB BI Connector {name: Andrew, address: {street: }} Analytics & visualization Table Mapping meta-data Document Application data

82 Defining Data Mapping mongodrdl mongobischema PostgreSQL MongoDBspecific Foreign Data Wrapper DRDL 82 mongodrdl --host port d mydbname \ -o mydrdlfile.drdl mongobischema import mycollectionname mydrdlfile.drdl

83 Optionally Manually Edit DRDL File Redact attributes Use more appropriate types (sampling can get it wrong) Rename tables (v1.1+) Rename columns (v1.1+) Build new views using MongoDB Aggregation Framework e.g., $lookup to join 2 tables - table: homesales collection: homesales pipeline: [] columns: - name: _id mongotype: bson.objectid sqlname: _id sqltype: varchar - name: address.county mongotype: string sqlname: address_county sqltype: varchar - name: address.nameornumber mongotype: int sqlname: address_nameornumber sqltype: varchar 83

84 Summary

85 Analytics in MongoDB? Analytics Create Read Update Deletet? YES! Group Count Derive Values Filter Average Sort

86 Complex aggregation queries Ad-hoc reporting Framework Use Cases Real-time analytics Visualizing and reshaping data

87 Questions?

Course Content MongoDB

Course Content MongoDB 1. Course introduction and mongodb Essentials (basics) 2. Introduction to NoSQL databases What is NoSQL? Why NoSQL? Difference Between RDBMS and NoSQL Databases Benefits of NoSQL