Analytics in MongoDB. Massimo Brignoli Principal Solutions
|
|
- Kathryn McBride
- 6 years ago
- Views:
Transcription
1 Analytics in MongoDB Massimo Brignoli Principal Solutions
2 Agenda Analytics in MongoDB? Aggregation Framework Aggregation Pipeline Stages Aggregation Framework in Action Joins in MongoDB 3.2 Integrations Analytical Architectures
3 Relational Expressive Query Language & Secondary Indexes Strong Consistency Enterprise Management & Integrations
4 The World Has Changed Data Time Volume Velocity Variety Iterative Agile Short Cycles Risk Always On Secure Global Cost Open-Source Cloud Commodity
5 NoSQL Expressive Query Language & Secondary Indexes Flexibility Strong Consistency Scalability & Performance Enterprise Management & Integrations Always On, Global Deployments
6 Nexus Architecture Expressive Query Language & Secondary Indexes Flexibility Strong Consistency Scalability & Performance Enterprise Management & Integrations Always On, Global Deployments
7 Some Common MongoDB Use Cases Single View Internet of Things Mobile Real-Time Analytics Catalog Personalization Content Management
8 MongoDB in Research
9 Analytics in MongoDB?
10 Analytics in MongoDB? Analytics? Create Read Update Delete Group Count Derive Values Filter Average Sort
11 Analytics on MongoDB Data MapReduce & HDFS Hadoop Connector SQL Connector Extract data from MongoDB and perform complex analytics with Hadoop Batch rather than real-time Extra nodes to manage Direct access to MongoDB from SPARK MongoDB BI Connector Direct SQL Access from BI Tools MongoDB aggregation pipeline Real-time Live, operational data set Narrower feature set
12 For Example: US Census Data Census data from 1990, 2000, 2010 Question: Which US Division has the fastest growing population density? We only want to include data states with more than 1M people We only want to include divisions larger than 100K square miles Division = a group of US States Population density = Area of division/# of people Data is provided at the state level
13 US Regions and Divisions
14 How would we solve this in SQL? SELECT GROUP BY HAVING
15 Aggregation Framework
16 Aggregation Framework
17 What is an Aggregation Pipeline? A Series of Document Transformations Executed in stages Original input is a collection Output as a cursor or a collection Rich Library of Functions Filter, compute, group, and summarize data Output of one stage sent to input of next Operations executed in sequential order
18 Aggregation Pipeline
19 Aggregation Pipeline $match {u}
20 Aggregation Pipeline $match {u}
21 Aggregation Pipeline $match {u} $project {n= + }
22 Aggregation Pipeline $match {u} $project {u n} {u n} {u n} {n= + }
23 Aggregation Pipeline $match $project $lookup {u} {u n} {u n} {u n} {n= + } { v} { v} { v} { v}
24 Aggregation Pipeline $match $project $lookup {u} {u n} {u n} {u n} {n= + } {u n[vv]} {u n[vv]} {u n} { v} { v} { v} { v}
25 Aggregation Pipeline $match $project $lookup $group {u} {u n} {u n} {u n} {n= + } {u n[vv]} {u n[vv]} {u n} { v} { v} { v} { v} {n Σ λ σ} {n Σ λ σ} {n Σ λ σ}
26 Aggregation Pipeline Stages $match Filter documents $geonear Geospherical query $project Reshape documents $lookup New Left-outer equi joins $unwind Expand documents $group Summarize documents $sample New Randomly selects a subset of documents $sort Order documents $skip Jump over a number of documents $limit Limit number of documents $redact Restrict documents $out Sends results to a new collection
27 Aggregation Framework in Action (let s play with the census data)
28 MongoDB State Collection Document For Each State Name Region Division Census Data For 1990, 2000, 2010 Population Housing Units Occupied Housing Units Census Data is an array with three subdocuments
29 Document Model { "_id" : ObjectId("54e23c7b f "), "name" : "California", "region" : "West", } "data" : [ { "totalpop" : , ], "totalhouse" : , "occhouse" : , "year" : 2000}, { "totalpop" : , "totalhouse" : , "occhouse" : , "year" : 2010}, { "totalpop" : , "totalhouse" : , "occhouse" : , "year" : 1990}
30 Total US Area db.cdata.aggregate([ {"$group" : { "_id" : null, "totalarea" : {$sum : "$aream"}, avgarea" : {$avg : "$aream"} }} ])
31 $group Group documents by value Field reference, object, constant Other output fields are computed $max, $min, $avg, $sum $addtoset, $push $first, $last Processes all data in memory by default
32 Area By Region db.cdata.aggregate([ {"$group" : { "_id" : "$region", "totalarea" : {$sum : "$aream"}, "avgarea" : {$avg : "$aream"}, "numstates" : {$sum : 1}, "states" : {$push : "$name"} } } ])
33 Calculating Average State Area By Region {state: New York", aream: 218, region: North East" } {state: New Jersey", aream: 90, region: North East } {state: California", area: 300, region: West" } { $group: { _id: "$region", avgaream: {$avg: $aream" } }} { _id: North East", avgaream: 154} {_id: West", avgaream: 300}
34 Calculating Total Area and State Count {state: New York", aream: 218, region: North East" } {state: New Jersey", aream: 90, region: North East } {state: California", area: 300, region: West" } { $group: { _id: "$region", totarea: {$sum: $aream" }, scount : {$sum : 1} }} { _id: North East", totarea: 308 scount: 2} { _id: West", totarea: 300, scount: 1}
35 Total US Population By Year db.cdata.aggregate([ ]) {$unwind : "$data"}, {$group : { "_id" : "$data.year", "totalpop" : {$sum :"$data.totalpop"}}}, {$sort : {"totalpop" : 1}}
36 $unwind Operate on an array field Create documents from array elements Array replaced by element value Missing/empty fields no output Non-array fields error Pipe to $group to aggregate
37 { state: New York", census: [1990, 2000, 2010]} { state: New Jersey", census: [1990, 2000]} { state: California", census: [1980, 1990, 2000, 2010]} { state: Delaware", census: [1990, 2000]} $unwind { $unwind: $census } { state: New York, census: 1990} { state: New York, census: 2000} { state: New York, census: 2010} { state: New Jersey, census: 1990} { state: New Jersey, census: 2000}
38 Southern State Population By Year db.cdata.aggregate([ ]) {$match : {"region" : "South"}}, {$unwind : "$data"}, {$group : { "_id" : "$data.year", "totalpop : {"$sum : "$data.totalpop"}}}
39 $match Filter documents Uses existing query syntax, same as.find()
40 $match {state: New York", aream: 218, region: North East" } {state: Oregon", aream: 245, region: West } {state: California", area: 300, region: West" } { $match: { region : West } } {state: Oregon", aream: 245, region: West } {state: California", area: 300, region: West"}
41 Population Delta By State from 1990 to 2010 db.cdata.aggregate([ }]) {$unwind : "$data"}, {$sort : {"data.year" : 1}}, {$group : { "_id" : "$name", "pop1990" : {"$first" : "$data.totalpop"}, "pop2010" : {"$last" : "$data.totalpop"}}}, {$project : { "_id" : 0, "name" : "$_id", "delta" : {"$subtract" : ["$pop2010", "$pop1990"]}, "pop1990" : 1, "pop2010 : 1}
42 $sort, $limit, $skip Sort documents by one or more fields Same order syntax as cursors Waits for earlier pipeline operator to return In-memory unless early and indexed Limit and skip follow cursor behavior
43 $first, $last Collection operations like $push and $addtoset Must be used in $group $first and $last determined by document order Typically used with $sort to ensure ordering is known
44 $project Reshape Documents Include, exclude or rename fields Inject computed fields Create sub-document fields
45 { } { Including and Excluding Fields "_id" : "Virginia, "pop1990" : , "pop2010" : "_id" : "South Dakota", "pop1990" : , "pop2010" : } { $project: { _id : 0, pop1990 : 1, pop2010 : 1} } {"pop1990" : , "pop2010" : } {"pop1990" : , "pop2010" : }
46 { } { Renaming and Computing Fields "_id" : "Virginia, "pop1990" : , "pop2010" : "_id" : "South Dakota", "pop1990" : , "pop2010" : } { $project: { _id : 0, pop1990 : 0, pop2010 : 0, name : $_id, "delta" : {"$subtract" : ["$pop2010", "$pop1990"]}} } { name" : Virginia, delta" : } { name" : South Dakota, delta" : }
47 Compare number of people living within 500KM of Memphis, TN in 1990, 2000, 2010
48 Compare number of people living within 500KM of Memphis, TN in 1990, 2000, 2010 db.cdata.aggregate([ ]) {$geonear : { "near" : {"type" : "Point", "coordinates" : [90, 35]}, {$unwind : "$data"}, distancefield : "dist.calculated", maxdistance : , includelocs : "dist.location", spherical : true }}, {$group : { "_id" : "$data.year", {$sort : {"_id" : 1}} "totalpop" : {"$sum" : "$data.totalpop"}, "states" : {"$addtoset" : "$name"}}},
49 $geonear Order/Filter Documents by Location Requires a geospatial index Output includes physical distance Must be first aggregation stage
50 $geonear {"_id" : "Virginia, "pop1990" : , "pop2010" : , center : { type : Point, coordinates : [78.6, 37.5]}} { "_id" : Tennessee", "pop1990" : , "pop2010" : , center : { type : Point, coordinates : [86.6, 37.8]}} {$geonear : { "near : {"type : "Point", "coordinates : [90, 35]}, maxdistance : , spherical : true }} {"_id" : Tennessee", "pop1990" : , "pop2010" : , center : { type : Point, coordinates : [86.6, 37.8]}}
51 What if I want to save the results to a collection? db.cdata.aggregate([ {$geonear : { "near" : {"type" : "Point", "coordinates" : [90, 35]}, distancefield : "dist.calculated", maxdistance : , includelocs : "dist.location", spherical : true }}, {$unwind : "$data"}, {$group : { "_id" : "$data.year", "totalpop" : {"$sum" : "$data.totalpop"}, "states" : {"$addtoset" : "$name"}}}, {$sort : {"_id" : 1}}, {$out : peoplenearmemphis } ])
52 $out db.cdata.aggregate([<pipeline stages>, { $out : resultscollection }]) Save aggregation results to a new collection New aggregation uses: Transform documents - ETL
53 Back To The Original Question Which US Division has the fastest growing population density? We only want to include data states with more than 1M people We only want to include divisions larger than 100K square miles
54 Division with Fastest Growing Pop Density db.cdata.aggregate( [{$match : {"data.totalpop" : {"$gt" : }}}, {$unwind : "$data"}, {$sort : {"data.year" : 1}}, {$group : {"_id" : "$name", "pop1990" : {"$first" : "$data.totalpop"}, {$group : { "pop2010" : {"$last" : "$data.totalpop"}, "aream" : {"$first" : "$aream"}, "division" : {"$first" : "$division"}}}, "_id" : "$division", "totalpop1990" : {"$sum" : "$pop1990"}, "totalpop2010" : {"$sum" : "$pop2010"}, "totalaream" : {"$sum" : "$aream"}}}, {$match : {"totalaream" : {"$gt" : }}}, {$project : {"_id" : 0, "division" : "$_id", "density1990" : {"$divide" : ["$totalpop1990", "$totalaream"]}, "density2010" : {"$divide" : ["$totalpop2010", "$totalaream"]}, "dendelta" : {"$subtract" : [{"$divide" : ["$totalpop2010", $totalaream"]}, {"$divide" : ["$totalpop1990,"$totalaream"]}]}, "totalaream" : 1, "totalpop1990" : 1, "totalpop2010" : 1}}, {$sort : {"dendelta" : -1}}])
55 Aggregate Options
56 Aggregate options db.cdata.aggregate([<pipeline stages>], { explain : false 'allowdiskuse' : true, 'cursor' : {'batchsize' : 5}}) explain similar to find().explain() allowdiskuse enable use of disk to store intermediate results cursor specify the size of the initial result
57 Aggregation and Sharding
58 Workload split between shards Shards execute pipeline up to a point Primary shard merges cursors and continues processing* Use explain to analyze pipeline split Early $match can exclude shards Potential CPU and memory implications for primary shard host *Prior to v2.6 second stage pipeline processing was done by mongos Sharding
59 MongoDB 3.2: Joins and other improvements
60 Existing Alternatives to Joins { "_id": 10000, "items": [ { "productname": "laptop", "unitprice": 1000, "weight": 1.2, "remainingstock": 23}, { "productname": "mouse", "unitprice": 20, "weight": 0.2, "remainingstock": 276}], } orders Option 1: Include all data for an order in the same document Fast reads One find delivers all the required data Captures full description at the time of the event Consumes extra space Details of each product stored in many order documents Complex to maintain A change to any product attribute must be propagated to all affected orders
61 Existing Alternatives to Joins { "_id": 10000, "items": [ 12345, ],... } { "_id": 12345, "productname": "laptop", "unitprice": 1000, "weight": 1.2, "remainingstock": 23 } { "_id": 54321, "productname": "mouse", "unitprice": 20, "weight": 0.2, "remainingstock": 276 } Option 2: Order document references product documents Slower reads Multiple trips to the database Space efficient Product details stored once Lose point-in-time snapshot of full record Extra application logic Must iterate over product IDs in the order document and find the product documents RDBMS would automate through a JOIN orders products
62 The Winner? In general, Option 1 wins Performance and containment of everything in same place beats space efficiency of normalization There are exceptions e.g. Comments in a blog post -> unbounded size However, analytics benefit from combining data from multiple collections Keep listening...
63 $lookup Left Collection Right Collection Left-outer join Includes all documents from the left collection For each document in the left collection, find the matching documents from the right collection and embed them
64 $lookup Left Collection Right Collection db.leftcollection.aggregate([{ $lookup: }]) { } from: rightcollection, localfield: leftval, foreignfield: rightval, as: embeddeddata
65 Worked Example Data Set db.postcodes.findone() { "_id":objectid(" e50fa77da54dfc 0d2"), ]}} "postcode": "SL6 0AA", "location": { "type": "Point", "coordinates": [ , db.homesales.findone() { "_id":objectid("56005dd980c3678b19792b7f"), "amount": 9000, "date": ISODate(" T00:00:00Z"), "address": { "nameornumber": 25, "street": "NORFOLK PARK COTTAGES", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 7DR" } }
66 Reduce Data Set First db.homesales.aggregate([ {$match: { amount: {$gte: }} } ]) { "_id": ObjectId("56005dda80c3678b19799e52"), "amount": , "date": ISODate(" T00:00:00Z"), "address": { "nameornumber": "TEMPLE FERRY PLACE", "street": "MILL LANE", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 5ND" } },
67 Join (left-outer-equi) Results With Second Collection db.homesales.aggregate([ {$match: { amount: {$gte: }} }, {$lookup: { from: "postcodes", localfield: "address.postcode", foreignfield: "postcode", as: "postcode_docs"} } ])... "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 5ND" }, "postcode_docs": [ { "_id": ObjectId("560053e280c3678b1978b293"), "postcode": "SL6 5ND", "location": { "type": "Point", "coordinates": [ , ] }}]},...
68 Refactor Each Resulting Document...}, {$project: { _id: 0, saledate: $date", price: "$amount", address: 1, location: {$arrayelemat: ["$postcode_docs.location", 0]}} ]) { "address": { "nameornumber": "TEMPLE FERRY PLACE", "street": "MILL LANE", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 5ND" }, "saledate": ISODate(" T00:00:00Z"), "price": , "location": { "type": "Point", "coordinates": [ , ]}},...
69 Sort on Sale Price & Write to Collection...}, {$sort: {price: -1}}, {$out: "hotspots"} ]) {"address": { "nameornumber": "2-3", "street": "THE SWITCHBACK", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 7RJ" }, "saledate": ISODate(" T00:00:00Z"), "price": , "location": { "type": "Point", "coordinates": [ , ]}},...
70 Aggregated Statistics db.homesales.aggregate([ ]) {$group: { _id: {$year: "$date"}, higestprice: {$max: "$amount"}, }} lowestprice: {$min: "$amount"}, averageprice: {$avg: "$amount"}, amountstddev: {$stddevpop: "$amount"}... { "_id": 1995, "higestprice": , "lowestprice": 12000, "averageprice": , "amountstddev": }, { "_id": 1996, "higestprice": , "lowestprice": 9000, "averageprice": , "amountstddev": },...
71 Clean Up Output..., {$project: { _id: 0, year: "$_id", higestprice: 1, lowestprice: 1, } ]) } averageprice: {$trunc: "$averageprice"}, pricestddev: {$trunc: "$amountstddev"}... { "higestprice": , "lowestprice": 12000, "averageprice": , "year": 1995, "pricestddev": }, { "higestprice": , "lowestprice": 10500, "averageprice": , "year": 2004, "pricestddev": },...
72 Integrations
73 Hadoop Connector
74 Mongo-Hadoop Connector Turn MongoDB into a Hadoop-enabled filesystem: use as the input or output for Hadoop Works with MongoDB backup files (.bson) Input data Hadoop Cluster -or-.bson
75 Benefits and Features Takes advantage of full multi-core parallelism to process data in Mongo Full integration with Hadoop and JVM ecosystems Can be used with Amazon Elastic MapReduce Can read and write backup files from local filesystem, HDFS, or S3
76 Benefits and Features Vanilla Java MapReduce If you don t want to use Java, support for Hadoop Streaming. Write MapReduce code in
77 Benefits and Features Support for Pig high-level scripting language for data analysis and building map/reduce workflows Support for Hive SQL-like language for ad-hoc queries + analysis of data sets on Hadoopcompatible file systems
78 How It Works Adapter examines the MongoDB input collection and calculates a set of splits from the data Each split gets assigned to a node in Hadoop cluster In parallel, Hadoop nodes pull data for splits from MongoDB (or BSON) and process them locally Hadoop merges results and streams output back to MongoDB or BSON
79 BI Connector
80 MongoDB Connector for BI Visualize and explore multi-dimensional documents using SQL-based BI tools. The connector does the following: Provides the BI tool with the schema of the MongoDB collection to be visualized Translates SQL statements issued by the BI tool into equivalent MongoDB queries that are sent to MongoDB for processing Converts the results into the tabular format expected by the BI tool, which can then visualize the data based on user requirements
81 Location & Flow of Data MongoDB BI Connector {name: Andrew, address: {street: }} Analytics & visualization Table Mapping meta-data Document Application data
82 Defining Data Mapping mongodrdl mongobischema PostgreSQL MongoDBspecific Foreign Data Wrapper DRDL 82 mongodrdl --host port d mydbname \ -o mydrdlfile.drdl mongobischema import mycollectionname mydrdlfile.drdl
83 Optionally Manually Edit DRDL File Redact attributes Use more appropriate types (sampling can get it wrong) Rename tables (v1.1+) Rename columns (v1.1+) Build new views using MongoDB Aggregation Framework e.g., $lookup to join 2 tables - table: homesales collection: homesales pipeline: [] columns: - name: _id mongotype: bson.objectid sqlname: _id sqltype: varchar - name: address.county mongotype: string sqlname: address_county sqltype: varchar - name: address.nameornumber mongotype: int sqlname: address_nameornumber sqltype: varchar 83
84 Summary
85 Analytics in MongoDB? Analytics Create Read Update Deletet? YES! Group Count Derive Values Filter Average Sort
86 Complex aggregation queries Ad-hoc reporting Framework Use Cases Real-time analytics Visualizing and reshaping data
87 Questions?
88
Course Content MongoDB
Course Content MongoDB 1. Course introduction and mongodb Essentials (basics) 2. Introduction to NoSQL databases What is NoSQL? Why NoSQL? Difference Between RDBMS and NoSQL Databases Benefits of NoSQL
More information1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions
Big Data Hadoop Architect Online Training (Big Data Hadoop + Apache Spark & Scala+ MongoDB Developer And Administrator + Apache Cassandra + Impala Training + Apache Kafka + Apache Storm) 1 Big Data Hadoop
More informationMaking MongoDB Accessible to All. Brody Messmer Product Owner DataDirect On-Premise Drivers Progress Software
Making MongoDB Accessible to All Brody Messmer Product Owner DataDirect On-Premise Drivers Progress Software Agenda Intro to MongoDB What is MongoDB? Benefits Challenges and Common Criticisms Schema Design
More informationBig Data Architect.
Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationHow Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,
How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS
More informationHadoop. Introduction / Overview
Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures
More informationBig Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours
Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals
More informationThe course modules of MongoDB developer and administrator online certification training:
The course modules of MongoDB developer and administrator online certification training: 1 An Overview of the Course Introduction to the course Table of Contents Course Objectives Course Overview Value
More informationFinal Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23
Final Exam Review 2 Kathleen Durant CS 3200 Northeastern University Lecture 23 QUERY EVALUATION PLAN Representation of a SQL Command SELECT {DISTINCT} FROM {WHERE
More informationBig Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
More informationMongoDB An Overview. 21-Oct Socrates
MongoDB An Overview 21-Oct-2016 Socrates Agenda What is NoSQL DB? Types of NoSQL DBs DBMS and MongoDB Comparison Why MongoDB? MongoDB Architecture Storage Engines Data Model Query Language Security Data
More informationMongoDB. Nicolas Travers Conservatoire National des Arts et Métiers. MongoDB
Nicolas Travers Conservatoire National des Arts et Métiers 1 Introduction Humongous (monstrous / enormous) NoSQL: Documents Oriented JSon Serialized format: BSon objects Implemented in C++ Keys indexing
More informationCIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )
Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL
More informationBlended Learning Outline: Cloudera Data Analyst Training (171219a)
Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills
More informationProject: Relational Databases vs. NoSQL. Catherine Easdon
Project: Relational Databases vs. NoSQL Catherine Easdon Database Choice (Relational) PostgreSQL (using BigSQL tools) Open source Mature project 21 years old, release 10.1 Supports all major OSes Highly
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationOral Questions and Answers (DBMS LAB) Questions & Answers- DBMS
Questions & Answers- DBMS https://career.guru99.com/top-50-database-interview-questions/ 1) Define Database. A prearranged collection of figures known as data is called database. 2) What is DBMS? Database
More informationPig A language for data processing in Hadoop
Pig A language for data processing in Hadoop Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Apache Pig: Introduction Tool for querying data on Hadoop
More informationStages of Data Processing
Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,
More informationPrincipal Software Engineer Red Hat Emerging Technology June 24, 2015
USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015 ABOUT ME Distributed systems and data science in Red Hat's Emerging
More informationBIG DATA COURSE CONTENT
BIG DATA COURSE CONTENT [I] Get Started with Big Data Microsoft Professional Orientation: Big Data Duration: 12 hrs Course Content: Introduction Course Introduction Data Fundamentals Introduction to Data
More informationOverview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::
Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional
More informationBig Data Hadoop Course Content
Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux
More informationBig Data Hadoop Stack
Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware
More informationChapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join
More informationCISC 7610 Lecture 2b The beginnings of NoSQL
CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone
More informationInnovatus Technologies
HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String
More informationThe Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou
The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component
More informationScaling for Humongous amounts of data with MongoDB
Scaling for Humongous amounts of data with MongoDB Alvin Richards Technical Director, EMEA alvin@10gen.com @jonnyeight alvinonmongodb.com From here... http://bit.ly/ot71m4 ...to here... http://bit.ly/oxcsis
More informationSQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism
Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and
More informationUNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017
UNIFY DATA AT MEMORY SPEED Haoyuan (HY) Li, CEO @ Alluxio Inc. VAULT Conference 2017 March 2017 HISTORY Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in
More informationHadoop Development Introduction
Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand
More informationChapter 12: Query Processing. Chapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join
More informationEvolving To The Big Data Warehouse
Evolving To The Big Data Warehouse Kevin Lancaster 1 Copyright Director, 2012, Oracle and/or its Engineered affiliates. All rights Insert Systems, Information Protection Policy Oracle Classification from
More informationHacking PostgreSQL Internals to Solve Data Access Problems
Hacking PostgreSQL Internals to Solve Data Access Problems Sadayuki Furuhashi Treasure Data, Inc. Founder & Software Architect A little about me... > Sadayuki Furuhashi > github/twitter: @frsyuki > Treasure
More informationHadoop & Big Data Analytics Complete Practical & Real-time Training
An ISO Certified Training Institute A Unit of Sequelgate Innovative Technologies Pvt. Ltd. www.sqlschool.com Hadoop & Big Data Analytics Complete Practical & Real-time Training Mode : Instructor Led LIVE
More informationCloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018
Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster
More informationMicrosoft Big Data and Hadoop
Microsoft Big Data and Hadoop Lara Rubbelke @sqlgal Cindy Gross @sqlcindy 2 The world of data is changing The 4Vs of Big Data http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data 3 Common
More informationDocument Object Storage with MongoDB
Document Object Storage with MongoDB Lecture BigData Analytics Julian M. Kunkel julian.kunkel@googlemail.com University of Hamburg / German Climate Computing Center (DKRZ) 2017-12-15 Disclaimer: Big Data
More informationHadoop An Overview. - Socrates CCDH
Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected
More informationBring Context To Your Machine Data With Hadoop, RDBMS & Splunk
Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Raanan Dagan and Rohit Pujari September 25, 2017 Washington, DC Forward-Looking Statements During the course of this presentation, we may
More informationDatabase System Concepts
Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth
More informationScalable Web Programming. CS193S - Jan Jannink - 2/25/10
Scalable Web Programming CS193S - Jan Jannink - 2/25/10 Weekly Syllabus 1.Scalability: (Jan.) 2.Agile Practices 3.Ecology/Mashups 4.Browser/Client 7.Analytics 8.Cloud/Map-Reduce 9.Published APIs: (Mar.)*
More informationExpert Lecture plan proposal Hadoop& itsapplication
Expert Lecture plan proposal Hadoop& itsapplication STARTING UP WITH BIG Introduction to BIG Data Use cases of Big Data The Big data core components Knowing the requirements, knowledge on Analyst job profile
More informationGroup13: Siddhant Deshmukh, Sudeep Rege, Sharmila Prakash, Dhanusha Varik
Group13: Siddhant Deshmukh, Sudeep Rege, Sharmila Prakash, Dhanusha Varik mongodb (humongous) Introduction What is MongoDB? Why MongoDB? MongoDB Terminology Why Not MongoDB? What is MongoDB? DOCUMENT STORE
More informationNoSQL. Based on slides by Mike Franklin and Jimmy Lin
NoSQL Based on slides by Mike Franklin and Jimmy Lin 1 Big Data (some old numbers) Facebook: 130TB/day: user logs 200-400TB/day: 83 million pictures Google: > 25 PB/day processed data Gene sequencing:
More informationTalend Big Data Sandbox. Big Data Insights Cookbook
Overview Pre-requisites Setup & Configuration Hadoop Distribution Download Demo (Scenario) Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario) About this cookbook What is
More informationData pipelines with PostgreSQL & Kafka
Data pipelines with PostgreSQL & Kafka Oskari Saarenmaa PostgresConf US 2018 - Jersey City Agenda 1. Introduction 2. Data pipelines, old and new 3. Apache Kafka 4. Sample data pipeline with Kafka & PostgreSQL
More informationQuery Processing. Introduction to Databases CompSci 316 Fall 2017
Query Processing Introduction to Databases CompSci 316 Fall 2017 2 Announcements (Tue., Nov. 14) Homework #3 sample solution posted in Sakai Homework #4 assigned today; due on 12/05 Project milestone #2
More informationCopyright 2013, Oracle and/or its affiliates. All rights reserved.
1 Oracle NoSQL Database: Release 3.0 What s new and why you care Dave Segleau NoSQL Product Manager The following is intended to outline our general product direction. It is intended for information purposes
More informationData Partitioning and MapReduce
Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,
More informationOpen source, high performance database. July 2012
Open source, high performance database July 2012 1 Quick introduction to mongodb Data modeling in mongodb, queries, geospatial, updates and map reduce. Using a location-based app as an example Example
More information1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda
Agenda Oracle9i Warehouse Review Dulcian, Inc. Oracle9i Server OLAP Server Analytical SQL Mining ETL Infrastructure 9i Warehouse Builder Oracle 9i Server Overview E-Business Intelligence Platform 9i Server:
More information1Z Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions
1Z0-449 Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions Table of Contents Introduction to 1Z0-449 Exam on Oracle Big Data 2017 Implementation Essentials... 2 Oracle 1Z0-449
More informationUnifying Big Data Workloads in Apache Spark
Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationImpala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam
Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction Architecture Front End Back End Evaluation Comparison with Spark SQL Introduction Why not use Hive or HBase?
More informationQuery Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016
Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,
More informationMODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS
MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale
More informationBlended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance
More informationMongoDB Schema Design
MongoDB Schema Design Demystifying document structures in MongoDB Jon Tobin @jontobs MongoDB Overview NoSQL Document Oriented DB Dynamic Schema HA/Sharding Built In Simple async replication setup Automated
More informationAccelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card
Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card The Rise of MongoDB Summary One of today s growing database
More informationIt also suggests creating more reduce tasks to deal with problems in keeping reducer inputs in memory.
Reference 1. Processing Theta-Joins using MapReduce, Alper Okcan, Mirek Riedewald Northeastern University 1.1. Video of presentation on above paper 2. Optimizing Joins in map-reduce environment F.N. Afrati,
More informationCSE 190D Spring 2017 Final Exam Answers
CSE 190D Spring 2017 Final Exam Answers Q 1. [20pts] For the following questions, clearly circle True or False. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join
More informationIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -
More informationAzure Data Factory VS. SSIS. Reza Rad, Consultant, RADACAD
Azure Data Factory VS. SSIS Reza Rad, Consultant, RADACAD 2 Please silence cell phones Explore Everything PASS Has to Offer FREE ONLINE WEBINAR EVENTS FREE 1-DAY LOCAL TRAINING EVENTS VOLUNTEERING OPPORTUNITIES
More informationAdvanced Database Systems
Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed
More informationMONGODB INTERVIEW QUESTIONS
MONGODB INTERVIEW QUESTIONS http://www.tutorialspoint.com/mongodb/mongodb_interview_questions.htm Copyright tutorialspoint.com Dear readers, these MongoDB Interview Questions have been designed specially
More informationLambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
More informationApache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source
Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source DMITRIY SETRAKYAN Founder, PPMC https://ignite.apache.org @apacheignite @dsetrakyan Agenda About In- Memory Computing Apache Ignite
More informationT-SQL Training: T-SQL for SQL Server for Developers
Duration: 3 days T-SQL Training Overview T-SQL for SQL Server for Developers training teaches developers all the Transact-SQL skills they need to develop queries and views, and manipulate data in a SQL
More informationInstructor : Dr. Sunnie Chung. Independent Study Spring Pentaho. 1 P a g e
ABSTRACT Pentaho Business Analytics from different data source, Analytics from csv/sql,create Star Schema Fact & Dimension Tables, kettle transformation for big data integration, MongoDB kettle Transformation,
More informationCERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)
CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program
More informationTHE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES
1 THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon Vincent.Garonne@cern.ch ph-adp-ddm-lab@cern.ch XLDB
More informationBig Data Infrastructure at Spotify
Big Data Infrastructure at Spotify Wouter de Bie Team Lead Data Infrastructure September 26, 2013 2 Who am I? According to ZDNet: "The work they have done to improve the Apache Hive data warehouse system
More informationIntroduction to BigData, Hadoop:-
Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,
More informationPart 1: Indexes for Big Data
JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,
More informationCloud Analytics and Business Intelligence on AWS
Cloud Analytics and Business Intelligence on AWS Enterprise Applications Virtual Desktops Sharing & Collaboration Platform Services Analytics Hadoop Real-time Streaming Data Machine Learning Data Warehouse
More informationHadoop, Yarn and Beyond
Hadoop, Yarn and Beyond 1 B. R A M A M U R T H Y Overview We learned about Hadoop1.x or the core. Just like Java evolved, Java core, Java 1.X, Java 2.. So on, software and systems evolve, naturally.. Lets
More informationData Analytics at Logitech Snowflake + Tableau = #Winning
Welcome # T C 1 8 Data Analytics at Logitech Snowflake + Tableau = #Winning Avinash Deshpande I am a futurist, scientist, engineer, designer, data evangelist at heart Find me at Avinash Deshpande Chief
More informationChapter 13: Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
More informationMaking the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor
Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack Chief Architect RainStor Agenda Importance of Hadoop + data compression Data compression techniques Compression,
More informationIntroduction to Database Services
Introduction to Database Services Shaun Pearce AWS Solutions Architect 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Today s agenda Why managed database services? A non-relational
More informationOracle Big Data Connectors
Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process
More informationMapReduce programming model
MapReduce programming model technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate
More informationCertified Big Data Hadoop and Spark Scala Course Curriculum
Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills
More informationFluentd + MongoDB + Spark = Awesome Sauce
Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited Bhavani Ananth, Tech Manager, Wipro Limited Your company logo here Wipro Open Source Practice: Vision & Mission Vision
More informationBig Data Analytics using Apache Hadoop and Spark with Scala
Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important
More informationVOLTDB + HP VERTICA. page
VOLTDB + HP VERTICA ARCHITECTURE FOR FAST AND BIG DATA ARCHITECTURE FOR FAST + BIG DATA FAST DATA Fast Serve Analytics BIG DATA BI Reporting Fast Operational Database Streaming Analytics Columnar Analytics
More informationIntroduction to Big-Data
Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,
More informationmicrosoft
70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series
More informationScaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig
CSE 6242 / CX 4242 Scaling Up 1 Hadoop, Pig Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Le
More informationDatabases 2 (VU) ( / )
Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:
More informationYour First MongoDB Environment: What You Should Know Before Choosing MongoDB as Your Database
Your First MongoDB Environment: What You Should Know Before Choosing MongoDB as Your Database Me - @adamotonete Adamo Tonete Senior Technical Engineer Brazil Agenda What is MongoDB? The good side of MongoDB
More informationGLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)
DE: A Scalable Framework for Efficient Analytics Florin Rusu (University of California, Merced) Alin Dobra (University of Florida) Big Data Analytics Big Data Storage is cheap ($100 for 1TB disk) Everything
More informationDocument Databases: MongoDB
NDBI040: Big Data Management and NoSQL Databases hp://www.ksi.mff.cuni.cz/~svoboda/courses/171-ndbi040/ Lecture 9 Document Databases: MongoDB Marn Svoboda svoboda@ksi.mff.cuni.cz 28. 11. 2017 Charles University
More informationCassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent
Tanton Jeppson CS 401R Lab 3 Cassandra, MongoDB, and HBase Introduction For my report I have chosen to take a deeper look at 3 NoSQL database systems: Cassandra, MongoDB, and HBase. I have chosen these
More information