Analytics in MongoDB. Massimo Brignoli Principal Solutions

Size: px
Start display at page:

Download "Analytics in MongoDB. Massimo Brignoli Principal Solutions"

Transcription

1 Analytics in MongoDB Massimo Brignoli Principal Solutions

2 Agenda Analytics in MongoDB? Aggregation Framework Aggregation Pipeline Stages Aggregation Framework in Action Joins in MongoDB 3.2 Integrations Analytical Architectures

3 Relational Expressive Query Language & Secondary Indexes Strong Consistency Enterprise Management & Integrations

4 The World Has Changed Data Time Volume Velocity Variety Iterative Agile Short Cycles Risk Always On Secure Global Cost Open-Source Cloud Commodity

5 NoSQL Expressive Query Language & Secondary Indexes Flexibility Strong Consistency Scalability & Performance Enterprise Management & Integrations Always On, Global Deployments

6 Nexus Architecture Expressive Query Language & Secondary Indexes Flexibility Strong Consistency Scalability & Performance Enterprise Management & Integrations Always On, Global Deployments

7 Some Common MongoDB Use Cases Single View Internet of Things Mobile Real-Time Analytics Catalog Personalization Content Management

8 MongoDB in Research

9 Analytics in MongoDB?

10 Analytics in MongoDB? Analytics? Create Read Update Delete Group Count Derive Values Filter Average Sort

11 Analytics on MongoDB Data MapReduce & HDFS Hadoop Connector SQL Connector Extract data from MongoDB and perform complex analytics with Hadoop Batch rather than real-time Extra nodes to manage Direct access to MongoDB from SPARK MongoDB BI Connector Direct SQL Access from BI Tools MongoDB aggregation pipeline Real-time Live, operational data set Narrower feature set

12 For Example: US Census Data Census data from 1990, 2000, 2010 Question: Which US Division has the fastest growing population density? We only want to include data states with more than 1M people We only want to include divisions larger than 100K square miles Division = a group of US States Population density = Area of division/# of people Data is provided at the state level

13 US Regions and Divisions

14 How would we solve this in SQL? SELECT GROUP BY HAVING

15 Aggregation Framework

16 Aggregation Framework

17 What is an Aggregation Pipeline? A Series of Document Transformations Executed in stages Original input is a collection Output as a cursor or a collection Rich Library of Functions Filter, compute, group, and summarize data Output of one stage sent to input of next Operations executed in sequential order

18 Aggregation Pipeline

19 Aggregation Pipeline $match {u}

20 Aggregation Pipeline $match {u}

21 Aggregation Pipeline $match {u} $project {n= + }

22 Aggregation Pipeline $match {u} $project {u n} {u n} {u n} {n= + }

23 Aggregation Pipeline $match $project $lookup {u} {u n} {u n} {u n} {n= + } { v} { v} { v} { v}

24 Aggregation Pipeline $match $project $lookup {u} {u n} {u n} {u n} {n= + } {u n[vv]} {u n[vv]} {u n} { v} { v} { v} { v}

25 Aggregation Pipeline $match $project $lookup $group {u} {u n} {u n} {u n} {n= + } {u n[vv]} {u n[vv]} {u n} { v} { v} { v} { v} {n Σ λ σ} {n Σ λ σ} {n Σ λ σ}

26 Aggregation Pipeline Stages $match Filter documents $geonear Geospherical query $project Reshape documents $lookup New Left-outer equi joins $unwind Expand documents $group Summarize documents $sample New Randomly selects a subset of documents $sort Order documents $skip Jump over a number of documents $limit Limit number of documents $redact Restrict documents $out Sends results to a new collection

27 Aggregation Framework in Action (let s play with the census data)

28 MongoDB State Collection Document For Each State Name Region Division Census Data For 1990, 2000, 2010 Population Housing Units Occupied Housing Units Census Data is an array with three subdocuments

29 Document Model { "_id" : ObjectId("54e23c7b f "), "name" : "California", "region" : "West", } "data" : [ { "totalpop" : , ], "totalhouse" : , "occhouse" : , "year" : 2000}, { "totalpop" : , "totalhouse" : , "occhouse" : , "year" : 2010}, { "totalpop" : , "totalhouse" : , "occhouse" : , "year" : 1990}

30 Total US Area db.cdata.aggregate([ {"$group" : { "_id" : null, "totalarea" : {$sum : "$aream"}, avgarea" : {$avg : "$aream"} }} ])

31 $group Group documents by value Field reference, object, constant Other output fields are computed $max, $min, $avg, $sum $addtoset, $push $first, $last Processes all data in memory by default

32 Area By Region db.cdata.aggregate([ {"$group" : { "_id" : "$region", "totalarea" : {$sum : "$aream"}, "avgarea" : {$avg : "$aream"}, "numstates" : {$sum : 1}, "states" : {$push : "$name"} } } ])

33 Calculating Average State Area By Region {state: New York", aream: 218, region: North East" } {state: New Jersey", aream: 90, region: North East } {state: California", area: 300, region: West" } { $group: { _id: "$region", avgaream: {$avg: $aream" } }} { _id: North East", avgaream: 154} {_id: West", avgaream: 300}

34 Calculating Total Area and State Count {state: New York", aream: 218, region: North East" } {state: New Jersey", aream: 90, region: North East } {state: California", area: 300, region: West" } { $group: { _id: "$region", totarea: {$sum: $aream" }, scount : {$sum : 1} }} { _id: North East", totarea: 308 scount: 2} { _id: West", totarea: 300, scount: 1}

35 Total US Population By Year db.cdata.aggregate([ ]) {$unwind : "$data"}, {$group : { "_id" : "$data.year", "totalpop" : {$sum :"$data.totalpop"}}}, {$sort : {"totalpop" : 1}}

36 $unwind Operate on an array field Create documents from array elements Array replaced by element value Missing/empty fields no output Non-array fields error Pipe to $group to aggregate

37 { state: New York", census: [1990, 2000, 2010]} { state: New Jersey", census: [1990, 2000]} { state: California", census: [1980, 1990, 2000, 2010]} { state: Delaware", census: [1990, 2000]} $unwind { $unwind: $census } { state: New York, census: 1990} { state: New York, census: 2000} { state: New York, census: 2010} { state: New Jersey, census: 1990} { state: New Jersey, census: 2000}

38 Southern State Population By Year db.cdata.aggregate([ ]) {$match : {"region" : "South"}}, {$unwind : "$data"}, {$group : { "_id" : "$data.year", "totalpop : {"$sum : "$data.totalpop"}}}

39 $match Filter documents Uses existing query syntax, same as.find()

40 $match {state: New York", aream: 218, region: North East" } {state: Oregon", aream: 245, region: West } {state: California", area: 300, region: West" } { $match: { region : West } } {state: Oregon", aream: 245, region: West } {state: California", area: 300, region: West"}

41 Population Delta By State from 1990 to 2010 db.cdata.aggregate([ }]) {$unwind : "$data"}, {$sort : {"data.year" : 1}}, {$group : { "_id" : "$name", "pop1990" : {"$first" : "$data.totalpop"}, "pop2010" : {"$last" : "$data.totalpop"}}}, {$project : { "_id" : 0, "name" : "$_id", "delta" : {"$subtract" : ["$pop2010", "$pop1990"]}, "pop1990" : 1, "pop2010 : 1}

42 $sort, $limit, $skip Sort documents by one or more fields Same order syntax as cursors Waits for earlier pipeline operator to return In-memory unless early and indexed Limit and skip follow cursor behavior

43 $first, $last Collection operations like $push and $addtoset Must be used in $group $first and $last determined by document order Typically used with $sort to ensure ordering is known

44 $project Reshape Documents Include, exclude or rename fields Inject computed fields Create sub-document fields

45 { } { Including and Excluding Fields "_id" : "Virginia, "pop1990" : , "pop2010" : "_id" : "South Dakota", "pop1990" : , "pop2010" : } { $project: { _id : 0, pop1990 : 1, pop2010 : 1} } {"pop1990" : , "pop2010" : } {"pop1990" : , "pop2010" : }

46 { } { Renaming and Computing Fields "_id" : "Virginia, "pop1990" : , "pop2010" : "_id" : "South Dakota", "pop1990" : , "pop2010" : } { $project: { _id : 0, pop1990 : 0, pop2010 : 0, name : $_id, "delta" : {"$subtract" : ["$pop2010", "$pop1990"]}} } { name" : Virginia, delta" : } { name" : South Dakota, delta" : }

47 Compare number of people living within 500KM of Memphis, TN in 1990, 2000, 2010

48 Compare number of people living within 500KM of Memphis, TN in 1990, 2000, 2010 db.cdata.aggregate([ ]) {$geonear : { "near" : {"type" : "Point", "coordinates" : [90, 35]}, {$unwind : "$data"}, distancefield : "dist.calculated", maxdistance : , includelocs : "dist.location", spherical : true }}, {$group : { "_id" : "$data.year", {$sort : {"_id" : 1}} "totalpop" : {"$sum" : "$data.totalpop"}, "states" : {"$addtoset" : "$name"}}},

49 $geonear Order/Filter Documents by Location Requires a geospatial index Output includes physical distance Must be first aggregation stage

50 $geonear {"_id" : "Virginia, "pop1990" : , "pop2010" : , center : { type : Point, coordinates : [78.6, 37.5]}} { "_id" : Tennessee", "pop1990" : , "pop2010" : , center : { type : Point, coordinates : [86.6, 37.8]}} {$geonear : { "near : {"type : "Point", "coordinates : [90, 35]}, maxdistance : , spherical : true }} {"_id" : Tennessee", "pop1990" : , "pop2010" : , center : { type : Point, coordinates : [86.6, 37.8]}}

51 What if I want to save the results to a collection? db.cdata.aggregate([ {$geonear : { "near" : {"type" : "Point", "coordinates" : [90, 35]}, distancefield : "dist.calculated", maxdistance : , includelocs : "dist.location", spherical : true }}, {$unwind : "$data"}, {$group : { "_id" : "$data.year", "totalpop" : {"$sum" : "$data.totalpop"}, "states" : {"$addtoset" : "$name"}}}, {$sort : {"_id" : 1}}, {$out : peoplenearmemphis } ])

52 $out db.cdata.aggregate([<pipeline stages>, { $out : resultscollection }]) Save aggregation results to a new collection New aggregation uses: Transform documents - ETL

53 Back To The Original Question Which US Division has the fastest growing population density? We only want to include data states with more than 1M people We only want to include divisions larger than 100K square miles

54 Division with Fastest Growing Pop Density db.cdata.aggregate( [{$match : {"data.totalpop" : {"$gt" : }}}, {$unwind : "$data"}, {$sort : {"data.year" : 1}}, {$group : {"_id" : "$name", "pop1990" : {"$first" : "$data.totalpop"}, {$group : { "pop2010" : {"$last" : "$data.totalpop"}, "aream" : {"$first" : "$aream"}, "division" : {"$first" : "$division"}}}, "_id" : "$division", "totalpop1990" : {"$sum" : "$pop1990"}, "totalpop2010" : {"$sum" : "$pop2010"}, "totalaream" : {"$sum" : "$aream"}}}, {$match : {"totalaream" : {"$gt" : }}}, {$project : {"_id" : 0, "division" : "$_id", "density1990" : {"$divide" : ["$totalpop1990", "$totalaream"]}, "density2010" : {"$divide" : ["$totalpop2010", "$totalaream"]}, "dendelta" : {"$subtract" : [{"$divide" : ["$totalpop2010", $totalaream"]}, {"$divide" : ["$totalpop1990,"$totalaream"]}]}, "totalaream" : 1, "totalpop1990" : 1, "totalpop2010" : 1}}, {$sort : {"dendelta" : -1}}])

55 Aggregate Options

56 Aggregate options db.cdata.aggregate([<pipeline stages>], { explain : false 'allowdiskuse' : true, 'cursor' : {'batchsize' : 5}}) explain similar to find().explain() allowdiskuse enable use of disk to store intermediate results cursor specify the size of the initial result

57 Aggregation and Sharding

58 Workload split between shards Shards execute pipeline up to a point Primary shard merges cursors and continues processing* Use explain to analyze pipeline split Early $match can exclude shards Potential CPU and memory implications for primary shard host *Prior to v2.6 second stage pipeline processing was done by mongos Sharding

59 MongoDB 3.2: Joins and other improvements

60 Existing Alternatives to Joins { "_id": 10000, "items": [ { "productname": "laptop", "unitprice": 1000, "weight": 1.2, "remainingstock": 23}, { "productname": "mouse", "unitprice": 20, "weight": 0.2, "remainingstock": 276}], } orders Option 1: Include all data for an order in the same document Fast reads One find delivers all the required data Captures full description at the time of the event Consumes extra space Details of each product stored in many order documents Complex to maintain A change to any product attribute must be propagated to all affected orders

61 Existing Alternatives to Joins { "_id": 10000, "items": [ 12345, ],... } { "_id": 12345, "productname": "laptop", "unitprice": 1000, "weight": 1.2, "remainingstock": 23 } { "_id": 54321, "productname": "mouse", "unitprice": 20, "weight": 0.2, "remainingstock": 276 } Option 2: Order document references product documents Slower reads Multiple trips to the database Space efficient Product details stored once Lose point-in-time snapshot of full record Extra application logic Must iterate over product IDs in the order document and find the product documents RDBMS would automate through a JOIN orders products

62 The Winner? In general, Option 1 wins Performance and containment of everything in same place beats space efficiency of normalization There are exceptions e.g. Comments in a blog post -> unbounded size However, analytics benefit from combining data from multiple collections Keep listening...

63 $lookup Left Collection Right Collection Left-outer join Includes all documents from the left collection For each document in the left collection, find the matching documents from the right collection and embed them

64 $lookup Left Collection Right Collection db.leftcollection.aggregate([{ $lookup: }]) { } from: rightcollection, localfield: leftval, foreignfield: rightval, as: embeddeddata

65 Worked Example Data Set db.postcodes.findone() { "_id":objectid(" e50fa77da54dfc 0d2"), ]}} "postcode": "SL6 0AA", "location": { "type": "Point", "coordinates": [ , db.homesales.findone() { "_id":objectid("56005dd980c3678b19792b7f"), "amount": 9000, "date": ISODate(" T00:00:00Z"), "address": { "nameornumber": 25, "street": "NORFOLK PARK COTTAGES", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 7DR" } }

66 Reduce Data Set First db.homesales.aggregate([ {$match: { amount: {$gte: }} } ]) { "_id": ObjectId("56005dda80c3678b19799e52"), "amount": , "date": ISODate(" T00:00:00Z"), "address": { "nameornumber": "TEMPLE FERRY PLACE", "street": "MILL LANE", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 5ND" } },

67 Join (left-outer-equi) Results With Second Collection db.homesales.aggregate([ {$match: { amount: {$gte: }} }, {$lookup: { from: "postcodes", localfield: "address.postcode", foreignfield: "postcode", as: "postcode_docs"} } ])... "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 5ND" }, "postcode_docs": [ { "_id": ObjectId("560053e280c3678b1978b293"), "postcode": "SL6 5ND", "location": { "type": "Point", "coordinates": [ , ] }}]},...

68 Refactor Each Resulting Document...}, {$project: { _id: 0, saledate: $date", price: "$amount", address: 1, location: {$arrayelemat: ["$postcode_docs.location", 0]}} ]) { "address": { "nameornumber": "TEMPLE FERRY PLACE", "street": "MILL LANE", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 5ND" }, "saledate": ISODate(" T00:00:00Z"), "price": , "location": { "type": "Point", "coordinates": [ , ]}},...

69 Sort on Sale Price & Write to Collection...}, {$sort: {price: -1}}, {$out: "hotspots"} ]) {"address": { "nameornumber": "2-3", "street": "THE SWITCHBACK", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 7RJ" }, "saledate": ISODate(" T00:00:00Z"), "price": , "location": { "type": "Point", "coordinates": [ , ]}},...

70 Aggregated Statistics db.homesales.aggregate([ ]) {$group: { _id: {$year: "$date"}, higestprice: {$max: "$amount"}, }} lowestprice: {$min: "$amount"}, averageprice: {$avg: "$amount"}, amountstddev: {$stddevpop: "$amount"}... { "_id": 1995, "higestprice": , "lowestprice": 12000, "averageprice": , "amountstddev": }, { "_id": 1996, "higestprice": , "lowestprice": 9000, "averageprice": , "amountstddev": },...

71 Clean Up Output..., {$project: { _id: 0, year: "$_id", higestprice: 1, lowestprice: 1, } ]) } averageprice: {$trunc: "$averageprice"}, pricestddev: {$trunc: "$amountstddev"}... { "higestprice": , "lowestprice": 12000, "averageprice": , "year": 1995, "pricestddev": }, { "higestprice": , "lowestprice": 10500, "averageprice": , "year": 2004, "pricestddev": },...

72 Integrations

73 Hadoop Connector

74 Mongo-Hadoop Connector Turn MongoDB into a Hadoop-enabled filesystem: use as the input or output for Hadoop Works with MongoDB backup files (.bson) Input data Hadoop Cluster -or-.bson

75 Benefits and Features Takes advantage of full multi-core parallelism to process data in Mongo Full integration with Hadoop and JVM ecosystems Can be used with Amazon Elastic MapReduce Can read and write backup files from local filesystem, HDFS, or S3

76 Benefits and Features Vanilla Java MapReduce If you don t want to use Java, support for Hadoop Streaming. Write MapReduce code in

77 Benefits and Features Support for Pig high-level scripting language for data analysis and building map/reduce workflows Support for Hive SQL-like language for ad-hoc queries + analysis of data sets on Hadoopcompatible file systems

78 How It Works Adapter examines the MongoDB input collection and calculates a set of splits from the data Each split gets assigned to a node in Hadoop cluster In parallel, Hadoop nodes pull data for splits from MongoDB (or BSON) and process them locally Hadoop merges results and streams output back to MongoDB or BSON

79 BI Connector

80 MongoDB Connector for BI Visualize and explore multi-dimensional documents using SQL-based BI tools. The connector does the following: Provides the BI tool with the schema of the MongoDB collection to be visualized Translates SQL statements issued by the BI tool into equivalent MongoDB queries that are sent to MongoDB for processing Converts the results into the tabular format expected by the BI tool, which can then visualize the data based on user requirements

81 Location & Flow of Data MongoDB BI Connector {name: Andrew, address: {street: }} Analytics & visualization Table Mapping meta-data Document Application data

82 Defining Data Mapping mongodrdl mongobischema PostgreSQL MongoDBspecific Foreign Data Wrapper DRDL 82 mongodrdl --host port d mydbname \ -o mydrdlfile.drdl mongobischema import mycollectionname mydrdlfile.drdl

83 Optionally Manually Edit DRDL File Redact attributes Use more appropriate types (sampling can get it wrong) Rename tables (v1.1+) Rename columns (v1.1+) Build new views using MongoDB Aggregation Framework e.g., $lookup to join 2 tables - table: homesales collection: homesales pipeline: [] columns: - name: _id mongotype: bson.objectid sqlname: _id sqltype: varchar - name: address.county mongotype: string sqlname: address_county sqltype: varchar - name: address.nameornumber mongotype: int sqlname: address_nameornumber sqltype: varchar 83

84 Summary

85 Analytics in MongoDB? Analytics Create Read Update Deletet? YES! Group Count Derive Values Filter Average Sort

86 Complex aggregation queries Ad-hoc reporting Framework Use Cases Real-time analytics Visualizing and reshaping data

87 Questions?

88

Course Content MongoDB

Course Content MongoDB Course Content MongoDB 1. Course introduction and mongodb Essentials (basics) 2. Introduction to NoSQL databases What is NoSQL? Why NoSQL? Difference Between RDBMS and NoSQL Databases Benefits of NoSQL

More information

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions Big Data Hadoop Architect Online Training (Big Data Hadoop + Apache Spark & Scala+ MongoDB Developer And Administrator + Apache Cassandra + Impala Training + Apache Kafka + Apache Storm) 1 Big Data Hadoop

More information

Making MongoDB Accessible to All. Brody Messmer Product Owner DataDirect On-Premise Drivers Progress Software

Making MongoDB Accessible to All. Brody Messmer Product Owner DataDirect On-Premise Drivers Progress Software Making MongoDB Accessible to All Brody Messmer Product Owner DataDirect On-Premise Drivers Progress Software Agenda Intro to MongoDB What is MongoDB? Benefits Challenges and Common Criticisms Schema Design

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera, How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS

More information

Hadoop. Introduction / Overview

Hadoop. Introduction / Overview Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

The course modules of MongoDB developer and administrator online certification training:

The course modules of MongoDB developer and administrator online certification training: The course modules of MongoDB developer and administrator online certification training: 1 An Overview of the Course Introduction to the course Table of Contents Course Objectives Course Overview Value

More information

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23 Final Exam Review 2 Kathleen Durant CS 3200 Northeastern University Lecture 23 QUERY EVALUATION PLAN Representation of a SQL Command SELECT {DISTINCT} FROM {WHERE

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

MongoDB An Overview. 21-Oct Socrates

MongoDB An Overview. 21-Oct Socrates MongoDB An Overview 21-Oct-2016 Socrates Agenda What is NoSQL DB? Types of NoSQL DBs DBMS and MongoDB Comparison Why MongoDB? MongoDB Architecture Storage Engines Data Model Query Language Security Data

More information

MongoDB. Nicolas Travers Conservatoire National des Arts et Métiers. MongoDB

MongoDB. Nicolas Travers Conservatoire National des Arts et Métiers. MongoDB Nicolas Travers Conservatoire National des Arts et Métiers 1 Introduction Humongous (monstrous / enormous) NoSQL: Documents Oriented JSon Serialized format: BSon objects Implemented in C++ Keys indexing

More information

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( ) Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL

More information

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Blended Learning Outline: Cloudera Data Analyst Training (171219a) Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills

More information

Project: Relational Databases vs. NoSQL. Catherine Easdon

Project: Relational Databases vs. NoSQL. Catherine Easdon Project: Relational Databases vs. NoSQL Catherine Easdon Database Choice (Relational) PostgreSQL (using BigSQL tools) Open source Mature project 21 years old, release 10.1 Supports all major OSes Highly

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Oral Questions and Answers (DBMS LAB) Questions & Answers- DBMS

Oral Questions and Answers (DBMS LAB) Questions & Answers- DBMS Questions & Answers- DBMS https://career.guru99.com/top-50-database-interview-questions/ 1) Define Database. A prearranged collection of figures known as data is called database. 2) What is DBMS? Database

More information

Pig A language for data processing in Hadoop

Pig A language for data processing in Hadoop Pig A language for data processing in Hadoop Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Apache Pig: Introduction Tool for querying data on Hadoop

More information

Stages of Data Processing

Stages of Data Processing Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,

More information

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Principal Software Engineer Red Hat Emerging Technology June 24, 2015 USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015 ABOUT ME Distributed systems and data science in Red Hat's Emerging

More information

BIG DATA COURSE CONTENT

BIG DATA COURSE CONTENT BIG DATA COURSE CONTENT [I] Get Started with Big Data Microsoft Professional Orientation: Big Data Duration: 12 hrs Course Content: Introduction Course Introduction Data Fundamentals Introduction to Data

More information

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training:: Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional

More information

Big Data Hadoop Course Content

Big Data Hadoop Course Content Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone

More information

Innovatus Technologies

Innovatus Technologies HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String

More information

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component

More information

Scaling for Humongous amounts of data with MongoDB

Scaling for Humongous amounts of data with MongoDB Scaling for Humongous amounts of data with MongoDB Alvin Richards Technical Director, EMEA alvin@10gen.com @jonnyeight alvinonmongodb.com From here... http://bit.ly/ot71m4 ...to here... http://bit.ly/oxcsis

More information

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and

More information

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017 UNIFY DATA AT MEMORY SPEED Haoyuan (HY) Li, CEO @ Alluxio Inc. VAULT Conference 2017 March 2017 HISTORY Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in

More information

Hadoop Development Introduction

Hadoop Development Introduction Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

Evolving To The Big Data Warehouse

Evolving To The Big Data Warehouse Evolving To The Big Data Warehouse Kevin Lancaster 1 Copyright Director, 2012, Oracle and/or its Engineered affiliates. All rights Insert Systems, Information Protection Policy Oracle Classification from

More information

Hacking PostgreSQL Internals to Solve Data Access Problems

Hacking PostgreSQL Internals to Solve Data Access Problems Hacking PostgreSQL Internals to Solve Data Access Problems Sadayuki Furuhashi Treasure Data, Inc. Founder & Software Architect A little about me... > Sadayuki Furuhashi > github/twitter: @frsyuki > Treasure

More information

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Hadoop & Big Data Analytics Complete Practical & Real-time Training An ISO Certified Training Institute A Unit of Sequelgate Innovative Technologies Pvt. Ltd. www.sqlschool.com Hadoop & Big Data Analytics Complete Practical & Real-time Training Mode : Instructor Led LIVE

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

Microsoft Big Data and Hadoop

Microsoft Big Data and Hadoop Microsoft Big Data and Hadoop Lara Rubbelke @sqlgal Cindy Gross @sqlcindy 2 The world of data is changing The 4Vs of Big Data http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data 3 Common

More information

Document Object Storage with MongoDB

Document Object Storage with MongoDB Document Object Storage with MongoDB Lecture BigData Analytics Julian M. Kunkel julian.kunkel@googlemail.com University of Hamburg / German Climate Computing Center (DKRZ) 2017-12-15 Disclaimer: Big Data

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Raanan Dagan and Rohit Pujari September 25, 2017 Washington, DC Forward-Looking Statements During the course of this presentation, we may

More information

Database System Concepts

Database System Concepts Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth

More information

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10 Scalable Web Programming CS193S - Jan Jannink - 2/25/10 Weekly Syllabus 1.Scalability: (Jan.) 2.Agile Practices 3.Ecology/Mashups 4.Browser/Client 7.Analytics 8.Cloud/Map-Reduce 9.Published APIs: (Mar.)*

More information

Expert Lecture plan proposal Hadoop& itsapplication

Expert Lecture plan proposal Hadoop& itsapplication Expert Lecture plan proposal Hadoop& itsapplication STARTING UP WITH BIG Introduction to BIG Data Use cases of Big Data The Big data core components Knowing the requirements, knowledge on Analyst job profile

More information

Group13: Siddhant Deshmukh, Sudeep Rege, Sharmila Prakash, Dhanusha Varik

Group13: Siddhant Deshmukh, Sudeep Rege, Sharmila Prakash, Dhanusha Varik Group13: Siddhant Deshmukh, Sudeep Rege, Sharmila Prakash, Dhanusha Varik mongodb (humongous) Introduction What is MongoDB? Why MongoDB? MongoDB Terminology Why Not MongoDB? What is MongoDB? DOCUMENT STORE

More information

NoSQL. Based on slides by Mike Franklin and Jimmy Lin

NoSQL. Based on slides by Mike Franklin and Jimmy Lin NoSQL Based on slides by Mike Franklin and Jimmy Lin 1 Big Data (some old numbers) Facebook: 130TB/day: user logs 200-400TB/day: 83 million pictures Google: > 25 PB/day processed data Gene sequencing:

More information

Talend Big Data Sandbox. Big Data Insights Cookbook

Talend Big Data Sandbox. Big Data Insights Cookbook Overview Pre-requisites Setup & Configuration Hadoop Distribution Download Demo (Scenario) Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario) About this cookbook What is

More information

Data pipelines with PostgreSQL & Kafka

Data pipelines with PostgreSQL & Kafka Data pipelines with PostgreSQL & Kafka Oskari Saarenmaa PostgresConf US 2018 - Jersey City Agenda 1. Introduction 2. Data pipelines, old and new 3. Apache Kafka 4. Sample data pipeline with Kafka & PostgreSQL

More information

Query Processing. Introduction to Databases CompSci 316 Fall 2017

Query Processing. Introduction to Databases CompSci 316 Fall 2017 Query Processing Introduction to Databases CompSci 316 Fall 2017 2 Announcements (Tue., Nov. 14) Homework #3 sample solution posted in Sakai Homework #4 assigned today; due on 12/05 Project milestone #2

More information

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Copyright 2013, Oracle and/or its affiliates. All rights reserved. 1 Oracle NoSQL Database: Release 3.0 What s new and why you care Dave Segleau NoSQL Product Manager The following is intended to outline our general product direction. It is intended for information purposes

More information

Data Partitioning and MapReduce

Data Partitioning and MapReduce Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,

More information

Open source, high performance database. July 2012

Open source, high performance database. July 2012 Open source, high performance database July 2012 1 Quick introduction to mongodb Data modeling in mongodb, queries, geospatial, updates and map reduce. Using a location-based app as an example Example

More information

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda Agenda Oracle9i Warehouse Review Dulcian, Inc. Oracle9i Server OLAP Server Analytical SQL Mining ETL Infrastructure 9i Warehouse Builder Oracle 9i Server Overview E-Business Intelligence Platform 9i Server:

More information

1Z Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions

1Z Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions 1Z0-449 Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions Table of Contents Introduction to 1Z0-449 Exam on Oracle Big Data 2017 Implementation Essentials... 2 Oracle 1Z0-449

More information

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction Architecture Front End Back End Evaluation Comparison with Spark SQL Introduction Why not use Hive or HBase?

More information

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,

More information

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

MongoDB Schema Design

MongoDB Schema Design MongoDB Schema Design Demystifying document structures in MongoDB Jon Tobin @jontobs MongoDB Overview NoSQL Document Oriented DB Dynamic Schema HA/Sharding Built In Simple async replication setup Automated

More information

Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card

Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card The Rise of MongoDB Summary One of today s growing database

More information

It also suggests creating more reduce tasks to deal with problems in keeping reducer inputs in memory.

It also suggests creating more reduce tasks to deal with problems in keeping reducer inputs in memory. Reference 1. Processing Theta-Joins using MapReduce, Alper Okcan, Mirek Riedewald Northeastern University 1.1. Video of presentation on above paper 2. Optimizing Joins in map-reduce environment F.N. Afrati,

More information

CSE 190D Spring 2017 Final Exam Answers

CSE 190D Spring 2017 Final Exam Answers CSE 190D Spring 2017 Final Exam Answers Q 1. [20pts] For the following questions, clearly circle True or False. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join

More information

Introduction to HDFS and MapReduce

Introduction to HDFS and MapReduce Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -

More information

Azure Data Factory VS. SSIS. Reza Rad, Consultant, RADACAD

Azure Data Factory VS. SSIS. Reza Rad, Consultant, RADACAD Azure Data Factory VS. SSIS Reza Rad, Consultant, RADACAD 2 Please silence cell phones Explore Everything PASS Has to Offer FREE ONLINE WEBINAR EVENTS FREE 1-DAY LOCAL TRAINING EVENTS VOLUNTEERING OPPORTUNITIES

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

MONGODB INTERVIEW QUESTIONS

MONGODB INTERVIEW QUESTIONS MONGODB INTERVIEW QUESTIONS http://www.tutorialspoint.com/mongodb/mongodb_interview_questions.htm Copyright tutorialspoint.com Dear readers, these MongoDB Interview Questions have been designed specially

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source DMITRIY SETRAKYAN Founder, PPMC https://ignite.apache.org @apacheignite @dsetrakyan Agenda About In- Memory Computing Apache Ignite

More information

T-SQL Training: T-SQL for SQL Server for Developers

T-SQL Training: T-SQL for SQL Server for Developers Duration: 3 days T-SQL Training Overview T-SQL for SQL Server for Developers training teaches developers all the Transact-SQL skills they need to develop queries and views, and manipulate data in a SQL

More information

Instructor : Dr. Sunnie Chung. Independent Study Spring Pentaho. 1 P a g e

Instructor : Dr. Sunnie Chung. Independent Study Spring Pentaho. 1 P a g e ABSTRACT Pentaho Business Analytics from different data source, Analytics from csv/sql,create Star Schema Fact & Dimension Tables, kettle transformation for big data integration, MongoDB kettle Transformation,

More information

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program

More information

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES 1 THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon Vincent.Garonne@cern.ch ph-adp-ddm-lab@cern.ch XLDB

More information

Big Data Infrastructure at Spotify

Big Data Infrastructure at Spotify Big Data Infrastructure at Spotify Wouter de Bie Team Lead Data Infrastructure September 26, 2013 2 Who am I? According to ZDNet: "The work they have done to improve the Apache Hive data warehouse system

More information

Introduction to BigData, Hadoop:-

Introduction to BigData, Hadoop:- Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,

More information

Part 1: Indexes for Big Data

Part 1: Indexes for Big Data JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,

More information

Cloud Analytics and Business Intelligence on AWS

Cloud Analytics and Business Intelligence on AWS Cloud Analytics and Business Intelligence on AWS Enterprise Applications Virtual Desktops Sharing & Collaboration Platform Services Analytics Hadoop Real-time Streaming Data Machine Learning Data Warehouse

More information

Hadoop, Yarn and Beyond

Hadoop, Yarn and Beyond Hadoop, Yarn and Beyond 1 B. R A M A M U R T H Y Overview We learned about Hadoop1.x or the core. Just like Java evolved, Java core, Java 1.X, Java 2.. So on, software and systems evolve, naturally.. Lets

More information

Data Analytics at Logitech Snowflake + Tableau = #Winning

Data Analytics at Logitech Snowflake + Tableau = #Winning Welcome # T C 1 8 Data Analytics at Logitech Snowflake + Tableau = #Winning Avinash Deshpande I am a futurist, scientist, engineer, designer, data evangelist at heart Find me at Avinash Deshpande Chief

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack Chief Architect RainStor Agenda Importance of Hadoop + data compression Data compression techniques Compression,

More information

Introduction to Database Services

Introduction to Database Services Introduction to Database Services Shaun Pearce AWS Solutions Architect 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Today s agenda Why managed database services? A non-relational

More information

Oracle Big Data Connectors

Oracle Big Data Connectors Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process

More information

MapReduce programming model

MapReduce programming model MapReduce programming model technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate

More information

Certified Big Data Hadoop and Spark Scala Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills

More information

Fluentd + MongoDB + Spark = Awesome Sauce

Fluentd + MongoDB + Spark = Awesome Sauce Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited Bhavani Ananth, Tech Manager, Wipro Limited Your company logo here Wipro Open Source Practice: Vision & Mission Vision

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

VOLTDB + HP VERTICA. page

VOLTDB + HP VERTICA. page VOLTDB + HP VERTICA ARCHITECTURE FOR FAST AND BIG DATA ARCHITECTURE FOR FAST + BIG DATA FAST DATA Fast Serve Analytics BIG DATA BI Reporting Fast Operational Database Streaming Analytics Columnar Analytics

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig CSE 6242 / CX 4242 Scaling Up 1 Hadoop, Pig Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Le

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

Your First MongoDB Environment: What You Should Know Before Choosing MongoDB as Your Database

Your First MongoDB Environment: What You Should Know Before Choosing MongoDB as Your Database Your First MongoDB Environment: What You Should Know Before Choosing MongoDB as Your Database Me - @adamotonete Adamo Tonete Senior Technical Engineer Brazil Agenda What is MongoDB? The good side of MongoDB

More information

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida) DE: A Scalable Framework for Efficient Analytics Florin Rusu (University of California, Merced) Alin Dobra (University of Florida) Big Data Analytics Big Data Storage is cheap ($100 for 1TB disk) Everything

More information

Document Databases: MongoDB

Document Databases: MongoDB NDBI040: Big Data Management and NoSQL Databases hp://www.ksi.mff.cuni.cz/~svoboda/courses/171-ndbi040/ Lecture 9 Document Databases: MongoDB Marn Svoboda svoboda@ksi.mff.cuni.cz 28. 11. 2017 Charles University

More information

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent Tanton Jeppson CS 401R Lab 3 Cassandra, MongoDB, and HBase Introduction For my report I have chosen to take a deeper look at 3 NoSQL database systems: Cassandra, MongoDB, and HBase. I have chosen these

More information