Scaling up Data Management: From Data to Big Data

Size: px
Start display at page:

Download "Scaling up Data Management: From Data to Big Data"

Transcription

1 Scaling up Data Management: From Data to Big Data

2 Data Management: Evolution 60s o Access data in files o Computerized databases started shared access o Network model (CODASYL) Integrated Data Store (IDS) o Hierarchical model (IMS) -- Information Management System o SABRE was created to manage airline reservations 70s o Relational model o ACM SIGMOD and VLDB started (1975) o ER model o System R, Ingres o SQL 80s o Databases for PCs o DB2, Oracle, Sybase, Informix o SQL standard o RDBMS became a success o Expert systems, OODBMS, distributed databases

3 Data Management: Evolution 90s o Expensive products -- database for the rich o Internet database connectors; features for spatial, temporal, multimedia data; active and deductive capabilities o Exploit massively parallel processors 2000s o Oracle, IBM and Microsoft are the major RDBMS vendors o Main-memory databases 2010s o Open source databases for all o Big Data o NoSQL do not attempt to provide atomicity, consistency, isolation and durability o NewSQL SQL + NoSQL

4 Data Management Software Revenue Global database market reached over $40 billion in 2015 Business analytics software market 2013: $37 billion

5 Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology and services market growing at a compound annual growth rate (CAGR) of 23.1% over the forecast period with annual spending reaching $48.6 billion in

6 Big Data

7 4 Vs of Big Data 8

8 Big Data: New Applications Google: many billions of pages indexed, products, structured data Facebook: 1.5 billion users using the site each month Twitter: 517 million accounts, 320 million monthly active users, 500 million tweets/day

9 Big Data: New Computing Infrastructure Meet the cloud! [Hardware, Infrastructure, Platform] as a service Utility Computing: pay-as-you-go computing o Illusion of infinite resources o No up-front cost o Fine-grained billing (e.g., hourly)

10 Cloud Computing: Why Now? Experience with very large data centers o Unprecedented economies of scale o Transfer of risk Technology factors o Pervasive broadband Internet o Maturity in virtualization technology Business factors o Minimal capital expenditure o Pay-as-you-go billing model Agrawal et al., VLDB 2010 Tutorial

11 Warehouse Scale Computing Google s data center in Oregon 16 Million Nodes per building Agrawal et al., VLDB 2010 Tutorial

12 Economics of Cloud Users Pay by use instead of provisioning for peak Resources Capacity Demand Resources Capacity Demand Agrawal et al., VLDB 2010 Tutorial Time Static data center Unused resources Time Data center in the cloud Slide Credits: Berkeley RAD Lab

13 Economics of Cloud Users Risk of over-provisioning: underutilization Resources Capacity Demand Unused resources Time Static data center Agrawal et al., VLDB 2010 Tutorial Slide Credits: Berkeley RAD Lab

14 Economics of Cloud Users Heavy penalty for under-provisioning Resources Time (days) Agrawal et al., VLDB 2010 Tutorial Capacity Demand Resources Resources Time (days) Lost revenue Time (days) Lost users Slide Credits: Berkeley RAD Lab Capacity Demand Capacity Demand

15 Cloud Computing: Hype or Reality Unlike the earlier attempts: o Distributed Computing o Distributed Databases o Grid Computing Cloud Computing is REAL: o Organic growth: Google, Yahoo, Microsoft, and Amazon o Poised to be an integral aspect of National Infrastructure in US and elsewhere Agrawal et al., VLDB 2010 Tutorial

16 Cloud Computing Modalities Can we outsource our IT software and hardware infrastructure? Hosted Applications and services Pay-as-you-go model Scalability, fault-tolerance, elasticity, and self-manageability We have terabytes of click-stream data what can we do with it? Very large data repositories Complex analysis Distributed and parallel data processing Agrawal et al., VLDB 2010 Tutorial

17 Why Data Analysis? What is the most effective distribution channel? Who are our lowest/highest margin customers? Business have been doing this for a long time! Who are my customers and what products are they buying? What product promotions have the biggest impact on revenue? Agrawal et al., VLDB 2010 Tutorial What impact will new products/services have on revenue and margins? Which customers are most likely to go to the competition?

18 Decision Support Data analysis in the enterprise context emerged: o As a tool to build decision support systems o Data-centric decision making instead of using intuition o New term: Business Intelligence Used to manage and control business Data is historical or point-in-time Optimized for inquiry rather than update Use of the system is loosely defined and can be ad-hoc Used by managers and end-users to understand the business and make judgments Agrawal et al., VLDB 2010 Tutorial

19 Decision Support Traditional approach: o Decision makers wait for reports from disparate OLTP systems o Put it all together in a spreadsheet o Manual process There are many commercial systems that support analytics and decision support Agrawal et al., VLDB 2010 Tutorial

20 Decision Support Traditional approach: o Decision makers wait for reports from disparate OLTP systems o Put it all together in a spreadsheet o Manual process There are many commercial systems that support analytics and decision support Modified from Agrawal et al., VLDB 2010 Tutorial

21 Analytics in the Big Data Era Lots of open data available on the Web! Data capture at the user interaction level: o In contrast to the client transaction level in the Enterprise context o The amount of data increases significantly o Need to analyze such data to understand user behavior Cannot afford expensive warehouse solutions

22 Why Data Analysis? What would the impacts be of fare change? Where are our lowest/highest margin passengers? Now, many more stakeholders want to do this too! What is the distribution of trip lengths? What is the quickest route from midtown to downtown at 4pm on Monday? What impact will the introduction of additional medallions have? Where should drivers go to get passengers?

23 Data Analytics in the Cloud Scalability to large data volumes: o Scan 100 TB on 1 50 MB/sec = 23 days o Scan 100 TB on 1000-node cluster = 33 minutes Divide-And-Conquer (i.e., data partitioning) Cost-efficiency: o Commodity nodes (cheap, but unreliable) o Commodity network o Automatic fault-tolerance (fewer admins) o Easy to use (fewer programmers) Agrawal et al., VLDB 2010 Tutorial

24 Platforms for Large-scale Data Analysis Parallel DBMS technologies o Proposed in the late eighties o Matured over the last two decades o Multi-billion dollar industry: Proprietary DBMS Engines intended as Data Warehousing solutions for very large enterprises Map Reduce o pioneered by Google o popularized by Yahoo! (open-source Hadoop) Agrawal et al., VLDB 2010 Tutorial

25 Parallel DBMS technologies Popularly used for more than two decades o Research Projects: Gamma, Grace, o Commercial: Multi-billion dollar industry but access to only a privileged few Relational Data Model Indexing Familiar SQL interface Advanced query optimization Well understood and studied Very reliable! Agrawal et al., VLDB 2010 Tutorial

26 Parallel Databases DBMS hides the complexity from the client application DBA does most of the work data partitioning, optimization, etc.

27 MapReduce Overview: o Data-parallel programming model o An associated parallel and distributed implementation for commodity clusters Pioneered by Google o Processing 20 PB of data per day (circa 2008) [Dean et al., OSDI 2004, CACM Jan 2008, CACM Jan 2010] Agrawal et al., VLDB 2010 Tutorial

28 Hadoop Open source of MapReduce framework of Apache Project Used by Yahoo!, Facebook, Amazon, and the list is growing Key components o MapReduce - distributes applications o Hadoop Distributed File System (HDFS) - distributes data Hadoop Distributed File System (HDFS) o Store big files across machines o Store each file as a sequence of blocks o Blocks of a file are replicated for fault tolerance Distribute processing of large data across thousands of commodity machines You have to program your data processing and analysis

29 Word Count in Python def word_count_dict(filename): """Returns a word/count dict for this filename.""" # Utility used by count() and Topcount(). word_count = {} # Map each word to its count input_file = open(filename, 'r') for line in input_file: words = line.split() for word in words: word = word.lower() # Special case if we're seeing this word for the first time. if not word in word_count: word_count[word] = 1 else: word_count[word] = word_count[word] + 1 input_file.close() # Not strictly required, but good form. return word_count

30 MapReduce Programming Model Borrows primitives from functional programming Users should implement two primary methods: o Map: (key1, val1) [(key2, val2)] o Reduce: (key2, [val, val, val, ]) [(key3, val3)] Kyuseok Shim (VLDB 2012 TUTORIAL)

31 Word Counting with MapReduce M 1 Documents Key Value Key Value Doc1 Doc2 Doc3 Doc4 Doc5 Financial, IMF, Econ omics, Crisis Financial, IMF, Crisi s Documents Economics, Harry Financial, Harry, Pott er, Film Crisis, Harry, Potter Map Map Financial 1 ` ` IMF 1 Economics 1 ` Crisis 1 Financial 1 ` IMF 1 Crisis 1 Economics 1 ` Harry 1 Financial 1 Harry 1 ` Potter 1 Film 1 Crisis 1 Harry 1 ` Potter 1 M 2 Kyuseok Shim (VLDB 2012 TUTORIAL)

32 Word Counting with MapReduce Doc1 Doc2 Doc3 Doc4 Doc5 Documents Financial, IMF, Econ omics, Crisis Financial, IMF, Crisi s Documents Economics, Harry Financial, Harry, Pott er, Film Crisis, Harry, Potter Map Map Key KeyValue Key Value list Value Financial Financial1 Crisis 1, 1, 1 1 Financial IMF 1 Crisis 1, 1 1 Financial Economics 1 Crisis 1, 1 1 IMF Crisis 1 Harry 1, 1, 1 1 IMF Harry 1 Harry 1, 1, 1 1 Economics Film 1 Harry 1 1 Economics Potter 1 Film 1, 1 1 Potter 1 Potter 1 Reduce Reduce Key Value Financial 3 ` IMF 2 Economics 2 Crisis 3 Harry 3 ` Film 1 Potter 2 Before reduce functions are called, for each distinct key, a list of associated values is generated Kyuseok Shim (VLDB 2012 TUTORIAL)

33 MapReduce Advantages Automatic Parallelization: o Depending on the size of RAW INPUT DATA à instantiate multiple MAP tasks o Similarly, depending upon the number of intermediate <key, value> partitions à instantiate multiple REDUCE tasks Run-time: o Data partitioning o Task scheduling o Handling machine failures o Managing inter-machine communication Completely transparent to the programmer/analyst/user Agrawal et al., VLDB 2010 Tutorial

34 MapReduce Experience Runs on large commodity clusters: o 1000s to 10,000s of machines Processes many terabytes of data Easy to use since run-time complexity hidden from the users 1000s of MR jobs/day at Google (circa 2004) 100s of MR programs implemented (circa 2004) Agrawal et al., VLDB 2010 Tutorial

35 The Need Special-purpose programs to process large amounts of data: crawled documents, Web Query Logs, etc. At Google and others (Yahoo!, Facebook): o Inverted index o Graph structure of the WEB documents or social network o Summaries of #pages/host, set of frequent queries, etc. o Ad Optimization o Spam filtering o Agrawal et al., VLDB 2010 Tutorial

36 Takeaway MapReduce s data-parallel programming model hides complexity of distribution and fault tolerance Principal philosophies: o Make it scale, so you can throw hardware at problems o Make it cheap, saving hardware, programmer and administration costs (but requiring fault tolerance) MapReduce is not suitable for all problems, but when it works, it may save you a lot of time Agrawal et al., VLDB 2010 Tutorial

37 Map Reduce vs Parallel DBMS Parallel DBMS MapReduce Schema Support ü Not out of the box Indexing ü Not out of the box Programming Model Optimizations (Compres sion, Query Optimization) Declarative (SQL) ü Imperative (C/C++, Java, ) Extensions through Pig and Hive Not out of the box Flexibility Not out of the box ü Fault Tolerance Agrawal et al., VLDB 2010 Tutorial Coarse grained techniques [Pavlo et al., SIGMOD 2009, Stonebraker et al., CACM 2010, ] ü

38 MapReduce: A step backwards? Don t need 1000 nodes to process petabytes: o Parallel DBs do it in fewer than 100 nodes No support for schema: o Sharing across multiple MR programs is difficult No indexing: o Wasteful access to unnecessary data Non-declarative programming model: o Requires highly-skilled programmers No support for JOINs: o Requires multiple MR phases for the analysis We will study this in more detail! Agrawal et al., VLDB 2010 Tutorial

39 MapReduce and Big Data MapReduce programming model Hadoop infrastructure HDFS, NoSQL stores Data management and query processing in Hadoop environments Spark: processing engine compatible with Hadoop data o Supports streaming data, interactive queries, and machine learning o SQL vs. NoSQL: Big Data Hype and Reality [Tutorial by C. Mohan] o Need to look back at the lessons learned in database design o

40 Analysis and Mining

41 Data Mining Discovery of patterns and models that are o Valid applicable to new data with some certainty o Useful o Unexpected o Understandable to people Confluence of different areas: databases, machine learning, visualization, statistics We will study aspects from these areas, but focus on: o Scalability o Algorithms and architectures

42 Data Analysis and Mining Many challenges, even when data is not big Data cleaning and curation: Bad data à bad results o Detection and correction of errors in data, e.g., number of passengers = 255, taxis in the river. o Entity resolution and disambiguation, e.g., apple the fruit vs. Apple the company

43 Data Analysis and Mining Many challenges, even when data is not big Data cleaning and curation: Bad data à bad results o Detection and correction of errors in data E.g., number of passengers = 255, taxis in the river. o Entity resolution and disambiguation, e.g., apple the fruit vs. Apple the company Sometimes it can be hard to distinguish between errors and outliers!

44 Data Analysis and Mining Many challenges, even when data is not big Data cleaning and curation: o Detection and correction of errors in data E.g., number of passengers = 255, taxis in the river. o Entity resolution and disambiguation, e.g., apple the fruit vs. Apple the company Sometimes it can be hard to distinguish between errors and outliers! Visualization: Pictures help us to think o Substitute perception for cognition o External memory: free up limited cognitive/memory resources for higher-level problems Mining: Discovery of useful, possibly unexpected, patterns in data

45 Data Analysis and Mining In exploratory tasks, change is the norm! o Data analysis and mining are iterative processes o Many trial-and-error steps Data Process Data Product Perception & Cognition Knowledge Specification Exploration Data Manipulation User Figure modified from J. van Wijk, IEEE Vis 2005

46 Data Analysis and Mining In exploratory tasks, change is the norm! o Data analysis and mining are iterative processes o Many trial-and-error steps, easy to get lost Need to manage the data exploration process: o Guide users support for reflective reasoning o Need provenance for reproducibility [Freire et al., CISE 2008] Data Process Data Product Perception & Cognition Knowledge Specification Exploration Data Manipulation User Figure modified from J. van Wijk, IEEE Vis 2005

47 Sharing and Collaboration Result transparency o Show me your work! o Allow results to be verified à trust the results Keep track of what you do and the steps you follow the provenance of your work Hard data science problems require people with different expertise to collaborate o Need to share work, but this can be challenging o E.g., A sends their analysis script to B, but B cannot run it Missing or incorrect versions of libraries Hard-coded file names: /home/a/myinputfile.txt Follow best practices for sharing and reproducibility

48 Analyzing and Mining Big Data: Issues Scalability for algorithms and computations: need to design/extend algorithms to leverage new computing model o We will cover this in the third module of our course A big data-mining risk is that you will discover patterns that are meaningless watch out for bogus patterns/ events Bonferroni correction gives a statistically sound way to avoid most of these bogus positive responses

49 Bonferroni s Principle Calculate the expected number of occurrences of the events you are looking for, assuming that data is random If this number is significantly larger than the number of real instances you hope to find, then you must expect almost anything you find to be bogus, i.e., a statistical artifact rather than evidence of what you are looking for. Read textbook! o Chapter 1 of Mining of Massive Data Analysis

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

CSE6331: Cloud Computing

CSE6331: Cloud Computing CSE6331: Cloud Computing Leonidas Fegaras University of Texas at Arlington c 2019 by Leonidas Fegaras Cloud Computing Fundamentals Based on: J. Freire s class notes on Big Data http://vgc.poly.edu/~juliana/courses/bigdata2016/

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Hadoop vs. Parallel Databases. Juliana Freire!

Hadoop vs. Parallel Databases. Juliana Freire! Hadoop vs. Parallel Databases Juliana Freire! The Debate Starts The Debate Continues A comparison of approaches to large-scale data analysis. Pavlo et al., SIGMOD 2009! o Parallel DBMS beats MapReduce

More information

5 Fundamental Strategies for Building a Data-centered Data Center

5 Fundamental Strategies for Building a Data-centered Data Center 5 Fundamental Strategies for Building a Data-centered Data Center June 3, 2014 Ken Krupa, Chief Field Architect Gary Vidal, Solutions Specialist Last generation Reference Data Unstructured OLTP Warehouse

More information

A Review Paper on Big data & Hadoop

A Review Paper on Big data & Hadoop A Review Paper on Big data & Hadoop Rupali Jagadale MCA Department, Modern College of Engg. Modern College of Engginering Pune,India rupalijagadale02@gmail.com Pratibha Adkar MCA Department, Modern College

More information

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and

More information

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Page 1. Goals for Today Background of Cloud Computing Sources Driving Big Data CS162 Operating Systems and Systems Programming Lecture 24 Goals for Today" CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing" Distributed systems Cloud Computing programming paradigms Cloud Computing OS December 2, 2013 Anthony

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data

More information

Mobile Cloud Computing

Mobile Cloud Computing MTAT.03.262 -Mobile Application Development Lecture 8 Mobile Cloud Computing Satish Srirama, Huber Flores satish.srirama@ut.ee Outline Cloud Computing Mobile Cloud Access schemes HomeAssignment3 10/20/2014

More information

Big Data landscape Lecture #2

Big Data landscape Lecture #2 Big Data landscape Lecture #2 Contents 1 1 CORE Technologies 2 3 MapReduce YARN 4 SparK 5 Cassandra Contents 2 16 HBase 72 83 Accumulo memcached 94 Blur 10 5 Sqoop/Flume Contents 3 111 MongoDB 12 2 13

More information

Modern Database Concepts

Modern Database Concepts Modern Database Concepts Introduction to the world of Big Data Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz What is Big Data? buzzword? bubble? gold rush? revolution? Big data is like teenage

More information

Big Data and Cloud Computing

Big Data and Cloud Computing Big Data and Cloud Computing Presented at Faculty of Computer Science University of Murcia Presenter: Muhammad Fahim, PhD Department of Computer Eng. Istanbul S. Zaim University, Istanbul, Turkey About

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access Map/Reduce vs. DBMS Sharma Chakravarthy Information Technology Laboratory Computer Science and Engineering Department The University of Texas at Arlington, Arlington, TX 76009 Email: sharma@cse.uta.edu

More information

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache Databases on AWS 2017 Amazon Web Services, Inc. and its affiliates. All rights served. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon Web Services,

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri and Shamkant B. Navathe CHAPTER 1 Databases and Database Users Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Slide 1-2 OUTLINE Types of Databases and Database Applications

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10 Scalable Web Programming CS193S - Jan Jannink - 2/25/10 Weekly Syllabus 1.Scalability: (Jan.) 2.Agile Practices 3.Ecology/Mashups 4.Browser/Client 7.Analytics 8.Cloud/Map-Reduce 9.Published APIs: (Mar.)*

More information

Acknowledgements. Beyond DBMSs. Presentation Outline

Acknowledgements. Beyond DBMSs. Presentation Outline Acknowledgements Beyond RDBMSs These slides are put together from a variety of sources (both papers and slides/tutorials available on the web) Sharma Chakravarthy Information Technology Laboratory Computer

More information

Microsoft Big Data and Hadoop

Microsoft Big Data and Hadoop Microsoft Big Data and Hadoop Lara Rubbelke @sqlgal Cindy Gross @sqlcindy 2 The world of data is changing The 4Vs of Big Data http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data 3 Common

More information

Webinar Series TMIP VISION

Webinar Series TMIP VISION Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing

More information

Advanced Database Technologies NoSQL: Not only SQL

Advanced Database Technologies NoSQL: Not only SQL Advanced Database Technologies NoSQL: Not only SQL Christian Grün Database & Information Systems Group NoSQL Introduction 30, 40 years history of well-established database technology all in vain? Not at

More information

Acquiring Big Data to Realize Business Value

Acquiring Big Data to Realize Business Value Acquiring Big Data to Realize Business Value Agenda What is Big Data? Common Big Data technologies Use Case Examples Oracle Products in the Big Data space In Summary: Big Data Takeaways

More information

Safe Harbor Statement

Safe Harbor Statement Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment

More information

DATA SCIENCE USING SPARK: AN INTRODUCTION

DATA SCIENCE USING SPARK: AN INTRODUCTION DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data

More information

CS 6240: Parallel Data Processing in MapReduce: Module 1. Mirek Riedewald

CS 6240: Parallel Data Processing in MapReduce: Module 1. Mirek Riedewald CS 6240: Parallel Data Processing in MapReduce: Module 1 Mirek Riedewald Why Parallel Processing? Answer 1: Big Data 2 How Much Information? Source: http://www2.sims.berkeley.edu/research/projects/ho w-much-info-2003/execsum.htm

More information

745: Advanced Database Systems

745: Advanced Database Systems 745: Advanced Database Systems Yanlei Diao University of Massachusetts Amherst Outline Overview of course topics Course requirements Database Management Systems 1. Online Analytical Processing (OLAP) vs.

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation

More information

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED PLATFORM Executive Summary Financial institutions have implemented and continue to implement many disparate applications

More information

Big Data Infrastructure at Spotify

Big Data Infrastructure at Spotify Big Data Infrastructure at Spotify Wouter de Bie Team Lead Data Infrastructure September 26, 2013 2 Who am I? According to ZDNet: "The work they have done to improve the Apache Hive data warehouse system

More information

Big Data with Hadoop Ecosystem

Big Data with Hadoop Ecosystem Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process

More information

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data) Principles of Data Management Lecture #16 (MapReduce & DFS for Big Data) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s News Bulletin

More information

Mobile Cloud Computing

Mobile Cloud Computing MTAT.03.262 Mobile Application Development Mobile Cloud Computing Satish Srirama, Huber Flores satish.srirama@ut.ee Tartu, Estonia, 2013 Outline Cloud Computing Mobile Cloud Access schemas Research challenges

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

2013 AWS Worldwide Public Sector Summit Washington, D.C.

2013 AWS Worldwide Public Sector Summit Washington, D.C. 2013 AWS Worldwide Public Sector Summit Washington, D.C. EMR for Fun and for Profit Ben Butler Sr. Manager, Big Data butlerb@amazon.com @bensbutler Overview 1. What is big data? 2. What is AWS Elastic

More information

Hierarchy of knowledge BIG DATA 9/7/2017. Architecture

Hierarchy of knowledge BIG DATA 9/7/2017. Architecture BIG DATA Architecture Hierarchy of knowledge Data: Element (fact, figure, etc.) which is basic information that can be to be based on decisions, reasoning, research and which is treated by the human or

More information

Meaning & Concepts of Databases

Meaning & Concepts of Databases 27 th August 2015 Unit 1 Objective Meaning & Concepts of Databases Learning outcome Students will appreciate conceptual development of Databases Section 1: What is a Database & Applications Section 2:

More information

DATABASE DESIGN II - 1DL400

DATABASE DESIGN II - 1DL400 DATABASE DESIGN II - 1DL400 Fall 2016 A second course in database systems http://www.it.uu.se/research/group/udbl/kurser/dbii_ht16 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Cloud Computing. What is cloud computing. CS 537 Fall 2017

Cloud Computing. What is cloud computing. CS 537 Fall 2017 Cloud Computing CS 537 Fall 2017 What is cloud computing Illusion of infinite computing resources available on demand Scale-up for most apps Elimination of up-front commitment Small initial investment,

More information

Oracle Database Exadata Cloud Service Exadata Performance, Cloud Simplicity DATABASE CLOUD SERVICE

Oracle Database Exadata Cloud Service Exadata Performance, Cloud Simplicity DATABASE CLOUD SERVICE Oracle Database Exadata Exadata Performance, Cloud Simplicity DATABASE CLOUD SERVICE Oracle Database Exadata combines the best database with the best cloud platform. Exadata is the culmination of more

More information

BIG DATA TESTING: A UNIFIED VIEW

BIG DATA TESTING: A UNIFIED VIEW http://core.ecu.edu/strg BIG DATA TESTING: A UNIFIED VIEW BY NAM THAI ECU, Computer Science Department, March 16, 2016 2/30 PRESENTATION CONTENT 1. Overview of Big Data A. 5 V s of Big Data B. Data generation

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

Big Data The end of Data Warehousing?

Big Data The end of Data Warehousing? Big Data The end of Data Warehousing? Hermann Bär Oracle USA Redwood Shores, CA Schlüsselworte Big data, data warehousing, advanced analytics, Hadoop, unstructured data Introduction If there was an Unwort

More information

Oracle Big Data Connectors

Oracle Big Data Connectors Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process

More information

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Database System Architectures Parallel DBs, MapReduce, ColumnStores Database System Architectures Parallel DBs, MapReduce, ColumnStores CMPSCI 445 Fall 2010 Some slides courtesy of Yanlei Diao, Christophe Bisciglia, Aaron Kimball, & Sierra Michels- Slettvet Motivation:

More information

Large-Scale Data Engineering. Overview and Introduction

Large-Scale Data Engineering. Overview and Introduction Large-Scale Data Engineering Overview and Introduction Administration Blackboard Page Announcements, also via email (pardon html formatting) Practical enrollment, Turning in assignments, Check Grades Contact:

More information

Stages of Data Processing

Stages of Data Processing Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 5: Analyzing Relational Data (1/3) February 8, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F. Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart ahart@apache.org @andrewfhart My Day Job CTO, Pogoseat Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2 Additionally Member,

More information

Approaching the Petabyte Analytic Database: What I learned

Approaching the Petabyte Analytic Database: What I learned Disclaimer This document is for informational purposes only and is subject to change at any time without notice. The information in this document is proprietary to Actian and no part of this document may

More information

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight ESG Lab Review InterSystems Data Platform: A Unified, Efficient Data Platform for Fast Business Insight Date: April 218 Author: Kerry Dolan, Senior IT Validation Analyst Abstract Enterprise Strategy Group

More information

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

Distributed Databases: SQL vs NoSQL

Distributed Databases: SQL vs NoSQL Distributed Databases: SQL vs NoSQL Seda Unal, Yuchen Zheng April 23, 2017 1 Introduction Distributed databases have become increasingly popular in the era of big data because of their advantages over

More information

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES 1 THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon Vincent.Garonne@cern.ch ph-adp-ddm-lab@cern.ch XLDB

More information

what is cloud computing?

what is cloud computing? what is cloud computing? (Private) Cloud Computing with Mesos at Twi9er Benjamin Hindman @benh scalable virtualized self-service utility managed elastic economic pay-as-you-go what is cloud computing?

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

The Hadoop Paradigm & the Need for Dataset Management

The Hadoop Paradigm & the Need for Dataset Management The Hadoop Paradigm & the Need for Dataset Management 1. Hadoop Adoption Hadoop is being adopted rapidly by many different types of enterprises and government entities and it is an extraordinarily complex

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

Spatial Analytics Built for Big Data Platforms

Spatial Analytics Built for Big Data Platforms Spatial Analytics Built for Big Platforms Roberto Infante Software Development Manager, Spatial and Graph 1 Copyright 2011, Oracle and/or its affiliates. All rights Global Digital Growth The Internet of

More information

Oracle Database 11g for Data Warehousing & Big Data: Strategy, Roadmap Jean-Pierre Dijcks, Hermann Baer Oracle Redwood City, CA, USA

Oracle Database 11g for Data Warehousing & Big Data: Strategy, Roadmap Jean-Pierre Dijcks, Hermann Baer Oracle Redwood City, CA, USA Oracle Database 11g for Data Warehousing & Big Data: Strategy, Roadmap Jean-Pierre Dijcks, Hermann Baer Oracle Redwood City, CA, USA Keywords: Big Data, Oracle Big Data Appliance, Hadoop, NoSQL, Oracle

More information

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS 61C: Great Ideas in Computer Architecture. MapReduce CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing

More information

Massive Online Analysis - Storm,Spark

Massive Online Analysis - Storm,Spark Massive Online Analysis - Storm,Spark presentation by R. Kishore Kumar Research Scholar Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Kharagpur-721302, India (R

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

When, Where & Why to Use NoSQL?

When, Where & Why to Use NoSQL? When, Where & Why to Use NoSQL? 1 Big data is becoming a big challenge for enterprises. Many organizations have built environments for transactional data with Relational Database Management Systems (RDBMS),

More information

The age of Big Data Big Data for Oracle Database Professionals

The age of Big Data Big Data for Oracle Database Professionals The age of Big Data Big Data for Oracle Database Professionals Oracle OpenWorld 2017 #OOW17 SessionID: SUN5698 Tom S. Reddy tom.reddy@datareddy.com About the Speaker COLLABORATE & OpenWorld Speaker IOUG

More information

Next-Generation Cloud Platform

Next-Generation Cloud Platform Next-Generation Cloud Platform Jangwoo Kim Jun 24, 2013 E-mail: jangwoo@postech.ac.kr High Performance Computing Lab Department of Computer Science & Engineering Pohang University of Science and Technology

More information

An Indian Journal FULL PAPER. Trade Science Inc. Research on data mining clustering algorithm in cloud computing environments ABSTRACT KEYWORDS

An Indian Journal FULL PAPER. Trade Science Inc. Research on data mining clustering algorithm in cloud computing environments ABSTRACT KEYWORDS [Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 17 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(17), 2014 [9562-9566] Research on data mining clustering algorithm in cloud

More information

Seminar Map/Reduce Prof. Johann-Christoph Freytag, Ph. D. Rico Bergmann

Seminar Map/Reduce Prof. Johann-Christoph Freytag, Ph. D. Rico Bergmann Seminar Map/ 20.10.2010 Prof. Johann-Christoph Freytag, Ph. D. Rico Bergmann contact Prof. Johann-Christoph Freytag Ph.D. Prof. at chair in Databases and Information Systems (DBIS) RUD25 Rico Bergmann

More information

IBM Data Replication for Big Data

IBM Data Replication for Big Data IBM Data Replication for Big Data Highlights Stream changes in realtime in Hadoop or Kafka data lakes or hubs Provide agility to data in data warehouses and data lakes Achieve minimum impact on source

More information

Big Data on AWS. Big Data Agility and Performance Delivered in the Cloud. 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Big Data on AWS. Big Data Agility and Performance Delivered in the Cloud. 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big Data on AWS Big Data Agility and Performance Delivered in the Cloud 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big Data Technologies and techniques for working productively

More information

Scalable Tools - Part I Introduction to Scalable Tools

Scalable Tools - Part I Introduction to Scalable Tools Scalable Tools - Part I Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/mbds2018/ Scalable Tools session

More information

Data Intensive Scalable Computing

Data Intensive Scalable Computing Data Intensive Scalable Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Examples of Big Data Sources Wal-Mart 267 million items/day, sold at 6,000 stores HP built them

More information

Strategic Briefing Paper Big Data

Strategic Briefing Paper Big Data Strategic Briefing Paper Big Data The promise of Big Data is improved competitiveness, reduced cost and minimized risk by taking better decisions. This requires affordable solution architectures which

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Fall 2016 1 HW8 is out Last assignment! Get Amazon credits now (see instructions) Spark with Hadoop Due next wed CSE 344 - Fall 2016

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

Data Warehousing and Decision Support (mostly using Relational Databases) CS634 Class 20

Data Warehousing and Decision Support (mostly using Relational Databases) CS634 Class 20 Data Warehousing and Decision Support (mostly using Relational Databases) CS634 Class 20 Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke, Chapter 25 Introduction Increasingly,

More information

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale

More information

Overview of Data Services and Streaming Data Solution with Azure

Overview of Data Services and Streaming Data Solution with Azure Overview of Data Services and Streaming Data Solution with Azure Tara Mason Senior Consultant tmason@impactmakers.com Platform as a Service Offerings SQL Server On Premises vs. Azure SQL Server SQL Server

More information

Lesson 14: Cloud Computing

Lesson 14: Cloud Computing Yang, Chaowei et al. (2011) 'Spatial cloud computing: how can the geospatial sciences use and help shape cloud computing?', International Journal of Digital Earth, 4: 4, 305 329 GEOG 482/582 : GIS Data

More information

Chapter 6 VIDEO CASES

Chapter 6 VIDEO CASES Chapter 6 Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:

More information

Chapter 6. Foundations of Business Intelligence: Databases and Information Management VIDEO CASES

Chapter 6. Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Chapter 6 Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate

More information

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB s C. Faloutsos A. Pavlo Lecture#23: Distributed Database Systems (R&G ch. 22) Administrivia Final Exam Who: You What: R&G Chapters 15-22

More information

Sensor Data Collection and Processing

Sensor Data Collection and Processing Sensor Data Collection and Processing Applying Web Scale To Sensor Data Today s speaker Josh Patterson josh@cloudera.com / twitter: @jpatanooga Master s Thesis: self-organizing mesh networks Published

More information

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data Oracle Big Data SQL Release 3.2 The unprecedented explosion in data that can be made useful to enterprises from the Internet of Things, to the social streams of global customer bases has created a tremendous

More information

Hadoop, Yarn and Beyond

Hadoop, Yarn and Beyond Hadoop, Yarn and Beyond 1 B. R A M A M U R T H Y Overview We learned about Hadoop1.x or the core. Just like Java evolved, Java core, Java 1.X, Java 2.. So on, software and systems evolve, naturally.. Lets

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information