Scaling up Data Management: From Data to Big Data
|
|
- Solomon Watson
- 6 years ago
- Views:
Transcription
1 Scaling up Data Management: From Data to Big Data
2 Data Management: Evolution 60s o Access data in files o Computerized databases started shared access o Network model (CODASYL) Integrated Data Store (IDS) o Hierarchical model (IMS) -- Information Management System o SABRE was created to manage airline reservations 70s o Relational model o ACM SIGMOD and VLDB started (1975) o ER model o System R, Ingres o SQL 80s o Databases for PCs o DB2, Oracle, Sybase, Informix o SQL standard o RDBMS became a success o Expert systems, OODBMS, distributed databases
3 Data Management: Evolution 90s o Expensive products -- database for the rich o Internet database connectors; features for spatial, temporal, multimedia data; active and deductive capabilities o Exploit massively parallel processors 2000s o Oracle, IBM and Microsoft are the major RDBMS vendors o Main-memory databases 2010s o Open source databases for all o Big Data o NoSQL do not attempt to provide atomicity, consistency, isolation and durability o NewSQL SQL + NoSQL
4 Data Management Software Revenue Global database market reached over $40 billion in 2015 Business analytics software market 2013: $37 billion
5 Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology and services market growing at a compound annual growth rate (CAGR) of 23.1% over the forecast period with annual spending reaching $48.6 billion in
6 Big Data
7 4 Vs of Big Data 8
8 Big Data: New Applications Google: many billions of pages indexed, products, structured data Facebook: 1.5 billion users using the site each month Twitter: 517 million accounts, 320 million monthly active users, 500 million tweets/day
9 Big Data: New Computing Infrastructure Meet the cloud! [Hardware, Infrastructure, Platform] as a service Utility Computing: pay-as-you-go computing o Illusion of infinite resources o No up-front cost o Fine-grained billing (e.g., hourly)
10 Cloud Computing: Why Now? Experience with very large data centers o Unprecedented economies of scale o Transfer of risk Technology factors o Pervasive broadband Internet o Maturity in virtualization technology Business factors o Minimal capital expenditure o Pay-as-you-go billing model Agrawal et al., VLDB 2010 Tutorial
11 Warehouse Scale Computing Google s data center in Oregon 16 Million Nodes per building Agrawal et al., VLDB 2010 Tutorial
12 Economics of Cloud Users Pay by use instead of provisioning for peak Resources Capacity Demand Resources Capacity Demand Agrawal et al., VLDB 2010 Tutorial Time Static data center Unused resources Time Data center in the cloud Slide Credits: Berkeley RAD Lab
13 Economics of Cloud Users Risk of over-provisioning: underutilization Resources Capacity Demand Unused resources Time Static data center Agrawal et al., VLDB 2010 Tutorial Slide Credits: Berkeley RAD Lab
14 Economics of Cloud Users Heavy penalty for under-provisioning Resources Time (days) Agrawal et al., VLDB 2010 Tutorial Capacity Demand Resources Resources Time (days) Lost revenue Time (days) Lost users Slide Credits: Berkeley RAD Lab Capacity Demand Capacity Demand
15 Cloud Computing: Hype or Reality Unlike the earlier attempts: o Distributed Computing o Distributed Databases o Grid Computing Cloud Computing is REAL: o Organic growth: Google, Yahoo, Microsoft, and Amazon o Poised to be an integral aspect of National Infrastructure in US and elsewhere Agrawal et al., VLDB 2010 Tutorial
16 Cloud Computing Modalities Can we outsource our IT software and hardware infrastructure? Hosted Applications and services Pay-as-you-go model Scalability, fault-tolerance, elasticity, and self-manageability We have terabytes of click-stream data what can we do with it? Very large data repositories Complex analysis Distributed and parallel data processing Agrawal et al., VLDB 2010 Tutorial
17 Why Data Analysis? What is the most effective distribution channel? Who are our lowest/highest margin customers? Business have been doing this for a long time! Who are my customers and what products are they buying? What product promotions have the biggest impact on revenue? Agrawal et al., VLDB 2010 Tutorial What impact will new products/services have on revenue and margins? Which customers are most likely to go to the competition?
18 Decision Support Data analysis in the enterprise context emerged: o As a tool to build decision support systems o Data-centric decision making instead of using intuition o New term: Business Intelligence Used to manage and control business Data is historical or point-in-time Optimized for inquiry rather than update Use of the system is loosely defined and can be ad-hoc Used by managers and end-users to understand the business and make judgments Agrawal et al., VLDB 2010 Tutorial
19 Decision Support Traditional approach: o Decision makers wait for reports from disparate OLTP systems o Put it all together in a spreadsheet o Manual process There are many commercial systems that support analytics and decision support Agrawal et al., VLDB 2010 Tutorial
20 Decision Support Traditional approach: o Decision makers wait for reports from disparate OLTP systems o Put it all together in a spreadsheet o Manual process There are many commercial systems that support analytics and decision support Modified from Agrawal et al., VLDB 2010 Tutorial
21 Analytics in the Big Data Era Lots of open data available on the Web! Data capture at the user interaction level: o In contrast to the client transaction level in the Enterprise context o The amount of data increases significantly o Need to analyze such data to understand user behavior Cannot afford expensive warehouse solutions
22 Why Data Analysis? What would the impacts be of fare change? Where are our lowest/highest margin passengers? Now, many more stakeholders want to do this too! What is the distribution of trip lengths? What is the quickest route from midtown to downtown at 4pm on Monday? What impact will the introduction of additional medallions have? Where should drivers go to get passengers?
23 Data Analytics in the Cloud Scalability to large data volumes: o Scan 100 TB on 1 50 MB/sec = 23 days o Scan 100 TB on 1000-node cluster = 33 minutes Divide-And-Conquer (i.e., data partitioning) Cost-efficiency: o Commodity nodes (cheap, but unreliable) o Commodity network o Automatic fault-tolerance (fewer admins) o Easy to use (fewer programmers) Agrawal et al., VLDB 2010 Tutorial
24 Platforms for Large-scale Data Analysis Parallel DBMS technologies o Proposed in the late eighties o Matured over the last two decades o Multi-billion dollar industry: Proprietary DBMS Engines intended as Data Warehousing solutions for very large enterprises Map Reduce o pioneered by Google o popularized by Yahoo! (open-source Hadoop) Agrawal et al., VLDB 2010 Tutorial
25 Parallel DBMS technologies Popularly used for more than two decades o Research Projects: Gamma, Grace, o Commercial: Multi-billion dollar industry but access to only a privileged few Relational Data Model Indexing Familiar SQL interface Advanced query optimization Well understood and studied Very reliable! Agrawal et al., VLDB 2010 Tutorial
26 Parallel Databases DBMS hides the complexity from the client application DBA does most of the work data partitioning, optimization, etc.
27 MapReduce Overview: o Data-parallel programming model o An associated parallel and distributed implementation for commodity clusters Pioneered by Google o Processing 20 PB of data per day (circa 2008) [Dean et al., OSDI 2004, CACM Jan 2008, CACM Jan 2010] Agrawal et al., VLDB 2010 Tutorial
28 Hadoop Open source of MapReduce framework of Apache Project Used by Yahoo!, Facebook, Amazon, and the list is growing Key components o MapReduce - distributes applications o Hadoop Distributed File System (HDFS) - distributes data Hadoop Distributed File System (HDFS) o Store big files across machines o Store each file as a sequence of blocks o Blocks of a file are replicated for fault tolerance Distribute processing of large data across thousands of commodity machines You have to program your data processing and analysis
29 Word Count in Python def word_count_dict(filename): """Returns a word/count dict for this filename.""" # Utility used by count() and Topcount(). word_count = {} # Map each word to its count input_file = open(filename, 'r') for line in input_file: words = line.split() for word in words: word = word.lower() # Special case if we're seeing this word for the first time. if not word in word_count: word_count[word] = 1 else: word_count[word] = word_count[word] + 1 input_file.close() # Not strictly required, but good form. return word_count
30 MapReduce Programming Model Borrows primitives from functional programming Users should implement two primary methods: o Map: (key1, val1) [(key2, val2)] o Reduce: (key2, [val, val, val, ]) [(key3, val3)] Kyuseok Shim (VLDB 2012 TUTORIAL)
31 Word Counting with MapReduce M 1 Documents Key Value Key Value Doc1 Doc2 Doc3 Doc4 Doc5 Financial, IMF, Econ omics, Crisis Financial, IMF, Crisi s Documents Economics, Harry Financial, Harry, Pott er, Film Crisis, Harry, Potter Map Map Financial 1 ` ` IMF 1 Economics 1 ` Crisis 1 Financial 1 ` IMF 1 Crisis 1 Economics 1 ` Harry 1 Financial 1 Harry 1 ` Potter 1 Film 1 Crisis 1 Harry 1 ` Potter 1 M 2 Kyuseok Shim (VLDB 2012 TUTORIAL)
32 Word Counting with MapReduce Doc1 Doc2 Doc3 Doc4 Doc5 Documents Financial, IMF, Econ omics, Crisis Financial, IMF, Crisi s Documents Economics, Harry Financial, Harry, Pott er, Film Crisis, Harry, Potter Map Map Key KeyValue Key Value list Value Financial Financial1 Crisis 1, 1, 1 1 Financial IMF 1 Crisis 1, 1 1 Financial Economics 1 Crisis 1, 1 1 IMF Crisis 1 Harry 1, 1, 1 1 IMF Harry 1 Harry 1, 1, 1 1 Economics Film 1 Harry 1 1 Economics Potter 1 Film 1, 1 1 Potter 1 Potter 1 Reduce Reduce Key Value Financial 3 ` IMF 2 Economics 2 Crisis 3 Harry 3 ` Film 1 Potter 2 Before reduce functions are called, for each distinct key, a list of associated values is generated Kyuseok Shim (VLDB 2012 TUTORIAL)
33 MapReduce Advantages Automatic Parallelization: o Depending on the size of RAW INPUT DATA à instantiate multiple MAP tasks o Similarly, depending upon the number of intermediate <key, value> partitions à instantiate multiple REDUCE tasks Run-time: o Data partitioning o Task scheduling o Handling machine failures o Managing inter-machine communication Completely transparent to the programmer/analyst/user Agrawal et al., VLDB 2010 Tutorial
34 MapReduce Experience Runs on large commodity clusters: o 1000s to 10,000s of machines Processes many terabytes of data Easy to use since run-time complexity hidden from the users 1000s of MR jobs/day at Google (circa 2004) 100s of MR programs implemented (circa 2004) Agrawal et al., VLDB 2010 Tutorial
35 The Need Special-purpose programs to process large amounts of data: crawled documents, Web Query Logs, etc. At Google and others (Yahoo!, Facebook): o Inverted index o Graph structure of the WEB documents or social network o Summaries of #pages/host, set of frequent queries, etc. o Ad Optimization o Spam filtering o Agrawal et al., VLDB 2010 Tutorial
36 Takeaway MapReduce s data-parallel programming model hides complexity of distribution and fault tolerance Principal philosophies: o Make it scale, so you can throw hardware at problems o Make it cheap, saving hardware, programmer and administration costs (but requiring fault tolerance) MapReduce is not suitable for all problems, but when it works, it may save you a lot of time Agrawal et al., VLDB 2010 Tutorial
37 Map Reduce vs Parallel DBMS Parallel DBMS MapReduce Schema Support ü Not out of the box Indexing ü Not out of the box Programming Model Optimizations (Compres sion, Query Optimization) Declarative (SQL) ü Imperative (C/C++, Java, ) Extensions through Pig and Hive Not out of the box Flexibility Not out of the box ü Fault Tolerance Agrawal et al., VLDB 2010 Tutorial Coarse grained techniques [Pavlo et al., SIGMOD 2009, Stonebraker et al., CACM 2010, ] ü
38 MapReduce: A step backwards? Don t need 1000 nodes to process petabytes: o Parallel DBs do it in fewer than 100 nodes No support for schema: o Sharing across multiple MR programs is difficult No indexing: o Wasteful access to unnecessary data Non-declarative programming model: o Requires highly-skilled programmers No support for JOINs: o Requires multiple MR phases for the analysis We will study this in more detail! Agrawal et al., VLDB 2010 Tutorial
39 MapReduce and Big Data MapReduce programming model Hadoop infrastructure HDFS, NoSQL stores Data management and query processing in Hadoop environments Spark: processing engine compatible with Hadoop data o Supports streaming data, interactive queries, and machine learning o SQL vs. NoSQL: Big Data Hype and Reality [Tutorial by C. Mohan] o Need to look back at the lessons learned in database design o
40 Analysis and Mining
41 Data Mining Discovery of patterns and models that are o Valid applicable to new data with some certainty o Useful o Unexpected o Understandable to people Confluence of different areas: databases, machine learning, visualization, statistics We will study aspects from these areas, but focus on: o Scalability o Algorithms and architectures
42 Data Analysis and Mining Many challenges, even when data is not big Data cleaning and curation: Bad data à bad results o Detection and correction of errors in data, e.g., number of passengers = 255, taxis in the river. o Entity resolution and disambiguation, e.g., apple the fruit vs. Apple the company
43 Data Analysis and Mining Many challenges, even when data is not big Data cleaning and curation: Bad data à bad results o Detection and correction of errors in data E.g., number of passengers = 255, taxis in the river. o Entity resolution and disambiguation, e.g., apple the fruit vs. Apple the company Sometimes it can be hard to distinguish between errors and outliers!
44 Data Analysis and Mining Many challenges, even when data is not big Data cleaning and curation: o Detection and correction of errors in data E.g., number of passengers = 255, taxis in the river. o Entity resolution and disambiguation, e.g., apple the fruit vs. Apple the company Sometimes it can be hard to distinguish between errors and outliers! Visualization: Pictures help us to think o Substitute perception for cognition o External memory: free up limited cognitive/memory resources for higher-level problems Mining: Discovery of useful, possibly unexpected, patterns in data
45 Data Analysis and Mining In exploratory tasks, change is the norm! o Data analysis and mining are iterative processes o Many trial-and-error steps Data Process Data Product Perception & Cognition Knowledge Specification Exploration Data Manipulation User Figure modified from J. van Wijk, IEEE Vis 2005
46 Data Analysis and Mining In exploratory tasks, change is the norm! o Data analysis and mining are iterative processes o Many trial-and-error steps, easy to get lost Need to manage the data exploration process: o Guide users support for reflective reasoning o Need provenance for reproducibility [Freire et al., CISE 2008] Data Process Data Product Perception & Cognition Knowledge Specification Exploration Data Manipulation User Figure modified from J. van Wijk, IEEE Vis 2005
47 Sharing and Collaboration Result transparency o Show me your work! o Allow results to be verified à trust the results Keep track of what you do and the steps you follow the provenance of your work Hard data science problems require people with different expertise to collaborate o Need to share work, but this can be challenging o E.g., A sends their analysis script to B, but B cannot run it Missing or incorrect versions of libraries Hard-coded file names: /home/a/myinputfile.txt Follow best practices for sharing and reproducibility
48 Analyzing and Mining Big Data: Issues Scalability for algorithms and computations: need to design/extend algorithms to leverage new computing model o We will cover this in the third module of our course A big data-mining risk is that you will discover patterns that are meaningless watch out for bogus patterns/ events Bonferroni correction gives a statistically sound way to avoid most of these bogus positive responses
49 Bonferroni s Principle Calculate the expected number of occurrences of the events you are looking for, assuming that data is random If this number is significantly larger than the number of real instances you hope to find, then you must expect almost anything you find to be bogus, i.e., a statistical artifact rather than evidence of what you are looking for. Read textbook! o Chapter 1 of Mining of Massive Data Analysis
Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationCSE6331: Cloud Computing
CSE6331: Cloud Computing Leonidas Fegaras University of Texas at Arlington c 2019 by Leonidas Fegaras Cloud Computing Fundamentals Based on: J. Freire s class notes on Big Data http://vgc.poly.edu/~juliana/courses/bigdata2016/
More informationEmbedded Technosolutions
Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication
More informationHadoop/MapReduce Computing Paradigm
Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications
More informationWhere We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)
More informationCloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationHadoop vs. Parallel Databases. Juliana Freire!
Hadoop vs. Parallel Databases Juliana Freire! The Debate Starts The Debate Continues A comparison of approaches to large-scale data analysis. Pavlo et al., SIGMOD 2009! o Parallel DBMS beats MapReduce
More information5 Fundamental Strategies for Building a Data-centered Data Center
5 Fundamental Strategies for Building a Data-centered Data Center June 3, 2014 Ken Krupa, Chief Field Architect Gary Vidal, Solutions Specialist Last generation Reference Data Unstructured OLTP Warehouse
More informationA Review Paper on Big data & Hadoop
A Review Paper on Big data & Hadoop Rupali Jagadale MCA Department, Modern College of Engg. Modern College of Engginering Pune,India rupalijagadale02@gmail.com Pratibha Adkar MCA Department, Modern College
More informationSQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism
Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and
More informationPage 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24
Goals for Today" CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing" Distributed systems Cloud Computing programming paradigms Cloud Computing OS December 2, 2013 Anthony
More informationThe MapReduce Framework
The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data
More informationMobile Cloud Computing
MTAT.03.262 -Mobile Application Development Lecture 8 Mobile Cloud Computing Satish Srirama, Huber Flores satish.srirama@ut.ee Outline Cloud Computing Mobile Cloud Access schemes HomeAssignment3 10/20/2014
More informationBig Data landscape Lecture #2
Big Data landscape Lecture #2 Contents 1 1 CORE Technologies 2 3 MapReduce YARN 4 SparK 5 Cassandra Contents 2 16 HBase 72 83 Accumulo memcached 94 Blur 10 5 Sqoop/Flume Contents 3 111 MongoDB 12 2 13
More informationModern Database Concepts
Modern Database Concepts Introduction to the world of Big Data Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz What is Big Data? buzzword? bubble? gold rush? revolution? Big data is like teenage
More informationBig Data and Cloud Computing
Big Data and Cloud Computing Presented at Faculty of Computer Science University of Murcia Presenter: Muhammad Fahim, PhD Department of Computer Eng. Istanbul S. Zaim University, Istanbul, Turkey About
More informationCSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark
CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationTutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access
Map/Reduce vs. DBMS Sharma Chakravarthy Information Technology Laboratory Computer Science and Engineering Department The University of Texas at Arlington, Arlington, TX 76009 Email: sharma@cse.uta.edu
More informationAgenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache
Databases on AWS 2017 Amazon Web Services, Inc. and its affiliates. All rights served. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon Web Services,
More informationCopyright 2016 Ramez Elmasri and Shamkant B. Navathe
Copyright 2016 Ramez Elmasri and Shamkant B. Navathe CHAPTER 1 Databases and Database Users Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Slide 1-2 OUTLINE Types of Databases and Database Applications
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)
More informationScalable Web Programming. CS193S - Jan Jannink - 2/25/10
Scalable Web Programming CS193S - Jan Jannink - 2/25/10 Weekly Syllabus 1.Scalability: (Jan.) 2.Agile Practices 3.Ecology/Mashups 4.Browser/Client 7.Analytics 8.Cloud/Map-Reduce 9.Published APIs: (Mar.)*
More informationAcknowledgements. Beyond DBMSs. Presentation Outline
Acknowledgements Beyond RDBMSs These slides are put together from a variety of sources (both papers and slides/tutorials available on the web) Sharma Chakravarthy Information Technology Laboratory Computer
More informationMicrosoft Big Data and Hadoop
Microsoft Big Data and Hadoop Lara Rubbelke @sqlgal Cindy Gross @sqlcindy 2 The world of data is changing The 4Vs of Big Data http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data 3 Common
More informationWebinar Series TMIP VISION
Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing
More informationAdvanced Database Technologies NoSQL: Not only SQL
Advanced Database Technologies NoSQL: Not only SQL Christian Grün Database & Information Systems Group NoSQL Introduction 30, 40 years history of well-established database technology all in vain? Not at
More informationAcquiring Big Data to Realize Business Value
Acquiring Big Data to Realize Business Value Agenda What is Big Data? Common Big Data technologies Use Case Examples Oracle Products in the Big Data space In Summary: Big Data Takeaways
More informationSafe Harbor Statement
Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationCS 6240: Parallel Data Processing in MapReduce: Module 1. Mirek Riedewald
CS 6240: Parallel Data Processing in MapReduce: Module 1 Mirek Riedewald Why Parallel Processing? Answer 1: Big Data 2 How Much Information? Source: http://www2.sims.berkeley.edu/research/projects/ho w-much-info-2003/execsum.htm
More information745: Advanced Database Systems
745: Advanced Database Systems Yanlei Diao University of Massachusetts Amherst Outline Overview of course topics Course requirements Database Management Systems 1. Online Analytical Processing (OLAP) vs.
More informationDatabases 2 (VU) ( / )
Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:
More informationWhat is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?
Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation
More informationCONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM
CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED PLATFORM Executive Summary Financial institutions have implemented and continue to implement many disparate applications
More informationBig Data Infrastructure at Spotify
Big Data Infrastructure at Spotify Wouter de Bie Team Lead Data Infrastructure September 26, 2013 2 Who am I? According to ZDNet: "The work they have done to improve the Apache Hive data warehouse system
More informationBig Data with Hadoop Ecosystem
Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process
More informationPrinciples of Data Management. Lecture #16 (MapReduce & DFS for Big Data)
Principles of Data Management Lecture #16 (MapReduce & DFS for Big Data) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s News Bulletin
More informationMobile Cloud Computing
MTAT.03.262 Mobile Application Development Mobile Cloud Computing Satish Srirama, Huber Flores satish.srirama@ut.ee Tartu, Estonia, 2013 Outline Cloud Computing Mobile Cloud Access schemas Research challenges
More informationMapReduce, Hadoop and Spark. Bompotas Agorakis
MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)
More information2013 AWS Worldwide Public Sector Summit Washington, D.C.
2013 AWS Worldwide Public Sector Summit Washington, D.C. EMR for Fun and for Profit Ben Butler Sr. Manager, Big Data butlerb@amazon.com @bensbutler Overview 1. What is big data? 2. What is AWS Elastic
More informationHierarchy of knowledge BIG DATA 9/7/2017. Architecture
BIG DATA Architecture Hierarchy of knowledge Data: Element (fact, figure, etc.) which is basic information that can be to be based on decisions, reasoning, research and which is treated by the human or
More informationMeaning & Concepts of Databases
27 th August 2015 Unit 1 Objective Meaning & Concepts of Databases Learning outcome Students will appreciate conceptual development of Databases Section 1: What is a Database & Applications Section 2:
More informationDATABASE DESIGN II - 1DL400
DATABASE DESIGN II - 1DL400 Fall 2016 A second course in database systems http://www.it.uu.se/research/group/udbl/kurser/dbii_ht16 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationCloud Computing. What is cloud computing. CS 537 Fall 2017
Cloud Computing CS 537 Fall 2017 What is cloud computing Illusion of infinite computing resources available on demand Scale-up for most apps Elimination of up-front commitment Small initial investment,
More informationOracle Database Exadata Cloud Service Exadata Performance, Cloud Simplicity DATABASE CLOUD SERVICE
Oracle Database Exadata Exadata Performance, Cloud Simplicity DATABASE CLOUD SERVICE Oracle Database Exadata combines the best database with the best cloud platform. Exadata is the culmination of more
More informationBIG DATA TESTING: A UNIFIED VIEW
http://core.ecu.edu/strg BIG DATA TESTING: A UNIFIED VIEW BY NAM THAI ECU, Computer Science Department, March 16, 2016 2/30 PRESENTATION CONTENT 1. Overview of Big Data A. 5 V s of Big Data B. Data generation
More informationAn Introduction to Big Data Formats
Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION
More informationBig Data The end of Data Warehousing?
Big Data The end of Data Warehousing? Hermann Bär Oracle USA Redwood Shores, CA Schlüsselworte Big data, data warehousing, advanced analytics, Hadoop, unstructured data Introduction If there was an Unwort
More informationOracle Big Data Connectors
Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process
More informationDatabase System Architectures Parallel DBs, MapReduce, ColumnStores
Database System Architectures Parallel DBs, MapReduce, ColumnStores CMPSCI 445 Fall 2010 Some slides courtesy of Yanlei Diao, Christophe Bisciglia, Aaron Kimball, & Sierra Michels- Slettvet Motivation:
More informationLarge-Scale Data Engineering. Overview and Introduction
Large-Scale Data Engineering Overview and Introduction Administration Blackboard Page Announcements, also via email (pardon html formatting) Practical enrollment, Turning in assignments, Check Grades Contact:
More informationStages of Data Processing
Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 5: Analyzing Relational Data (1/3) February 8, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationApplied Spark. From Concepts to Bitcoin Analytics. Andrew F.
Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart ahart@apache.org @andrewfhart My Day Job CTO, Pogoseat Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2 Additionally Member,
More informationApproaching the Petabyte Analytic Database: What I learned
Disclaimer This document is for informational purposes only and is subject to change at any time without notice. The information in this document is proprietary to Actian and no part of this document may
More informationCPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University
CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network
More informationHadoop An Overview. - Socrates CCDH
Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected
More informationAbstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight
ESG Lab Review InterSystems Data Platform: A Unified, Efficient Data Platform for Fast Business Insight Date: April 218 Author: Kerry Dolan, Senior IT Validation Analyst Abstract Enterprise Strategy Group
More informationThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,
More informationCompSci 516: Database Systems
CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and
More informationDistributed Databases: SQL vs NoSQL
Distributed Databases: SQL vs NoSQL Seda Unal, Yuchen Zheng April 23, 2017 1 Introduction Distributed databases have become increasingly popular in the era of big data because of their advantages over
More informationTHE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES
1 THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon Vincent.Garonne@cern.ch ph-adp-ddm-lab@cern.ch XLDB
More informationwhat is cloud computing?
what is cloud computing? (Private) Cloud Computing with Mesos at Twi9er Benjamin Hindman @benh scalable virtualized self-service utility managed elastic economic pay-as-you-go what is cloud computing?
More informationChapter 5. The MapReduce Programming Model and Implementation
Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing
More informationThe Hadoop Paradigm & the Need for Dataset Management
The Hadoop Paradigm & the Need for Dataset Management 1. Hadoop Adoption Hadoop is being adopted rapidly by many different types of enterprises and government entities and it is an extraordinarily complex
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationBig Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
More informationSpatial Analytics Built for Big Data Platforms
Spatial Analytics Built for Big Platforms Roberto Infante Software Development Manager, Spatial and Graph 1 Copyright 2011, Oracle and/or its affiliates. All rights Global Digital Growth The Internet of
More informationOracle Database 11g for Data Warehousing & Big Data: Strategy, Roadmap Jean-Pierre Dijcks, Hermann Baer Oracle Redwood City, CA, USA
Oracle Database 11g for Data Warehousing & Big Data: Strategy, Roadmap Jean-Pierre Dijcks, Hermann Baer Oracle Redwood City, CA, USA Keywords: Big Data, Oracle Big Data Appliance, Hadoop, NoSQL, Oracle
More informationCS 61C: Great Ideas in Computer Architecture. MapReduce
CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing
More informationMassive Online Analysis - Storm,Spark
Massive Online Analysis - Storm,Spark presentation by R. Kishore Kumar Research Scholar Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Kharagpur-721302, India (R
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationWhen, Where & Why to Use NoSQL?
When, Where & Why to Use NoSQL? 1 Big data is becoming a big challenge for enterprises. Many organizations have built environments for transactional data with Relational Database Management Systems (RDBMS),
More informationThe age of Big Data Big Data for Oracle Database Professionals
The age of Big Data Big Data for Oracle Database Professionals Oracle OpenWorld 2017 #OOW17 SessionID: SUN5698 Tom S. Reddy tom.reddy@datareddy.com About the Speaker COLLABORATE & OpenWorld Speaker IOUG
More informationNext-Generation Cloud Platform
Next-Generation Cloud Platform Jangwoo Kim Jun 24, 2013 E-mail: jangwoo@postech.ac.kr High Performance Computing Lab Department of Computer Science & Engineering Pohang University of Science and Technology
More informationAn Indian Journal FULL PAPER. Trade Science Inc. Research on data mining clustering algorithm in cloud computing environments ABSTRACT KEYWORDS
[Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 17 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(17), 2014 [9562-9566] Research on data mining clustering algorithm in cloud
More informationSeminar Map/Reduce Prof. Johann-Christoph Freytag, Ph. D. Rico Bergmann
Seminar Map/ 20.10.2010 Prof. Johann-Christoph Freytag, Ph. D. Rico Bergmann contact Prof. Johann-Christoph Freytag Ph.D. Prof. at chair in Databases and Information Systems (DBIS) RUD25 Rico Bergmann
More informationIBM Data Replication for Big Data
IBM Data Replication for Big Data Highlights Stream changes in realtime in Hadoop or Kafka data lakes or hubs Provide agility to data in data warehouses and data lakes Achieve minimum impact on source
More informationBig Data on AWS. Big Data Agility and Performance Delivered in the Cloud. 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big Data on AWS Big Data Agility and Performance Delivered in the Cloud 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big Data Technologies and techniques for working productively
More informationScalable Tools - Part I Introduction to Scalable Tools
Scalable Tools - Part I Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/mbds2018/ Scalable Tools session
More informationData Intensive Scalable Computing
Data Intensive Scalable Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Examples of Big Data Sources Wal-Mart 267 million items/day, sold at 6,000 stores HP built them
More informationStrategic Briefing Paper Big Data
Strategic Briefing Paper Big Data The promise of Big Data is improved competitiveness, reduced cost and minimized risk by taking better decisions. This requires affordable solution architectures which
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Fall 2016 1 HW8 is out Last assignment! Get Amazon credits now (see instructions) Spark with Hadoop Due next wed CSE 344 - Fall 2016
More informationBig Data Analytics using Apache Hadoop and Spark with Scala
Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important
More informationData Warehousing and Decision Support (mostly using Relational Databases) CS634 Class 20
Data Warehousing and Decision Support (mostly using Relational Databases) CS634 Class 20 Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke, Chapter 25 Introduction Increasingly,
More informationMODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS
MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale
More informationOverview of Data Services and Streaming Data Solution with Azure
Overview of Data Services and Streaming Data Solution with Azure Tara Mason Senior Consultant tmason@impactmakers.com Platform as a Service Offerings SQL Server On Premises vs. Azure SQL Server SQL Server
More informationLesson 14: Cloud Computing
Yang, Chaowei et al. (2011) 'Spatial cloud computing: how can the geospatial sciences use and help shape cloud computing?', International Journal of Digital Earth, 4: 4, 305 329 GEOG 482/582 : GIS Data
More informationChapter 6 VIDEO CASES
Chapter 6 Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:
More informationChapter 6. Foundations of Business Intelligence: Databases and Information Management VIDEO CASES
Chapter 6 Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:
More informationData Informatics. Seon Ho Kim, Ph.D.
Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate
More informationCMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS
Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB s C. Faloutsos A. Pavlo Lecture#23: Distributed Database Systems (R&G ch. 22) Administrivia Final Exam Who: You What: R&G Chapters 15-22
More informationSensor Data Collection and Processing
Sensor Data Collection and Processing Applying Web Scale To Sensor Data Today s speaker Josh Patterson josh@cloudera.com / twitter: @jpatanooga Master s Thesis: self-organizing mesh networks Published
More informationOracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data
Oracle Big Data SQL Release 3.2 The unprecedented explosion in data that can be made useful to enterprises from the Internet of Things, to the social streams of global customer bases has created a tremendous
More informationHadoop, Yarn and Beyond
Hadoop, Yarn and Beyond 1 B. R A M A M U R T H Y Overview We learned about Hadoop1.x or the core. Just like Java evolved, Java core, Java 1.X, Java 2.. So on, software and systems evolve, naturally.. Lets
More informationCloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018
Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster
More information