Big Data & Hadoop Environment
|
|
- Candace Hamilton
- 5 years ago
- Views:
Transcription
1 Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi dahilinde gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi ne ait olup İSTKA veya Kalkınma Bakanlığı nın görüşlerini yansıtmamaktadır.
2 What is big data? Why do we need big data analytics? How to setup Infrastructure for big data?
3 Processes 20 PB a day (2008) Crawls 20B web pages a day (2012) Search index is 100+ PB (5/2014) Bigtable serves 2+ EB, 600M QPS (5/2014) Hadoop: 365 PB, 330K nodes (6/2014) 400B pages, 10+ PB (2/2014) 300 PB data in Hive TB/day (4/2014) Hadoop: 10K nodes, 150K cores, 150 PB (4/2014) LHC: ~15 PB a year 150 PB on 50k+ servers running 15k apps (6/2011) S3: 2T objects, 1.1M request/second (4/2013) 640K ought to be enough for anybody. SKA: EB per year (~2020) LSST: 6-10 PB a year (~2020) How much data?
4 The percentage of all data in the word that has been generated in last 2 years?
5 What are Key Features of Big Data? Volume Petabyte scale Big Data Velocity Social Media Sensor Throughput Variety Structured Semi-structured Unstructured 4 Vs Veracity Unclean Imprecise Unclear
6 GATGCTTACTATGCGGGCCCC CGGTCTAATGCTTACTATGC? GCTTACTATGCGGGCCCCTT AATGCTTACTATGCGGGCCCCTT TAATGCTTACTATGC AATGCTTAGCTATGCGGGC AATGCTTACTATGCGGGCCCCTT AATGCTTACTATGCGGGCCCCTT CGGTCTAGATGCTTACTATGC AATGCTTACTATGCGGGCCCCTT CGGTCTAATGCTTAGCTATGC ATGCTTACTATGCGGGCCCCTT Reads Subject genome Human genome: 3 gbp A few billion short reads (~100 GB compressed data) Sequencer
7 What to do with more data? Answering factoid questions Pattern matching on the Web Works amazingly well Learning relations Who shot Abraham Lincoln? X shot Abraham Lincoln Start with seed instances Search for patterns on the Web Using patterns to find more instances Wolfgang Amadeus Mozart ( ) Einstein was born in 1879 Birthday-of(Mozart, 1756) Birthday-of(Einstein, 1879) PERSON (DATE PERSON was born in DATE (Brill et al., TREC 2001; Lin, ACM TOIS 2007) (Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; )
8
9 No data like more data! s/knowledge/data/g; (Banko and Brill, ACL 2001) (Brants et al., EMNLP 2007)
10 Database Workloads OLTP (online transaction processing) Typical applications: e-commerce, banking, airline reservations User facing: real-time, low latency, highly-concurrent Tasks: relatively small set of standard transactional queries Data access pattern: random reads, updates, writes (involving relatively small amounts of data) OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing: batch workloads, less concurrency Tasks: complex analytical queries, often ad hoc Data access pattern: table scans, large amounts of data per query
11 One Database or Two? Downsides of co-existing OLTP and OLAP workloads Poor memory management Conflicting data access patterns Variable latency Solution: separate databases User-facing OLTP database for high-volume transactions Data warehouse for OLAP workloads How do we connect the two?
12 OLTP/OLAP Architecture OLTP ETL (Extract, Transform, and Load) OLAP
13 Structure of Data Warehouses SELECT P.Brand, S.Country, SUM(F.Units_Sold) FROM Fact_Sales F INNER JOIN Dim_Date D ON F.Date_Id = D.Id INNER JOIN Dim_Store S ON F.Store_Id = S.Id INNER JOIN Dim_Product P ON F.Product_Id = P.Id WHERE D.YEAR = 1997 AND P.Product_Category = 'tv' GROUP BY P.Brand, S.Country; Source: Wikipedia (Star Schema)
14 product OLAP Cubes Common operations slice and dice roll up/drill down pivot store
15 Fast forward
16 ETL Bottleneck ETL is typically a nightly task: What happens if processing 24 hours of data takes longer than 24 hours? Hadoop is perfect: Ingest is limited by speed of HDFS Scales out with more nodes Massively parallel Ability to use any processing tool Cheaper than parallel databases ETL is a batch process anyway!
17 What s changed? Dropping cost of disks Cheaper to store everything than to figure out what to throw away Types of data collected From data that s obviously valuable to data whose value is less apparent Rise of social media and user-generated content Large increase in data volume Growing maturity of data mining techniques Demonstrates value of data analytics
18 Virtuous Product Cycle a useful service $ (hopefully) transform insights into action data products analyze user behavior to extract insights data science
19 OLTP/OLAP/Hadoop Architecture OLTP ETL (Extract, Transform, and Load) Hadoop OLAP
20 ETL: Redux Often, with noisy datasets, ETL is the analysis! Note that ETL necessarily involves brute force data scans L, then E and T?
21 Big Data Ecosystem Source : datafloq.com
22 Cloud Computing Before clouds Grids Connection machine Vector supercomputers Cloud computing means many different things: Big data Rebranding of web 2.0 Utility computing Everything as a service
23 Utility Computing What? Computing resources as a metered service ( pay as you go ) Ability to dynamically provision virtual machines Why? Cost: capital vs. operating expenses Scalability: infinite capacity Elasticity: scale up or down on demand Does it make sense? Benefits to cloud users Business case for cloud providers I think there is a world market for about five computers.
24 Enabling Technology: Virtualization App App App App App App OS OS OS Operating System Hypervisor Hardware Hardware Traditional Stack Virtualized Stack
25 Everything as a Service Utility computing = Infrastructure as a Service (IaaS) Why buy machines when you can rent cycles? Examples: Amazon s EC2, Rackspace Platform as a Service (PaaS) Give me nice API and take care of the maintenance, upgrades, Example: Google App Engine Software as a Service (SaaS) Just run it for me! Example: Gmail, Salesforce
26 Who cares? A source of problems Cloud-based services generate big data Clouds make it easier to start companies that generate big data As well as a solution Ability to provision analytics clusters on-demand in the cloud Commoditization and democratization of big data capabilities
27 Parallelization Challenges How do we assign work units to workers? What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have finished? What if workers die? What s the common theme of all of these problems?
28 Where the rubber meets the road Concurrency is difficult to reason about Concurrency is even more difficult to reason about At the scale of datacenters and across datacenters In the presence of failures In terms of multiple interacting services Not to mention debugging The reality: Lots of one-off solutions, custom code Write you own dedicated library, then program with it Burden on the programmer to explicitly manage everything
29 The datacenter is the computer! Source: Barroso and Urs Hölzle (2009)
30 The datacenter is the computer It s all about the right level of abstraction Moving beyond the von Neumann architecture What s the instruction set of the datacenter computer? Hide system-level details from the developers No more race conditions, lock contention, etc. No need to explicitly worry about reliability, fault tolerance, etc. Separating the what from the how Developer specifies the computation that needs to be performed Execution framework ( runtime ) handles actual execution
31 Scaling up vs. out No single machine is large enough Smaller cluster of large SMP machines vs. larger cluster of commodity machines (e.g., core machines vs core machines) Nodes need to talk to each other! Intra-node latencies: ~100 ns Inter-node latencies: ~100 s Move processing to the data Cluster have limited bandwidth Process data sequentially, avoid random access Seeks are expensive, disk throughput is reasonable Seamless scalability Source: analysis on this an subsequent slides from Barroso and Urs Hölzle (2009)
32 Common Big Data Analytics Use Cases Batch Use case 1 : ETL / Batch query (Single Silo) Use case 2 : distributed log aggregation Batch + real time Use case 3 : real time data store Use case 4 : real time data store + batch analytics Real time / Streaming Use case 5 : Streaming
33 USE CASE 1 : ETL AND BATCH ANALYTICS at SCALE Data collected in various databases Data is scattered across multiple silos! Need a single silo to bring all data together and analyze
34 USE CASE 1 Using core Hadoop components No vendor lock in (works on all Hadoop distributions) Use HDFS (Hadoop File System) for storage Data Ingest with Sqoop Processing done by Map Reduce & Cousins Results are exported back to DB
35 USE CASE 2 : DATA COMING FROM MULTIPLE SOURCES Data coming in from multiple sources. Data is streaming in Capture data in Hadoop Do batch analytics
36 USE CASE 2 Flume To bring in logs from multiple sources Distributed, reliable way to collect and move data If uplinks are dis connected, flume agents will store and forward data HDFS Flume can directly write data to HDFS Files are segmented or rolled by size / time e.g. Data _ log Data _ log Data _ log
37 USE CASE 2 Analytics stack : Pig / Hive / Oozie / Spark (Same as in Use Case 1) Oozie Work flow manager run this work flow every 1 hour run this work flow when data shows up in input directory Can manage complex work flows Send alerts when processes fail..etc
38 USE CASE 3 : REAL TIME DATA STORE Events are coming in Need to store the events Can be billions of events And query them in real time e.g. last 10 events by user
39 USE CASE 3 HDFS is not ideal for updating data in real time and accessing data in random A scalable real time store is needed e.g. Hbase, Cassandra to support real time updates Data comes trickling in (as stream) Saved data becomes queryable immediately Use HBase APIs (Java / REST) to build dashboards
40 USE CASE 4 : REAL TIME + BATCH Building on use case 3 Do extensive analysis on data on HBase E.g. : scoring user models flagging credit card transactions
41 USE CASE 4 HBase is the real time store Analytics is done via Map Reduce stack (Pig / Hive) Can we do them in a single stack? May not be a good idea Don t mix real time and batch analytics Batch Analytics will impede real time performance
42 USE CASE 4 How to replicate data? 1 : periodic synchronization of data between clusters 2 : data goes to both clusters at the same time
43 USE CASE 5 : Streaming Decision times : batch ( hours / days) Use cases: Modeling ETL Reporting
44 MOVING TOWARDS FAST DATA Decision time : (near) real time seconds (or milli seconds) Use Cases Alerts (medical / security) Fraud detection Streaming is becoming more prevalent Connected Devices Internet of Things Beyond Batch We need faster processing / analytics
45 data bucket STREAMING ARCHITECTURE DATA BUCKET Captures incoming data Acts as a buffer smoothes out bursts So even if our processing offline, we won t loose data Data bucket choices * Kafka MQ (RabittMQ..etc) Amazon Kinesis
46 KAFKA ARCHITECTURE Producers write data to brokers Consumers read data from brokers All of this is distributed / parallel Failure tolerant Data is stored as topics sensor_data alerts s
47 STREAMING ARCHITECTURE PROCESSING ENGINE Need to process events with low latency So many to choose from! Choices Storm Spark NiFi Flink
48 STREAMING ARCHITECTURE DATA STORE Where processed data ends up Need to absorb data in real time Usually a NoSQL storage HBase Cassandra Lots of NoSQL stores
49 LAMBDA ARCHITECTURE Each component is scalable Each component is fault tolerant Incorporates best practices All open source!
50 LAMBDA ARCHITECTURE 1. All new data is sent to both batch layer and speed layer 2. Batch layer Holds master data set (immutable, append-only) Answers batch queries 3. Serving layer updates batch views so they can be queried adhoc 4. Speed Layer Handles new data Facilitates fast / real-time queries 5. Query layer Answers queries using batch & real-time views
MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS
MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale
More informationHadoop Distributed File System(HDFS)
Hadoop Distributed File System(HDFS) Bu eğitim sunumları İstanbul Kalkınma Ajansı nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no lu İstanbul
More informationMapReduce Algorithm Design
MapReduce Algorithm Design Bu eğitim sunumları İstanbul Kalkınma Ajansı nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no lu İstanbul Big
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 5: Analyzing Relational Data (1/3) February 8, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationIntroduction to Apache Spark
Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no lu İstanbul
More informationWhat is this course about?
Data-Intensive Information Processing Applications! Session #1 Introduction to MapReduce Jordan Boyd-Graber University of Maryland Thursday, February 3, 2011 This work is licensed under a Creative Commons
More informationBig Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
More informationMicrosoft Big Data and Hadoop
Microsoft Big Data and Hadoop Lara Rubbelke @sqlgal Cindy Gross @sqlcindy 2 The world of data is changing The 4Vs of Big Data http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data 3 Common
More informationThe Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou
The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component
More informationCloud Computing. Big Data. Dell Zhang Birkbeck, University of London 2017/18
Cloud Computing Big Data Dell Zhang Birkbeck, University of London 2017/18 The Challenges and Opportunities of Big Data Why Big Data? 640K ought to be enough for anybody. Big Data Everywhere! Lots of data
More informationCSE6331: Cloud Computing
CSE6331: Cloud Computing Leonidas Fegaras University of Texas at Arlington c 2019 by Leonidas Fegaras Cloud Computing Fundamentals Based on: J. Freire s class notes on Big Data http://vgc.poly.edu/~juliana/courses/bigdata2016/
More informationWe are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info
We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423
More informationBig Data. Big Data Analyst. Big Data Engineer. Big Data Architect
Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION
More informationVOLTDB + HP VERTICA. page
VOLTDB + HP VERTICA ARCHITECTURE FOR FAST AND BIG DATA ARCHITECTURE FOR FAST + BIG DATA FAST DATA Fast Serve Analytics BIG DATA BI Reporting Fast Operational Database Streaming Analytics Columnar Analytics
More informationBig Data Hadoop Stack
Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware
More information2013 AWS Worldwide Public Sector Summit Washington, D.C.
2013 AWS Worldwide Public Sector Summit Washington, D.C. EMR for Fun and for Profit Ben Butler Sr. Manager, Big Data butlerb@amazon.com @bensbutler Overview 1. What is big data? 2. What is AWS Elastic
More informationBig Data Architect.
Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional
More informationIntroduction to MapReduce
Introduction to MapReduce Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec Single-node architecture CPU Memory Data Analysis, Machine Learning, Statistics Disk How much
More informationBig Data on AWS. Big Data Agility and Performance Delivered in the Cloud. 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big Data on AWS Big Data Agility and Performance Delivered in the Cloud 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big Data Technologies and techniques for working productively
More informationLambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
More informationBig Data Hadoop Course Content
Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux
More informationIntroduction to MapReduce
Introduction to MapReduce Based on slides by Juliana Freire Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec Required Reading! Data-Intensive Text Processing with MapReduce,
More informationHadoop An Overview. - Socrates CCDH
Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationDelving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture
Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases
More informationBIG DATA COURSE CONTENT
BIG DATA COURSE CONTENT [I] Get Started with Big Data Microsoft Professional Orientation: Big Data Duration: 12 hrs Course Content: Introduction Course Introduction Data Fundamentals Introduction to Data
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationBig Data on AWS. Peter-Mark Verwoerd Solutions Architect
Big Data on AWS Peter-Mark Verwoerd Solutions Architect What to get out of this talk Non-technical: Big Data processing stages: ingest, store, process, visualize Hot vs. Cold data Low latency processing
More informationModule Day Topic. 1 Definition of Cloud Computing and its Basics
Module Day Topic 1 Definition of Cloud Computing and its Basics 1 2 3 1. How does cloud computing provides on-demand functionality? 2. What is the difference between scalability and elasticity? 3. What
More informationEmbedded Technosolutions
Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication
More informationCloud Analytics and Business Intelligence on AWS
Cloud Analytics and Business Intelligence on AWS Enterprise Applications Virtual Desktops Sharing & Collaboration Platform Services Analytics Hadoop Real-time Streaming Data Machine Learning Data Warehouse
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data
More informationCISC 7610 Lecture 2b The beginnings of NoSQL
CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone
More informationStages of Data Processing
Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationCloud Computing Technologies and Types
Cloud Computing Technologies and Types Jo, Heeseung From Dell Zhang's, Birkbeck, University of London The Technological Underpinnings of Cloud Computing Data centers Virtualization RESTful APIs Cloud storage
More informationCPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University
CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network
More informationHow Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,
How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS
More informationFlash Storage Complementing a Data Lake for Real-Time Insight
Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum
More informationThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,
More information5 Fundamental Strategies for Building a Data-centered Data Center
5 Fundamental Strategies for Building a Data-centered Data Center June 3, 2014 Ken Krupa, Chief Field Architect Gary Vidal, Solutions Specialist Last generation Reference Data Unstructured OLTP Warehouse
More informationMicrosoft Exam
Volume: 42 Questions Case Study: 1 Relecloud General Overview Relecloud is a social media company that processes hundreds of millions of social media posts per day and sells advertisements to several hundred
More informationCloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationContainer 2.0. Container: check! But what about persistent data, big data or fast data?!
@unterstein @joerg_schad @dcos @jaxdevops Container 2.0 Container: check! But what about persistent data, big data or fast data?! 1 Jörg Schad Distributed Systems Engineer @joerg_schad Johannes Unterstein
More informationIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large
More informationCloud Computing. Technologies and Types
Cloud Computing Cloud Computing Technologies and Types Dell Zhang Birkbeck, University of London 2017/18 The Technological Underpinnings of Cloud Computing Data centres Virtualisation RESTful APIs Cloud
More information@unterstein #bedcon. Operating microservices with Apache Mesos and DC/OS
@unterstein @dcos @bedcon #bedcon Operating microservices with Apache Mesos and DC/OS 1 Johannes Unterstein Software Engineer @Mesosphere @unterstein @unterstein.mesosphere 2017 Mesosphere, Inc. All Rights
More informationIntro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect
Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect Igor Roiter Big Data Cloud Solution Architect Working as a Data Specialist for the last 11 years 9 of them as a Consultant specializing
More informationLambda Architecture for Batch and Stream Processing. October 2018
Lambda Architecture for Batch and Stream Processing October 2018 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document is provided for informational purposes only.
More informationDatabases 2 (VU) ( / )
Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:
More informationIntroduction to Big-Data
Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,
More informationIntroduction to MapReduce
Introduction to MapReduce Based on slides by Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec How much data? Google processes 20 PB a day (2008) Internet Archive Wayback Machine has 3 PB + 100
More informationIncrease Value from Big Data with Real-Time Data Integration and Streaming Analytics
Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics Cy Erbay Senior Director Striim Executive Summary Striim is Uniquely Qualified to Solve the Challenges of Real-Time
More informationWhat is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?
Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation
More informationThe Technology of the Business Data Lake. Appendix
The Technology of the Business Data Lake Appendix Pivotal data products Term Greenplum Database GemFire Pivotal HD Spring XD Pivotal Data Dispatch Pivotal Analytics Description A massively parallel platform
More informationHadoop Development Introduction
Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand
More informationApache Hadoop Goes Realtime at Facebook. Himanshu Sharma
Apache Hadoop Goes Realtime at Facebook Guide - Dr. Sunny S. Chung Presented By- Anand K Singh Himanshu Sharma Index Problem with Current Stack Apache Hadoop and Hbase Zookeeper Applications of HBase at
More informationMap-Reduce. Marco Mura 2010 March, 31th
Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of
More informationHortonworks and The Internet of Things
Hortonworks and The Internet of Things Dr. Bernhard Walter Solutions Engineer About Hortonworks Customer Momentum ~700 customers (as of November 4, 2015) 152 customers added in Q3 2015 Publicly traded
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationArchitectural challenges for building a low latency, scalable multi-tenant data warehouse
Architectural challenges for building a low latency, scalable multi-tenant data warehouse Mataprasad Agrawal Solutions Architect, Services CTO 2017 Persistent Systems Ltd. All rights reserved. Our analytics
More informationData Architectures in Azure for Analytics & Big Data
Data Architectures in for Analytics & Big Data October 20, 2018 Melissa Coates Solution Architect, BlueGranite Microsoft Data Platform MVP Blog: www.sqlchick.com Twitter: @sqlchick Data Architecture A
More informationMapR Enterprise Hadoop
2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS
More informationReport on The Infrastructure for Implementing the Mobile Technologies for Data Collection in Egypt
Report on The Infrastructure for Implementing the Mobile Technologies for Data Collection in Egypt Date: 10 Sep, 2017 Draft v 4.0 Table of Contents 1. Introduction... 3 2. Infrastructure Reference Architecture...
More informationSpark Programming Essentials
Spark Programming Essentials Bu eğitim sunumları İstanbul Kalkınma Ajansı nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no lu İstanbul
More informationAgenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache
Databases on AWS 2017 Amazon Web Services, Inc. and its affiliates. All rights served. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon Web Services,
More informationChase Wu New Jersey Institute of Technology
CS 644: Introduction to Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Institute of Technology Some of the slides were provided through the courtesy of Dr. Ching-Yung Lin at Columbia
More informationSQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism
Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and
More informationData contains value and knowledge
Data contains value and knowledge What is the purpose of big data systems? To support analysis and knowledge discovery from very large amounts of data But to extract the knowledge data needs to be Stored
More informationAWS Serverless Architecture Think Big
MAKING BIG DATA COME ALIVE AWS Serverless Architecture Think Big Garrett Holbrook, Data Engineer Feb 1 st, 2017 Agenda What is Think Big? Example Project Walkthrough AWS Serverless 2 Think Big, a Teradata
More informationParallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce
Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The
More informationData Analytics with HPC. Data Streaming
Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationBig Data with Hadoop Ecosystem
Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process
More informationIndex. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /
Index A ACID, 251 Actor model Akka installation, 44 Akka logos, 41 OOP vs. actors, 42 43 thread-based concurrency, 42 Agents server, 140, 251 Aggregation techniques materialized views, 216 probabilistic
More informationData Acquisition. The reference Big Data stack
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference
More informationhttps://www.youtube.com/watch?v=-gj93l2qa6c Topics: Foundation of Data Analytics and Data Mining Data Volume, Velocity, & Variety Harnessing Big Data Enabling technologies: Cloud Computing 2 No single
More informationHDInsight > Hadoop. October 12, 2017
HDInsight > Hadoop October 12, 2017 2 Introduction Mark Hudson >20 years mixing technology with data >10 years with CapTech Microsoft Certified IT Professional Business Intelligence Member of the Richmond
More informationThe SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.
Dublin Apache Kafka Meetup, 30 August 2017 The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Joseph @pleia2 * ASF projects 1 Elizabeth K. Joseph, Developer Advocate Developer Advocate
More informationFLORIDA DEPARTMENT OF TRANSPORTATION PRODUCTION BIG DATA PLATFORM
FLORIDA DEPARTMENT OF TRANSPORTATION PRODUCTION BIG DATA PLATFORM RECOMMENDATION AND JUSTIFACTION Executive Summary: VHB has been tasked by the Florida Department of Transportation District Five to design
More informationInnovatus Technologies
HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String
More informationBig Data Infrastructure at Spotify
Big Data Infrastructure at Spotify Wouter de Bie Team Lead Data Infrastructure September 26, 2013 2 Who am I? According to ZDNet: "The work they have done to improve the Apache Hive data warehouse system
More informationA Survey on Big Data
A Survey on Big Data D.Prudhvi 1, D.Jaswitha 2, B. Mounika 3, Monika Bagal 4 1 2 3 4 B.Tech Final Year, CSE, Dadi Institute of Engineering & Technology,Andhra Pradesh,INDIA ---------------------------------------------------------------------***---------------------------------------------------------------------
More informationOverview of Data Services and Streaming Data Solution with Azure
Overview of Data Services and Streaming Data Solution with Azure Tara Mason Senior Consultant tmason@impactmakers.com Platform as a Service Offerings SQL Server On Premises vs. Azure SQL Server SQL Server
More informationModern Data Warehouse The New Approach to Azure BI
Modern Data Warehouse The New Approach to Azure BI History On-Premise SQL Server Big Data Solutions Technical Barriers Modern Analytics Platform On-Premise SQL Server Big Data Solutions Modern Analytics
More informationLeverage the Oracle Data Integration Platform Inside Azure and Amazon Cloud
Leverage the Oracle Data Integration Platform Inside Azure and Amazon Cloud WHITE PAPER / AUGUST 8, 2018 DISCLAIMER The following is intended to outline our general product direction. It is intended for
More informationBring Context To Your Machine Data With Hadoop, RDBMS & Splunk
Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Raanan Dagan and Rohit Pujari September 25, 2017 Washington, DC Forward-Looking Statements During the course of this presentation, we may
More informationBig Data and Cloud Computing
Big Data and Cloud Computing Presented at Faculty of Computer Science University of Murcia Presenter: Muhammad Fahim, PhD Department of Computer Eng. Istanbul S. Zaim University, Istanbul, Turkey About
More informationHadoop. Introduction / Overview
Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures
More informationCloud Computing. DB Special Topics Lecture (10/5/2012) Kyle Hale Maciej Swiech
Cloud Computing DB Special Topics Lecture (10/5/2012) Kyle Hale Maciej Swiech Managing servers isn t for everyone What are some prohibitive issues? (we touched on these last time) Cost (initial/operational)
More informationConfiguring and Deploying Hadoop Cluster Deployment Templates
Configuring and Deploying Hadoop Cluster Deployment Templates This chapter contains the following sections: Hadoop Cluster Profile Templates, on page 1 Creating a Hadoop Cluster Profile Template, on page
More informationBig Data Analytics. Description:
Big Data Analytics Description: With the advance of IT storage, pcoressing, computation, and sensing technologies, Big Data has become a novel norm of life. Only until recently, computers are able to capture
More information8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara
Week 1-B-0 Week 1-B-1 CS535 BIG DATA FAQs Slides are available on the course web Wait list Term project topics PART 0. INTRODUCTION 2. DATA PROCESSING PARADIGMS FOR BIG DATA Sangmi Lee Pallickara Computer
More informationBig Data Analytics. Rasoul Karimi
Big Data Analytics Rasoul Karimi Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 1 Outline
More informationIntroduction to Hadoop. Owen O Malley Yahoo!, Grid Team
Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationHadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)
Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:
More informationCertified Big Data and Hadoop Course Curriculum
Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation
More informationEXTRACT DATA IN LARGE DATABASE WITH HADOOP
International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0
More informationFluentd + MongoDB + Spark = Awesome Sauce
Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited Bhavani Ananth, Tech Manager, Wipro Limited Your company logo here Wipro Open Source Practice: Vision & Mission Vision
More informationIan Choy. Technology Solutions Professional
Ian Choy Technology Solutions Professional XML KPIs SQL Server 2000 Management Studio Mirroring SQL Server 2005 Compression Policy-Based Mgmt Programmability SQL Server 2008 PowerPivot SharePoint Integration
More information