Big Data & Hadoop Environment

Size: px

Start display at page:

Download "Big Data & Hadoop Environment"

Candace Hamilton
5 years ago
Views:

2016 yılı Yenilikçi ve Yaratıcı İstanbul

yürütülmekte olan TR10/16/YNY/0036 no lu

1 Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi dahilinde gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi ne ait olup İSTKA veya Kalkınma Bakanlığı nın görüşlerini yansıtmamaktadır.

2 What is big data? Why do we need big data analytics? How to setup Infrastructure for big data?

3 Processes 20 PB a day (2008) Crawls 20B web pages a day (2012) Search index is 100+ PB (5/2014) Bigtable serves 2+ EB, 600M QPS (5/2014) Hadoop: 365 PB, 330K nodes (6/2014) 400B pages, 10+ PB (2/2014) 300 PB data in Hive TB/day (4/2014) Hadoop: 10K nodes, 150K cores, 150 PB (4/2014) LHC: ~15 PB a year 150 PB on 50k+ servers running 15k apps (6/2011) S3: 2T objects, 1.1M request/second (4/2013) 640K ought to be enough for anybody. SKA: EB per year (~2020) LSST: 6-10 PB a year (~2020) How much data?

4 The percentage of all data in the word that has been generated in last 2 years?

5 What are Key Features of Big Data? Volume Petabyte scale Big Data Velocity Social Media Sensor Throughput Variety Structured Semi-structured Unstructured 4 Vs Veracity Unclean Imprecise Unclear

6 GATGCTTACTATGCGGGCCCC CGGTCTAATGCTTACTATGC? GCTTACTATGCGGGCCCCTT AATGCTTACTATGCGGGCCCCTT TAATGCTTACTATGC AATGCTTAGCTATGCGGGC AATGCTTACTATGCGGGCCCCTT AATGCTTACTATGCGGGCCCCTT CGGTCTAGATGCTTACTATGC AATGCTTACTATGCGGGCCCCTT CGGTCTAATGCTTAGCTATGC ATGCTTACTATGCGGGCCCCTT Reads Subject genome Human genome: 3 gbp A few billion short reads (~100 GB compressed data) Sequencer

7 What to do with more data? Answering factoid questions Pattern matching on the Web Works amazingly well Learning relations Who shot Abraham Lincoln? X shot Abraham Lincoln Start with seed instances Search for patterns on the Web Using patterns to find more instances Wolfgang Amadeus Mozart ( ) Einstein was born in 1879 Birthday-of(Mozart, 1756) Birthday-of(Einstein, 1879) PERSON (DATE PERSON was born in DATE (Brill et al., TREC 2001; Lin, ACM TOIS 2007) (Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; )

9 No data like more data! s/knowledge/data/g; (Banko and Brill, ACL 2001) (Brants et al., EMNLP 2007)

10 Database Workloads OLTP (online transaction processing) Typical applications: e-commerce, banking, airline reservations User facing: real-time, low latency, highly-concurrent Tasks: relatively small set of standard transactional queries Data access pattern: random reads, updates, writes (involving relatively small amounts of data) OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing: batch workloads, less concurrency Tasks: complex analytical queries, often ad hoc Data access pattern: table scans, large amounts of data per query

11 One Database or Two? Downsides of co-existing OLTP and OLAP workloads Poor memory management Conflicting data access patterns Variable latency Solution: separate databases User-facing OLTP database for high-volume transactions Data warehouse for OLAP workloads How do we connect the two?

12 OLTP/OLAP Architecture OLTP ETL (Extract, Transform, and Load) OLAP

13 Structure of Data Warehouses SELECT P.Brand, S.Country, SUM(F.Units_Sold) FROM Fact_Sales F INNER JOIN Dim_Date D ON F.Date_Id = D.Id INNER JOIN Dim_Store S ON F.Store_Id = S.Id INNER JOIN Dim_Product P ON F.Product_Id = P.Id WHERE D.YEAR = 1997 AND P.Product_Category = 'tv' GROUP BY P.Brand, S.Country; Source: Wikipedia (Star Schema)

14 product OLAP Cubes Common operations slice and dice roll up/drill down pivot store

15 Fast forward

16 ETL Bottleneck ETL is typically a nightly task: What happens if processing 24 hours of data takes longer than 24 hours? Hadoop is perfect: Ingest is limited by speed of HDFS Scales out with more nodes Massively parallel Ability to use any processing tool Cheaper than parallel databases ETL is a batch process anyway!

17 What s changed? Dropping cost of disks Cheaper to store everything than to figure out what to throw away Types of data collected From data that s obviously valuable to data whose value is less apparent Rise of social media and user-generated content Large increase in data volume Growing maturity of data mining techniques Demonstrates value of data analytics

18 Virtuous Product Cycle a useful service $ (hopefully) transform insights into action data products analyze user behavior to extract insights data science

19 OLTP/OLAP/Hadoop Architecture OLTP ETL (Extract, Transform, and Load) Hadoop OLAP

20 ETL: Redux Often, with noisy datasets, ETL is the analysis! Note that ETL necessarily involves brute force data scans L, then E and T?

21 Big Data Ecosystem Source : datafloq.com

22 Cloud Computing Before clouds Grids Connection machine Vector supercomputers Cloud computing means many different things: Big data Rebranding of web 2.0 Utility computing Everything as a service

Utility Computing What? Computing resources as a metered service ( pay as you go ) Ability to dynamically provision virtual machines Why? Cost: capital vs.

23 Utility Computing What? Computing resources as a metered service ( pay as you go ) Ability to dynamically provision virtual machines Why? Cost: capital vs. operating expenses Scalability: infinite capacity Elasticity: scale up or down on demand Does it make sense? Benefits to cloud users Business case for cloud providers I think there is a world market for about five computers.

24 Enabling Technology: Virtualization App App App App App App OS OS OS Operating System Hypervisor Hardware Hardware Traditional Stack Virtualized Stack

25 Everything as a Service Utility computing = Infrastructure as a Service (IaaS) Why buy machines when you can rent cycles? Examples: Amazon s EC2, Rackspace Platform as a Service (PaaS) Give me nice API and take care of the maintenance, upgrades, Example: Google App Engine Software as a Service (SaaS) Just run it for me! Example: Gmail, Salesforce

26 Who cares? A source of problems Cloud-based services generate big data Clouds make it easier to start companies that generate big data As well as a solution Ability to provision analytics clusters on-demand in the cloud Commoditization and democratization of big data capabilities

27 Parallelization Challenges How do we assign work units to workers? What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have finished? What if workers die? What s the common theme of all of these problems?

28 Where the rubber meets the road Concurrency is difficult to reason about Concurrency is even more difficult to reason about At the scale of datacenters and across datacenters In the presence of failures In terms of multiple interacting services Not to mention debugging The reality: Lots of one-off solutions, custom code Write you own dedicated library, then program with it Burden on the programmer to explicitly manage everything

29 The datacenter is the computer! Source: Barroso and Urs Hölzle (2009)

30 The datacenter is the computer It s all about the right level of abstraction Moving beyond the von Neumann architecture What s the instruction set of the datacenter computer? Hide system-level details from the developers No more race conditions, lock contention, etc. No need to explicitly worry about reliability, fault tolerance, etc. Separating the what from the how Developer specifies the computation that needs to be performed Execution framework ( runtime ) handles actual execution

31 Scaling up vs. out No single machine is large enough Smaller cluster of large SMP machines vs. larger cluster of commodity machines (e.g., core machines vs core machines) Nodes need to talk to each other! Intra-node latencies: ~100 ns Inter-node latencies: ~100 s Move processing to the data Cluster have limited bandwidth Process data sequentially, avoid random access Seeks are expensive, disk throughput is reasonable Seamless scalability Source: analysis on this an subsequent slides from Barroso and Urs Hölzle (2009)

32 Common Big Data Analytics Use Cases Batch Use case 1 : ETL / Batch query (Single Silo) Use case 2 : distributed log aggregation Batch + real time Use case 3 : real time data store Use case 4 : real time data store + batch analytics Real time / Streaming Use case 5 : Streaming

33 USE CASE 1 : ETL AND BATCH ANALYTICS at SCALE Data collected in various databases Data is scattered across multiple silos! Need a single silo to bring all data together and analyze

34 USE CASE 1 Using core Hadoop components No vendor lock in (works on all Hadoop distributions) Use HDFS (Hadoop File System) for storage Data Ingest with Sqoop Processing done by Map Reduce & Cousins Results are exported back to DB

35 USE CASE 2 : DATA COMING FROM MULTIPLE SOURCES Data coming in from multiple sources. Data is streaming in Capture data in Hadoop Do batch analytics

36 USE CASE 2 Flume To bring in logs from multiple sources Distributed, reliable way to collect and move data If uplinks are dis connected, flume agents will store and forward data HDFS Flume can directly write data to HDFS Files are segmented or rolled by size / time e.g. Data _ log Data _ log Data _ log

37 USE CASE 2 Analytics stack : Pig / Hive / Oozie / Spark (Same as in Use Case 1) Oozie Work flow manager run this work flow every 1 hour run this work flow when data shows up in input directory Can manage complex work flows Send alerts when processes fail..etc

38 USE CASE 3 : REAL TIME DATA STORE Events are coming in Need to store the events Can be billions of events And query them in real time e.g. last 10 events by user

39 USE CASE 3 HDFS is not ideal for updating data in real time and accessing data in random A scalable real time store is needed e.g. Hbase, Cassandra to support real time updates Data comes trickling in (as stream) Saved data becomes queryable immediately Use HBase APIs (Java / REST) to build dashboards

40 USE CASE 4 : REAL TIME + BATCH Building on use case 3 Do extensive analysis on data on HBase E.g. : scoring user models flagging credit card transactions

41 USE CASE 4 HBase is the real time store Analytics is done via Map Reduce stack (Pig / Hive) Can we do them in a single stack? May not be a good idea Don t mix real time and batch analytics Batch Analytics will impede real time performance

42 USE CASE 4 How to replicate data? 1 : periodic synchronization of data between clusters 2 : data goes to both clusters at the same time

43 USE CASE 5 : Streaming Decision times : batch ( hours / days) Use cases: Modeling ETL Reporting

MOVING TOWARDS FAST DATA Decision time :

44 MOVING TOWARDS FAST DATA Decision time : (near) real time seconds (or milli seconds) Use Cases Alerts (medical / security) Fraud detection Streaming is becoming more prevalent Connected Devices Internet of Things Beyond Batch We need faster processing / analytics

45 data bucket STREAMING ARCHITECTURE DATA BUCKET Captures incoming data Acts as a buffer smoothes out bursts So even if our processing offline, we won t loose data Data bucket choices * Kafka MQ (RabittMQ..etc) Amazon Kinesis

46 KAFKA ARCHITECTURE Producers write data to brokers Consumers read data from brokers All of this is distributed / parallel Failure tolerant Data is stored as topics sensor_data alerts s

47 STREAMING ARCHITECTURE PROCESSING ENGINE Need to process events with low latency So many to choose from! Choices Storm Spark NiFi Flink

48 STREAMING ARCHITECTURE DATA STORE Where processed data ends up Need to absorb data in real time Usually a NoSQL storage HBase Cassandra Lots of NoSQL stores

49 LAMBDA ARCHITECTURE Each component is scalable Each component is fault tolerant Incorporates best practices All open source!

50 LAMBDA ARCHITECTURE 1. All new data is sent to both batch layer and speed layer 2. Batch layer Holds master data set (immutable, append-only) Answers batch queries 3. Serving layer updates batch views so they can be queried adhoc 4. Speed Layer Handles new data Facilitates fast / real-time queries 5. Query layer Answers queries using batch & real-time views

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale