Big Data and Cloud Computing

Size: px

Start display at page:

Download "Big Data and Cloud Computing"

Marian Knight
6 years ago
Views:

1 Big Data and Cloud Computing Presented at Faculty of Computer Science University of Murcia Presenter: Muhammad Fahim, PhD Department of Computer Eng. Istanbul S. Zaim University, Istanbul, Turkey

2 About Our University The distance from Murcia to Istanbul is approximately 2559 kilometers (km) 2

Self Introduction BS(Computer Science), Institute of Computing

Science), Department of Computer Science, National University

Engineering), Department of Computer Engineering, Kyung Hee

2014 Postdoc Ubiquitous Computing Lab, Kyung Hee University,

3 Self Introduction BS(Computer Science), Institute of Computing and Information Technology, Gomal University, 2007 MS(Computer Science), Department of Computer Science, National University of Computer and Emerging Sciences, 2009 PhD(Computer Engineering), Department of Computer Engineering, Kyung Hee University, Feb Postdoc Ubiquitous Computing Lab, Kyung Hee University, Aug Assistant Professor: Department of Computer Engineering, Istanbul S. Zaim University, Sep. 2014~ Till Now. 3

4 Agenda What is Data? Context vs Understanding in Data Big Data Characteristics and Processing MapReduce and Hadoop Big Data Solution for Real Scenarios Cloud Computing Services and Business Model Cloud Enabling Technologies 4

5 What is Data? We live in the data age Data become information and information is Knowledge Knowledge is power Data Information Knowledge 5

What is Data? Ref: http://www.allthingy.

6 What is Data? Ref: 6

7 Context vs Understanding in Data Ref: 7

8 Every Day Data Statistics Every day, we create 2.5 quintillion* bytes of data** So much that 90% of the data in the world today has been created in the last two years alone. IBM *2.5 billion gigabytes **every day in

9 What happens in an Internet minute? 9

10 Check-point!! How much data is generated by Airbus 380 during one flight? 10

11 Types of Data Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data (Social Networks) Streaming Data 11

12 Big Data Data comes from everywhere o sensors used to gather climate information, o posts to social media sites, o digital pictures and videos, o purchase transaction records, and cell phone GPS signals to name a few. This data is Big Data 12

13 What to do with these data? Aggregation and Statistics Indexing, Searching, and Querying o Keyword based search o Pattern matching Knowledge discovery o Data Mining o Business Intelligence 13

14 Big Data Characteristics How big is the Big Data? What is big today maybe not big tomorrow Any data that can challenge our current technology in some way o Volume o Speed of generating o Meaningful analysis etc. 14

generating or processing of data o High-variety different data type

15 Big Data Characteristics Big Data Vectors (3Vs) o High-volume amount of data o High-velocity Speed rate in collecting or acquiring or generating or processing of data o High-variety different data type such as audio, video, image data (mostly unstructured data) Velocity Volume Variety 15

16 Technology Player in Big Data 16

17 How to Deal with Big Data? Where do you store a petabyte? how do you read it? how do you process it? What if something goes wrong? 17

18 Some Challenges in Big Data Big Data integration is multidisciplinary Less than 10% of Big Data world are genuinely relational Meaningful data integration in the real scenarios is messy, schema-less and complex 18

19 How to Process Big Data? How to process big data with reasonable cost and time? 19

20 How to Process Big Data? Parallel DBMS Tech. MapReduce Relational Data Model Indexing Familiar SQL interface Advanced query optimization Well understood and studied Proprietary DBMS Engines intended as Data Warehousing solutions for very large enterprises Data-parallel programming model A distributed implementation over commodity clusters Map Reduce o pioneered by Google o popularized by Yahoo! 20

21 MapReduce Advantages Automatic Parallelization Run-time o Data partitioning o Handling machine failures o Managing inter-machine communication Completely transparent to the programmer/analyst/user 21

22 How to use MapReduce? 22

23 MapReduce and Hadoop Google invented a new style of data processing known as MapReduce After a year Google published a white paper describing the MapReduce framework Doug Cutting and Mike Cafarella, inspired by the white paper, created Hadoop to apply these concepts to an open-source software framework to support distribution for the Nutch search engine project. 23

24 Some Hadoop Milestones Hadoop Wins Terabyte Sort Benchmark (sorted 1 terabyte of data in 209 seconds, compared to previous record of 297 seconds) Hadoop's Hbase, Hive and Pig subprojects completed, adding more computational power to Hadoop framework ZooKeeper Completed Hadoop and Hadoop alpha and so on -- Latest one so far Hadoop and released on 11 Feb,

25 Real Scenarios Large dataset computing Facebook Messages

pre-generate and statically serve articles to improve performance o Using

26 Hadoop Solutions Non-real time large dataset computing: o NY Times was dynamically generating PDFs of articles from o Wanted to pre-generate and statically serve articles to improve performance o Using Hadoop + MapReduce running on EC2 / S3, converted 4TB of TIFFs into 11 million PDF articles in 24 hrs 26

27 Facebook Messages Design Requirements o Integrate display of , SMS and chat messages between pairs and groups of users o Suited for production use between 500 million people immediately after launch o Stringent latency and uptime requirements 27

28 Hadoop in the Wild: Facebook Messages System Requirements o High write throughput o Cheap and elastic storage o Low latency o High consistency o Disk-efficient sequential and random read performance 28

29 Hadoop in the Wild: Facebook Messages Classic Solution o These requirements typically met using large MySQL cluster and caching tiers using Memcached o Content on HDFS could be loaded into MySQL or Memcached if needed by web tier 29

30 Hadoop in the Wild: Facebook Messages Problems with Classic Solution o MySQL has low random write throughput BIG problem for messaging! o Difficult to scale MySQL clusters rapidly while maintaining performance o MySQL clusters have high management overhead 30

31 Hadoop in the Wild: Facebook Messages Facebook s Solution + 31

32 Hadoop in the Wild: Facebook Messages Facebook s Solution Improve and adapt HDFS and HBase to scale to FB s workload and operational considerations Major concern was availability: NameNode is SPOF & failover times are at least 20 minutes Proprietary AvatarNode : eliminates SPOF, makes HDFS safe to deploy even with 24/7 uptime requirement Performance improvements for real-time workload 32

33 Hadoop Related Subprojects Pig o High-level language for data analysis HBase o Table storage for semi-structured data Zookeeper o Coordinating distributed applications Hive o SQL-like Query language and Metastore Storm o Storm provides realtime computation Mahout o Machine learning 33

34 Break Time After break we will talk about Cloud Computing Paradigm

Existing Computing Paradigms Personal Computing Reconfigurable Computing Parallel Computing Ubiquitous Computing Super Computing Grid

35 Existing Computing Paradigms Personal Computing Reconfigurable Computing Parallel Computing Ubiquitous Computing Super Computing Grid Computing Cluster Computing Distributed Computing Autonomic Computing Mobile Computing Utility Computing Cloud Computing Pervasive Computing 35

36 Cloud Computing Cloud Computing is a general term used to describe a new class of network based computing that takes place over the Internet Basically a step on from Utility Computing A collection/group of integrated and networked hardware, software and Internet infrastructure 36

37 Cloud Computing On-demand network access to a shared pool of configurable computing resources o For Example: networks, servers, storage, applications, and services Revolutionizing for health care, financial systems, scientific research, and society 37

38 Cloud Computing And many many more 38

39 What is Cloud? A single site cloud (also known as: datacenter) consists of o Computer nodes grouped into racks o Switches and connecting racks o A network topology o Storage nodes o Front-end for submitting jobs and receiving the client requests 39

40 Four Features of Cloud Computing 1. Massive Scale 2. Data intensive nature 3. On demand access 4. New cloud programming paradigms 40

41 1. Massive Scale MBs TBs PBs XBs Facebook [2012] o 30,000 machines in 2009 o 60,000 machines in 2010 o 180,000 machines in 2012 Yahoo! [2009] o 100,000 machines o Split into cluster of 4000 ebay [2012] o 50,000 machines Google: A lot 41

42 2. Data Intensive Nature In data intensive computing, the focus shifts from computation to data. CPU utilization no longer the most important resource metric, instead I/O is. 42

infrastructure as a service o Buy resources o Servers o Software o Data center

43 3. On Demand Access IaaS It is also know as Cloud Computing Services It consists of the following access o IaaS: Infrastructure as a Service o Delivery of computer infrastructure as a service o Buy resources o Servers o Software o Data center space o Network equipment as fully outsourced services Example: Amazon web services 43

44 3. On Demand Access PaaS PaaS: Platform as a Service o It provides flexible computing, storage infrastructure, coupled with software platform o For Example: Google App Engine 44

45 3. On Demand Access IaaS SaaS: Software as a Service o It provides software services. Often said to subsume SOA (Service Oriented Architecture) o For Example: Google Docs 45

46 Cloud Computing Business Model 46

47 Cloud Computing Business Model 47

48 Cloud Enabling Technology 1. Virtualization 2. Web Distributed Storage 4. Distributed Computing 5. Utility Computing 6. Network Bandwidth & Latency 7. Fault Tolerant Systems 48

49 1. Virtualization Technology Virtualization technology is a major enabler of cloud computing It s a path to share IT resource pools: Web servers, storage, data, network, software and databases. Higher utilization rates 49

50 2. Web 2.0 Web 2.0 describes World Wide Web sites that emphasize user-generated content, usability, and interoperability. 50

51 3. Distributed Storage A distributed data store is a computer network where information is stored on more than one node, often in a replicated fashion. 51

52 4. Distributed Computing Using distributed systems to solve large problems Distributed System: multiple autonomous computers connected through a communication network Information exchanged using communication models, ex: MPI (Message Passing Interface) 52

53 5. Utility Computing Water, gas, and electricity are provided to every home and business as commodity services You get connected to the utility companies public infrastructure You get these utility services on demand and you pay as you use Utility Computing is doing same for computing resources 53

54 6. Network Bandwidth & Latency Latency is delay. It is the amount of time it takes a packet to travel from source to destination Latency is normally expressed in milliseconds Bandwidth is normally expressed in bits per second It's the amount of data that can be transferred during a second 54

55 7. Fault Tolerant Systems Fault-tolerant describes a computer system or component designed so that, in the event that a component fails, a backup component or procedure can immediately take its place with no loss of service Fault tolerance can be provided with software, or embedded in hardware, or provided by some combination. 55

56 Why Cloud Computing? Large Scale Data Intensive Applications Flexibility Scalability Customization No Maintenance Effect: o Reduce Cost o Reduce Maintenance o High Utilization o High Availability 56

57 Why Cloud Computing? Flexibility Software: Any software platform Access: access resources from any machine connected to the Internet Availability: Deploy infrastructure from anywhere at anytime Control: Software controls infrastructure 57

58 Why Cloud Computing? Scalability Instant Control via software o Add/cancel/rebuild resources instantly Start small, then scale your resources up/down as you need Illusion of infinite resources available on demand 58

59 Why Cloud Computing? Customization Everything in your wish list Software platforms Storage Network bandwidth Speed 59

60 Why Cloud Computing? Maintenance Reduce the size of IT department Is the responsibility of the cloud vendor This Includes: o Software updates o Security patches o Monitoring system s health o System backup etc. 60

61 Drawbacks Security and Privacy Vendor lock in Network dependent Migration 61

62 Reference This presentation material is based on Matei Zaharia and Dr. Bina Ramamurthy tutorial. Hadoop: Pig: Hive: Video tutorials: Amazon Web Services: Amazon Elastic MapReduce guide: Apache Hadoop Tutorial: Cloudera Videos by Aaron Kimball: 62

63 Thank You

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation