A BigData Tour HDFS, Ceph and MapReduce

Size: px
Start display at page:

Download "A BigData Tour HDFS, Ceph and MapReduce"

Transcription

1 A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo! Developer Network MapReduce Tutorial

2 Data Management and Processing Data intensive computing Concerns with the production, manipulation and analysis of data in the range of hundreds of megabytes (MB) to petabytes (PB) and beyond A range of supporting parallel and distributed computing technologies to deal with the challenges of data representation, reliable shared storage, efficient algorithms and scalable infrastructure to perform analysis

3 Challenges Ahead Challenges with data intensive computing Scalable algorithms that can search and process massive datasets New metadata management technologies that can scale to handle complex, heterogeneous and distributed data sources Support for accessing in-memory multi-terabyte data structures High performance, highly reliable petascale distributed file system Techniques for data reduction and rapid processing Software mobility to move computation where data is located Hybrid interconnect with support for multi-gigabyte data streams Flexible and high performance software integration technique Hadoop A family of related project, best known for MapReduce and Hadoop Distributed File System (HDFS)

4 Disk (MB/s), CPU (MIPS) Data Intensive Computing Data volumes increasing massively! Clusters, storage capacity increasing massively! Disk speeds are not keeping pace.! Seek speeds even worse than read/write Mahout! data mining 1000x!

5 Scale-Out Disk streaming speed ~ 50MB/s! 3TB =17.5 hrs! 1PB = 8 months! Scale-out (weak scaling) - filesystem distributes data on ingest Jonathan Dursi

6 Seeking too slow! ~10ms for a seek! Enough time to read half a megabyte! Batch processing! Go through entire data set in one (or small number) of passes Scale-Out Jonathan Dursi

7 Combining results Each node preprocesses its local data! Shuffles its data to a small number of other nodes! Final processing, output is done there Jonathan Dursi

8 Fault Tolerance Data also replicated upon ingest! Runtime watches for dead tasks, restarts them on live nodes! Re-replicates Jonathan Dursi

9 Why Hadoop Drivers 500M+ unique users per month Billions of interesting events per day Data analysis is key Need massive scalability PB s of storage, millions of files, 1000 s of nodes Need to do this cost effectively Use commodity hardware Share resources among multiple projects Provide scale when needed Need reliable infrastructure Must be able to deal with failures hardware, software, networking Failure is expected rather than exceptional Transparent to applications very expensive to build reliability into each application The Hadoop infrastructure provides these capabilities

10 Introduction to Hadoop Apache Hadoop Based on 2004 Google MapReduce Paper Originally composed of HDFS (distributed F/S), a core-runtime and an implementation of Map-Reduce Open Source Apache Foundation project Yahoo! is Apache Platinum Sponsor History Started in 2005 by Doug Cutting Yahoo! became the primary contributor in 2006 Yahoo! scaled it from 20 node clusters to 4000 node clusters today Portable Written in Java Runs on commodity hardware Linux, Mac OS/X, Windows, and Solaris

11 Jonathan Dursi

12 HPC vs Hadoop HPC attitude The problem of disk-limited, loosely-coupled data analysis was solved by throwing more disks and using weak scaling Flip-side: A single novice developer can write real, scalable, node data-processing tasks in Hadoop-family tools in an afternoon MPI... less so Jonathan Dursi

13 Jonathan Dursi

14 Jonathan Dursi

15 Data Distribution: Disk Hadoop and similar architectures handle the hardest part of parallelism for you - data distribution.! On disk: HDFS distributes, replicates data as it comes in! Keeps track; computations local to data Jonathan Dursi

16 Data Distribution: Network On network: Map Reduce (eg) works in terms of key-value pairs.! Preprocessing (map) phase ingests data, emits (k,v) pairs! Shuffle phase assigns reducers, gets all pairs with same key onto that reducer.! Programmer does not have to design communication patterns (key1,17) (key5, 23) (key1,99) (key2, 12) (key1,83) (key2, 9) (key1,[17,99]) (key5,[23,83]) (key2,[12,9]) Jonathan Dursi

17 Jonathan Dursi

18 Built a reusable substrate The filesystem (HDFS) and the MapReduce layer were very well architected.! Enables many higher-level tools! Data analysis, machine learning, NoSQL DBs,...! Extremely productive environment! And Hadoop 2.x (YARN) is now much much more than just MapReduce Image from Jonathan Dursi

19 and Hadoop vs HPC Not either-or anyway! Use HPC to generate big / many simulations, Hadoop to analyze results! Use Hadoop to preprocess huge input data sets (ETL), and HPC to do the tightly coupled computation afterwards.! Besides,... Jonathan Dursi

20 Everything is converging 1/2 Jonathan Dursi

21 Everything is converging 2/2 Jonathan Dursi

22 Big Data Analytics Stack Amir Payberah

23 Big Data Storage (sans POSIX) I Traditional filesystems are not well-designed for large-scale data processing systems. I E ciency has a higher priority than other features, e.g., directory service. I Massive size of data tends to store it across multiple machines in a distributed way. I HDFS, Amazon S3,... Amir Payberah

24 Big Data - Databases I Relational Databases Management Systems (RDMS) were not designed to be distributed. I NoSQL databases relax one or more of the ACID properties: BASE I Di erent data models: key/value, column-family, graph, document. I Dynamo, Scalaris, BigTable, Hbase, Cassandra, MongoDB, Voldemort, Riak, Neo4J,... Amir Payberah

25 Big Data Resource Management I Di erent frameworks require di erent computing resources. I Large organizations need the ability to share data and resources between multiple frameworks. I Resource management share resources in a cluster between multiple frameworks while providing resource isolation. I Mesos, YARN, Quincy,... Amir Payberah

26 YARN 1/3 To address Hadoop v1 deficiencies with scalability, memory usage and synchronization, the Yet Another Resource Negotiator (YARN) Apache sub-project was started Previously a JobTracker service ran on each node. Its roles were then split into separate daemons for Resource management Job scheduling/monitoring Hortonworks

27 YARN 2/3 YARN splits the JobTracker s responsibilities into Resource management the global Resource Manager daemon Per application Application Master The resource manger and per-node slave Node Managers allow generic node management The resource manager has a pluggable scheduler Hortonworks

28 Hortonworks YARN 3/3 The Scheduler performs its scheduling function based on the resource requirements of the applications; it does so based on the abstract notion of a Resource Container which incorporates resource elements such as memory, cpu, disk, network The NodeManager is the per-machine slave, which is responsible for launching the applications containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager. The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress. From the system perspective, the ApplicationMaster itself runs as a normal container.

29 Big Data Execution Engine I Scalable and fault tolerance parallel data processing on clusters of unreliable machines. I Data-parallel programming model for clusters of commodity machines. I MapReduce, Spark, Stratosphere, Dryad, Hyracks,... Amir Payberah

30 Big Data Query/Scripting Languages I Low-level programming of execution engines, e.g., MapReduce, is not easy for end users. I Need high-level language to improve the query capabilities of execution engines. I It translates user-defined functions to low-level API of the execution engines. I Pig, Hive, Shark, Meteor, DryadLINQ, SCOPE,... Amir Payberah

31 Big Data Stream Processing I Providing users with fresh and low latency results. I Database Management Systems (DBMS) vs. Systems (SPS) Stream Processing I Storm, S4, SEEP, D-Stream, Naiad,... Amir Payberah

32 Big Data Graph Processing I Many problems are expressed using graphs: sparse computational dependencies, and multiple iterations to converge. I Data-parallel frameworks, such as MapReduce, are not ideal for these problems: slow I Graph processing frameworks are optimized for graph-based problems. I Pregel, Giraph, GraphX, GraphLab, PowerGraph, GraphChi,... Amir Payberah

33 Big Data Machine Learning I Implementing and consuming machine learning techniques at scale are di cult tasks for developers and end users. I There exist platforms that address it by providing scalable machinelearning and data mining libraries. I Mahout, MLBase, SystemML, Ricardo, Presto,... Amir Payberah

34 Hadoop Big Data Analytics Stack Amir Payberah

35 Spark Big Data Analytics Stack Amir Payberah

36 Hadoop Ecosystem Hortonworks

37 Hadoop Ecosystem 2008 onwards usage exploded Creation of many tools on top of Hadoop infrastructure

38 The Need For Filesystems I Controls how data is stored in and retrieved from disk. Amir Payberah

39 Distributed Filesystems I When data outgrows the storage capacity of a single machine: partition it across a number of separate machines. I Distributed filesystems: manage the storage across a network of machines. Amir Payberah

40

41 Hadoop Distributed File System (HDFS) A distributed file system designed to run on commodity hardware HDFS was originally built as infrastructure for the Apache Nutch web search engine project, with the aim to achieve fault tolerance, ability to run on low-cost hardware and handle large datasets It is now an Apache Hadoop subproject Share similarities with existing distributed file systems and supports traditional hierarchical file organization Reliable data replication and accessible via Web interface and Shell commands Benefits: Fault tolerant, high throughput, streaming data access, robustness and handling of large data sets HDFS is not a general purpose F/S

42 Assumptions and Goals Hardware failures Detection of faults, quick and automatic recovery Streaming data access Designed for batch processing rather than interactive use by users Large data sets Applications that run on HDFS have large data sets, typically in gigabytes to terabytes in size Optimized for batch reads rather than random reads Simple coherency model Applications need a write-once, read-many times access model for files Computation migration Computation is moved closer to where data is located Portability Easily portable between heterogeneous hardware and software platforms

43 What HDFS is not good for I Low-latency reads High-throughput rather than low latency for small chunks of data. HBase addresses this issue. I Large amount of small files Better for millions of large files instead of billions of small files. I Multiple writers Single writer per file. Writes only at the end of file, no-support for arbitrary o set. Amir Payberah

44 HDFS Architecture The Hadoop Distributed File System (HDFS) Offers a way to store large files across multiple machines, rather than requiring a single machine to have disk capacity equal to/greater than the summed total size of the files HDFS is designed to be faulttolerant Using data replication and distribution of data When a file is loaded into HDFS, it is replicated and broken up into "blocks" of data These blocks are stored across the cluster nodes designated for storage, a.k.a. DataNodes.

45 Files and Blocks 1/3 I Files are split into blocks. I Blocks Single unit of storage: a contiguous piece of information on a disk. Transparent to user. Managed by Namenode, storedbydatanode. Blocks are traditionally either 64MB or 128MB: default is 64MB. Amir Payberah

46 Files and Blocks 2/3 I Why is a block in HDFS so large? To minimize the cost of seeks. I Time to read a block = seek time + transfer time seektime I Keeping the ratio transfertime small: we are reading data from the disk almost as fast as the physical limit imposed by the disk. I Example: if seek time is 10ms and the transfer rate is 100MB/s, to make the seek time 1% of the transfer time, we need to make the block size around 100MB. Amir Payberah

47 Files and Blocks 3/3 I Same block is replicated on multiple machines: default is 3 Replica placements are rack aware. 1st replica on the local rack. 2nd replica on the local rack but di erent machine. 3rd replica on the di erent rack. I Namenode determines replica placement. Amir Payberah

48 HDFS Daemons HDFS cluster is manager by three types of processes Namenode Manages the filesystem, e.g., namespace, meta-data, and file blocks Metadata is stored in memory Datanode Stores and retrieves data blocks Reports to Namenode Runs on many machines Secondary Namenode Only for checkpointing. Not a backup for Namenode Amir Payberah

49 Hadoop Server Roles

50 NameNode 1/3 The HDFS namespace is a hierarchy of files and directories These are represented in the NameNode using inodes Inodes record attributes permissions, modification and access times; namespace and disk space quotas. The file content is split into large blocks (typically 128 megabytes, but user selectable file-by-file), and each block of the file is independently replicated at multiple DataNodes (typically three, but user selectable file-by-file) The NameNode maintains the namespace tree and the mapping of blocks to DataNodes A Hadoop cluster can have thousands of DataNodes and tens of thousands of HDFS clients per cluster, as each DataNode may execute multiple application tasks concurrently

51 NameNode 2/3 The inodes and the list of blocks that define the metadata of the name system are called the image (FsImage above) NameNode keeps the entire namespace image in RAM Each client-initiated transaction is recorded in the journal, and the journal file is flushed and synced before the acknowledgment is sent to the client The NameNode is a multithreaded system and processes requests simultaneously from multiple clients.

52 NameNode 3/3 HDFS requires A NameNode process to run on one node in the cluster All other nodes run the DataNode service to run on each "slave" node that will be processing data. When data is loaded into HDFS Data is replicated and split into blocks that are distributed across the DataNodes The NameNode is responsible for storage and management of metadata, so that when MapReduce or another execution framework calls for the data, the NameNode informs it where the needed data resides.

53 Where to Replicate? Tradeoff to choosing replication locations! Close: faster updates, less network bandwidth! switch 1 switch 2 Further: better failure tolerance! Default strategy: first copy on different location on same node, second on different rack (switch), third on same rack location, different node.! Strategy configurable.! Need to configure Hadoop file system to know location of nodes Jonathan Dursi rack1 rack2

54 DataNode 1/3 Each block replica on a DataNode is represented by two files in the local native filesystem. The first file contains the data itself and the second file records the block's metadata including checksums for the data and the generation stamp. At startup each DataNode connects to a NameNode and preforms a handshake. The handshake verifies that the DataNode is part of the NameNode and runs the same version of software A DataNode identifies block replicas in its possession to the NameNode by sending a block report. A block report contains the block ID, the generation stamp and the length for each block replica the server hosts The first block report is sent immediately after the DataNode registration Subsequent block reports are sent every hour and provide the NameNode with an up-to-date view of where block replicas are located on the cluster.

55 DataNode 2/3 During normal operation DataNodes send heartbeats to the NameNode to confirm that the DataNode is operating and the block replicas it hosts are available If the NameNode does not receive a heartbeat from a DataNode in ten minutes, it considers the DataNode to be out of service and the block replicas hosted by that DataNode to be unavailable The NameNode then schedules creation of new replicas of those blocks on other DataNodes. Heartbeats from a DataNode also carry information about total storage capacity, fraction of storage in use, and the number of data transfers currently in progress. These statistics are used for the NameNode's block allocation and load balancing decisions.

56 DataNode 3/3 The NameNode does not directly send requests to DataNodes. It uses replies to heartbeats to send instructions to the DataNodes The instructions include commands to replicate blocks to other nodes, remove local block replicas, re-register and send an immediate block report, and shut down the node These commands are important for maintaining the overall system integrity and therefore it is critical to keep heartbeats frequent even on big clusters. The NameNode can process thousands of heartbeats per second without affecting other NameNode operations.

57 HDFS Client 1/3 User applications access the filesystem using the HDFS client, a library that exports the HDFS filesystem interface User is oblivious to backend implementation details eg # of replicas and which servers have appropriate blocks

58 HDFS Client 2/3 When an application reads a file, the HDFS client first asks the NameNode for the list of DataNodes that host replicas of the blocks of the file The list is sorted by the network topology distance from the client The client contacts a DataNode directly and requests the transfer of the desired block.

59 Reading a file Client:! Read lines from bigdata.dat 1. Open Reading a file shorter! Get block locations! Read from a replica Namenode /user/ljdursi/diffuse bigdata.dat Jonathan Dursi datanode1 datanode2 datanode3

60 Reading a file Client:! Read lines from bigdata.dat 2. Get block locations Reading a file shorter! Get block locations! Read from a replica Namenode /user/ljdursi/diffuse bigdata.dat Jonathan Dursi datanode1 datanode2 datanode3

61 Reading a file Client:! Read lines from bigdata.dat 3. read blocks Reading a file shorter! Get block locations! Read from a replica Namenode /user/ljdursi/diffuse bigdata.dat datanode1 datanode2 datanode3 Jonathan Dursi

62 HDFS Client 3/3 When a client writes, it first asks the NameNode to choose DataNodes to host replicas of the first block of the file The client organizes a pipeline from node-to-node and sends the data When the first block is filled, the client requests new DataNodes to be chosen to host replicas of the next block A new pipeline is organized, and the client sends the further bytes of the file Choice of DataNodes for each block is likely to be different

63 Writing a file Writing a file multiple stage process! Create file! Get nodes for blocks! Start writing! Data nodes coordinate replication! Get ack back! Complete Client:! Write newdata.dat 1. create Namenode /user/ljdursi/diffuse datanode1 datanode2 datanode3 bigdata.dat Jonathan Dursi

64 Writing a file Writing a file multiple stage process! Create file! Get nodes for blocks! Start writing! Data nodes coordinate replication! Get ack back! Complete Client:! Write newdata.dat 2. get nodes Namenode /user/ljdursi/diffuse datanode1 datanode2 datanode3 bigdata.dat Jonathan Dursi

65 Writing a file Writing a file multiple stage process! Create file! Get nodes for blocks! Start writing! Data nodes coordinate replication! Get ack back! Complete 3. start writing Client:! Write newdata.dat Namenode /user/ljdursi/diffuse datanode1 datanode2 datanode3 bigdata.dat Jonathan Dursi

66 Writing a file Writing a file multiple stage process! Create file! Get nodes for blocks! Start writing! Data nodes coordinate replication! Get ack back! Complete Client:! Write newdata.dat 4. repl Namenode /user/ljdursi/diffuse datanode1 datanode2 datanode3 bigdata.dat Jonathan Dursi

67 Writing a file Writing a file multiple stage process! Create file! Get nodes for blocks! Start writing! Data nodes coordinate replication! Get ack back (while writing)! Complete Client:! Write newdata.dat 5. ack Namenode /user/ljdursi/diffuse datanode1 datanode2 datanode3 bigdata.dat Jonathan Dursi

68 Writing a file Writing a file multiple stage process! Create file! Get nodes for blocks! Start writing! Data nodes coordinate replication! Get ack back! Complete Client:! Write newdata.dat 6. complete Namenode /user/ljdursi/diffuse datanode1 datanode2 datanode3 bigdata.dat Jonathan Dursi

69 HDFS Federation I Hadoop 2+ I Each Namenode will host part of the blocks. I A Block Pool is a set of blocks that belong to a single namespace. I Support for machine clusters. Amir Payberah

70 File I/O and Leases in HDFS An application Adds data to HDFS by creating a new file and writing data to it On closing the file, new data can only be appended HDFS implements a single-writer, multiple-reader model Leases are granted by the NameNode to HDFS clients Writer clients need to periodically renew the lease via a heartbeat to the NameNode On file close, the lease is revoked There are soft and hard limits for leases (the hard limit being an hour) A write lease does not prevent multiple readers from reading the file

71 Data Pipelining for Writing Blocks 1/2 An HDFS file consists of blocks When there is a need for a new block, the NameNode allocates a block with a unique block ID and determines a list of DataNodes to host replicas of the block The DataNodes form a pipeline, the order of which minimizes the total network distance from the client to the last DataNode

72 Data Pipelining for Writing Blocks 2/2 Bytes are pushed to the pipeline as a sequence of packets. The bytes that an application writes first buffer at the client side After a packet buffer is filled (typically 64 KB), the data are pushed to the pipeline The next packet can be pushed to the pipeline before receiving the acknowledgment for the previous packets The number of outstanding packets is limited by the outstanding packets window size of the client.

73 HDFS Interfaces There are many interfaces to interact with HDFS Simplest way of interacting with HDFS in command-line Two properties are set in HDFS configuration Default Hadoop filesystem fs.default.name: hdfs://localhost/ Used to determine the host (localhost) and port (8020) for the HDFS NameNode Replication factor dfs.replication Default is 3, disable replication by setting it to 1 (single datanode) Other HDFS interfaces HTTP: a read only interface for retrieving directory listings and data over HTTP FTP: permits the use of the FTP protocol to interact with HDFS

74 Replication in HDFS Replica placement Critical to improve data reliability, availability and network bandwidth utilization Rack-aware policy as rack failure is far less than node failure With the default replication factor (3), one replica is put on one node in the local rack, another on a node in a different (remote) rack, and the last on a different node in the same remote rack One third of replication are on one node; two-third of replicas are on one rack, and the other third are evenly distributed across racks Benefits is to reduce inter-rack write traffic Replica selection A read request is satisfied from a replica that is nearby to the application Minimizes global bandwidth consumption and read latency If HDFS spans multiple data center, replica in the local data center is preferred over any remote replica

75 Communication Protocol All HDFS communication protocols are layered on top of the TCP/IP protocol A client establishes a connection to a configurable TCP port on the NameNode machine and uses ClientProtocol DataNodes talk to the NameNode using DataNode protocol A Remote Procedure Call (RPC) abstraction wraps both the ClientProtocol and DataNode protocol NameNode never initiates a RPC, instead it only responds to RPC requests issued by DataNodes or clients

76 Robustness Primary objective of HDFS is to store data reliably even during failures Three common types of failures: NameNode, DataNode and network partitions Data disk failure Heartbeat messages to track the health of DataNodes NameNodes performs necessary re-replication on DataNode unavailability, replica corruption or disk fault Cluster rebalancing Automatically move data between DataNodes, if the free space on a DataNode falls below a threshold or during sudden high demand Data integrity Checksum checking on HDFS files, during file creation and retrieval Metadata disk failure Manual intervention no auto recovery, restart or failover

77 Ceph An Alternative to HDFS in One Slide APP HOST / VM Client RadosGW RBD CephFS S3 Swift LibRados Rados MDS MDS.1 MONs MON.1 Pool 1 Pool 2 Pool... X... Pool n CRUSH map MDS.n MON.n PG 1 PG 2 PG 3 PG 4... PG n activities/tf-storage/ws16/ slides/ low_cost_storage_cephopenstack_swift.pdf n n 1 n Cluster Node [OSDs] Cluster Node [OSDs] Cluster Node [OSDs]

78 MAP-REDUCE

79 What is it? I A programming model: to batch process large data sets (inspired by functional programming). I An execution framework: to run parallel algorithms on clusters of commodity hardware. I Don t worry about parallelization, fault tolerance, data distribution, and load balancing (MapReduce takes care of these). I Hide system-level details from programmers. Amir Payberah

80 MapReduce Basics A programming model and its associated implementation for parallel processing of large data sets It was developed within Google as a mechanism for processing large amounts of raw data, e.g. crawled documents or web request logs. Capable of efficiently distribute processing of TB s of data on 1000 s of processing nodes This distribution implies parallel computing since the same computations are performed on each CPU, but with a different dataset (or different segment of a large dataset) Implementation s run-time system library takes care of parallelism, fault tolerance, data distribution, load balancing etc Complementary to RDBMS, but differs in many ways (data size, access, update, structure, integrity and scale) Features: fault tolerance, locality, task granularity, backup tasks, skipping bad records and so on

81 MapReduce Simple Dataflow I map function: processes data and generates a set of intermediate key/value pairs. I reduce function: merges all intermediate values associated with the same intermediate key. Amir Payberah

82 MapReduce Functional Programming Concepts MapReduce programs are designed to compute large volumes of data in a parallel fashion This model would not scale to large clusters (hundreds or thousands of nodes) if the components were allowed to share data arbitrarily The communication overhead required to keep the data on the nodes synchronized at all times would prevent the system from performing reliably or efficiently at large scale Instead, all data elements in MapReduce are immutable, meaning that they cannot be updated If in a mapping task you change an input (key, value) pair, it does not get reflected back in the input files; communication occurs only by generating new output (key, value) pairs which are then forwarded by the Hadoop system into the next phase of execution

83 MapReduce List Processing Conceptually, MapReduce programs transform lists of input data elements into lists of output data elements A MapReduce program will do this twice, using two different list processing idioms: map, and reduce These terms are taken from several list processing languages such as LISP, Scheme, or ML

84 Note: the input has not been modifed. A new string has been returned MapReduce Mapping Lists The first phase of a MapReduce program is called mapping A list of data elements are provided, one at a time, to a function called the Mapper, which transforms each element individually to an output data element. Say there is a toupper(str) function, which returns an uppercase version of the input string. The Map would then turn input strings into a list of uppercase strings.

85 MapReduce Reducing Lists Reducing lets you aggregate values together A reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value. Reducing is often used to produce "summary" data, turning a large volume of data into a smaller summary of itself. For example, "+" can be used as a reducing function, to return the sum of a list of input values.

86 Putting Map and Reduce together A MapReduce program has two components: one that implements the mapper, and another that implements the reducer Keys and values: In MapReduce, no value stands on its own. Every value has a key associated with it. Keys identify related values. Eg. the list below is for flight departures and the number of passengers that failed to board EK123 65, 12:00pm BA789 50, 12:02pm EK123 40, 12:05pm QF456 25, 12:15pm... The mapping and reducing functions receive not just values, but (key, value) pairs. The output of each of these functions is the same: both a key and a value must be emitted to the next list in the data flow.

87 MapReduce - Keys In MapReduce, an arbitrary number of values can be output from each phase; a mapper may map one input into zero, one, or one hundred outputs. A reducer may compute over an input list and emit one or a dozen different outputs Keys divide the reduce space: A reducing function turns a large list of values into one (or a few) output values Different colors represent different keys. All values with the same key are presented to a single reduce task. In MapReduce, all of the output values are not usually reduced together All of the values with the same key are presented to a single reducer together

88 Word Count Was used as an example in the original MapReduce paper! Now basically the hello world of map reduce! Do a count of words of some set of documents.! A simple model of many actual web analytics problem file01 Hello World! Bye World file02 output/part Hello 2! World 2! Bye 1! Hadoop 2! Goodbye 1 Hello Hadoop Goodbye Hadoop Jonathan Dursi

89 High-Level Structure of a MR Program 1/2 mapper (filename, file-contents): for each word in file-contents: emit (word, 1) reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum)

90 High-Level Structure of a MR Program 2/2 Several instances of the mapper function are created on the different machines in a Hadoop cluster mapper (filename, file-contents): for each word in file-contents: emit (word, 1) reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum) Each instance receives a different input file (it is assumed that there are many such files) The mappers output (word, 1) pairs which are then forwarded to the reducers Several instances of the reducer method are also instantiated on the different machines Each reducer is responsible for processing the list of values associated with a different word The list of values will be a list of 1's; the reducer sums up those ones into a final count associated with a single word. The reducer then emits the final (word, count) output which is written to an output file.

91 Word Count How would you do this with a huge document?! Each time you see a word, if it s a new word, add a tick mark beside it, otherwise add a new word with a tick!...but hard to parallelize (updating the list) file01 Hello World! Bye World file02 output/part Hello 2! World 2! Bye 1! Hadoop 2! Goodbye 1 Hello Hadoop Goodbye Hadoop Jonathan Dursi

92 Word Count MapReduce way - all hard work is done by the shuffle - eg, automatically.! Map: just emit a 1 for each word you see file01 Hello World! Bye World (Hello,1)! (World,1)! (Bye, 1)! (World,1) file02 Hello Hadoop Goodbye Hadoop (Hello, 1)! (Hadoop, 1)! (Goodbye,1)! (Hadoop, 1) Jonathan Dursi

93 Word Count Shuffle assigns keys (words) to each reducer, sends (k,v) pairs to appropriate reducer! Reducer just has to sum up the ones (Hello,1)! (World,1)! (Bye, 1)! (World,1) (Hello,[1,1])! (World,[1,1])! (Bye, 1) (Hello, 1)! (Hadoop, 1)! (Goodbye,1)! (Hadoop, 1) (Hadoop, [1,1])! (Goodbye,1) I The shu e phase between map and reduce phase creates a list of values associated with each key. Hello 2! World 1! Bye 1 Hadoop 2! Goodbye 1 Jonathan Dursi

94 MapReduce Data Flow 1/4

95 MapReduce Data Flow 2/4 MapReduce inputs typically come from input files loaded onto the Hadoop cluster s HDFS F/S These files are evenly distributed across all nodes Running a MapReduce program involves running mapping tasks on many or all of the nodes in the Hadoop cluster Each of these mapping tasks is equivalent: no mappers have particular "identities" associated with them Thus, any mapper can process any input file. Each mapper loads the set of files local to that machine and processes them

96 MapReduce Data Flow 3/4 When the mapping phase has completed, the intermediate (key, value) pairs must be exchanged between machines All values with the same key are sent to a single reducer

97 MapReduce Data Flow 4/4 The reduce tasks are spread across the same nodes in the cluster as the mappers. This is the only communication step in MapReduce The user never explicitly marshals information from one machine to another. All data transfer is handled by the Hadoop MapReduce platform runtime, guided implicitly by the different keys associated with values. This is a fundamental element of Hadoop MapReduce's reliability. If nodes in the cluster fail, tasks must be able to be restarted. If they have been performing side-effects, e.g., communicating with the outside world, then the shared state must be restored in a restarted task. By eliminating communication and side-effects, restarts can be handled more gracefully.

98 In Depth View of Map-Reduce

99

100 InputFormat 1/2 Input files reside on HDFS and can be of an arbitrary format Deciding how to split up and read these files is decided by the InputFormat class. It: Selects the files or other objects that should be used for input Defines the InputSplits that break a file into tasks Provides a factory for RecordReader objects that read the file

101 InputFormat 2/2 An InputSplit is the unit of work which comprises a single map task By default this is a 64MB chunk. As various blocks make up a file, it is possible to run parallel Map tasks on these chunks

102 RecordReader and Mapper An InputSplit defined a unit of work The RecordReader class defines how to load the data and convert into (key, value) pairs that the Map phase can use The Mapper function does the Map, emitting (key, value) pairs for use by the Reduce phase

103 Partition and Shuffle On completion of the first batch of Map tasks, nodes begin exchanging outputs to Reducers this is called the Shuffle phase Each reducer is given a different subset of the key space (called Partitions) by the Partitioner class These (key,value) pairs are then inputs for the Reduce phase

104 Sort, Reduce and Output Intermediate (key, value) pairs from the Shuffle process are then sorted as input to the Reducer The Reducers iterate over all their values and produce an output The outputs are then written back to HDFS

105 Handling Failure Worker failure To detect failure, the master pings every worker periodically If no response is received from a worker in a certain amount of time, the master marks the worker as failed Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers Completed map tasks are re-executed as their output on is stored on the local disk(s) of the failed machine and is therefore inaccessible Completed reduce tasks do not need to be re-executed since their output is stored in a global file system Master failure Periodic checkpoints are written to handle master failure If the master task dies, a new copy can be started from the last checkpoint state

106 Data Locality Network bandwidth is a valuable scarce resource and it should be consumed wisely The distributed file system replicates data across different nodes The Master takes these locations into account when scheduling Map tasks, trying to place them with the data Otherwise, Map tasks are scheduled to reside near a replica of the data (e.g., on a worker machine that is on the same network switch) When running large MapReduce operations, most input data is read locally and consume no network bandwidth Data locality worked well with a Hadoop-specific distributed file system Integration of a Cloud-based file system incurs extra cost and loss data locality

107 Task Granularity Finely granular tasks: many more map tasks than machines Better dynamic load balancing Minimizes time for fault recovery Can pipeline the shuffling/grouping while maps are still running Typically 200k Map tasks, 5k Reduce tasks for 2k hosts For M map tasks and R reduce tasks there are O(M+R) scheduling decisions and O(M*R) states

108 Load Balancing Built-in dynamic load balancing One other problem that can slow calculations is the existence of stragglers; machines suffering from either hardware defects, contention for resources with other applications etc. When an overall MapReduce operation passes some point deemed to be nearly complete, the Master schedules backup tasks for all of the currently in-progress tasks When a particular task is completed, whether it be original or back-up, its value is used This strategy costs little more overall, but can result in big performance gains

109 Refinements Partitioning function MapReduce users specify the number of reduce tasks/output files (R) Data gets partitioned across these tasks using a partitioning function on the intermediate key Default is hash(key) mod R, resulting in well balanced partitions Special partitioning function can also be used, such as hash(hostname(urlkey)) to combine all URLs (output keys) from the same host to the same output file Ordering guarantees Within a given partition, the intermediate key/value pairs are processed in increasing key order This ordering guarantee makes it easy to generate a sorted output file per partition Allows users to have sorted output and efficient access lookups by key

110 Refinements (Cont d) Combiner function There can be significant repetition in the intermediate keys produced by each map task and the reduce task is associative While one reduce task can perform the aggregation, an on-processor combiner function can be used to perform partial merging of Map output locally before sending over the network The combiner function is executed on each machine that performs a map task The program logic for the combiner function and reduce tasks are potentially same, except how the output is handled, i.e. writing output in an intermediate file or in the final output file Input/Output types Multiple input/output format supported User can also add support to new input/output type by providing an implementation to the reader/writer interface

111 Refinements (Cont d) Skipping bad records MapReduce provides a mode for skipping records that are diagnosed to cause Map() crashes Each worker process installs a signal handler that catches segment violations and bus errors, tracked by master When the master notices more than one failure on a particular record, it indicates that the record should be skipped during re-execution Local execution/debugging Not straightforward due to the distributed computation of MapReduce Alternative implementation of the MapReduce library that sequentially on one node (local machine) Users can use any debugging or testing tools they find useful

112 Refinements (Cont d) Status information Master contains internal http server to produce status pages with information on how many tasks have been completed, how many are in progress, bytes of input, bytes of intermediate data, bytes of output, and processing rates. The status page contains links to the standard error and standard output files generated by each task A user can monitor progress, predict computation time and accelerate it by adding more hosts Counters Counters A facility to count occurrences of various events To use this facility, user code creates a named counter object and then increments the counter appropriately in Map and/or Reduce function

113 MapReduce Applications Applications Text tokenization (alert system), indexing, and search Data mining, statistical modeling, and machine learning Healthcare parse, clean and reconcile extremely large amount of data Biosciences drug discovery, meta-genomics, bioassay activities Cost-effective mash-ups retrieving and analyzing biomedical knowledge Computational biology parallelize bioinformatics algorithms for SNP discovery, genotyping and personal genomics, e.g. CloudBurst Emergency response real-time monitoring/forecasting for operational decision support and so on (Check: MapReduce inapplicability Database management does not provide traditional DBMS features Database implementation lack of schema, low data integrity Normalization poses problems for MapReduce, due to non-local reading Applications cannot have read and write many times feature

114 How Hadoop Runs a MapReduce job Client submits MapReduce job JobTracker coordinates job run TaskTracker runs split tasks HDFS is used for file storage Hadoop: The Definitive Guide, O Reilly

115 Streaming and Pipes Hadoop Streaming, API to MapReduce to write non-java map and reduce function Hadoop and the user program communicates using standard I/O streams Hadoop Pipes is the C+ + interface to MapReduce Uses socket as channel to communicate with the process running the C++ Map or Reduce function Hadoop: The Definitive Guide, O Reilly

116 Progress and Status Updates Operations constituting progress Reading an input record Writing an output record Setting status description Incrementing a counter Calling progress () method Hadoop: The Definitive Guide, O Reilly

117 Hadoop Failures Task failure Map or reduce task throws a runtime exception For streaming tasks, streaming processes exiting with a non-zero exit code are considered as failed Task call also be killed and re-scheduled Tasktracker failure Crash or slow execution can cause infrequent (or stop) sending heartbeats to the job tracker A tasktracker can also be blacklisted by the jobtracker if it fails a significant number of tasks, higher than average task failure rate Jobtracker failure Single point of failure - no mechanism to deal with it One solution is to run multiple jobtracker or have backup jobtracker

118 Checkpointing in Hadoop Hadoop: The Definitive Guide, O Reilly

119 Job Scheduling in Hadoop Started with FIFO scheduling and now comes with a choice of schedulers The fair scheduler Aims to give every user a fair share of the cluster capacity over time Jobs are placed in pools and by default each user gets their own pool Support preemption capacity provisioning of over-capacity to undercapacity pool The capacity scheduler Slightly different approach to multi-user scheduling A cluster is made up of a number of queues, which may be hierarchical, and each queue has an allocated capacity Within each queue jobs are scheduled using FIFO scheduling, with priorities

120 MapReduce and HDFS Amir Payberah

121 Fault Tolerance I On worker failure: Detect failure via periodic heartbeats. Re-execute in-progress map and reduce tasks. Re-execute completed map tasks: their output is stored on the local disk of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global filesystem. I On master failure: State is periodically checkpointed: a new copy of master starts from the last checkpoint state. Amir Payberah

122 EXTRA MATERIAL

123 CEPH A HDFS REPLACEMENT

124 What is Ceph? Ceph is a distributed, highly available unified object, block and file storage system with no SPOF running on commodity hardware APP HOST/VM CLIENT

125 Ceph Architecture Host Level At the host level We have Object Storage Devices (OSDs) and Monitors Monitors keep track of the components of the Ceph cluster (i.e. where the OSDs are) The device, host, rack, row, and room are stored by the Monitors and used to compute a failure domain OSDs store the Ceph data objects A host can run multiple OSDs, but it needs to be appropriately provisioned btrfs xfs ext4

126 Ceph Architecture Block Level At the block device level... Object Storage Device (OSD) can be an entire drive, a partition, or a folder OSDs must be formatted in ext4, XFS, or btrfs (experimental). Pools OSD Filesystem OSD Filesystem OSD Filesystem OSD Filesystem Drive/Partition Drive/Partition Drive/Partition Drive/Partition

127 Ceph Architecture Data Organization Level At the data organization level Data are partitioned into pools Pools contain a number of Placement Groups (PGs) Ceph data objects map to PGs (via a modulo of hash of name) PGs then map to multiple OSDs. OSD Pool: mydata obj obj PG #1 obj obj PG #2 OSD OSD OSD

128 Ceph Placement Groups Ceph shards a pool into placement groups distributed evenly and pseudo-randomly across the cluster The CRUSH algorithm assigns each object to a placement group, and assigns each placement group to a set of OSDs creating a layer of indirection between the Ceph client and the OSDs storing the copies of an object The CRUSH algorithm dynamically assigns each object to a placement group and then assigns each placement group to a set of Ceph OSDs This layer of indirection allows the Ceph storage cluster to re-balance dynamically when new Ceph OSD come online or when Ceph OSDs fail RedHat Ceph Architecture v1.2.3

129 Ceph Architecture Overall View APP HOST / VM Client RadosGW RBD CephFS S3 Swift LibRados Rados MDS MDS.1 MONs MON.1 Pool 1 Pool 2 Pool... X... Pool n CRUSH map MDS.n MON.n PG 1 PG 2 PG 3 PG 4... PG n activities/tf-storage/ws16/ slides/ low_cost_storage_cephopenstack_swift.pdf n n 1 n Cluster Node [OSDs] Cluster Node [OSDs] Cluster Node [OSDs]

130 Ceph Architecture RADOS An Application interacts with a RADOS cluster RADOS (Reliable Autonomic Distributed Object Store) is a distributed object service that manages the distribution, replication, and migration of objects On top of that reliable storage abstraction Ceph builds a range of services, including a block storage abstraction (RBD, or Rados Block Device) and a cache-coherent distributed file system (CephFS). RADOS CLUSTER

131 Ceph Architecture RADOS Components OSDs:! 10s to 10000s in a cluster! One per disk (or one per SSD, RAID group )! Serve stored objects to clients! Intelligently peer for replication & recovery Monitors:! Maintain cluster membership and state! Provide consensus for distributed decision-making! Small, odd number! These do not serve stored objects to clients

132 Ceph Architecture Where Do Objects Live???

133 Ceph Architecture Where Do Objects Live? 1 2 Contact a Metadata server?

134 Ceph Architecture Where Do Objects Live? A-G H-N O-T U-Z Or calculate the placement via static mapping?

135 Ceph Architecture CRUSH Maps RADOS CLUSTER

136 Ceph Architecture CRUSH Maps RADOS CLUSTER Data objects are distributed across Object Storage Devices (OSD), which refers to either physical or logical storage units, using CRUSH (Controlled Replication Under Scalable Hashing) CRUSH is a deterministic hashing function that allows administrators to define flexible placement policies over a hierarchical cluster structure (e.g., disks, hosts, racks, rows, datacenters) The location of objects can be calculated based on the object identifier and cluster layout (similar to consistent hashing), thus there is no need for a metadata index or server for the RADOS object store

137 Ceph Architecture CRUSH 1/2 RADOS CLUSTER

138 Ceph Architecture CRUSH 2/2 CRUSH:! Pseudo-random placement algorithm! Fast calculation, no lookup! Repeatable, deterministic! Statistically uniform distribution! Stable mapping! Limited data migration on change! Rule-based configuration! Infrastructure topology aware! Adjustable replication! Weighting

139 Ceph Architecture librados LIBRADOS:! Direct access to RADOS for applications! C, C++, Python, PHP, Java, Erlang! Direct access to storage nodes! No HTTP overhead socket storage2014/ present/ 2.%20Konferenztag/ 13_06_2014_06_Ink tank.pdf RADOS CLUSTER

140 Ceph Architecture RADOS Gateway RADOSGW:! REST-based object storage proxy! Uses RADOS to store objects! API supports buckets, accounts! Usage accounting for billing! Compatible with S3 and Swift applications REST storage2014/ present/ 2.%20Konferenztag/ 13_06_2014_06_Ink tank.pdf RADOS CLUSTER socket

141 Ceph Architecture RADOS Block Device (RBD) 1/3 RADOS BLOCK DEVICE:! Storage of disk images in RADOS! Decouples VMs from host! Images are striped across the cluster (pool)! Snapshots! Copy-on-write clones! Support in:! Mainline Linux Kernel ( )! Qemu/KVM, native Xen coming soon! OpenStack, CloudStack, Nebula, Proxmox

142 Ceph Architecture RADOS Block Device (RBD) 2/3 Virtual Machine storage using RDB RADOS CLUSTER Live Migration using RBD RADOS CLUSTER

143 Ceph Architecture RADOS Block Device (RBD) 3/3 Direct host access from Linux RADOS CLUSTER

144 Ceph Architecture CephFS POSIX F/S METADATA SERVER! Manages metadata for a POSIX-compliant shared filesystem! Directory hierarchy! File metadata (owner, timestamps, mode, etc.)! Stores metadata in RADOS! Does not serve file data to clients! Only required for shared filesystem metadata data RADOS CLUSTER

145 software.intel.com/enus/blogs/2015/04/06/ ceph-erasure-codingintroduction Ceph Read/Write Flows

146 Ceph Replicated I/O RedHat Ceph Architecture v1.2.3

147 Ceph Erasure Coding 1/5 Erasure Code is a theory started at 1960s. The most famous algorithm is the Reed-Solomon. Many variations came out, like the Fountain Codes, Pyramid Codes and Local Repairable Codes. Erasure Codes usually defines the number of total disks (N) and the number of data disks (K), and it can tolerate N K failures with overhead of N/K E,g, a typical Reed Solomon scheme: (8, 5), where 8 is the total disks, 5 is the data disks. In this case, the data in disks would be like: RS (8, 5) can tolerate 3 arbitrary failures. If there s some data chunks missing, then one could use the rest available data to restore the original content.

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017 Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

Distributed Systems 16. Distributed File Systems II

Distributed Systems 16. Distributed File Systems II Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

HDFS Architecture Guide

HDFS Architecture Guide by Dhruba Borthakur Table of contents 1 Introduction...3 2 Assumptions and Goals...3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets...3 2.4 Simple Coherency Model... 4 2.5

More information

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu } Introduction } Architecture } File

More information

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 HDFS Architecture Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 Based Upon: http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoopproject-dist/hadoop-hdfs/hdfsdesign.html Assumptions At scale, hardware

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2013/14 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Cluster File

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

MapReduce. U of Toronto, 2014

MapReduce. U of Toronto, 2014 MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

Programming Models MapReduce

Programming Models MapReduce Programming Models MapReduce Majd Sakr, Garth Gibson, Greg Ganger, Raja Sambasivan 15-719/18-847b Advanced Cloud Computing Fall 2013 Sep 23, 2013 1 MapReduce In a Nutshell MapReduce incorporates two phases

More information

BigData and Map Reduce VITMAC03

BigData and Map Reduce VITMAC03 BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to

More information

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google File System (GFS) and Hadoop Distributed File System (HDFS) Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Dept. Of Computer Science, Colorado State University

Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,

More information

Hadoop MapReduce Framework

Hadoop MapReduce Framework Hadoop MapReduce Framework Contents Hadoop MapReduce Framework Architecture Interaction Diagram of MapReduce Framework (Hadoop 1.0) Interaction Diagram of MapReduce Framework (Hadoop 2.0) Hadoop MapReduce

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone

More information

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

More information

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng. Introduction to MapReduce Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng. Before MapReduce Large scale data processing was difficult! Managing hundreds or thousands of processors Managing parallelization

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

The Google File System. Alexandru Costan

The Google File System. Alexandru Costan 1 The Google File System Alexandru Costan Actions on Big Data 2 Storage Analysis Acquisition Handling the data stream Data structured unstructured semi-structured Results Transactions Outline File systems

More information

50 Must Read Hadoop Interview Questions & Answers

50 Must Read Hadoop Interview Questions & Answers 50 Must Read Hadoop Interview Questions & Answers Whizlabs Dec 29th, 2017 Big Data Are you planning to land a job with big data and data analytics? Are you worried about cracking the Hadoop job interview?

More information

Introduction to MapReduce (cont.)

Introduction to MapReduce (cont.) Introduction to MapReduce (cont.) Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 2 MapReduce: Summary USC INF 553 Foundations

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW

More information

MapReduce Simplified Data Processing on Large Clusters

MapReduce Simplified Data Processing on Large Clusters MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

A Glimpse of the Hadoop Echosystem

A Glimpse of the Hadoop Echosystem A Glimpse of the Hadoop Echosystem 1 Hadoop Echosystem A cluster is shared among several users in an organization Different services HDFS and MapReduce provide the lower layers of the infrastructures Other

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E. 18-hdfs-gfs.txt Thu Oct 27 10:05:07 2011 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

More information

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

More information

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH) Cloudera CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Download Full Version : http://killexams.com/pass4sure/exam-detail/cca-410 Reference: CONFIGURATION PARAMETERS DFS.BLOCK.SIZE

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

CS60021: Scalable Data Mining. Sourangshu Bhattacharya CS60021: Scalable Data Mining Sourangshu Bhattacharya In this Lecture: Outline: HDFS Motivation HDFS User commands HDFS System architecture HDFS Implementation details Sourangshu Bhattacharya Computer

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 9 MapReduce Prof. Li Jiang 2014/11/19 1 What is MapReduce Origin from Google, [OSDI 04] A simple programming model Functional model For large-scale

More information

CS November 2017

CS November 2017 Bigtable Highly available distributed storage Distributed Systems 18. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017 Distributed Systems 15. Distributed File Systems Paul Krzyzanowski Rutgers University Fall 2017 1 Google Chubby ( Apache Zookeeper) 2 Chubby Distributed lock service + simple fault-tolerant file system

More information

UNIT-IV HDFS. Ms. Selva Mary. G

UNIT-IV HDFS. Ms. Selva Mary. G UNIT-IV HDFS HDFS ARCHITECTURE Dataset partition across a number of separate machines Hadoop Distributed File system The Design of HDFS HDFS is a file system designed for storing very large files with

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 1: Distributed File Systems GFS (The Google File System) 1 Filesystems

More information

Introduction to Map Reduce

Introduction to Map Reduce Introduction to Map Reduce 1 Map Reduce: Motivation We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate

More information

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 26 File Systems Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ Cylinders: all the platters?

More information

Map Reduce Group Meeting

Map Reduce Group Meeting Map Reduce Group Meeting Yasmine Badr 10/07/2014 A lot of material in this presenta0on has been adopted from the original MapReduce paper in OSDI 2004 What is Map Reduce? Programming paradigm/model for

More information

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability

More information

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment. Distributed Systems 15. Distributed File Systems Google ( Apache Zookeeper) Paul Krzyzanowski Rutgers University Fall 2017 1 2 Distributed lock service + simple fault-tolerant file system Deployment Client

More information

GFS: The Google File System

GFS: The Google File System GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material

More information

Hadoop Distributed File System(HDFS)

Hadoop Distributed File System(HDFS) Hadoop Distributed File System(HDFS) Bu eğitim sunumları İstanbul Kalkınma Ajansı nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no lu İstanbul

More information

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google 2017 fall DIP Heerak lim, Donghun Koo 1 Agenda Introduction Design overview Systems interactions Master operation Fault tolerance

More information

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 15. Distributed File Systems Paul Krzyzanowski Rutgers University Fall 2016 1 Google Chubby 2 Chubby Distributed lock service + simple fault-tolerant file system Interfaces File access

More information

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS W13.A.0.0 CS435 Introduction to Big Data W13.A.1 FAQs Programming Assignment 3 has been posted PART 2. LARGE SCALE DATA STORAGE SYSTEMS DISTRIBUTED FILE SYSTEMS Recitations Apache Spark tutorial 1 and

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 Lecture 24 Mass Storage, HDFS/Hadoop Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ What 2

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E. 18-hdfs-gfs.txt Thu Nov 01 09:53:32 2012 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2012 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

More information

The amount of data increases every day Some numbers ( 2012):

The amount of data increases every day Some numbers ( 2012): 1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect

More information

2/26/2017. The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012): The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to

More information

CSE 124: Networked Services Fall 2009 Lecture-19

CSE 124: Networked Services Fall 2009 Lecture-19 CSE 124: Networked Services Fall 2009 Lecture-19 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa09/cse124 Some of these slides are adapted from various sources/individuals including but

More information

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies Informa)on Retrieval and Map- Reduce Implementa)ons Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies mas4108@louisiana.edu Map-Reduce: Why? Need to process 100TB datasets On 1 node:

More information

The Google File System

The Google File System The Google File System By Ghemawat, Gobioff and Leung Outline Overview Assumption Design of GFS System Interactions Master Operations Fault Tolerance Measurements Overview GFS: Scalable distributed file

More information

Distributed Systems CS6421

Distributed Systems CS6421 Distributed Systems CS6421 Intro to Distributed Systems and the Cloud Prof. Tim Wood v I teach: Software Engineering, Operating Systems, Sr. Design I like: distributed systems, networks, building cool

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information Subject 10 Fall 2015 Google File System and BigTable and tiny bits of HDFS (Hadoop File System) and Chubby Not in textbook; additional information Disclaimer: These abbreviated notes DO NOT substitute

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Actual4Dumps. Provide you with the latest actual exam dumps, and help you succeed

Actual4Dumps.   Provide you with the latest actual exam dumps, and help you succeed Actual4Dumps http://www.actual4dumps.com Provide you with the latest actual exam dumps, and help you succeed Exam : HDPCD Title : Hortonworks Data Platform Certified Developer Vendor : Hortonworks Version

More information

CS November 2018

CS November 2018 Bigtable Highly available distributed storage Distributed Systems 19. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler MSST 10 Hadoop in Perspective Hadoop scales computation capacity, storage capacity, and I/O bandwidth by

More information

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS 61C: Great Ideas in Computer Architecture. MapReduce CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing

More information

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation

More information

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Architecture

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Architecture What Is Datacenter (Warehouse) Computing Distributed and Parallel Technology Datacenter, Warehouse and Cloud Computing Hans-Wolfgang Loidl School of Mathematical and Computer Sciences Heriot-Watt University,

More information

Certified Big Data and Hadoop Course Curriculum

Certified Big Data and Hadoop Course Curriculum Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation

More information

L5-6:Runtime Platforms Hadoop and HDFS

L5-6:Runtime Platforms Hadoop and HDFS Indian Institute of Science Bangalore, India भ रत य व ज ञ न स स थ न ब गल र, भ रत Department of Computational and Data Sciences SE256:Jan16 (2:1) L5-6:Runtime Platforms Hadoop and HDFS Yogesh Simmhan 03/

More information

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA CSE 3302 Lecture 11: Map/Reduce 7 October 2010 Nate Nystrom UTA 378,000 results in 0.17 seconds including images and video communicates with 1000s of machines web server index servers document servers

More information

CSE 124: Networked Services Lecture-16

CSE 124: Networked Services Lecture-16 Fall 2010 CSE 124: Networked Services Lecture-16 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/23/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments

More information

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Page 1. Goals for Today Background of Cloud Computing Sources Driving Big Data CS162 Operating Systems and Systems Programming Lecture 24 Goals for Today" CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing" Distributed systems Cloud Computing programming paradigms Cloud Computing OS December 2, 2013 Anthony

More information

A brief history on Hadoop

A brief history on Hadoop Hadoop Basics A brief history on Hadoop 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct 2003 - Google releases papers with GFS (Google File System)

More information

BigTable: A Distributed Storage System for Structured Data

BigTable: A Distributed Storage System for Structured Data BigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26

More information